How does the failover mechanism work?
RAC relies on the cluster services for failure detection. The cluster services are a distributed kernel component that monitors whether cluster members (nodes) can communicate with each other and, through this process,enforces the rules of cluster membership. In Oracle Database 10g, this function is performed by CSS, through the CSSD process.
The functions performed by CSS can be broadly listed as follows:
- Forms a cluster, adds members to a cluster, and removes members from a cluster
- Tracks which members in a cluster are active
- Maintains a cluster membership list that is consistent on all member nodes
- Provides timely notification of membership changes
- Detects and handles possible cluster partitions
- Monitors group membership
The CSS determines the availability of a member in a cluster using a polling method.
- When a node polls another node (target) in the cluster, and the target has not responded successfully after repeated attempts, a timeout occurs after approximately 60 seconds.
- Among the responding nodes, the node that was started first and that is alive declares that the other node is not responding and has failed.
- This node becomes the new MASTER and starts evicting the nonresponding node from the cluster.
- Once eviction is complete,cluster reformation begins.
- The reorganization process regroups accessible nodes and removes the failed ones.