Failover and Data Recovery in the Activity Engine

Activity Ingress

In a clustered Jive configuration, when a Jive web application node sends an activity, it selects an appropriate Activity Engine node to which to send the activity. The web app node selects an Activity Engine node based on the following criteria:

User affinity: All Jive web apps in a cluster attempt to route activity from a particular user to the same Activity Engine node.
Sensitive ordering: User affinity is based on node ordering. We recommend keeping consistent the list of Activity Engine nodes defined in the Activity Engine setup.
Failover behavior: If an Activity Engine node becomes unavailable, the web application bans it and reroutes applicable users to the next node in the list. The new routing becomes "sticky" for a small number of subsequent requests (30) before it is discarded and the original route is again attempted.

Activity Engine Architecture Diagram

After the web application selects an available Activity Engine node, the web application attempts the delivery. If it fails for any reason, the activity is journaled and attempted again later (potentially against a different node). Delivery failure may occur due to the following:

Unavailable node: The node is unreachable or otherwise unavailable between its selection and activity delivery.
Activity not written to the Activity Engine database: The node is unable to write the activity to the database for any reason.
Activity not queued for processing: The node is unable to add the activity to its processing queue for any reason.

Streams Durability

Each Activity Engine node is backed by its own Lucene stream service. While ingress remains unchanged, each node is responsible for replicating its processed data to all other nodes in the Activity Engine cluster. Each node is made aware of its siblings during the web application's registration process.

The replication procedure works as follows:

Disk queue: Fully-processed activities, as well as events (for example, reads, hides, moves), are queued to disk in a simple pipeline destined for all siblings.
Connection pool: Each Activity Engine node establishes up to 10 connections to each of its siblings on the same port used by the web app. Note: Each node will therefore accept and establish up to 10 x #-of-siblings additional connections to handle the replication requirement.

Activity Engine Failover and Recovery

If an Activity Engine node goes down:

Unprocessed activity is in limbo: There is no way to reclaim activities that are still queued on disk and/or activities/events in the replication queue. You must bring the failed node back online for its queue to be processed.
Queues lost to disk failure are recoverable: In the event of a complete disk failure, you can recover unprocessed activities by running a stream rebuild on all nodes. While this guarantees that the unprocessed activities are correctly reflected in all indexes, it is a resource-intensive procedure. Therefore, Jive Software recommends avoiding this scenario if possible.
Lucene is completely mirrored: Because each node has its own complete Lucene index, no stream data goes missing or becomes unreachable. All other data (e.g., follows and email notifications) are persisted to the database and are unaffected by an Activity Engine node failure.
Recoverable cross-communication: If an Activity Engine node is unable to reach one of its siblings during replication, activities/events destined for that sibling are pushed to a "retry" queue where they are reattempted at a later time. After a failed node recovers, replication to that node should resume within approximately 1 minute; so there will be a brief period of stream inaccuracy on the affected node.

After an Activity Engine node recovers:

Processing resumes: The Activity Engine node will continue processing its disk queue, as well as activities/events in the replication queue. No further action is required.
Corrupted queues: In the event that the failure has corrupted one or more disk queues and/or activities/events in the replication queue, those queues are copied off with the suffix ".corrupt" and can potentially be copied and analyzed. But there is no way to recover corrupt queues. If your replication queue is corrupt, we strongly recommend that you run a stream rebuild on all Activity Engine nodes.

If you decommission an Activity Engine node:

Unprocessed activity can be recovered: If you decommission an Activity Engine node with unprocessed activity in its queues, then we recommend you run a stream rebuild on all nodes. This is a resource-intensive procedure, but we recommend it over attempting to move the unprocessed queue data to another node.
Cross-communication adjusts accordingly: When you remove an Activity Engine node from the web app's Activity Engine configuration, all remaining nodes are made aware of the change. Replication to the decommissioned node ceases immediately.

If you add or re-add a node:

Cross-communication adjusts accordingly: When you add an Activity Engine node to the web app's Activity Engine configuration, all current nodes are made aware of the change. Replication to the commissioned node begins immediately.
Stream rebuild is initiated: When an Activity Engine node is registered with the web app, it is able to detect whether it was a new addition (or re-addition). The node flags itself for a stream rebuild to ensure that its index contains all required stream data.

Note that, while we have made extra efforts to ensure that consistency is maintained automatically during the management of an Activity Engine cluster, in the event of any unexpected inconsistency or missing data, you can correct the issue by initiating a stream rebuild on the affected node(s). The Activity Engine node remains completely accessible while performing a rebuild, during which time streams will fill with users' newest activity first. In addition, Jive supports complex "in/out" configurations which play a large part in activity ingress.

Activity Egress

Because each node maintains its own Lucene index, if an Activity Engine node is in the process of performing a stream rebuild, it is possible for users routed to a particular Activity Engine node to have partly-diminished streams. This state is temporary and resolves itself automatically (similar to a search or browse index rebuild). The most recent stream activity will always be available first, and in no event will any user activity ever be permanently lost. There are some background tasks which are only performed by an elected node. However, as of Jive 6.0, the "per-user stream rebuild" task is no longer applicable and has been removed. Therefore, the only tasks requiring node election are upgrade migrations. In addition, Jive 6.0+ supports complex "in/out" configurations which play a large part in activity egress.