Troubleshooting Caching and Clustering

This topic lists caching- or clustering-related problems that can arise, as well as tools and best practices.

Log Files Related to Caching

If a cache server machine name or IP address is invalid, you'll get verbose messages on the command line. You'll also get the messages in log files found in $JIVE_HOME/var/logs/.

  • cache-gc.log -- Output from garbage collection of the cache process.
  • cache-service.out -- Cache startup messages, output from the cache processes, showing start flags, restarts, and general errors.

Misconfiguration Through Mismatched Cache Address Lists

If you have multiple cache servers, the configuration list of cache addresses for each must be the same. A mismatched configuration will show up in the cache-service.out file. For example, if two servers have the same list, but a third one doesn't, the log will include messages indicating that the third server has one server but not another, or that a key is expected to be on one server, but is on another instead.

For more on adding a cache server to a cluster, see Adding a Cache Server Machine. If you're setting up cache servers for high-availability, then also take a look at Configuring the Cache Servers for High-Availability.

Cache Server Banned Under Heavy Load

Under extreme load, an application server node may be so overwhelmed that it may ban a remote cache server for a small period of time because responses from the cache server are taking too long. If this occurs, you'll see it in the application log as entries related to the ThresholdFailureDetector.

This is usually a transient failure. However, if this continues, take steps to reduce the load on the application server to reasonable levels by adding more nodes to the cluster. You might also see this in some situations where a single under-provisioned cache server (for example, a cache server allocated just a single CPU core) is being overwhelmed by caching requests. To remedy this, ensure that the cache server has an adequate number of CPU cores. For more on hardware requirements, see Cache Server Hardware Requirements.

Banned Node Can Result in Near Cache Mismatches

While the failure of a node won't typically cause caching to fail across the cluster (cache data lives in a separate cache server), the banning of an unresponsive node can adversely affect near caches. This will show up as a mismatch visible in the application user interface.

An unresponsive node will be removed from the cluster to help ensure that it doesn't disrupt the rest of the application (other nodes will ignore it until it's reinstated). Generally, this situation will resolve itself, with the intermediate downside of an increase in database access.

If this happens, recent content lists can become mismatched between nodes in the cluster. That's because near cache changes, which represent the most recent changes, are batched and communicated across the cluster. If the cluster relationship is broken, communication will fail between the banned node and other nodes.

After First Start up, Node Unable to Leave Then Rejoin Cluster

After the first run of a cluster -- the first time you start up all of the nodes -- nodes that are banned (due to being unresponsive, for example) might appear not to rejoin the cluster when they become available. That's because when each node registers itself in the database, it also retrieves the list of other nodes from the database. If one of the earlier nodes is the cluster coordinator -- responsible for merging a banned cluster node back into the cluster -- it will be unaware of a problem if the last started node becomes unreachable.

To avoid this problem, after you start every node for the first time, bounce the entire cluster. That way, each will be able to read node information about all of the others.

For example, imagine you start nodes A, B, and C in succession for the first time. The database contained no entries for them until you started them. Each enters its address in the database. Node A starts, registering itself. Node B starts, seeing A in the database. Node C starts, seeing A and B. However, because node C wasn't in the database when A and B started, they don't know to check on node C -- if it becomes unreachable, the won't know and won't inform the cluster coordinator. (Note that the coordinator might have changed since start up).

If a node leaves the cluster, the coordinator needs to have the full list at hand to re-merge membership after the node becomes reachable again.