Troubleshooting caching and clustering
This topic lists caching- or clustering-related problems that can arise, as well as tools and best practices.
Log files related to caching
If a cache server machine name or IP address is invalid, you get verbose messages on
the command line. You also get the messages in log files found in
$JIVE_HOME/var/logs/
.
- cache-gc.log — Output from garbage collection of the cache process.
- cache-service.out — Cache startup messages, output from the cache processes, showing start flags, restarts, and general errors.
Misconfiguration through mismatched cache address lists
If you have multiple cache servers, the configuration list of cache addresses for each must be the same. A mismatched configuration shows up in the cache-service.out file.
For example, if two servers have the same list, but a third one doesn't, the log includes messages indicating that the third server has one server but not another, or that a key is expected to be on one server, but is on another instead.
For more information on adding a cache server to a cluster, see Adding cache server machines. For more information on setting up cache servers for high-availability, see Configuring Cache servers for high-availability.
Cache server banned under heavy load
Under extreme load, an application server node may be so overwhelmed that it may ban
a remote cache server for a small period of time because responses from the cache
server are taking too long. If this occurs, you see it in the application log as
entries related to the ThresholdFailureDetector
.
This is usually a transient failure. However, if this continues, you should take steps to reduce the load on the application server to reasonable levels by adding more nodes to the cluster. You might also see this in some situations where a single under-provisioned cache server (for example, a cache server allocated just a single CPU core) is being overwhelmed by caching requests. To remedy this, ensure that the cache server has an adequate number of CPU cores. For more information on hardware requirements, see Cache Server machine.
Banned node can result in near cache mismatches
While the failure of a node doesn't typically cause caching to fail across the cluster (cache data lives in a separate cache server), the banning of an unresponsive node can adversely affect near caches. This shows up as a mismatch visible in the application user interface.
An unresponsive node is removed from the cluster to help ensure that it doesn't disrupt the rest of the application (other nodes will ignore it until it's reinstated). Generally, this situation resolves itself, with the intermediate downside of an increase in database access.
If this happens, recent content lists can become mismatched between nodes in the cluster. That's because near cache changes, which represent the most recent changes, are batched and communicated across the cluster. If the cluster relationship is broken, communication fails between the banned node and other nodes.
After first start up, a node is unable to leave then rejoin the cluster
After the first run of a cluster — the first time you start up all of the nodes — nodes that are banned (due to being unresponsive, for example) might appear not to rejoin the cluster when they become available. That's because when each node registers itself in the database, it also retrieves the list of other nodes from the database. If one of the earlier nodes is the cluster coordinator — responsible for merging a banned cluster node back into the cluster — it is unaware of a problem if the last started node becomes unreachable.
To avoid this problem, after you start every node for the first time, bounce the entire cluster. That way, each is able to read node information about all of the others.
For example, imagine you start nodes A, B, and C in succession for the first time. The database contained no entries for them until you started them. Each enters its address in the database. Node A starts, registering itself. Node B starts, seeing A in the database. Node C starts, seeing A and B. However because node C wasn't in the database when A and B started, they don't know to check on node C — if it becomes unreachable, the won't know and won't inform the cluster coordinator. Note that the coordinator might have changed since startup.
If a node leaves the cluster, the coordinator needs to have the full list at hand to re-merge membership after the node becomes reachable again.