This article provides some practical suggestions for debugging Geode Applications.
- cache.xml - provided by the application developer to configure the caches and regions for Geode client and server processes.
- Properties files - provided by the application developer to configure Geode system properties and membership discovery
- System logs - logs generated by Geode clients, servers, and locators. The logs contain information about membership, client connections, warnings about outstanding message requests and errors. The log also displays both Geode and Java system properties.
- Statistics - archive files generated by Geode clients and/or servers containing information about the Geode application. GemFire VSD (Visual Statistics Display) is used to graph the Geode and system metrics recorded in these archives.
Geode gfsh provides a command-line interface from which you can launch, manage, and monitor Geode processes, data, and applications. The shell provides commands useful in debugging and to bring all the information to a single location for analysis. The following gfsh commands require first executing the gfsh
connect command to establish the connection to the locator or JMX Manager of the distributed system. Please refer to the Geode documentation for more details.
- export logs
- export stack-traces
- show log
- show dead-locks
Use Geode MergeLogFiles (com.gemstone.gemfire.internal.logging.MergeLogFiles) to merge your log files based on timestamps.
Check your environment - Machine (e.g. ulimit settings), JDK, JVM properties (-Xmx -Xms), GC parameters
Draw out a diagram of your system topology (servers, clients) and make a note of Listeners, Writers and other plug-ins.
Verify your cache and region configuration
Confirm your system properties (Review properties files and display in system log)
On your system topology diagram, add notes on the initialization and processing being done in various members or classes of members.
If you are debugging a specific interaction, draw a sequence diagram.
If possible, bring all the system logs and stack dumps together into a single directory for inspection (use gfsh commands above). Here's a simple script which will search for specific strings in the logs.
Search the system logs for warning, error or severe messages
Search the system logs for any underlying Exceptions. For example: ConcurrentModificationException, NullPointerException, SerializationException.
Search the system logs for warnings about resources causing delays in statistics sampling. If found, use VSD to investigate further.
Verify there are no HotSpot (hs_err_pid.log files) indicating a HotSpot error in the JVM Refer to the Oracle Troubleshooting Guide for more details.
Search the stack dumps for
Java-level deadlock. Dumping the stacks using jstackor the Linux command
kill -3 <pid>will highlight any Java-level deadlocks including the threads involved in the deadlock as well as the stack dumps for each of those threads. When debugging, it is best to get stack dumps for all JVMs. To determine if progress is being made, execute multiple thread dumps several seconds apart for comparison.
You can also search for
state=BLOCKEDthreads and trace waiting to lock (e.g. waiting to lock java.lang.Object@16ce6f90) threads to whichever thread locked the object (e.g. locked java.lang.Object@16ce6f90). Follow this pattern until you find the root thread.
Search the system logs for any
15 seconds have elapsedmessages which don't have corresponding
wait for replies has completedlogs. You can match these log messages together via the thread id or native thread id. Note that these messages are only logged between peers in the Distributed System. See "Special Considerations for Clients" for messages specific to Geode clients.
In this example, we can see that the request did complete, so while we should be concerned (and possibly check stats in vsd to see what system resources are causing this delay), it will not be the cause of our hang.
If the request is never satisfied (there is no corresponding
wait for replies completed), look at the stack dumps for the non-responsive member. There could be a Java-level deadlock within that JVM.
There can also be distributed deadlocks between members. This requires following the
15 seconds have elapsedwarnings to the remote members and looking at the stack dumps. Searching for
waiting to lockin the stack dumps can also help to identify the problematic JVM. Once found in a non-responsive member, find the thread in that JVM that holds the lock and determine what prevents it from releasing the lock.
This example shows the outstanding request from the system log and the relevant stack dumps from the non-responding JVM.
The system log shows that vm 12659 is still awaiting a response from vm 12706
The stack dumps from 12706 show the
waiting to lock <monitor>and
locked <monitor>in the stack dumps
Geode clients can fail with ServerConnectivityExceptions when the servers are too busy to handle client requests. For example, with large GC pauses or distributed deadlocks.
is being terminated because its client timeout messages in the server system logs and to determine whether or not this is occurring in your application. If so, review the server side system logs, stack dumps and statistics to determine the cause.
To trace function execution from the initiator to the member executing the function, pass the initiating thread id to the function using
withArgs and log in both. Of course, this could easily be a string containing the pid, DistributedMemberId or any other identifying information for your application.
Log these values within the function to help with tracing during development.
Since Geode supports multiple CacheListeners, consider adding a LogListener which simply logs the relevant portion of the events as they are processed. This provides another way to enable traceability in your application during the early stages of development. For client/server applications, it can help to identify the originating member of an operation and the server that forwarded that event to a specific client.
If you are not using the CallbackArgument for your application, use the callbackArgument to encode information about the caller or the data, which you can log in your LogListener.
Events for operations initiated in the local JVM are logged by the calling thread as shown below. In this case vm_1_thr_10_edge4_w1-gst-dev18_10648.
Events fired in remote members are fired on asynchronous threads. In the case of clients, this asynchronous thread provides the identity of the server hosting the client's HARegionQueue. In this case bridgegemfire5_w1_gst_dev18_79056.