Debugging Tips

Provides a collection of tips for debugging Trafodion code.

Debugging an mxosrvr Process

For purposes of debugging compiler and executor code, using gdb on sqlci is the simplest and easiest debugging environment.

However, occasionally you may be debugging an issue that occurs only via ODBC/JDBC and cannot be reproduced via sqlci. For these issues, you may need to debug in the mxosrvr process. These are persistent server processes on the Trafodion cluster that service ODBC/JDBC connections.

Finding the Right mxosrvr Process

You can put an mxosrvr process into debug via gdb by starting gdb on the proper node and using the gdb attach command. To do this, though, you will need to know which mxosrvr process your client is connected to and its Linux pid. If you are using trafci (the type-4 JDBC interactive client to Trafodion), you can use the "show remoteprocess" command as in the following example:

SQL>show remoteprocess;

REMOTE PROCESS \venkatsentry-2.novalocal:1.$Z0112LJ

In the output above, the node hosting the mxosrvr process is venkatsentry-2.novalocal, and the Trafodion process name is $Z0112LJ.

If you now start a shell on that node, you can do an "sqps | grep Z0112LJ" command to see the Linux pid of the process.

Dealing with Timeouts

The mxosrvr process is designed to be persistent and relies on Zookeeper and the DCS Master process for this purpose. There is timeout logic to determine if an mxosrvr process is still alive. If mxosrvr is unresponsive for longer than that time it may kill itself or be killed (if it still exists) and a new mxosrvr process is created. This can be a problem in debugging; slowly stepping through code in gdb can cause one or another timeout to be exceeded. To mitigate this, you can set the timeouts to higher values. For example, add the following to the conf/dcs-site.xml file on each node in the cluster:

   <property>

      <name>dcs.server.user.program.zookeeper.session.timeout</name>

      <value>3600</value>

   </property>

   <property>

      <name>zookeeper.session.timeout</name>

      <value>3600000</value>

  </property>

After changing conf/dcs-site.xml, you will need to stop and restart DCS (use the "dcsstop" and "dcsstart" scripts) in order for the change to take effect.

For more detailed information about mxosrvr configuration parameters, see the Trafodion Data Connectivity Services Reference Guide at http://trafodion.apache.org/docs/dcs_reference/index.html.

Turning off Repository Writes

If you are debugging a compiler or executor issue in an mxosrvr process, you may find that your breakpoints are hitting on writes to the Trafodion Repository tables. There is a separate thread in mxosrvr that periodically flushes out statistical data to the Repository using SQL DML statements. This can be annoying. You can turn off Repository writes by adding the following to conf/dcs-site.xml:

   <property>

      <name>dcs.server.user.program.statistics.enabled</name>

      <value>false</value>

   </property>

Debugging Mixed C++/Java Processes

Many Trafodion processes (such as sqlci and mxosrvr) have a C++ main and substantial amounts of Java code invoked via JNI.

You can debug the C++ parts using a debugger such as gdb. One gotcha is that JVM threads often throw signals such as SIGSEGV as part of their normal processing. (The HotSpot JVM for example is reputed to use SIGSEGV to trigger lazy compilation of Java methods.) Unfortunately, gdb catches the signals first. This can be quite annoying.

A way to work around this is to enter the following command into gdb:

handle SIGSEGV pass noprint nostop

Alternatively, place this command in your .gdbinit file.

Navigating Around a Cluster

Many developers do their work on workstations, in an HBase standalone environment. Sometimes, though, you may be doing development work on a cluster. Here are some tips for getting around on a cluster.

Finding out what nodes are part of a cluster

If you are logged onto one node as the trafodion user, there are several environment variables that have a list of the node names. $MY_NODES is one example. You can simply do “env | grep <node name>”, where <node name> is the node you’re logged onto, to find these environment variables.

Logs

Logs for SeaQuest processes are found in $MY_SQROOT/logs (soon to be renamed $TRAF_HOME/logs). You can get there using the “cdl” command from a trafodion logon.

If you are debugging UPDATE STATISTICS by using the "update statistics log on" SQL statement, the log for that will be written to the ~/sqllogs directory (where ~ is the "trafodion" user) on the node where the master executor is running. If you are using trafci, you’ll have to hunt for the node where your mxosrvr is running. (See details on how to do that above.) If you are using sqlci, it is on the node where you are running sqlci.

Scripts

Scripts can be found in the $MY_SQROOT/sql/scripts (soon to be $TRAF_HOME/sql/scripts). You can ge there using the “cds” command from a trafodion logon.

Figuring out why a node went down

In your testing, you might suddenly discover that the node you’re logged into went down. (I had this happen when doing a control-C in sqlci while a massive UPDATE STATISTICS command was running.) Look at the SeaQuest logs to discover why. For example, do “cdl” to get to the logs directory, then do “ls -g” to see which logs were most recently updated. In my case, the watch dog timer killed the node, and this was revealed in the watch dog timer log. One caveat: “ls -g” shows times in local time. The log messages in the log files themselves bear timestamps in UTC time.

Bringing a node back up

In your testing, you might find that the instance seems to go down. If you are logged on as the trafodion user, you can use the "trafcheck" script to discover what nodes in the cluster are up or down.

If just one node is down, you can use “sqshell” to bring the node back up. When bringing it back up, use the full DNS name of the node (e.g. “abc031.yourcompany.local”). Use the help command in “sqshell” to get details on the command to bring a node up.

HBase Status

HBase provides a GUI that gives information about its status. The URL to access this GUI is like http://abc031.yourcompany.local:60010/master-status. The port number 60010 is typical on Cloudera(TM) clusters. Port number 16010 is used instead on Hortonworks(TM) clusters. From this tool you can obtain informatino about the number of region servers, region server state, the set of regions per table and so on

Vendors often provide additional managerial GUIs in their distributions. For example, Hortonworks distributions package Ambari. The URL for that is like http://abc031.yourcompany.local:8080. For Cloudera, it is Cloudera Manager, with a URL like http://abc031.yourcompany.local:7180/cmf/services/11/status. Both of these products typically require a login; get the user name and password from your cluster administrator.

Figuring out which pid is a RegionServer

On a workstation, the “jps” command shows you which process is the HMaster (which is also the RegionServer on standalone HBase installations). On a cluster, however, “jps” only shows information about java processes associated with your logon ID, which is usually “trafodion”. Unfortunately the HBase processes run under a different ID, typically “hbase”. Even so, you can still figure it out by using “ps -fu hbase”. You’ll need a wide session scroll to see it (or pipe it to a text file), but a java process with “-Dproc_regionserver” in its command parameters identifies a region server.

Figuring out what a RegionServer is up to

If you see in “top” that a java process is very busy, and you identify that process as a RegionServer, you might want to know what the RegionServer is doing. The HBase status GUI can tell you. Use the URL http://abc031.yourcompany.local:60030/dump (substituting your node name; also the port number will be 16030 on a Hortonworks cluster). This gives you a map of all threads in the process, along with their stack traces.

Another URL that might be interesting is http://abc031.yourcompany.local:60030/jmx (port number 16030 on a Hortonworks cluster), which gives JMX statistics for the process.

Space shortcuts

Page tree