The Network Server is capable of managing multiple simultaneous JDBC connections with its clients. We call this "Session Management", and this page discusses some of the intricacies of the implementation.
This page was constructed as part of the work on DERBY-1326, DERBY-1219, DERBY-1338, DERBY-1020, and DERBY-51 (and possibly others?), although the page is intended to endure after the bugs are fixed, providing background information about the implementation.
Introduction
The principal objects involved in session management are the Session
and the DRDAConnThread
.
The server considers threads to be more expensive than socket connections. The server is generally willing to accept an arbitrarily large number of socket connections from clients at the same time, but has a limit on the number of threads that it is willing to create. So there may be more
Session
instances than DRDAConnThread
instances.
Also, the server is willing to discard the Session
instance when the client disconnects, whereas Thread creation and destruction is considered expensive, so the server caches and re-uses threads.
So the server tracks threads and sessions separately, and uses a queueing model to associate work requests (Session
) with workers (Threads).
Session
There is a Session
instance for each physical socket connection that the server has with a client. Via the Session
object, the Network Server can access the various Database objects that the session is using.
Session
instances are created by the Client Thread, and are tracked in the Network Server's sessionTable
.
DRDAConnThread
There are typically several DRDAConnThread
instances in the Network Server. These threads carry out work on behalf of Session
instances.
DRDAConnThread
instances are created by the Client Thread, and are tracked in the Network Server's thread list.
Dispatching work to threads
The server uses a very simple mechanism, called the RunQueue
, to rendezvous and dispatch work to threads.
Any DRDAConnThread
instance can service any Session
instance, so the Network Server uses a very simple first-come,first-serve policy for matching up sessions and threads. To avoid starvation, there is a time-slicing mechanism as well.
Conceptually, the RunQueue
may be in any of three states at any given time:
- If there are more
Session
needing work than there are availableDRDAConnThread
instances, than theRunQueue
holds the sessions which are awaiting a thread. - If there are more
DRDAConnThread
instances than there areSession
needing work, then the idle threads wait on theRunQueue
for work to become ready. - If there are exactly as many active
Session
instances as there areDRDAConnThread
instances, then theRunQueue
is empty and inactive.
Thus, depending on the current balance of work-needing-a-thread versus threads-needing-work-to-do, the RunQueue
at any given instant may be viewed either as a queue of work (Session
instances), or as a queue of threads (DRDAConnThread
instances).
Of course, this is strictly a conceptual viewpoint, because at the physical level, the RunQueue
as a collection object (it is a Java Vector) only ever actually holds Session
instances. When the server is in the state of "there are idle threads waiting for work to do", there is no explicit collection object which contains that list of idle threads; instead there is merely a "virtual" collection defined as "the threads which are currently waiting on the RunQueue
in NetworkServerImpl.getNextSession()
".
Session Lifetime
A Session
instance is created by the Client Thread when a new socket connection is accepted from a client. The Session
object is added to the Network Server sessionTable
at that time.
The Session
object is held until the client disconnects, which is typically handled by throwing a disconnect exception on the server side. The
DRDAConnThread
catches the disconnect exception, removes the session from the Network Server sessionTable
, closes it, and discards it.
(I think that if the client disconnects at a point where there are currently more Session
instances than DRDAConnThread
instances, and the client's Session
is currently on the RunQueue
, then the Network Server will not detect the client disconnect until the next time that the Session
moves to the head of the RunQueue
and is dispatched to a thread for processing. But I haven't yet set up this situation in the debugger, so I'm not sure about this part.)
Session
objects are also closed and discarded when the Network Server shuts down or is restarted.
Session closing and DERBY-1020
Bug DERBY-1020 is related to the proper handling of exceptions when an exception occurs closing the session especially after the database has been shutdown.
There may be some synchronization impact as well because the exceptions that occur when the database is intentionally shutdown are treated as unexpected exceptions and it will not go through the normal session shutdown codepath.
Note that DERBY-1020 is the root cause of DERBY-273 (now fixed) and DERBY-803, so there appear to be a number of problems with exceptions that occur during shutdown processing.
DRDAConnThread Lifetime
A DRDAConnThread
instance is created by the Client Thread when it detects that a new Session
has just been created, and there aren't any free threads currently, and the server has not yet reached its maximum number of threads.
Note that since there is no explicit collection object which contains the list of idle threads, the server code maintains a simple counter of the number of idle threads.
DRDAConnThread
instances are only destroyed when the Network Server shuts down or is restarted. The mechanism for shutting down a DRDAConnThread
instance is cooperative (that is, we don't call Thread.stop())
: the Network Server calls the thread's close()
method, which sets a flag on the thread, and the thread checks the flag periodically and exits its run()
method if it discovers the closed()
flag is set.
Note specifically here, with respect to DERBY-1326, that when the
DRDAConnThread
notices that it has been closed, immediately upon returning from the getNextSession()
call in its run()
method, that it does not perform any shutdown processing for the Session
that it was told to process. This can cause a "hang", as discussed in DERBY-1219.
Network Server startup
When the Network Server starts up, it starts the embedded Derby engine, but does not start any Session
or DRDAConnThread
. Those objects will be created on demand when new client connections are made, as described above.
Network Server shutdown
When the Network Server shuts down, it makes a concerted effort to clean up its resources:
- It shuts down the Client Thread
- It goes through the
sessionTable
and closes all theSession
objects - It goes through the Thread list and closes and interrupts each
DRDAConnThread
- It closes the master socket on which it accepts connections
There are several observations worth making here:
- It seems like it might be better to close the master socket first, rather than at the end, since otherwise there is the chance that new connections will continue to accumulate during the shutdown processing (though since the Client Thread has been interrupted, no new work should get started for such connection attempts).
- Note that we do not close down the embedded Derby engine here. This is the essence of bug DERBY-51. Basically, the Network Server does not know whether or not it should shut down the embedded engine, because it is possible to have an application in which there are both local threads accessing the database directly, and client threads accessing the database over the network, and just because we are shutting down the Network Server does not necessarily mean that we no longer want to access the database from the local threads.
- Lastly, note that in addition to closing each
DRDAConnThread
, we also callinterrupt()
on those threads. This awakens the thread if it was blocked in await()
call, so that it can notice the fact that it's been closed.
The implications of interrupting the DRDAConnThread instances
Thread.interrupt()
is a very powerful call, and the fact that we make this call during server shutdown can have unforeseen consequences. Consider, for example, DERBY-1338, in which the Thread.interrupt()
call made during server shutdown has the unwanted side effect of interrupting the loading of a class file, causing the DRDAConnThread
instance to report a ClassNotFound exception.
In general, it seems like it would be preferable to design a protocol in which we do not need to interrupt the daemon thread instances, or, if we do, that we only do so at a point where we are certain that it is safe to interrupt the threads.
Network Server restart
The Network Server has a feature by which it can restart itself, without needing to shut down and restart the entire application which contains the Network Server. The Network Server does this automatically when it detects that the underlying Derby engine has been shut down? The checkDataSource test contains some code which triggers this restart processing.
When the Network Server restarts itself, it does the following:
- It goes through the
RunQueue
and closes all sessions in theRunQueue
. - It goes through each
DRDAConnThread
and closes the thread - It releases its pointer to the Derby engine, to allow it to be collected.
- It then starts up a new instance of the Derby engine.
There are several observations worth making here.
- Note that while Network Server shutdown closes every session in the
sessionTable
, Network Server restart only closes the sessions in theRunQueue
. - Note that while Network Server shutdown both
closes()
andinterrupts()
eachDRDAConnThread
instance, Network Server restart justcloses()
the threads. This means that the threads may not become aware that they have been closed until some later time, when they awaken from whateverwait()
call they were blocked in. - Note that Network Server shutdown terminates the Client Thread, while Network Server restart leaves the Client Thread alone. This means that new
Session
may be coming in during the restart processing.
Synchronization
There are several monitors used by the Network Server session management code for synchronization:
sessionTable
threadList
runQueue
serverStartSync
serverStartSync synchronization
The serverStartSync
object has exactly one use, which is to single-thread the flow through the startNetworkServer()
method so that the network server is always started (or restarted) exactly once, no matter how many threads call this routine at the same time.
The serverStartSync
monitor is used in conjunction with the
restartFlag
boolean, and there is a comment at the head of the
startNetworkServer()
method regarding the use of these two variables which I don't understand.
sessionTable synchronization
The sessionTable
object is a Java Hashtable, which is an automatically-synchronized collection object. Therefore calls to
sessionTable.put()
, sessionTable.get()
and
sessionTable.remove()
do not need to be synchronized by the caller; the Hashtable will perform internal synchronization automatically.
However, when traversing an enumeration of the elements in the Hashable, the automatic synchronization does not help, so in these cases the Network Server code explicitly encloses these traversals in synchronized
blocks.
threadList synchronization
The threadList
object is a Java Vector, which is also an automatically-synchronized collection object. So, again, calls to
threadList.addElement()
, threadList.get()
,
threadList.size()
, etc. do not need to be synchronized by the caller.
However, when traversing the entire threadList, the automatic synchronization does not help, so in these cases the Network Server code explicitly encloses the traversals in synchronized
blocks.
Unfortunately, the Network Server code is inconsistent about these traversals. These traversals are enclosed in synchronized
blocks:
- in
blockingStart()
- in
setLogConnections
However, these traversals are not enclosed in synchronized
blocks:
- in
startNetworkServer()
- in
buildRuntimeInfo()
- in the
ClientThread
logic which decides whether to add a thread
This inconsistent synchronization is not explained in the code.
runQueue synchronization
The runQueue
object is also a Java Vector, so again it is an automatically-synchronized object, and again the collection traversals need to be explicitly synchronized.
Furthermore, the runQueue
object is also used for wait()
and
notify()
calls, so some additional synchronization is needed to perform these operations.
And the runQueue
synchronization is also used to protect the
freeThreads
variable, which is the count of the number of threads currently waiting on the runQueue
object.
Unfortunately, the Network Server code is again inconsistent in its use of the explicit synchronization. There is inconsistent or missing synchronization in at least these locations:
- in
startNetworkServer()
- in
buildRuntimeInfo()
- in the
ClientThread
logic which decides whether to add a thread
And it seems to me that the getFreeThreads()
method is inherently dangerous, since even though it is internally synchronized, the instant that the method returns to its caller the value could change.
Possible ways to simplify/repair the synchronization
Here are some ideas I've had about possible changes to the synchronization:
- I think that the open-coded logic in
ClientThread.java
which handles creating newSession
andDRDAConnThread
objects should be moved into theNetworkServerImpl
class so that it can be uniformly synchronized with the other logic in that class. - I think that it is confusing to use the
runQueue
as both a collection object and as a thread wait queue, and it would be simpler and less error-prone to use two separate objects for that. - I think that primitive methods like
getThreadList()
andgetFreeThreads()
should be avoided, as they risk exposing too much of the implementation to callers, and allowing race conditions to be accidentally introduced.
freeThreads counter maintenance
As discussed earlier, there is no explicit collection object which contains the set of threads which are currently idle, waiting for work to do.
Instead, the Network Server maintains an integer variable, called
freeThreads
, to count the number of such threads. Immediately before calling runQueue.wait()
, the thread increments this counter, and immediately after returning from runQueue.wait()
, the thread decrements this counter.
However, it is also possible for the thread to be interrupted, in which case it doesn't return normally from runQueue.wait()
, but rather throws an InterruptedException
from that method.
The getNextSession()
method handles this possibility by catching the InterruptedException
message, and decrementing the counter in this case as well.
However, when the Network Server restarts, it also unconditionally sets the counter to 0. This means that there is a latent problem, which I uncovered when I experimented with having the Network Server interrupt the threads during restart processing. What happens is as follows:
- The Network Server restarts
- During restart processing, it interrupts the
DRDAConnThreads
- The Network Server then sets the freeThreads counter to 0
- The threads which have been interrupted then awaken, catch the
InterruptedException
, and decrement the counter.
The counter then goes negative, and becomes completely useless. The Client Thread thinks that there are plenty of free threads, because the free threads counter is not 0, so it never starts any new threads, but in fact there aren't any free threads (there are -1 or
-2 or some number like that).
The symptom is that, after the Network Server is restarted, assuming that it interrupts the threads during restart processing, it then never starts any new threads, although it continues to accept new TCP/IP connections and creates new Session
objects.
The fix, I believe, is simply to remove the unconditional setting of the freeThreads counter to 0 in the restart processing, and let the incrementing and decrementing logic in getNextSession()
handle all the maintenance of this counter.
Benign wakeups and notifyAll
There is a standard convention for thread programming, which dates back to Andrew Birrell's classic paper, "An Introduction to Programming with Threads" for handling the waiting and waking up of threads.
As Birrell explains (pages 14-15), threads which are waiting on a resource should always be coded as:
WHILE NOT expression DO Thread.Wait(m,c) END;
You will note that the getNextSession()
method is already coded in the correct style, which is good.
When the wait loop is coded in this fashion, Birrell explains, one benefit is (p. 15):
A final advantage of this convention is that it allows for simple
programming of calls to “Signal” or “Broadcast”—extra wake-ups are benign.
Carefully coding to ensure that only the correct threads are awoken is now
only a performance question, not a correctness one (but of course you must
ensure that at least the correct threads are awoken).
Here is where there is a slight issue in the NetworkServer code. In the
runQueueAdd()
method, note that we currently issue runQueue.notify()
. This means that we are waking up at most one thread to go handle the newly-queued Session.
Unfortunately, at this point, runQueueAdd()
does not have the proper information to be certain that it is waking up the correct thread. In particular, in the case of a NetworkServer restart, what if some of the waiting threads happened to be "poisoned" (threads that had been closed by the restart code)?
If we simply change runQueueAdd()
so that it calls notifyAll()
, rather than notify()
, then everything will still function properly even if the current mix of threads waiting on the runQueue contains both closed threads and non-closed threads. All of those threads will be awoken; one non-closed thread will eventually grab the session and go run it; all the other threads will either detect that they have been closed and break out of getNextSession() or detect that there are no more Sessions to run and go back to sleep.
Changing the Session Management code
I think that there are a number of problems in the Session Management code which need to be addressed, as part of the bugs DERBY-1326 and DERBY-51, and possibly others.
However, before we proceed to changing the code, I want to make sure that we are correctly understanding the current operation of the code, so I'm only taking the wiki page this far right now, and I intend to return to this page at a later date and start to propose changes to the code, once we've reviewed the material so far.