Meeting notes - ZooKeeper use in HBase
10/23/2009 phunt, mahadev, jdcryans, stack
Reviewed issues identified by HBase team. Also discussed how they might better troubleshoot when problems occur. Mentioned that they (hbase) should feel free to ping the zk user list if they have questions. JIRAs are great too if they have specific areas of ideas/concerns.
HBase team is going to create some JIRAs they identified (improve logging esp)
Phunt also asked the HBase team to fill in some usecases for how they are currently using zk, this will allow us (zk) to better understand how they are using, allowing us to provide better advice, vet their choices, and in some cases we may even test to verify under real conditions.
Some questions identified:
- Stack - impression is that zk is too flakey and sensitive at the moment in hbase deploys (i'm talking particularly about the new-user experience). I'd think a single server with its Nk reads/second with fairly rare updates should be able to host a big hbase cluster with a session timeout of 10/20 seconds but we are finding that we are recommending quorums of 3-5 on dedicated machines with sessions of 60 seconds because too often masters and regionservers timeout against zk. GC on RS can be a killer; pauses of tens of seconds. CMS helps some. What is the experience of other users of zk? Do you have recommendations for us regards deploy? When fellas are having trouble in the field with session timeouts, how do you debug/approach this issue?
zk: biggest issues we typically see are; gc causing vm to stall (client or server, we have no control over the jvm unfort, in hadoop i've heard they can see gc pauses of over 60 seconds), server disk io causing stalls (sequential consistency means that reads have to wait for writes to complete), network connectivity problems. Perhaps use of JVM tools such as gc monitoring and jvisualvm may help to track down. Over-provising a host will obviously slow down processing (running high cpu/disk processes on the same box(es) as zookeeper).
zk: Mahadev mentioned that zk team commonly sees the types of issues hbase experiences at least on first deploy but that after tuning and research, all settles down
- hbase - We'd like to know what kind of limits a 3 nodes quorum have. What's the reasonable amount of znodes, reads and writes per sec it can support? How many watches?
zk: this is the easiest to see http://hadoop.apache.org/zookeeper/docs/current/zookeeperOver.html#Performance however this is for optimum conditions (dedicated server class host with sufficient memory and dedicated spindle for the log) and a large number of clients. With smaller number of clients using synchronous operations you are probably limited more by network latency than anything.
- hbase - How expensive is a WatchEvent? What if we had a watcher per region hosted by a regionserver? Lets say up to 1k regions on a regionserver. Thats 1k watchers. Lets say cluster of 1-200 regionservers. Thats 200k watchers the zk cluster needs to control.
zk: very inexpensive by design. the intent was to support large numbers of watchers. I've tested a single client creating 100k ephemeral nodes on session a, then session b setting sync watches on all nodes, then session c sessing async watches on all nodes, then closing session a. This was all done on a 1core 5yrs old laptop, 200k watches were delivered in < 10 sec. Granted that's all the zk cluster was doing, but it gives you some idea.
- hbase - Would like to have queues up in zk. Queues of regions moving through state changes from unassigned to opening to open. Would like to keep total list of all regions and then list them by regionserver assigned to? What would you suggest?
Below is stale now, excerpted from another document. For latest see http://wiki.apache.org/hadoop/Hbase/MasterRewrite
Below are some notes from our master rewrite design doc. Its from the section where we talk of moving all state and schemas to zk:
- Tables are offlined, onlined, made read-only, and dropped (Add freeze of flushes and compactions state to facilitate snapshotting). Currently HBase Master does this by messaging regionservers. Instead move state to zookeeper. Let regionservers watch for changes and react. Allow that a cluster may have up to 100 tables. Tables are made of regions. There may be thousands of regions per table. A regionserver could be carrying a region from each of the 100 tables. TODO: Should regionserver have a table watcher or a watcher per region?
- Tables have schema. Tables are made of column families. Column families have schema/attributes. Column families can be added and removed. Currently the schema is written into a column in the .META. catalog family. Move all schema to zookeeper. Regionservers would have watchers on schema and would react to changes. TODO: A watcher per column family or a watcher per table or a watcher on the parent directory for schema?
- Run region state transitions – i.e. opening, closing – by changing state in zookeeper rather than in Master maps as is currently done.
- Keep up a region transition trail; regions move through states from unassigned to opening to open, etc. A region can't jump states as in going from unassigned to open.
- Master (or client) moves regions between states. Watchers on RegionServers notice changes and act on it. Master (or client) can do transitions in bulk; e.g. assign a regionserver 50 regions to open on startup. Effect is that Master "pushes" work out to regionservers rather than wait on them to heartbeat.
- A problem we have in current master is that states do not make a circle. Once a region is open, master stops keeping account of a regions' state; region state is now kept out in the .META. catalog table with its condition checked periodically by .META. table scan. State spanning two systems currently makes for confusion and evil such as region double assignment because there are race condition potholes as we move from one system – internal state maps in master – to the other during update to state in .META. Current thinking is to keep region lifecycle all up in zookeeper but that won't scale. Postulate 100k regions – 100TB at 1G regions – each with two or three possible states each with watchers for state change. My guess is that this is too much to put in zk. TODO: how to manage transition from zk to .META.?
- State and Schema are distinct in zk. No interactions.