HBase Prototype

Introduction

This page describes a partition which stores LDAP data in HBase.

The HBase partition is implemented as XDBM Partition. The main advantage compared to an implementation of the simple Partition interface is that the powerful search engine can be used. However instead of storing entries by their DN an hierarchical approach is used, inspired by http://cwiki.apache.org/confluence/display/DIRxSRVx11/Xdbm+fast+modifyDn+proposal and http://www.openldap.org/conf/odd-sfo-2003/howard-dev.pdf.

In contrast to the JDBM (or the LDIF) partition, the directory server (ApacheDS) and the storage engine (HBase) don't run within the same JVM but are distributed over the network and communicate via RPC calls. Hence it is important to reduce the communication between ApacheDS and HBase to a minimum.

The following format is used to illustrate the HBase table layout, timestamps are omitted:

------------------------------------------------------------
| table name | family1              | family2              |
------------------------------------------------------------
| row1       | qualifier1 -> value1 | qualifierA -> valueA |
|            | qualifier2 -> value2 | qualifierB -> valueB |
------------------------------------------------------------
| row2       | qualifier1 -> value1 | qualifierA -> valueA |
|            | qualifier2 -> value2 | qualifierB -> valueB |
------------------------------------------------------------

HBase Schema Design

Master Table

The master table stores the entries, one entry per row:

------------------------------------------------------------------------------------------------------------------------------------------------
| master                               | treeInfo                                         | upAttributes                                       |
------------------------------------------------------------------------------------------------------------------------------------------------
| 00000000-0000-0000-0000-000000000000 |                                                  |                                                    |
------------------------------------------------------------------------------------------------------------------------------------------------
| 11111111-1111-1111-1111-111111111111 | parentId -> 00000000-0000-0000-0000-000000000000 | objectClass0 -> top                                |
|                                      | upRdn -> o=sevenSeas                             | objectClass1 -> organization                       |
|                                      | normRdn -> 2.5.4.10=sevenseas                    | o0 -> sevenSeas                                    |
|                                      |                                                  | entryUUID0 -> 11111111-1111-1111-1111-111111111111 |
------------------------------------------------------------------------------------------------------------------------------------------------
| 22222222-2222-2222-2222-222222222222 | parentId -> 11111111-1111-1111-1111-111111111111 | objectClass0 -> top                                |
|                                      | upRdn -> ou=people                               | objectClass1 -> organizationalUnit                 |
|                                      | normRdn -> 2.5.4.11=people                       | ou0 -> people                                      |
|                                      |                                                  | entryUUID0 -> 22222222-2222-2222-2222-222222222222 |
------------------------------------------------------------------------------------------------------------------------------------------------
| 33333333-3333-3333-3333-333333333333 | parentId -> 11111111-1111-1111-1111-111111111111 | objectClass0 -> top                                |
|                                      | upRdn -> ou=groups                               | objectClass1 -> organizationalUnit                 |
|                                      | normRdn -> 2.5.4.11=groups                       | ou0 -> groups                                      |
|                                      |                                                  | entryUUID0 -> 33333333-3333-3333-3333-333333333333 |
------------------------------------------------------------------------------------------------------------------------------------------------
| 66666666-6666-6666-6666-666666666666 | parentId -> 22222222-2222-2222-2222-222222222222 | objectClass0 -> top                                |
|                                      | upRdn -> cn=Horatio Hornblower                   | objectClass1 -> person                             |
|                                      | normRdn -> 2.5.4.3=horatio hornblower            | objectClass2 -> organizationalPerson               |
|                                      |                                                  | objectClass3 -> inetOrgPerson                      |
|                                      |                                                  | cn0 -> Horatio Hornblower                          |
|                                      |                                                  | description0 -> Capt. Horatio Hornblower, R.N      |
|                                      |                                                  | givenName0 -> Horatio                              |
|                                      |                                                  | sn0 -> Hornblower                                  |
|                                      |                                                  | uid0 -> hhornblo                                   |
|                                      |                                                  | mail0 -> hhornblo@royalnavy.mod.uk                 |
|                                      |                                                  | userPassword0 -> <bytes>                           |
|                                      |                                                  | entryUUID0 -> 66666666-6666-6666-6666-666666666666 |
------------------------------------------------------------------------------------------------------------------------------------------------
| ...                                  |                                                  |                                                    |
------------------------------------------------------------------------------------------------------------------------------------------------

The row key is an UUID, the entryUUID attribute value of the entry is used. The advantage of using UUIDs is that they are random, random keys are preferred in HBase as it allows optimal distribution across data nodes. (Note: XDBM was adjusted to use an generic type parameter for the entry ID)

The treeInfo column family stores hierarchical information:

  • Column parentId contains the row key of the parent entry.
  • Column upRdn contains the user provided local name, relative to the parent (normally the RDN; the suffix DN for the context entry). It is used to reconstruct the entry's DN.
  • Column normRdn contains the normalized local name, relative to the parent (normally the RDN; DN for the context entry). This is just an optimization for constructing the key of the tree table in order to avoid RDN normalization (see below).

The upAttributes column family contains a map with all the attributes.

  • The attribute description (type+options) is used as column qualifier.
  • HBase stores one value per column. There are several workarounds how to store multiple values. When using serialization or JSON format it won't be possible to access one value at a time. Hence each value is stored in its own column and an additional index is added to the column qualifier. This way each value can be read and written separately.
  • The additional index is a zero-based 4-byte signed integer. To reconstruct the user provided attribute description the last 4 bytes needs to be removed from the column qualifier.
  • The values are stored as byte[].

The row with key '00000000-0000-0000-0000-000000000000' is the virtual root.

To retrieve the DN of an entry by its ID the entry's RDN and parent ID must be fetched. As long as the parent ID differs from '00000000-0000-0000-0000-000000000000' this step must be repeated for all parents. The DN is the result of all concatenated RDNs.

It would be also possible to determine the ID for an DN, however a full table scan would be necessary. For this reason a second table is available.

The master table contains all information needed to restore the data: reference to the parent, user provided RDN, and user provided attributes.

Alternatives and Improvements:

  • It would be possible to store the serialized form of the server entry instead of the attributes.
  • It would be possible to store the serialized form of the RDN to avoid parsing.
  • Different kind of attributes could be stored in separate column families (e.g. binary attributes).
  • Usage of entry UUID as row key. This helps to distribute the entries over all clusters and may avoid hot spots.
  • Add oneLevelCount and subLevelCount (currently stored in tree table) for faster lookup of counts.
  • Add normAttributes (currently stored in tree table) for faster lookup of reverse index and evaluator.

Tree Table

The tree table stores parent to child relationships.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| tree                                                            | treeInfo                                   | normAttributes                                           |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 00000000-0000-0000-0000-000000000000,2.5.4.10=sevenseas         | id -> 11111111-1111-1111-1111-111111111111 | 2.5.4.0=organization -> 1                                |
|                                                                 | oneLevelCount -> 4                         | 2.5.4.0=top -> 0                                         |
|                                                                 | subLevelCount -> 1583                      | 2.5.4.10=sevenseas -> 0                                  |
|                                                                 |                                            | 1.3.6.1.1.16.4=11111111-1111-1111-1111-111111111111 -> 0 |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 11111111-1111-1111-1111-111111111111,2.5.4.11=people            | id -> 22222222-2222-2222-2222-222222222222 | 2.5.4.0=organizationalunit -> 1                          |
|                                                                 | oneLevelCount -> 1254                      | 2.5.4.0=top -> 0                                         |
|                                                                 | subLevelCount -> 1254                      | 2.5.4.11=people -> 0                                     |
|                                                                 |                                            | 1.3.6.1.1.16.4=22222222-2222-2222-2222-222222222222 -> 0 |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 11111111-1111-1111-1111-111111111111,2.5.4.11=groups            | id -> 33333333-3333-3333-3333-333333333333 | 2.5.4.0=organizationalunit -> 1                          |
|                                                                 | oneLevelCount -> 2                         | 2.5.4.0=top -> 0                                         |
|                                                                 | subLevelCount -> 56                        | 2.5.4.11=groups -> 0                                     |
|                                                                 |                                            | 1.3.6.1.1.16.4=33333333-3333-3333-3333-333333333333 -> 0 |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 22222222-2222-2222-2222-222222222222,2.5.4.3=horatio hornblower | id -> 66666666-6666-6666-6666-666666666666 | 0.9.2342.19200300.100.1.1=hhornblo -> 0                  |
|                                                                 | oneLevelCount -> 0                         | 0.9.2342.19200300.100.1.3=hhornblo@royalnavy.mod.uk -> 0 |
|                                                                 | subLevelCount -> 0                         | 2.5.4.0=inetorgperson -> 3                               |
|                                                                 |                                            | 2.5.4.0=organizationalperson -> 2                        |
|                                                                 |                                            | 2.5.4.0=person -> 1                                      |
|                                                                 |                                            | 2.5.4.0=top -> 0                                         |
|                                                                 |                                            | 2.5.4.13=capt. horatio hornblower, r.n -> 0              |
|                                                                 |                                            | 2.5.4.3=horatio hornblower -> 0                          |
|                                                                 |                                            | 2.5.4.35=<bytes> -> 0                                    |
|                                                                 |                                            | 2.5.4.4=hornblower -> 0                                  |
|                                                                 |                                            | 2.5.4.42=horatio -> 0                                    |
|                                                                 |                                            | 1.3.6.1.1.16.4=66666666-6666-6666-6666-666666666666 -> 0 |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| ...                                                             |                                            |                                                          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The row key is composed of the parent entry ID (UUID), a comma, and the normalized RDN of an entry.

The treeInfo column family stores hierarchical information:

  • Column id contains the row key of the entry in the master table
  • Column oneLevelCount tracks the number of immediate children. It is used by the one level index. When adding or deleting an entry the oneLevelCounter of the parent entry is incremented or decremented.
  • Column subLevelCount tracks the number of all descendants. It is used by the sub level index. When adding or deleting an entry the subLevelCount counters of all parent entries are incremented or decremented.

The normAttributes column family stores a map with all attributes (indexed as well as unindexed) in normalized form. It is used for server-side filtering while scanning. The qualifier is composed of the attribute OID, an equals character, and the attribute value. The numeric values represent the 4-byte attribute index in the master table.

Using this table is easy to calculate the ID for an DN. The start row key can always be composed using the partion's suffix: 00000000-0000-0000-0000-000000000000,<suffix>. From that row key the suffix entry ID can be found in the treeInfo:id column. This ID and the next name component from the DN are used to compose the next row key. This is reapeated till all name components of the DN are processed.

The table is also used for one-level and sub-level index cursors. To iterate over all children of an entry 'X' a HBase scanner with start key 'X' and stop key 'X+1' is used. For sub-level scans column column treeInfo:id can be used to setup the next scanner's start and stop key. While walking the sub-level index the column treeInfo:oneLevelCount can be used to determine if it is necessary to scan the next level.

While scanning it is also possible to use column family normAttributes for server-side filtering. This is essential for unindexed searches as it is very expensive to load all entries from the HBase cluster into ApacheDS and evaluate the filter there. Instead the LDAP filter can be translated to an HBase filter and evaluated in the HBase cluster.

The table is also used by reverse indices, e.g. by evaluators.

Alternatives and Improvements:

  • Row keys may become long if custom AT or long values are used. Even a simple RDN like '2.5.4.11=users' has 14 bytes. As the key is always calculated and never parsed back it would be possible to shorten it. Possible strategies were to use some hash (MD5 fixed 16 bytes) or to substitute the OID with the short name.
  • It isn't necessary to store objectClass:top

Index Tables

An HBase table always has exactly one index: the row key. It is used to access the data in constant time. For value indices HBase provides two solutions:

  • Indexed Transactional HBase 'ITH' uses secondary tables but handles that transparently for the user.
  • Indexed HBase 'IH' is new in 0.20.3 https://issues.apache.org/jira/browse/HBASE-2037 compares both approaches. Both approaches don't support multi-valued attributes, so custom secondary tables are used for partition indices.

The characteristics of indices in LDAP are very different:

  • There are attributes with only a few different values but used by many entries. Example: 'objectClass: person'
  • There are (unique) attributes with many different values used by only one or few entries. Example 'uid'

To ensure optimal performance two different index implementations are used:

  • For the first type of attributes a row-based approach is used. For each value and candidate a separate row is created. This approach is the default as it scales and can handle values with millions of candidates. To process all the candidates a scanner over all the values is used and the creation of a scanner generates some overhead which hurts when only less candidates are available.
  • For the second type of attributes a column-based approach is used. For each value only one row exists and the candidate IDs are stored in columns. This approach doesn't scale well, however for a less number of candidates it is the right choice. All candidates for a value can be fetch with one read operation which is much faster than creating a scanner.

Row Index Table

-------------------------------------------------------------------------
| index_objectClass                                       | info        |
-------------------------------------------------------------------------
| =inetorgperson<\00>66666666-6666-6666-6666-666666666666 | id -> <\00> |
-------------------------------------------------------------------------
| =inetorgperson<\00>99999999-9999-9999-9999-999999999999 | id -> <\00> |
-------------------------------------------------------------------------
| =person<\00>66666666-6666-6666-6666-666666666666        | id -> <\00> |
-------------------------------------------------------------------------
| =person<\00>99999999-9999-9999-9999-999999999999        | id -> <\00> |
-------------------------------------------------------------------------

There is one row for each combination of value and entry. The key is composed the following way:

  • it starts with with an equals character '=' (0x3D)
  • followed by the normalized value
  • a null (0x00) character is used as delimiter between value and entry ID
  • finally the entry ID

The info column just holds a dummy value as the entry ID can be extracted from the row key.

The index table can be used for equality, substring, greater-than and less-than searches and can be filtered at the server-side.

  • For equality matches the scan starts at "=value\00" and ends at "=value\01". The trailing null byte and one byte bound the scan to the exact value.
  • For greater-than matches the scan starts at "=value" without an upper bound.
  • For less-than matches the scan stops at "=value" incremented by one bit and without an lower bound.
  • For substring matches a server-side filter is used: "^=<value pattern>\00[A-Fa-f0-9]{8}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{4}-[A-Fa-f0-9]{12}$"
    If the filter contains an initial pattern the lower bound "=value" and upper bound "=value" incremented by one bit can be set.

It is not possible to obtain a candidate count from that type of index table in constant time. Instead the table must be scanned.

Alternatives and Improvements:

  • It may be possible to maintain candidate counts for each value. However the maintanance of those counters significantly decreases write performance.

Column Index Table

------------------------------------------------------------------------
| index_givenname      | info                                          |
------------------------------------------------------------------------
| =cornelius           | 77777777-7777-7777-7777-777777777777 -> <\00> |
------------------------------------------------------------------------
| =horatio             | 66666666-6666-6666-6666-666666666666 -> <\00> |
|                      | CCCCCCCC-CCCC-CCCC-CCCC-CCCCCCCCCCCC -> <\00> |
------------------------------------------------------------------------
| =john                | 99999999-9999-9999-9999-999999999999 -> <\00> |
|                      | DDDDDDDD-DDDD-DDDD-DDDD-DDDDDDDDDDDD -> <\00> |
------------------------------------------------------------------------
| =william             | AAAAAAAA-AAAA-AAAA-AAAA-AAAAAAAAAAAA -> <\00> |
------------------------------------------------------------------------

Unlike the row index, the column index contains only one row per indexed value.
The row key is composed of an equals character '=' (0x3D) and the normalized value.
The candidate IDs are stored as column qualifier in the info family.

The index table can be used for equality, substring, greater-than and less-than searches and can be filtered at the server-side.

  • For equality matches exactly one row "=value" must be fetched.
  • For greater-than matches the scan starts at "=value" without an upper bound.
  • For less-than matches the scan stops at "=value" incremented by one bit and without an lower bound.
  • For substring matches a server-side filter is used: "^=<value pattern>$"
    If the filter contains an initial pattern the lower bound "=value" and upper bound "=value" incremented by one bit can be set.

This type of index table allows to count candidate of a value by fetching the whole row including all candidates using a single RPC call.

Existence Index

-------------------------------------------------------
| index_objectClass                     | info        |
-------------------------------------------------------
| *66666666-6666-6666-6666-666666666666 | id -> <\00> |
-------------------------------------------------------
| *99999999-9999-9999-9999-999999999999 | id -> <\00> |
-------------------------------------------------------

Additional to values the index tables store information about attribute existence.

The row key is composed of an asterisk character '*' (0x2A) and candidate ID.
The info column just holds a dummy value as the entry ID can be extracted from the row key.

Additional Notes

HBase tables can only be scanned by row key in ascending order. If scanning in descending order is required it is possible to create a second index table with bit-wise inverted row keys.

HBase sorts rows lexicographical by row key. To use the indices for greater-than and lesser-than filters it is important that the byte representation of the normalized values follows that rule. http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/ provides a good overview.

  • No labels