Blog

LDIF File Backed Partition

Introduction & Requirements

This document describes a new Partition implementation which backs its entry information in files using LDIFs. The LDIF Partition like it's JDBM based counterpart will be configured to store these LDIF files under some base directory. Zero or more LDIF files could be placed under this base directory in a flat or hierarchical directory structure. There should be no requirement to specify a load order for the LDIF files or the entries in the files. Using deferred construction of indices the Partition should be able to infer the entry hierarchy by the distinguished names of the entries under the suffix.

The LDIF Partition should do everything it can to prevent alteration of these files by users and other processes. File locking and other techniques can be leveraged to minimize the chance of concurrent alteration through the LDAP protocol and external means such as editors. Avoiding this situation will remove the need to have complex conflict resolution strategies.

Initialization

This LDIF Partition like the JDBM partition will maintain various hashes and btree structures to enable rapid searching. The only difference is that the LDIF partition will maintain these structures in memory. To build the proper indices on attributeTypes we still need access to the schema to perform matching and normalization. This should not be a problem and can be provided to the Partition at object creation time. At initialization time the partition will:

  1. Recursively descend down the base directory searching for files ending with the ldif file extension.
  2. Load each LDIF file discovered by:
    1. Parsing all LDIF entries to ServerEntry objects
    2. Putting ServerEntry objects into an in memory HashMap: a Long ID (the key) to each LDIF's ServerEntry is assigned.
    3. Add a link in a HashMap to the file the LDIF entry was found in: Long ID to LDIF file URI
    4. Populate the ID<->NDN btree in both directions: this is the NDN index. If overlaps are discovered report the presence of LDIF entries with the same DN.
    5. Populate the existence btree and other user indices deferring hierarchy index build for 2nd pass.
  3. Take a second pass off the NDN index to build the hierarchy index while reporting and removing orphaned entries.

Write Operations

When LDAP entries corresponding to LDIF entries are altered, the changes, what ever they may be, need to be pushed back into the LDIF file. The entire file will be written out while leveraging the ID to file URI index. All the entries in the LDIF file are written out regardless of whether or not they changed. This applies to modify and modifyDn operations as well as deletes where the entry in question would be removed instead of being altered.

The add operation however presents an interesting situation since the added entry never existed before in any LDIF file. Which LDIF file do we add it to? One solution would be to add it to the LDIF file of this new entry's parent. As an optimization we can add the new entry to a new LDIF file if the file containing the parent has more than a some specified number of entries. This might reduce the time to write out the LDIF files.

Although programatic transformation of ServerEntry objects back to LDIFs might take some time, it is a few degrees of magnitude less than the time taken to write the LDIF file to disk.
However when writing entries back into an LDIF file, there is little difference whether you write 1 or 100 entries. The primary cost in the disk IO comes from disk seek time which happens once to start writing. There's probably a cut off at which this is not true. Rather than deal with the number of entries one could look at the current size of the file to determine this. Since entries can vary in size the size of the file might be a better indicator. Experiments should be performed to understand what the relative point for this cut off is. I say relative because the nature of the disk subsystem will shift this value somewhat.