Introduction

Parsing DN is one of the most expensive task in a LDAP server. It's not only consumming CPU, but also memory, which cause the garbage collector to be called more often (some measures show that GC represent 30% of all the time consummed during an intensive searchRequest treatment).

Moreover, DN parsing is a complex operation, due to the numerous special cases, plus the normalization that should occurs in two places, and because we have to store the DN in more than one form to speed up the server by avoiding some useless parsing and normalization.

We will see in this page the way we are parsing DN in the server

LdapDN structure

External structure

From the user point of view, a DN is a composition :

DN => n * RDN
RDN => m * AVA
AVA => AttributeType + AttributeValue
AttributeType => non sensitive ascii String or an OID
AttributeValue => a String

Check if attributeType can have options.

Each RDN is separated from the other ones by a ',' or a ';' character.
Each AVA is separated from the other ones by a '+' character.
Attribute and values are separated by the '=' character.

At this point, we have said nothing about the charset used by DN. DN are supposed to be transmitted as UTF-8 chars to the server, so we might have up to 6 bytes to store a char. The RFCs specified that a user can escape some values by using either '\xx' sequence in a value (where xx is the hexadecimal value of this char), or octothrops strings (a string composed of a starting '#' character followed by hexadecimal pairs, corresponding of the byte values of the value part). We have to deal with those cases.

The last point is that spaces are not meaningfull around the ',', ';', '+' and '=' character, nor they are at the beginning and end of each RDN, AVA, type and value. To be able to store a leading or trailing space into a value, a user must embed this value between two '"' characters.

Of course, some other characters must be escaped : '\', ' ', '#', '=', '', ',', '"', '', ';', and we will have to deal with those cases too, but oly if the value is not surrounded by '"'...

Last, not least, as DNs contain values which are potentially stored into the backend as attribute values, wee must go through the PrepString process for each of this strings, if they are H/R of course (a DN can contain binary values, even if it eems strange, as soon as their value are correctly escaped -using either the hexadecimal notation or the starting # -)

Internal structure

The internal DN structure we have chose depends on different considerations :

speed : we must be able to avoid costly operations, by keeping normalized and parsed forms of the DN
memory consumption : as we said, garbage collecting is a costly operation. We should keep the number of objects necessary to hold a DN as small as possible. We must also have small objects because the smaller they are, the more of them we can store in cache.
serialization : DN are stored in the backend, which means we must serialize and deserialize those DNs. The operation must be fast, as disk access are slower than memory access by 2 orders of magnitude.
CPU consumption : we should avoid as much as possible doing twice the same opreation (like parsing, normalization), because the more operation we do, the more latency we generate due to synchronized portions of code. We also have to release the IOExecutor threads which are used to process the incoming requests.

As everyhting is a balance between those elements, we should favor the most frequent cases, which are pure ASCII DNs where RDN contains only one AVA, and where value are not escaped. If we don't fall into this simple case, then we fallback to the complex parsing.

DN cycle of life

DNs are send by users to identify entries. When an entry is first stored, its DN is created in the DN table. When a user is searching for a specific entry using its DN, it is compared whith the stored DNs. When a user does a ModifyDN operation, the DN might be modified for one or more entry (either replaced completely, or renamed). This last cas has an important impact on the internal data structure we chose.

What is important is that we have two forms for a DN : UP (User Provided) and Norm (normalized DN). UP DNs are used when we return this info to the user. Norm form is used internally to uniquely identify entries. It is very important to understand that DNs are manipulated in their Norm form into the server.

We have one special case : when moving data from a place to another one with the ModifyRDN request, which can modify the DN (its UP and Norm form), so we must be able to construct a new UP and Norm form of the modified DN.

So we will keep a Normalized form (Norm) and a user provided form (UP). What about storing the bytes ?

As every DNs are received by the server as an array of bytes, and will be returned to the client as an array of bytes, we might want to store this byte array in the internal DN structure.
The main benefit is that we will be able to encode the DN very quickly, avoiding a costly call to the method dn.getBytes( "UTF-8" ). The main drawback is that we will have to store more data in memory, and serialization will cost more.

We have to check if this is a valid optimization or not, by running some benchmarks.

Internally, we will use the normalized form, which is a form where the values have been processed by the PrepString algorithm, special chars have been unescaped, surrounding double quote have been removed and value has been normalized accordingly to its associated attribuyeType normalizer (assuming that we have such an attributeType normalizer available).

Normalization

We have five kinds of normalization to apply :

Attribute types are lower cased and spaces are removed around ',', ';', '=', and '+'. Options are kept.
Then attribute type are transformed to their OID counterpart to avoid having to deal with multiple form of the same attributeType (AT can have aliases, like CN, CommonNmae, and 2.5.4.4 which are all the same attributeType)
Attribute values are unescaped and transformed to Strings, if the attributeType is H/R
Attribute values are transformed applying the PrepString algorithm if their AttributeType is H/R
Attribute values are normalized accordingly to their attributeType syntax.
At the end, into a RDN, AVA will be ordered following the alphabetic increase order.

PrepString process hould be executed at the right moment. As we may have escaped chars, this could not occur before the unescaping process, but as soon as we have a String, we can do it (so it's before the normalization)

For instance, the following RDN :
"ou=" Some People " + dc = + Some anImAls,dommainComponent = eXample,dc= cOm"
will be transformed into as the Norm string:
"0.9.2342.19200300.100.1.25=+ some animals+2.5.4.11=\ some people\ \ ,0.9.2342.19200300.100.1.25=example,0.9.2342.19200300.100.1.25=com"
and the internal storage of UP and normalized values will be :

(here, quotes are just used to expose the leading and trailing spaces)

DN
  UP : 'ou=" Some   People  " + dc =  \+   Some anImAls,dommainComponent = eXample,dc= cOm'
  Norm : '0.9.2342.19200300.100.1.25=\+ some animals+2.5.4.11=\ some people\ \ ,0.9.2342.19200300.100.1.25=example,0.9.2342.19200300.100.1.25=com'
  RDN 1
    UP : 'ou=" Some   People  " + dc =  \+   Some anImAls'
    Norm : '0.9.2342.19200300.100.1.25=\+ some animals+2.5.4.11=\ some people\ \ '
    AVA 1
      UP AT : dc
      UP val : '\+   Some anImAls'
      Norm AT : 0.9.2342.19200300.100.1.25
      Norm val = '+ some animals'
    AVA 2 
      UP AT : ou
      UP val : '" Some   People  "'
      Norm AT : 2.5.4.11
      Norm val = ' some people  ' 
  RDN 2
    UP : 'dommainComponent = eXample'
    Norm : '0.9.2342.19200300.100.1.25=example'
    AVA 1
      UP AT : dommainComponent
      UP val : 'eXample'
      Norm AT : 0.9.2342.19200300.100.1.25
      Norm val = 'example'
  RDN 3
    UP : 'dommainComponent = cOm'
    Norm : '0.9.2342.19200300.100.1.25=com'
    AVA 1
      UP AT : dommainComponent
      UP val : 'cOm'
      Norm AT : 0.9.2342.19200300.100.1.25
      Norm val = 'com'

As we can see, types are replaced by their OID, useless spaces are removed, RDN are reordered and values are normalized (lowercased, and multiple inner spaces are replaced by a simple space, accordingly with OU and DC normalizer).

Another important thing is that we store different kind of UP values depending on the level of storage (DN, RDN or AVA). This is necessary as we manipulate those informations at different levels tooo, depending on which operation we are dealing with : values comparisons, DN modification, DN comparisons...

A special case is when we can't find the AttributeType in the schema. They are three cases where it can happen :

the entry is a referral (ie the referral OC is present into this entry)
the entry has the extensible ObjectClass
we are facing an error

In the two first cases, we won't be able to do a prepSTring nor a normalization as we don't know if the AttributeType is H/R or not, and we don't know either anything about the normalizer. We will simply track the fact that the AttributeType is unknown by not attaching a reference to the associated AttributeType object (for ServerDNs) and by not filling the normalized form for the AVA (keeping it null: an empty tring won't be enough, as we may have empty values after normalization - a very weird and twisted posibility, but ...-).

Internal structure

We have five objects to describe :

DN
RDN
AVA
Type
Value

The following paragraphs describe those ojects internal structure.

DN internal structure

A DN should not be empty, but we have to accept such a DN (rootDSE is the empty DN). We will store three forms for a DN

user provided form
byte array of this user provided form
normalized form

RDN internal structure

We should keep two forms of a RDN :

a User Provided form
a normalized form

This is important because we will often manipulate those RDN, for instance while searching the partition a DN is associated with.

A RDN may contains more than one AVA, but usually contains only one. An optimization could be not to create an array to store more than one ATV, but instead keeping it in a single member. If we have more than one AVA, then we will create an array instead. The extra cost of creating this array is totally acceptable regarding the frequency of such multiple AVAs in a RDN (which is very low). On the other hand, it forces the access method to deal with the number of AVA, but this cost is obviously less than access an AVA through an ArrayList (to be double checked)

What about a byte[] to store this Rdn? It does not seems we need it, even if we need the byte arrays to create a new DN when dealing with a ModifyDN operation. As it is not a frequent operation, we can accept the extra cost of a conversion from String to a byte array.

  class RDN
    
    // The user provided form
    String upRdn;

    // The normalized form
    String normRdn;

    // The AVA if we have only one
    AVA ava;

    // An array of AVAs if we have more than one
    List<AVA> avas;

    // The RDN position in the UP DN
    int start;

    // The RDN length in the UP DN
    int length;

The extra two fields (start and lngth) are usefull when dealing with a ModifyDN operation, in order to keep the UP RDN ordering. If the user wants to replace a RDN by another, we have to subtitute the RDN in the original position, creating a new UP DN.

AVAs are stored in alphabetic order

AVA internal structure

An AVA contains an attributeType and an attribute value. We could keep the String representation of an AVA internally, but as those members already have a String representation, and as we rarelly manipulate an AVA (except when creating it while decoding a DN), there is no need to store this duplicated information.

class AVA
  Attribute type;
  Value<?> value;

Attribute internal structure

We will store two informations :

the User provided form
the reference to the asociated AttributeType object if we have access to the schema

A the second form already contains the firt one, normalized, we don't need to keep both of them. It's better to define an interface and two implementing clases : UnknownAttribute and SchemAttribute. The interface will define common accessors for thos two classes. We will also define an intermediate Abstract class, carrying the common methods.

  interface Attribute
    // Tells if the attributeType is known
    boolean isSchemaAvailable();

    // Tells if the attributeType is binary
    boolean isBinary();


  absstract class AbstractAttribute
    // Impement the common methods


  class UnknownAttribute
    String        upType;


  class SchemaAttribute
    AttributeType attributeType;

AttributeValue internal structure

We will ue the Value<?> class we have already defined.

Child pages

DN Parsing

Introduction

LdapDN structure

External structure

Internal structure

DN cycle of life

Normalization

Internal structure

DN internal structure

RDN internal structure

AVA internal structure

Attribute internal structure

AttributeValue internal structure