Introduction

Parsing DN is one of the most expensive task in a LDAP server. It's not only consumming CPU, but also memory, which cause the garbage collector to be called more often (some measures show that GC represent 30% of all the time consummed during an intensive searchRequest treatment).

Moreover, DN parsing is a complex operation, due to the numerous special cases, plus the normalization that should occurs in two places, and because we have to store the DN in more than one form to speed up the server by avoiding some useless parsing and normalization.

We will see in this page the way we are parsing DN in the server

LdapDN structure

External structure

From the user point of view, a DN is a composition :

DN => n * RDN
RDN => m * AVA
AVA => AttributeType + AttributeValue
AttributeType => non sensitive String or an OID
AttributeValue => a String

Each RDN is separated from the other ones by a ',' or a ';' character.
Each AVA is separated from the other ones by a '+' character.
Attribute and values are separated by the '=' character.

At this point, we have said nothing about the charset used by DN. DN are supposed to be transmitted as UTF-8 chars to the server, so we might have up to 6 bytes to store a char. The RFCs specified that a user can escape some values by using either '\xx' sequence in a value (where xx is the hexadecimal value of this char), or octothrops strings (a string composed of a starting '#' character followed by hexadecimal pairs, corresponding of the byte values of the value part). We have to deal with those cases.

The last point is that spaces are not meaningfull around the ',', ';', '+' and '=' character, nor they are at the beginning and end of each RDN, AVA, type and value. To be able to store a leading or trailing space into a value, a user must embed this value between two '"' characters.

Of course, some other characters must be escaped : '\', ' ', '#', '=', '', ',', '"', '', ';', and we will have to deal with those cases too...

Internal structure

The internal DN structure we have chose depends on different considerations :

speed : we must be able to avoid costly operations, by keeping normalized and parsed forms of the DN
memory consumption : as we said, garbage collecting is a costly operation. We should keep the number of objects necessary to hold a DN as small as possible. We must also have small objects because the smaller they are, the more of them we can store in cache.
serialization : DN are stored in the backend, which means we must serialize and deserialize those DNs. The operation must be fast, as disk access are slower than memory access by 2 orders of magnitude.
CPU consumption : we should avoid as much as possible doing twice the same opeation (like parsing, normalization), because the more operation we do, the more latency we generate due to synchronized portions of code. We also have to release the IOExecutor threads which are used to process the incoming requests.

As everyhting is a balance between those elements, we should favor the most frequent cases, which are pure ASCII DNs where RDN contains only one AVA, where value are not escaped.

DN cycle of life

DNs are send by users to identify entries. When a entry is first stored, its DN is created in the DN table. When a user is searching for a specific entry using its DN, it is compared whith the stored DNs.

What is important is that we have two forms for a DN : UP (User Provided) and Norm (normalized DN). UP DNs are used when we return this info to the user. Norm form is used internally to uniquely identify entries. It is very important to understand that DNs are manipulated in their Norm form into the server.

We have one special case : when moving data from a place to another one with the ModifyRDN request, which can modify the DN (its UP and Norm form), so we must be able to construct a new UP and Norm form of the modified DN.

So we will keep a Norm form and an UP form. What about storing the bytes ?

As every DNs are received by the server as an array of bytes, and will be returned to the client as an array of bytes, we might wangt to store this byte array in the internal DN structure.
The main benefit is that we will be able to encode the DN very quickly, avoiding a costly call to the method dn.getBytes( "UTF-8" ). The main drawback is that we will have to store more data in memory, and serialization will cost more.

We have to check if this is a valid optimization or not, by running some benchmarks.

Internally, we will use the normalized form

Normalization

We have for kinds of normalization to apply :

attribute types are lower cased and spaces are removed around ',', ';', '=', and '+'
then attribute type are transformed to their OID counterpart to avoid having to deal with multiple form of the same attributeType (AT can have aliases, like CN, CommonNmae, and 2.5.4.4 which are all the same attributeType)
attribute values are normalized accordingly to their attributeType suyntax.
at the end, into a RDN, AVA will be ordered following the alphabetic increase order.

For instance, the following RDN :
"ou= Some People + dc = And Some anImAls,dommainComponent = eXample,dc= cOm"
will be transformed into :
"0.9.2342.19200300.100.1.25=and some animals+2.5.4.11=some people,0.9.2342.19200300.100.1.25=example,0.9.2342.19200300.100.1.25=com"

As we can see, types are replaced by their OID, useless spaces are removed, RDN are reordered and values are normalized (lowercased, and multiple inner spaces are replaced by a simple space).

Internal structure

We have five objects to describe :

DN
RDN
AVA
Type
Value

The following paragraphs describe those ojects internal structure.

DN internal structure

A DN should not be empty, but we have to accept such a DN (rootDSE is the empty DN). We will store three forms for a DN

user provided form
byte array of this user provided form
normalized form

RDN internal structure

We should keep two forms of a RDN :

a User Provided form
a normalized form

This is important because we will often manipulate those RDN, for instance while searching the partition a DN is associated with.

A RDN may contains more than one AVA, but usually contains only one. An optimization could be not to create an array to store more than one ATV, but instead keeping it in a single member. If we have more than one AVA, then we will create an array instead. The extra cost of creating this array is totally acceptable regarding the frequency of such multiple AVAs in a RDN (which is very low). On the osther side, it forces the access metho to deal with the number of AVA, but this cost is obviously less than access an AVA through an ArrayList (to be double checked)

What about a byte[] to store this Rdn? It does not seems we need it.

	class RDN
		// The user provided form
		String upRdn;

		// The AVA if we have only one
		AVA ava;

		// A flag set to true if we have only one AVA
		boolean isSingle;

		// An array of AVAs if we have more than one
		List<AVA> avas;

AVAs are stored in alphabetic order ?

AVA internal structure

An AVA contains an attributeType and an attribute value. We could keep the String representation of an AVA internally, but as those members already have a String representation, and as we rarelly manipulate an AVA (except when creating it while decoding a DN), there is no need to store this duplicated information.

	class AVA
		AttributeType type;
		AttributeValue value;

AttributeType internal structure

We will store only two informations :

the User provided form
the attributeType OID

we can keep the OID in two forms : as a String, or as an OID object. The OID object is smaller, and allows faster comparisons, as an OID String will be at least 2 times longer than an OID object:
"1.2.840.48018.1.2.2" is a 19 chars long string, while its equivalent OID :
0x2A, 0x86, 0x48, 0x82, 0xF7, 0x12, 0x01, 0x02, 0x02, which is a 9 bytes long array

	class AttributeType
		String 	upType;
		OID	normType;

AttributeValue internal structure

This element should be kept as a String (UP form) and as a byte array if the corresponding AT is not H/R, or as a String if the AT is H/R (H/R : Human Readable)

It could be good to keep it as a byte array to avoid a costly conversion when sending back the data to the user (to be checked)

The structure will be :

	class AttributeValue
		// a flag which tells if this value is H/R or not
		boolean isHR; 

		// The user povided value
  		String upValue; 

		// The structure which hold the value, either a String ro a byte[]
  		NormValue normValue;

with the interface NormValue and the implementing classes :

	
	interface NormValue

	class NormStringValue implements NormValue
		String value; // The value if it's a String

	class NormBytesValue implements NormValue
		byte[] value; // The value if it's a byte array

We might also create an interface to handle big values (above 1024 bytes or char, for instance), called StreamedValue :
interface StreamedValue

This interface will be implemented by StreamedNormStringValue and StreamedNormBytesValue

Child pages

DN Parsing

Introduction

LdapDN structure

External structure

Internal structure

DN cycle of life

Normalization

Internal structure

DN internal structure

RDN internal structure

AVA internal structure

AttributeType internal structure

AttributeValue internal structure