Work in progress

This site is in the process of being reviewed and updated.

Introduction

Normalizers are used to normalize attribute's values before we store them into the backend, when they are parts of an indexed attributes, or when they are used in a DN.

The reason we normalize those values is that we need to compare the request value with the stored value when searching, adding, comparing or deleting them. If we do that using the user provided value, we may have duplicated values in the backend or we migth nof found some entries.

Normalizing a value will be done with respect to it's specific matching rule, as described in RFC 4517, paragraph 4.2 and following.

List of normalizers

relation between MatchingRules and Normalizers

The following table presents the list of all used matching rules in ADS :

MatchingRule	Normalizer
bitStringMatch	NoOp
booleanMatch	DeepTrimToLowerCase
caseExactIA5Match	PreparedIA5String
caseExactMatch	PreparedString
caseExactOrderingMatch	PreparedString
caseExactSubstringsMatch	PreparedString
caseIgnoreIA5Match	PreparedToLowerIA5String
caseIgnoreIA5SubstringsMatch	PreparedToLowerIA5String
caseIgnoreListMatch	PreparedToLowerString
caseIgnoreListSubstringsMatch	PreparedToLowerString
caseIgnoreMatch	PreparedToLowerString
caseIgnoreOrderingMatch	PreparedToLowerString
caseIgnoreSubstringsMatch	PreparedToLowerString
directoryStringFirstComponentMatch	DN
distinguishedNameMatch	DN
generalizedTimeMatch	GeneralizedTime
generalizedTimeOrderingMatch	GeneralizedTime

MatchingRule	Normalizer
integerFirstComponentMatch	PreparedString
integerMatch	PreparedString
integerOrderingMatch	PreparedString
keywordMatch	Keyword
numericStringMatch	Numeric
numericStringOrderingMatch	Numeric
numericStringSubstringsMatch	Numeric
objectIdentifierFirstComponentMatch	NoOp
objectIdentifierMatch	NoOp
octetStringMatch	NoOp
octetStringOrderingMatch	NoOp
presentationAddressMatch	DeepTrimPreparedToLowerString
protocolInformationMatch	Protocol
telephoneNumberMatch	Telephone
telephoneNumberSubstringsMatch	Telephone
uniqueMemberMatch	DN
wordMatch	Word

Normalizers

Here is the table which presents the list of Normalizers we have to implement :

Normalizer	Description	Class
DeepTrim	Remove spaces at the beginning and at the end of the value Replace all consecutive spaces by a single space insid ethe value	DeepTrimNormalizer
DeepTrimPreparedToLowerString	-	DeepTrimToLowerNormalizer
DeepTrimToLowerCase	Suppress spaces at the beginning and at the end, and lowercase	DeepTrimToLowerNormalizer
DN	-	-
GeneralizedTime	-	-
Keyword	-	-
NoOp	Do nothing.	NoOpNormalizer
Numeric	Prepare the numeric string accordingly to RFC 4518	NumericNormalizer
PreparedIA5String	Prepare the IA5 string accordingly to RFC 4518	DeepTrimNormalizer
PreparedString	Prepare the string accordingly to RFC 4518 The string must be a DirectoryString, such as PrintableString	DeepTrimNormalizer
PreparedToLowerIA5String	-	DeepTrimToLowerNormalizer
PreparedToLowerString	-	DeepTrimToLowerNormalizer
Protocol	-	-
Telephone	Prepare the telephone number string accordingly to RFC 4518	TelephoneNumberNormalizer
Word	-	-

String 'preparation'

When we have to normalize a String in ADS, we have to prepare it, using the six following steps, has described in RFC 4518 :

Transcode
Map
Normalize
Prohibit
Check Bidi (Bidirectional)
Insignificant Character Handling
This preparation is necessary to be able to compare assertion values with attribute values stored in the backend. Sadly, RFC 4518 has been written with Unicode in mind, and we are supposed to support the full set of Unicode chars. This is obviously very difficult, and will imply a lot of change in the server, as we are supporting 16 bit chars only. Normalizing for instance will be supported by Java 6, but is not supported in Java 5.

Transcoding

We won't have to transcode the values in ADS, as every string is already transcoded to UTF-8. The RFC specify that we should transcode all the PrintableStrings to Unicode strings, and as UTF-8 is a way to encode an Unicode String , this is ok.

Mapping

The mapping is the action of transforming some Unicode chars to another one. The following table show all the needed trasnformations :

Unicode	UTF-8 bytes	Transformation
U+0000-0008	[00-08]	deleted
U+0009 (CHARACTER TABULATION)	09	U-0020 (SPACE)
U+000A (LINE FEED)	0A	U-0020 (SPACE)
U+000B (LINE TABULATION)	0B	U-0020 (SPACE)
U+000C (FORM FEED)	0C	U-0020 (SPACE)
U+000D (CARRIAGE RETURN)	0D	U-0020 (SPACE)
U+000E-001F	[0E-1F]	deleted
U+007F-0084	7F, C2 [80-84]	deleted
U+0085 (NEXT LINE )	C2 85	U-0020 (SPACE)
U+0086-009F	C2 [86-9F]	deleted
U+00A0	C2 A0	U-0020 (SPACE)
U+00AD (SOFT HYPHEN)	C2 8D	deleted
U+034F (COMBINING GRAPHEME JOINER)	CD 8F	deleted
U+06DD	DB 9D	deleted
U+070F	DC 8F	deleted
U+1680	E1 9A 80	U-0020 (SPACE)
U+1806 (MONGOLIAN TODO SOFT HYPHEN)	E1 A0 86	deleted
U+180B-180D (VARIATION SELECTOR)	E1 A0 [8B-8D]	deleted
U+180E	E1 A0 8E	deleted
U+2000-200A	E2 80 [80-8A]	U-0020 (SPACE)
U+200B (ZERO WIDTH SPACE)	E2 80 8B	deleted
U+200C-200F	E2 80 [8C-8F]	deleted
U+2028-2029	E2 80 [A8-A9]	U-0020 (SPACE)
U+202A-202E	E2 80 [AA-AE]	deleted
U+202F	E2 80 AF	U-0020 (SPACE)
U+205F	E2 81 9F	U-0020 (SPACE)
U+2060-2063	E2 81 [A9-A3]	deleted
U+206A-206F	E2 81 [AA-AF]	deleted
U+3000	E3 80 80	U-0020 (SPACE)
U+FE00-FE0F (VARIATION SELECTOR)	EF B8 [80-8F]	deleted
U+FEFF	EF BB BF	deleted
U+FFF9-FFFB	EF BF [B9-BB]	deleted
U+FFFC (OBJECT REPLACEMENT CHARACTER)	EF BF BC	deleted
U+1D173-1D17A	F0 9D 85 [B3-BA]	not handled
U+E0001	F0 8E 80 81	not handled
U+E0020-E007F	F0 8E 80 [A0-BF] - F0 8E 81 [80-BF]	not handled

Characters above 0xFFFF are not handled in the current versions of ADS (1.0.x and 1.5.x)

We also have to lowercase all the uppercase chars. The following table gives all the transformation we must do, including those which produces more than one character.

The Prepare Map, Lowercasing page contains all the mapping to apply.

This mapping is implemented in the class PrepareString

Normalizing

The normalization steps is the process to transform complex characters to many simples ones, like 'SchÃ¶n' will be transfromed to 'Scho\u0308n', in order to ease comparizons. Sadly, Java 4 does not have the tools to do this normalization (Java 6 has it . The question is : why do we have to normalize ?

The current version (1.5) of ADS will not support Unicode chars above 0xFFFF (they are encoded as 2 successive chars, using surrogates). The Normalizing step is not applied.

Prohibiting

We have to avoid some chars. The prohibit method will check for each invalid character, and will throw an exception if one of them is met.

The prohibited characters are listed in tables A.1, C.3, C.4, C.5 and C.8 in RFC 3454. The character U-FFFD is also prohibited.

Checking Bidi

Bidirectional characters are ignored, and thus eliminated from the String. Those bidirectionnal characters are listed in table D.1 and D.2 in RFC 3454

Insignifiant character handling

We have three different kind of Strings to handle :

Numeric Strings
Telephone Numbers
Printable Strings with case ignore and exact string matching

Numeric Strings :

All the spaces will be removed. For instance, "12 34 56 7" is transformed to "1234567"

Telephone Number Strings :

All spaces and hyphen are deleted : "(33)1 12 340-56-78" is transformed to "(33)112345678"

Printable Strings :

This is much more complex. We must remove spaces at the beginning and at the end of the string, but we also have to substitute all the consecutive spaces to a single space into the String, except a space which is followed by a combining mark character.

The following finite state machine is implemented to cover all those special cases :

The (S) box represent the PrintableString start. The black (e) box represent the end of this String. The ' ' represent a space, 'Â©' represent a combining mark, and 'c' a simple unicode character.

Actions :

From	To	char	Action
S	1	' '	Switch to state 1
S	2	any	add the char to target; Switch to state 2
1	1	' '	-
1	2	any	add the char to target; Switch to state 2
1	3	'Â©'	add a space and the char to target; Switch to state 3
1	e	end	end
2	2	any	add the char to target;
2	3	'Â©'	add the char to target; Switch to state 3
2	4	' '	Switch to state 4
2	e	end	end
3	2	any	add the char to target; Switch to state 2
3	3	'Â©'	add the char to target
3	4	' '	Switch to state 4
3	e	end	end
4	2	any	add a space and the char to target; Switch to state 2
4	3	'Â©'	add a space and the char to target; Switch to state 3
4	e	end	end

The state engine is implemented in this form :

remove all spaces at the beginning of the string, except if there is a combining mark after a space
then remove the spaces at the end of the string
if there are some non space characters, process them as if they were a list of chars and spaces : (char* space*)*, where spaces are replaced by a single space

Child pages

Schema Normalizers