Work in progress
This site is in the process of being reviewed and updated.
Introduction
Normalizers are used to normalize attribute's values before we store them into the backend, when they are parts of an indexed attributes, or when they are used in a DN.
The reason we normalize those values is that we need to compare the request value with the stored value when searching, adding, comparing or deleting them. If we do that using the user provided value, we may have duplicated values in the backend or we migth nof found some entries.
Normalizing a value will be done with respect to it's specific matching rule, as described in RFC 4517, paragraph 4.2 and following.
List of normalizers
relation between MatchingRules and Normalizers
The following table presents the list of all used matching rules in ADS :
MatchingRule |
Normalizer |
---|---|
bitStringMatch |
NoOp |
booleanMatch |
DeepTrimToLowerCase |
caseExactIA5Match |
PreparedIA5String |
caseExactMatch |
PreparedString |
caseExactOrderingMatch |
PreparedString |
caseExactSubstringsMatch |
PreparedString |
caseIgnoreIA5Match |
PreparedToLowerIA5String |
caseIgnoreIA5SubstringsMatch |
PreparedToLowerIA5String |
caseIgnoreListMatch |
PreparedToLowerString |
caseIgnoreListSubstringsMatch |
PreparedToLowerString |
caseIgnoreMatch |
PreparedToLowerString |
caseIgnoreOrderingMatch |
PreparedToLowerString |
caseIgnoreSubstringsMatch |
PreparedToLowerString |
directoryStringFirstComponentMatch |
DN |
distinguishedNameMatch |
DN |
generalizedTimeMatch |
GeneralizedTime |
generalizedTimeOrderingMatch |
GeneralizedTime |
MatchingRule |
Normalizer |
---|---|
integerFirstComponentMatch |
PreparedString |
integerMatch |
PreparedString |
integerOrderingMatch |
PreparedString |
keywordMatch |
Keyword |
numericStringMatch |
Numeric |
numericStringOrderingMatch |
Numeric |
numericStringSubstringsMatch |
Numeric |
objectIdentifierFirstComponentMatch |
NoOp |
objectIdentifierMatch |
NoOp |
octetStringMatch |
NoOp |
octetStringOrderingMatch |
NoOp |
presentationAddressMatch |
DeepTrimPreparedToLowerString |
protocolInformationMatch |
Protocol |
telephoneNumberMatch |
Telephone |
telephoneNumberSubstringsMatch |
Telephone |
uniqueMemberMatch |
DN |
wordMatch |
Word |
Normalizers
Here is the table which presents the list of Normalizers we have to implement :
Normalizer |
Description |
Implemented |
Class |
---|---|---|---|
DeepTrim |
Remove spaces at the beginning and at the end of the value |
|
DeepTrimNormalizer |
DeepTrimPreparedToLowerString |
- |
|
DeepTrimToLowerNormalizer |
DeepTrimToLowerCase |
Suppress spaces at the beginning and at the end, and lowercase |
|
DeepTrimToLowerNormalizer |
DN |
- |
|
- |
GeneralizedTime |
- |
|
- |
Keyword |
- |
|
- |
NoOp |
Do nothing. |
|
NoOpNormalizer |
Numeric |
Prepare the numeric string accordingly to RFC 4518 |
|
NumericNormalizer |
PreparedIA5String |
Prepare the IA5 string accordingly to RFC 4518 |
|
DeepTrimNormalizer |
PreparedString |
Prepare the string accordingly to RFC 4518 |
|
DeepTrimNormalizer |
PreparedToLowerIA5String |
- |
|
DeepTrimToLowerNormalizer |
PreparedToLowerString |
- |
|
DeepTrimToLowerNormalizer |
Protocol |
- |
|
- |
Telephone |
Prepare the telephone number string accordingly to RFC 4518 |
|
TelephoneNumberNormalizer |
Word |
- |
|
- |
String 'preparation'
When we have to normalize a String in ADS, we have to prepare it, using the six following steps, has described in RFC 4518 :
- Transcode
- Map
- Normalize
- Prohibit
- Check Bidi (Bidirectional)
- Insignificant Character Handling
This preparation is necessary to be able to compare assertion values with attribute values stored in the backend. Sadly, RFC 4518 has been written with Unicode in mind, and we are supposed to support the full set of Unicode chars. This is obviously very difficult, and will imply a lot of change in the server, as we are supporting 16 bit chars only. Normalizing for instance will be supported by Java 6, but is not supported in Java 5.
Transcoding
We won't have to transcode the values in ADS, as every string is already transcoded to UTF-8. The RFC specify that we should transcode all the PrintableStrings to Unicode strings, and as UTF-8 is a way to encode an Unicode String , this is ok.
Mapping
The mapping is the action of transforming some Unicode chars to another one. The following table show all the needed trasnformations :
Unicode |
UTF-8 bytes |
Transformation |
---|---|---|
U+0000-0008 |
[00-08] |
deleted |
U+0009 (CHARACTER TABULATION) |
09 |
U-0020 (SPACE) |
U+000A (LINE FEED) |
0A |
U-0020 (SPACE) |
U+000B (LINE TABULATION) |
0B |
U-0020 (SPACE) |
U+000C (FORM FEED) |
0C |
U-0020 (SPACE) |
U+000D (CARRIAGE RETURN) |
0D |
U-0020 (SPACE) |
U+000E-001F |
[0E-1F] |
deleted |
U+007F-0084 |
7F, C2 [80-84] |
deleted |
U+0085 (NEXT LINE ) |
C2 85 |
U-0020 (SPACE) |
U+0086-009F |
C2 [86-9F] |
deleted |
U+00A0 |
C2 A0 |
U-0020 (SPACE) |
U+00AD (SOFT HYPHEN) |
C2 8D |
deleted |
U+034F (COMBINING GRAPHEME JOINER) |
CD 8F |
deleted |
U+06DD |
DB 9D |
deleted |
U+070F |
DC 8F |
deleted |
U+1680 |
E1 9A 80 |
U-0020 (SPACE) |
U+1806 (MONGOLIAN TODO SOFT HYPHEN) |
E1 A0 86 |
deleted |
U+180B-180D (VARIATION SELECTOR) |
E1 A0 [8B-8D] |
deleted |
U+180E |
E1 A0 8E |
deleted |
U+2000-200A |
E2 80 [80-8A] |
U-0020 (SPACE) |
U+200B (ZERO WIDTH SPACE) |
E2 80 8B |
deleted |
U+200C-200F |
E2 80 [8C-8F] |
deleted |
U+2028-2029 |
E2 80 [A8-A9] |
U-0020 (SPACE) |
U+202A-202E |
E2 80 [AA-AE] |
deleted |
U+202F |
E2 80 AF |
U-0020 (SPACE) |
U+205F |
E2 81 9F |
U-0020 (SPACE) |
U+2060-2063 |
E2 81 [A9-A3] |
deleted |
U+206A-206F |
E2 81 [AA-AF] |
deleted |
U+3000 |
E3 80 80 |
U-0020 (SPACE) |
U+FE00-FE0F (VARIATION SELECTOR) |
EF B8 [80-8F] |
deleted |
U+FEFF |
EF BB BF |
deleted |
U+FFF9-FFFB |
EF BF [B9-BB] |
deleted |
U+FFFC (OBJECT REPLACEMENT CHARACTER) |
EF BF BC |
deleted |
U+1D173-1D17A |
F0 9D 85 [B3-BA] |
not handled |
U+E0001 |
F0 8E 80 81 |
not handled |
U+E0020-E007F |
F0 8E 80 [A0-BF] - F0 8E 81 [80-BF] |
not handled |
Characters above 0xFFFF are not handled in the current versions of ADS (1.0.x and 1.5.x)
We also have to lowercase all the uppercase chars. The following table gives all the transformation we must do, including those which produces more than one character.
The Prepare Map, Lowercasing page contains all the mapping to apply.
This mapping is implemented in the class PrepareString
Normalizing
The normalization steps is the process to transform complex characters to many simples ones, like 'Schön' will be transfromed to 'Scho\u0308n', in order to ease comparizons. Sadly, Java 4 does not have the tools to do this normalization (Java 6 has it . The question is : why do we have to normalize ?
The current version (1.5) of ADS will not support Unicode chars above 0xFFFF (they are encoded as 2 successive chars, using surrogates). The Normalizing step is not applied.
Prohibiting
We have to avoid some chars. The prohibit method will check for each invalid character, and will throw an exception if one of them is met.
The prohibited characters are listed in tables A.1, C.3, C.4, C.5 and C.8 in RFC 3454. The character U-FFFD is also prohibited.
Checking Bidi
Bidirectional characters are ignored, and thus eliminated from the String. Those bidirectionnal characters are listed in table D.1 and D.2 in RFC 3454
Insignifiant character handling
We have three different kind of Strings to handle :
- Numeric Strings
- Telephone Numbers
- Printable Strings with case ignore and exact string matching
Numeric Strings :
All the spaces will be removed. For instance, "12 34 56 7" is transformed to "1234567"
Telephone Number Strings :
All spaces and hyphen are deleted : "(33)1 12 340-56-78" is transformed to "(33)112345678"
Printable Strings :
This is much more complex. We must remove spaces at the beginning and at the end of the string, but we also have to substitute all the consecutive spaces to a single space into the String, except a space which is followed by a combining mark character.
The following finite state machine is implemented to cover all those special cases :
The (S) box represent the PrintableString start. The black (e) box represent the end of this String. The ' ' represent a space, '©' represent a combining mark, and 'c' a simple unicode character.
Actions :
From |
To |
char |
Action |
---|---|---|---|
S |
1 |
' ' |
Switch to state 1 |
S |
2 |
any |
add the char to target; Switch to state 2 |
1 |
1 |
' ' |
- |
1 |
2 |
any |
add the char to target; Switch to state 2 |
1 |
3 |
'©' |
add a space and the char to target; Switch to state 3 |
1 |
e |
end |
end |
2 |
2 |
any |
add the char to target; |
2 |
3 |
'©' |
add the char to target; Switch to state 3 |
2 |
4 |
' ' |
Switch to state 4 |
2 |
e |
end |
end |
3 |
2 |
any |
add the char to target; Switch to state 2 |
3 |
3 |
'©' |
add the char to target |
3 |
4 |
' ' |
Switch to state 4 |
3 |
e |
end |
end |
4 |
2 |
any |
add a space and the char to target; Switch to state 2 |
4 |
3 |
'©' |
add a space and the char to target; Switch to state 3 |
4 |
e |
end |
end |
The state engine is implemented in this form :
- remove all spaces at the beginning of the string, except if there is a combining mark after a space
- then remove the spaces at the end of the string
- if there are some non space characters, process them as if they were a list of chars and spaces : (char* space*)*, where spaces are replaced by a single space