Work in progress

This site is in the process of being reviewed and updated.

Introduction

Normalizers are used to normalize attribute's values before we store them into the backend, when they are parts of an indexed attributes, or when they are used in a DN.

The reason we normalize those values is that we need to compare the request value with the stored value when searching, adding, comparing or deleting them. If we do that using the user provided value, we may have duplicated values in the backend or we migth nof found some entries.

Normalizing a value will be done with respect to it's specific matching rule, as described in RFC 4517, paragraph 4.2 and following.

List of normalizers

relation between MatchingRules and Normalizers

The following table presents the list of all used matching rules in ADS :

MatchingRule

Normalizer

bitStringMatch

NoOp

booleanMatch

DeepTrimToLowerCase

caseExactIA5Match

PreparedIA5String

caseExactMatch

PreparedString

caseExactOrderingMatch

PreparedString

caseExactSubstringsMatch

PreparedString

caseIgnoreIA5Match

PreparedToLowerIA5String

caseIgnoreIA5SubstringsMatch

PreparedToLowerIA5String

caseIgnoreListMatch

PreparedToLowerString

caseIgnoreListSubstringsMatch

PreparedToLowerString

caseIgnoreMatch

PreparedToLowerString

caseIgnoreOrderingMatch

PreparedToLowerString

caseIgnoreSubstringsMatch

PreparedToLowerString

directoryStringFirstComponentMatch

DN

distinguishedNameMatch

DN

generalizedTimeMatch

GeneralizedTime

generalizedTimeOrderingMatch

GeneralizedTime

MatchingRule

Normalizer

integerFirstComponentMatch

PreparedString

integerMatch

PreparedString

integerOrderingMatch

PreparedString

keywordMatch

Keyword

numericStringMatch

Numeric

numericStringOrderingMatch

Numeric

numericStringSubstringsMatch

Numeric

objectIdentifierFirstComponentMatch

NoOp

objectIdentifierMatch

NoOp

octetStringMatch

NoOp

octetStringOrderingMatch

NoOp

presentationAddressMatch

DeepTrimPreparedToLowerString

protocolInformationMatch

Protocol

telephoneNumberMatch

Telephone

telephoneNumberSubstringsMatch

Telephone

uniqueMemberMatch

DN

wordMatch

Word

Normalizers

Here is the table which presents the list of Normalizers we have to implement :

Normalizer

Description

Implemented

Class

DeepTrim

Remove spaces at the beginning and at the end of the value
Replace all consecutive spaces by a single space insid ethe value

(tick)

DeepTrimNormalizer

DeepTrimPreparedToLowerString

-

(tick)

DeepTrimToLowerNormalizer

DeepTrimToLowerCase

Suppress spaces at the beginning and at the end, and lowercase

(tick)

DeepTrimToLowerNormalizer

DN

-

(error)

-

GeneralizedTime

-

(error)

-

Keyword

-

(error)

-

NoOp

Do nothing.

(tick)

NoOpNormalizer

Numeric

Prepare the numeric string accordingly to RFC 4518

(tick)

NumericNormalizer

PreparedIA5String

Prepare the IA5 string accordingly to RFC 4518

(tick)

DeepTrimNormalizer

PreparedString

Prepare the string accordingly to RFC 4518
The string must be a DirectoryString, such as PrintableString

(tick)

DeepTrimNormalizer

PreparedToLowerIA5String

-

(tick)

DeepTrimToLowerNormalizer

PreparedToLowerString

-

(tick)

DeepTrimToLowerNormalizer

Protocol

-

(error)

-

Telephone

Prepare the telephone number string accordingly to RFC 4518

(tick)

TelephoneNumberNormalizer

Word

-

(error)

-

String 'preparation'

When we have to normalize a String in ADS, we have to prepare it, using the six following steps, has described in RFC 4518 :

  1. Transcode
  2. Map
  3. Normalize
  4. Prohibit
  5. Check Bidi (Bidirectional)
  6. Insignificant Character Handling
    This preparation is necessary to be able to compare assertion values with attribute values stored in the backend. Sadly, RFC 4518 has been written with Unicode in mind, and we are supposed to support the full set of Unicode chars. This is obviously very difficult, and will imply a lot of change in the server, as we are supporting 16 bit chars only. Normalizing for instance will be supported by Java 6, but is not supported in Java 5.

Transcoding

We won't have to transcode the values in ADS, as every string is already transcoded to UTF-8. The RFC specify that we should transcode all the PrintableStrings to Unicode strings, and as UTF-8 is a way to encode an Unicode String , this is ok.

Mapping

The mapping is the action of transforming some Unicode chars to another one. The following table show all the needed trasnformations :

Unicode

UTF-8 bytes

Transformation

U+0000-0008

[00-08]

deleted

U+0009 (CHARACTER TABULATION)

09

U-0020 (SPACE)

U+000A (LINE FEED)

0A

U-0020 (SPACE)

U+000B (LINE TABULATION)

0B

U-0020 (SPACE)

U+000C (FORM FEED)

0C

U-0020 (SPACE)

U+000D (CARRIAGE RETURN)

0D

U-0020 (SPACE)

U+000E-001F

[0E-1F]

deleted

U+007F-0084

7F, C2 [80-84]

deleted

U+0085 (NEXT LINE )

C2 85

U-0020 (SPACE)

U+0086-009F

C2 [86-9F]

deleted

U+00A0

C2 A0

U-0020 (SPACE)

U+00AD (SOFT HYPHEN)

C2 8D

deleted

U+034F (COMBINING GRAPHEME JOINER)

CD 8F

deleted

U+06DD

DB 9D

deleted

U+070F

DC 8F

deleted

U+1680

E1 9A 80

U-0020 (SPACE)

U+1806 (MONGOLIAN TODO SOFT HYPHEN)

E1 A0 86

deleted

U+180B-180D (VARIATION SELECTOR)

E1 A0 [8B-8D]

deleted

U+180E

E1 A0 8E

deleted

U+2000-200A

E2 80 [80-8A]

U-0020 (SPACE)

U+200B (ZERO WIDTH SPACE)

E2 80 8B

deleted

U+200C-200F

E2 80 [8C-8F]

deleted

U+2028-2029

E2 80 [A8-A9]

U-0020 (SPACE)

U+202A-202E

E2 80 [AA-AE]

deleted

U+202F

E2 80 AF

U-0020 (SPACE)

U+205F

E2 81 9F

U-0020 (SPACE)

U+2060-2063

E2 81 [A9-A3]

deleted

U+206A-206F

E2 81 [AA-AF]

deleted

U+3000

E3 80 80

U-0020 (SPACE)

U+FE00-FE0F (VARIATION SELECTOR)

EF B8 [80-8F]

deleted

U+FEFF

EF BB BF

deleted

U+FFF9-FFFB

EF BF [B9-BB]

deleted

U+FFFC (OBJECT REPLACEMENT CHARACTER)

EF BF BC

deleted

U+1D173-1D17A

F0 9D 85 [B3-BA]

not handled

U+E0001

F0 8E 80 81

not handled

U+E0020-E007F

F0 8E 80 [A0-BF] - F0 8E 81 [80-BF]

not handled

Characters above 0xFFFF are not handled in the current versions of ADS (1.0.x and 1.5.x)

We also have to lowercase all the uppercase chars. The following table gives all the transformation we must do, including those which produces more than one character.

The Prepare Map, Lowercasing page contains all the mapping to apply.

This mapping is implemented in the class PrepareString

 Normalizing

The normalization steps is the process to transform complex characters to many simples ones, like 'Schön' will be transfromed to 'Scho\u0308n', in order to ease comparizons. Sadly, Java 4 does not have the tools to do this normalization (Java 6 has it (smile). The question is : why do we have to normalize ?

The current version (1.5) of ADS will not support Unicode chars above 0xFFFF (they are encoded as 2 successive chars, using surrogates). The Normalizing step is not applied.  

 

 Prohibiting

We have to avoid some chars. The prohibit method will check for each invalid character, and will throw an exception if one of them is met.

The prohibited characters are listed in tables A.1, C.3, C.4, C.5 and C.8 in RFC 3454. The character U-FFFD is also prohibited.

 Checking Bidi

Bidirectional characters are ignored, and thus eliminated from the String. Those bidirectionnal characters are listed in table D.1 and D.2 in RFC 3454 

 Insignifiant character handling

We have three different kind of Strings to handle :

  • Numeric Strings
  • Telephone Numbers
  • Printable Strings with case ignore and exact string matching

Numeric Strings :

All the spaces will be removed. For instance, "12 34  56    7" is transformed to "1234567"

Telephone Number Strings :

 All spaces and hyphen are deleted : "(33)1   12  340-56-78" is transformed to "(33)112345678"

Printable Strings :

 This is much more complex. We must remove spaces at the beginning and at the end of the string, but we also have to substitute all the consecutive spaces to a single space into the String, except a space which is followed by a combining mark character.

 The following finite state machine is implemented to cover all those special cases :


 

The (S) box represent the PrintableString start. The black (e) box represent the end of this String. The ' ' represent a space, '©' represent a combining mark, and 'c' a simple unicode character.

Actions :

From

To

char

Action

S

1

' '

Switch to state 1

S

2

any

add the char to target; Switch to state 2

1

1

' '

-

1

2

any

add the char to target; Switch to state 2

1

3

'©'

add a space and the char to target; Switch to state 3

1

e

end

end

2

2

any

add the char to target; 

2

3

'©'

add the char to target; Switch to state 3

2

4

' '

Switch to state 4

2

e

end

end

3

2

any

add the char to target; Switch to state 2

3

3

'©'

add the char to target

3

4

' '

Switch to state 4

3

e

end

end

4

2

any

add a space and the char to target; Switch to state 2

4

3

'©'

add a space and the char to target; Switch to state 3

4

e

end

end

The state engine is implemented in this form :

  1. remove all spaces at the beginning of the string, except if there is a combining mark after a space
  2. then remove the spaces at the end of the string
  3. if there are some non space characters, process them as if they were a list of chars and spaces : (char* space*)*, where spaces are replaced by a single space
  • No labels