...

If the search corpus contains "bald" Latin, simply use SerbianNormalizationFilterFactory.
If the search corpus has only Cyrillic or regular Latin text, and the users can be expected to enter Cyrillic or regular Latin, use SerbianNormalizationFilterFactory with the parameter haircut="regular".

If the search corpus has only Cyrillic or regular Latin text, but users can be expected to search with "bald" Latin, there are two solutions:

To simply use SerbianNormalizationFilterFactory with slightly worse results.

Wiki Markup

To use two indices: one index should use {{SerbianNormalizationFilterFactory}} and the other should use {{SerbianNormalizationFilterFactory}} with {{haircut="regular"}} (you can use [copyField|SchemaXml#Copy_Fields] directive to copy from one to the other). Then, if a user enters a query that contains a Cyrillic letter or any of 'č', 'ć', 'š', 'ž' or 'đ' (regexp: {{\[aбвгдђежзијклљмнњопрстћуфхцчџшčćđšž\]}}), search only the regular index; otherwise (the query might be "bald"), search the "bald" index.
\\

Background

Serbian language is specific in that it uses two alphabets, Cyrillic and Latin; while Cyrillic alphabet is considered the primary, Latin alphabet is also common. Texts might contain both alphabets, users might enter queries in both alphabets, so it is important to be able to search both at the same time.

...

Space shortcuts

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Background

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 1

New Version Current

Key

Background