...
- If the search corpus contains "bald" Latin, simply use
SerbianNormalizationFilterFactory
. - If the search corpus has only Cyrillic or regular Latin text, and the users can be expected to enter Cyrillic or regular Latin, use
SerbianNormalizationFilterFactory
with the parameterhaircut="regular"
. - If the search corpus has only Cyrillic or regular Latin text, but users can be expected to search with "bald" Latin, there are two solutions:
- To simply use
SerbianNormalizationFilterFactory
with slightly worse results. Wiki Markup To use two indices: one index should use {{SerbianNormalizationFilterFactory}} and the other should use {{SerbianNormalizationFilterFactory}} with {{haircut="regular"}} (you can use [copyField|SchemaXml#Copy_Fields] directive to copy from one to the other). Then, if a user enters a query that contains a Cyrillic letter or any of 'č', 'ć', 'š', 'ž' or 'đ' (regexp: {{\[aбвгдђежзијклљмнњопрстћуфхцчџшčćđšž\]}}), search only the regular index; otherwise (the query might be "bald"), search the "bald" index. \\
- To simply use
Background
Serbian language is specific in that it uses two alphabets, Cyrillic and Latin; while Cyrillic alphabet is considered the primary, Latin alphabet is also common. Texts might contain both alphabets, users might enter queries in both alphabets, so it is important to be able to search both at the same time.
...