Introduction

Earlier heuristics devised were found to be useless after trying to implement them. A better mechanism must be found. If you want to the this abandoned solution please check the change log for this page. Until we find a better solution various sections have been marked as TBD.

Logical disjunctions are OR expressions with two or more sub-expressions. An OR (|) filter expression accepts the union of sets matched by all sub-expressions.

The search engine must be able to build Cursors and Evaluators over logical disjunctions for candidate generation and lookups respectively. Cursors are used to traverse candidates matching the filter expression while Evaluators check to see if candidates are matched by the filter expression. This page describes the techniques used by the previous search engine based on directional NamingEnumerations and some of their drawbacks when dealing with disjunctions. The problems specific to Cursors are discussed. New heuristics are presented, and their benefits/drawbacks analyzed for the current search engine implementation based on Cursors.

Duplicates with Disjunctions

LDAP search semantics impose a uniqueness constraint on the results returned. Multiple copies of the same entry must not be returned in a search operation. Filters based on disjunctions impose a specific candidate traversal problem associated with duplicates. Consider the following search filter to the right and it's Venn diagram.

Each sub-expression of this filter matches a set of candidates. This represents the Venn Circles A, B, and C. There are 4 intersecting regions across these sets: 1, 2, 3, and 4. These intersecting regions have the potential to produce produce duplicates.

Region

Number of Duplicates

1

2

2

2

3

3

4

2

(| (ou=engineering) (ou=sales) (l=sunnyvale) )

When traversing a set one entry at a time, duplicates must be filtered. This is not such a simple problem to solve as we'll see it gets even more difficult to do effectively for Cursors.

Previous Implementation Issues

The previous search engine implementation built NamingEnumerations on disjunctions by stuffing a DisjunctionEnumeration with an array of NamingEnumerations for it's sub-expressions. The DisjunctionEnumeration exhausted each sub-enumeration as calls to nextElement() and next() were made. Unlike Cursors used in the new implementation, Enumerations are directional, hence they can only advance in one direction. This difference does impact the techniques used since at any point in time a Cursors position can be reset or the direction of traversal may change.

One major issue with the traversal of logical disjunctions is implementing the uniqueness constraint. Certain disjunctions based on separate child expressions may cause a candidate to be returned more than once. For example consider the following expression:

(| (ou=engineering) (l=sunnyvale))

The intersection between the child expression sets represents the candidates which will be returned twice by this filter. The old implementation used an 'already seen' hash to determine if a candidate should be returned. This is preferred for small result sets in disjunction expressions however for large result sets over very large directories this could consume a lot of memory.

Cursor Specific Implementation Issues

As stated before, Cursors can be positioned at any point in time and can change their traversal direction. The concept of what was already accessed/traversed by a Cursor is a bit hazy. With directional Enumerations this was never an issue. Using an "already seen" list or a blacklist to prevent the traversal of duplicates in a Cursor on a disjunction does not make much sense.

There's also no way to maintain Cursor ordering semantics with disjunctions. Child (sub-expression) Cursors used by a disjunction Cursor will not contain elements sorted on the same index.

For these reasons I'm quickly coming to the conclusion that duplicate elimination is a task that must be performed by the code using a Cursor. I really don't like this idea because it leads to inefficiency and may potentially lead to inconsistency.

An LDAP server should not return duplicates for search operations. To solve this we could filter out duplicates in the LDAP front end which uses the resultant search Cursor. This does not solve the problem of dealing with duplicates in Cursors returned by search when embedding the core. Even if we provide supporting classes to deal with this problem, those embedding the core would need to be aware of the issue to specifically handle duplicate elimination. This is pretty lousy.

New Cursor Implementation Techniques

TBD

An Example

TBD

Algorithm Description

TBD

Benefits and Drawbacks

TBD

Changes In Evaluators

Evaluators for handling logical disjunctions have not been impacted by the changes imposed due to the Cursor redesign. They work the same way they have worked before by leveraging a system of indices if available or accessing entries. We do however have some optimizations in mind that would help product more efficient evaluators.

Sub-expression Ordering Based on Scan Counts

Emmanuel Lécharny had a brilliant optimization suggestion. He suggested ordering sub-expression evaluation so the child sub-expression with the greatest probability of accepting a candidate is evaluated first. With logical disjunctions this will short the evaluation process once one sub-expression accepts the candidate as a match. Scan counts on child sub-expressions express, at most, the worst case result set that each child can return. The child sub-expression in the disjunction with the greatest scan count has the greatest probability of matching candidates.

The same optimization can be used with logical conjunctions (AND expressions). With conjunctions, the first child sub-expression to reject a candidate shorts the evaluation with a false return. To optimize in this case, child sub-expression evaluation is ordered from smallest scan count to greatest scan count. We want the child sub-expression with the least probability of matching or the greatest probability of NOT matching evaluated first.

  • No labels