Improved authz
Goals
The current (1.9) implementation of authz is lacking in two areas:
- Performance when a large number of paths needs to be checked. Affected operations are mainly checkout / export and log.
- Support for wildcards. Subversion should support
- "*" or single, arbitrary path segments (with no "/" in them) and
- "**" for arbitrary length non-empty paths.
- Classic wildcard patterns like "*foo*.bar" shall be supported.
All wildcard usage applies to full path segments only, i.e. a '*'
never matches a '/' except for the case of "/**/" where it matches
one to many full segments.
Missing definitions:
This doc does not provide an unambiguous definition of the pattern syntax.
- Does
foo*
match "foo"? If it does, does a single*
match 0-or-1 segment, or is it an exception? - Is
/**/foo
hungry (i.e., does it match the longest subpath that ends in "foo", or the first path segment that starts with "foo")? - Does
**
match 0-or-more or 1-or-more segments? - What about wildcard escaping in patterns? Our current authz rules allow matching literal "*".
— Brane
Terminology
A path consists of segments, separated by "/". Whildcards are valid segments and their interpretation depends on the context.
A rule set contains a number of rules, each describing what users shall have which access rights. Users might be specified indirectly by using groups.
A path rule specifies what rule set applies to a given path.
Inheritance and disambiguation
These are the rules as of today, with rule 3 added to support ambiguous path patches caused by pattern matching.
- Path rules implicitly apply to all paths in the respective sub-tree.
2. If multiple path rules match a given repository path, only the path rule(s) that cover the most segments shall apply.
3. If multiple path rules cover the same number of segments, only the one specified last in the authz file shall apply.
4. If there are multiple rules in a rule set for a given user, the union of all rights granted shall apply.
5. The following rule is implicitly added as the first rule given no access to anybody by default.
[/] * =
Design
The idea is to use combine the following approaches:
- Preprocessed, tree-like data structures applied to segmented paths
- Reduction of the tree to what applies to the current user
- Pre-calculate recursive rights for early exist
In the initial implementation we may rely on the svn_config API to provide us with parser functionality. At that stage, we don't need an independent representation of the whole file contents.
General workflow
Putting caching aside, the workflow involved three data models, building on top of each other.
- Parsing an authz file (from file system or repository) yields a consolidated hash (additive sections being combined automatically) of it contents in svn_config_t.
This is the current behaviour of the authz files, but only because we happen to use an svn_config_t
to represent a parsed authz files. I suggest we should constrain the semantics for authz rules:
- No rule may appear more than once in the authz file.
- Value placeholders (
%(name)s
) are not expanded.
This implies writing a different constructor for the in-memory authz structure, but is consistent with the intent, if not the current semantics, of the authz files.
— Brane
- Filtered path rule tree
- prefix tree with one node per segment
- created on demand per user and repository
- contains only rules that apply to the respective user and repository
- multiple instances of that being cached in the svn_authz_t structure alongside the single svn_config_t
- rule sets being reduced to access rights + order ID
- each node knows min / max rights on all sub-nodes
- Lookup state
- access rights accordingly the latest matching path rule
- list of tree nodes that may match sub-paths as we may need to follow multiple patterns
- temporary data structure thats reused between queries to save on allocation and construction overhead
Data models
These are persistent in the sense that we will cache and reuse them. They do not cover transient data models that various algorithms may use e.g. during authz parsing.
Unless indicated otherwise, all collections are ordered.
Only abstract types / the interface view is given here; the actual implementation may be different.
Filtered path rule tree
filtered-tree := root : filtered-node // exists due to default path rule filtered-node := segment : string // empty for root access : ref to access-type // empty if no path ends here min-rights: none | r | w | rw // user's minimum rights on any path in the sub-tree max-rights: none | r | w | rw // user's maximum rights on any path in the sub-tree repeat : boolean // set on nodes for "**" segments sub-nodes : { map following segment => filtered-node }[0..*] // one node per non-wildcard in following segment // empty if no rules on following segments // the following will be aggregated into a sub-structure to // facilitate a quick "has pattern sub-segment?" check any : filtered-node // empty if no path rule with "*" in following segment any-var : filtered-node // empty if no path rule with "**" in following segment prefixed : { list of filtered-node }[0..*] // empty if no path rule like "bar*" in following segment suffixed : { list of filtered-node }[0..*] // empty if no path rule like "*bar" in following segment complex : { list of filtered-node }[0..*] // empty if no path rule with complex wildcard pattern // like "*for*bar" in following segment access-type := order-id : integer // increases with declaration order in the file rights : none | r | w | rw
Correction:
There is no such thing as write-only access in our authz model. Access rights can be only none
, r
or rw
. This needs to be fixed throughout this doc.
— Brane
Lookup state
lookup-state := access : ref to access-type // user's rights for this path min-rights: none | r | w | rw // user's minimum rights on any path in the sub-tree // (aggregated over RIGHTS and all nodes in CURRENT) max-rights: none | r | w | rw // user's maximum rights on any path in the sub-tree // (aggregated over RIGHTS and all nodes in CURRENT) current : { ref to filtered-node }[0..*] // sub nodes of these may match the next segment
Algorithms
Normalization
Wildcard semantics?
These normalisations are only valid if one assumes a certain set of semantics for wildcards, but we do not define these semantics anywhere; see initial comment in this doc.
— Brane
Wildcard sequences in paths in rule sets shall be normalized. This is merely done to reduce matching costs later on.
// Variable length wildcards must be the last in a sequence while path contains "/**/*/" replace occurrence with "/*/**/" // Consecutive variable length wildcards are redundant while path contains "/**/**/" replace occurrence with "/*/**/" // Trailing variable length wildcards are redundant if path ends with "/**" replace occurrence with "/*" // "**" within segment patterns are redundant while path contains "**" not immediately preceded by "/" replace occurrence with "*" while path contains "**" not immediately followed by "/" replace occurrence with "*"
Lookup
Since there is always an implicit path rule at the root for all users, lookup is always necessary.
init(tree : ref to filtered-tree): return new lookup-state .rights = tree.root.rule .current = { tree.root }, if has_subnodes(tree.root) { }, otherwise
Checking whether we need to continue in our path matching is trivial, just check whether there are any potential matches to sub-paths.
done(state : ref to lookup-state): return is_empty(state.current)
One step in the lookup iteration, i.e. matching the next segment, is relatively simple as well. For simplicity, we ignore the min / max rights optimization here.
latest(lhs : ref to access-type, rhs : ref to access-type): if is_empty(lhs) return rhs if is_empty(rhs) or lhs.order-id > rhs.order-id return lhs return rhs step(state : ref to lookup-state, segment : string): next = { } access = empty foreach node in state.current: if node.repeat: next += node access = latest(access, node.access) foreach sub-node in all-subnodes(node): // simplified if matches(sub-node, segment): // simplified next += sub-node access = latest(access, sub-node.access) if not is_empty(access) state.rights = access state.current = next
So, the full lookup without caching is
lookup(tree : ref to filtered-tree, path : string): state = init(tree) foreach segment in path if done(state) break; state = step(state, segment) return state.rights
Implementation Notes
Changes from Current Behaviour
Rules Defined Only Once
The config-based parser would merge and override access entries in redefined rules. The new parser rejects authz files that define the same rule more than once.
New Rule Syntax for Wildcard Rules
Existing rules come in two flavours; repository-specific and global:
[repos:/path] [/path]
In these rules, /path
is always matched literally. The new parser supports two new forms for rules with paths that contain wildcards:
[:glob:repos:/path] [:glob:/path]
Because a glob rule is not required to actually contain wildcards in the path, two sections with different names may represent the same rule; for example,
[/path] [:glob:/path]
the above two rules are identical. The new parser detects (and rejects) such collisions.
Write-only Access is Not Allowed
The config-based authz parser will allow access entries that grant only write access; e.g.,
[/] * = w
The new parser flags such entries as invalid; our authz implementation does not support write-only access.
Only Group Definitions in the Global Groups File
The new parser allows only group definitions (i.e., only the [groups]
section) in the global groups file. The config-based parser would allow other sections there, but they would be ignored.