Purpose
Describe the rule of griffin measurements.
Including kinds of measurement types:
- accuracy
- completeness
- timeliness
Including kinds of data source types:
- batch
- streaming
Including kinds of data format types:
- Structured data
- table like hive
- json from like kafka
- avro
- Unstructured data
Idea
Our DSL is leveraged on SQL, but it provides Apache Griffin own SQL-LIKE syntax, customized to adapt to our Apache Griffin user case.
Basically, to calculate data quality metrics, users only need to provide comparison rule in our where clause like
where : "$source.uid = $target.uid and $source.itemid = $target.itemid and $source.timestamp =$target.timestamp"
Apache griffin will calculate out metrics for users.
Apache Griffin measures will determine which index to use to partition data in order to increate efficiency. Also, users can explicitly tell us which index to use as below.
where : "$source.uid = $target.uid and $source.itemid =$target.itemid and $source.timestamp = $target.timestamp" index : "uid, timestamp"
Examples
Example 1. simple accuracy between different hive tables
$source.user_id = $target.user_id AND $source.first_name = $target.first_name AND $source.last_name = $target.last_name AND $source.address = $target.address AND $source.email = $target.email AND $source.phone = $target.phone AND $source.post_code = $target.post_code
Example2. accuracy between different json strings from kafka
IF ( $source.__time + 24h <= $target.__time ) $source.json().seeds.json().url = $target.json().groups[0].attrsList['name' = 'URL'].values[0] AND $source.json().seeds.json().metadata.json().tracker.crawlRequestCreateTS = $target.json().groups[0].attrsList['name' = 'CRAWLMETADATA'].values[0].json().tracker.crawlRequestCreateTS
Syntax BNF
<rule> ::= <logical-statement> [WHEN <logical-statement>] // rule: mapping-rule [WHEN when-rule] // - mapping-rule: the first level opr should better not be OR | NOT, otherwise it can't automatically find the groupby column // - when-rule: only contain the general info of data source, not the special info of each data row <logical-statement> ::= [NOT] <logical-expression> [(AND | OR) <logical-expression>]+ | "(" <logical-statement> ")" // logical-statement: return boolean value // logical-operator: "AND" | "&&", "OR" | "||", "NOT" | "!" <logical-expression> ::= <math-expr> (<compare-opr> <math-expr> | <range-opr> <range-expr>) // logical-expression example: $source.id = $target.id, $source.page_id IN ('3214', '4312', '60821') <compare-opr> ::= "=" | "!=" | "<" | ">" | "<=" | ">=" <range-opr> ::= ["NOT"] "IN" | "BETWEEN" <range-expr> ::= "(" [<math-expr>] [, <math-expr>]+ ")" // range-expr example: ('3214', '4312', '60821'), (10, 15), () <math-expr> ::= [<unary-opr>] <math-factor> [<binary-opr> <math-factor>]+ // math-expr example: $source.price * $target.count, "hello" + " " + "world" + 123 <binary-opr> ::= "+" | "-" | "*" | "/" | "%" <unary-opr> ::= "+" | "-" <math-factor> ::= <literal> | <selection> | "(" <math-expr> ")" <selection> ::= <selection-head> [ <field-sel> | <function-operation> | <index-field-range-sel> | <filter-sel> ]+ // selection example: $source.price, $source.json(), $source['state'], $source.numList[3], $target.json().mails['org' = 'apache'].names[*] <selection-head> ::= $source | $target <field-sel> ::= "." <field-string> <function-operation> ::= "." <function-name> "(" <arg> [, <arg>]+ ")" <function-name> ::= <name-string> <arg> ::= <math-expr> <index-field-range-sel> ::= "[" <index-field-range> [, <index-field-range>]+ "]" <index-field-range> ::= <index-field> | (<index-field>, <index-field>) | "*" // index-field-range: 2 means the 3rd item, (0, 3) means first 4 items, * means all items, 'age' means item 'age' <index-field> ::= <index> | <field-quote> | <all-selection> // index: 0 ~ n means position from start, -1 ~ -n means position from end <field-quote> ::= ' <field-string> ' | " <field-string> " <filter-sel> ::= "[" <field-quote> <filter-compare-opr> <math-expr> "]" <filter-compare-opr> ::= "=" | "!=" | "<" | ">" | "<=" | ">=" // filter-sel example: ['name' = 'URL'], $source.man['age' > $source.graduate_age + 5 ] // When <math-expr> in the selection, it mustn't contain the different <selection-head>, for example: // $source.tags[1+2] valid // $source.tags[$source.first] valid // $source.tags[$target.first] invalid // -- Such job is for validation, not for parser <literal> ::= <literal-string> | <literal-number> | <literal-time> | <literal-boolean> <literal-string> ::= <any-string> <literal-number> ::= <integer> | <double> <literal-time> ::= <integer> ("d"|"h"|"m"|"s"|"ms") <literal-boolean> ::= true | false
Syntax description
a Basic elements
a-1 literal element
- literal string: 'uid', 'page_id', 'metadata' - literal number: 0, 1, 342309, -43 - literal time: 24h, 3m, 800ms, 7d - literal null value: null
a-2 variable element
- variable name: SLA, _test - quote variable name: ${source}, ${target}, ${SLA}
a-3 operation element
- different level calculation operators: +, -, *, /, %, ... - assign operator: = - compare operator: ==, !=, >, <, >=, <= - mapping operator: ===, !==, <<, >>, <<=, >>= - delimiter of statements: ; - annotation sign: @
b Selection elements
b-1 selection basic element
- field name: 'uid', 'page_id' - single position number: 0, 1, 5 - multiple position number: 0:5, * - filter condition: 'name'='URL', 'timeStamp'>20170914135347215, 'id'!=null
b-2 selection and function expression
- selection expression: ['seeds'], [*], ['name'='CRAWLMETADATA'] - function expression: .json()
c Calculation elements
c-1 calculation factor
- factor: ${source}['__time'], ${SLA}, ${target}['itm'], 24h, (expression)
c-2 calculation expression
- expression: ${source}['__time'] + ${SLA}, ${target}['price'] * 5 + 100
d Statement elements
d-1 assign statement
- assign statement: SLA = 24h; test_value = 12345;
d-2 condition statement
- condition statement: @Invalid ${source}['__time'] + ${SLA} > ${target}['__time'];
d-3 mapping statement
- mapping statement: @Key ${source}['uid'] === ${target}['uid']; ${source}['site_id'] === ${target}['site_id'];
Syntax Tree Parse
Statement Expression
trait StatementExpr extends Expr {} case class StatementsExpr(statements: Iterable[StatementExpr]) extends StatementExpr {} case class AssignExpr(expression: String, left: VariableExpr, right: ElementExpr) extends StatementExpr {} case class ConditionExpr(expression: String, left: ElementExpr, right: ElementExpr, annotations: Iterable[AnnotationExpr]) extends StatementExpr {} case class MappingExpr(expression: String, left: ElementExpr, right: ElementExpr, annotations: Iterable[AnnotationExpr]) extends StatementExpr {}
Calculation Expression
trait ElementExpr extends Expr {} case class FactorExpr(self: Expr with Calculatable) extends ElementExpr {} case class CalculationExpr(first: ElementExpr, others: Iterable[(String, ElementExpr)]) extends ElementExpr {}
Selection Expression
trait SelectExpr extends Expr {} case class NumPositionExpr(expression: String) extends SelectExpr {} case class StringPositionExpr(expression: String) extends SelectExpr {} case class AnyPositionExpr(expression: String) extends SelectExpr {} case class FilterOprExpr(expression: String, left: VariableExpr, right: ConstExpr) extends SelectExpr {} case class FunctionExpr(expression: String, args: Iterable[ConstExpr]) extends SelectExpr {} trait DataExpr extends Expr {} case class SelectionExpr(head: QuoteVariableExpr, args: Iterable[SelectExpr]) extends DataExpr {}
Basic Expression
trait ConstExpr extends Expr {} case class ConstStringExpr(expression: String) extends ConstExpr {} case class ConstTimeExpr(expression: String) extends ConstExpr {} case class ConstNumberExpr(expression: String) extends ConstExpr {} trait VariableExpr extends Expr {} case class VariableStringExpr(expression: String) extends VariableExpr {} case class QuoteVariableExpr(expression: String) extends VariableExpr {}
Expression calculation strategy
a Parse rule expressions, get a syntax tree of expressions.
b Connecting data source, for each row of data, get the selection values and store.
example:
for rule: "@Key ${source}['uid'] === ${target}['uid']; ${source}['itm'] === ${target}['itm']; ${source}['page_id'] === ${target}['page_id'];"
source data schema: { uid: long, itm: string, name: string, time: long, page_id: long }
parse rule, get the selection expressions of source data: ${source}['uid'], ${source}['itm'], ${source}['page_id'], in which ${source}['uid'] is the key selection data (for the @Key annotation).
for each row of source data, get the selection values and store as a pair of tuple and map
( (1234), Map( ("${source}['uid']" -> 1234), ("${source}['itm']" -> "some item"), ("${source}['page_id']" -> 24) ) )
( (1235), Map( ("${source}['uid']" -> 1235), ("${source}['itm']" -> "another item"), ("${source}['page_id']" -> 51) ) )
......
all source data are stored as RDD[(keys, valuesMap)], the same operation is for target data.
c When mapping, for each row of source and target data, get their stored map of selection values, institute into statements for calculation, judge they are matched or not.
example:
for rule: "@Key ${source}['uid'] === ${target}['uid']; ${source}['itm'] === ${target}['itm']; ${source}['page_id'] === ${target}['page_id'];"
trying to match a row of source data and target data with the same keys
source data value map: Map( ("${source}['uid']" -> 1234), ("${source}['itm']" -> "some item"), ("${source}['page_id']" -> 24) )
target data value map: Map( ("${target}['uid']" -> 1234), ("${target}['itm']" -> "some item"), ("${target}['page_id']" -> 25) )
merge the data value maps, get a whole value map: Map( ("${source}['uid']" -> 1234), ("${source}['itm']" -> "some item"), ("${source}['page_id']" -> 24), ("${target}['uid']" -> 1234), ("${target}['itm']" -> "some item"), ("${target}['page_id']" -> 25) )
institute into rule expressions, get:
@Key 1234 ===1234; "some item" === "some item"; 24 === 25;
calculate these expressions, get:
true; true; false;
the final value of matching should be false, because of the mismatching rule '${source}['page_id'] === ${target}['page_id']"