7. Apache Griffin DSL Guidance

Purpose

Griffin DSL is designed for DQ measurement, as a SQL-like language, trying to describe the DQ domain request.

Including kinds of measurement types:

accuracy
profiling
uniqueness
timeliness
...

Including kinds of data source types:

batch
- hive
- avro
- ...
streaming
- kafka

Including kinds of data format types:

Structured data
- hive table
- json string
- avro
Unstructured data

Idea

Griffin DSL is sql-like, customized to adapt to our Apache Griffin user case.

Basically, to calculate accuracy metrics, users only need to provide comparison rule like where clause

source.uid = target.uid and source.itemid = target.itemid and source.timestamp =target.timestamp

Apache griffin will calculate out metrics for users.

To calculate profiling metrics, users only need to provide sql-like rule ( in which key word "select" and "from" clause could be ignored )

source.id.count(), source.age.max() where source.age > 10

Examples

Example 1. accuracy between different source and target

source.id = target.id AND source.name = target.name AND source.age = target.age

Example 2. profiling of source

SELECT source.country, source.id.count() AS count, source.age.avg() AS avg_age FROM source GROUP BY source.country SORT BY count DESC LIMIT 5

Syntax BNF

-- literal --
<literal> ::= <literal-string> | <literal-number> | <literal-time> | <literal-boolean> | <literal-null> | <literal-nan>
<literal-string> ::= <any-string>
<literal-number> ::= <integer> | <double>
<literal-time> ::= <integer> ("d"|"h"|"m"|"s"|"ms")
<literal-boolean> ::= true | false
<literal-null> ::= null
<literal-nan> ::= nan

-- selection --
<selection> ::= <selection-head> [ <field-sel> | <index-sel> | <function-sel> ]* [<as-alias>]?
<selection-head> ::= ("data source name registered") | <function> | <field-name> | <all-selection>
<field-sel> ::= "." <field-name> | "[" <quote-field-name> "]"
<index-sel> ::= "[" <arg> "]"
<function-sel> ::= "." <function-name> "(" [<arg>]? [, <arg>]* ")"
<arg> ::= <math-expr>

-- as alias --
<as-alias> ::= <as> <field-name>

-- math expr --
<math-factor> ::= <literal> | <function> | <selection> | "(" <math-expr> ")" [<as-alias>]?
<unary-math-expr> ::= [<unary-opr>]* <math-factor>
<binary-math-expr> ::= <unary-math-expr> [<binary-opr> <unary-math-expr>]+
<math-expr> ::= <binary-math-expr>

-- logical expr --
<in-expr> ::= <math-expr> [<not>]? <in> <range-expr>
<between-expr> ::= <math-expr> [<not>]? <between> (<math-expr> <and> <math-expr> | <range-expr>)
<range-expr> ::= "(" [<math-expr>]? [, <math-expr>]+ ")"
<like-expr> ::= <math-expr> [<not>]? <like> <math-expr>
<is-null-expr> ::= <math-expr> <is> [<not>]? <null>
<is-nan-expr> ::= <math-expr> <is> [<not>]? <nan>
<logical-factor> ::= <math-expr> | <in-expr> | <between-expr> | <like-expr> | <is-null-expr> | <is-nan-expr> | "(" <logical-expr> ")" [<as-alias>]?
<unary-logical-expr> ::= [<unary-logical-opr>]* <logical-factor>
<binary-logical-expr> ::= <unary-logical-expr> [<binary-logical-opr> <unary-logical-expr>]+
<logical-expr> ::= <binary-logical-expr>

-- expression --
<expr> = <math-expr> | <logical-expr>

-- function expr --
<function> ::= <function-name> "(" [<arg>] [, <arg>]+ ")" [<as-alias>]?
<function-name> ::= ("function name registered")
<arg> ::= <expr>

-- clauses --
<select-clause> = <expr> [, <expr>]*
<where-clause> = <where> <expr>
<from-clause> = <from> ("data source name registered")
<having-clause> = <having> <expr>
<groupby-clause> = <group> <by> <expr> [ <having-clause> ]?
<orderby-item> = <expr> [ <DESC> ]?
<orderby-clause> = <order> <by> <orderby-item> [ , <orderby-item> ]*
<limit-clause> = <limit> <expr>

-- combined clauses --
<combined-clauses> = <select-clause> [ <from-clause> ]+ [ <where-clause> ]+ [ <groupby-clause> ]+ [ <orderby-clause> ]+ [ <limit-clause> ]+

Syntax description

a Supporting process

logical operation: not, and, or, in, between, like, is null, is nan, =, !=, <>, <=, >=, <, >
mathematical operation: +, -, *, /, %
sql statement: as, where, group by, having, order by, limit

b Keywords

- null, nan, true, false
- not, and, or
- in, between, like, is
- select, distinct, from, as, where, group, by, having, order, desc, asc, limit

c Operators

- !, &&, ||, =, !=, <, >, <=, >=, <>
- +, -, *, /, %
- (, )
- ., [, ]

d Literals

- string: any string surrounded with a pair of " or ', with escape charactor \ if any request.  
	e.g. "test", 'string 1', "hello \" world \" "
- number: double or integer number.  
	e.g. 123, 33.5
- time: a integer with unit in a string, will be translated to a integer number in millisecond.  
	e.g. 3d, 5h, 4ms
- boolean: boolean value literally.  
	e.g. true, false

e Selections

- selection head: data source name.
	e.g. source, target, `my table name`
- all field selection: * or with data source name ahead.
	e.g. *, source.*, target.*
- field selection: field name or with data source name ahead.
	e.g. source.age, target.name, user_id
- index selection: interget between square brackets "[]" with field name ahead.
	e.g. source.attributes[3]
- function selection: function name with brackets "()", with field name ahead or not.
	e.g. count(*), *.count(), source.user_id.count(), max(source.age)
- alias: declare an alias after a selection.
	e.g. source.user_id as id, target.user_name as name

f Math expressions

- math factor: literal or function or selection or math exression with brackets.
	e.g. 123, max(1, 2, 3, 4), source.age, (source.age + 13)
- unary math expression: unary math operator with factor.
	e.g. -(100 - source.score)
- binary math expression: math factors with binary math operators.
	e.g. source.age + 13, score * 2 + ratio

g Logical expression

- in: in clause like sql.
	e.g. source.country in ("USA", "CHN", "RSA")
- between: between clause like sql.
	e.g. source.age between 3 and 30, source.age between (3, 30)
- like: like clause like sql.
	e.g. source.name like "%abc%"
- is null: is null operator like sql.
	e.g. source.desc is not null
- is nan: check if the value is not a number, the syntax like is null
	e.g. source.age is not nan
- logical factor: math expression or logical expressions above or other logical expressions with brackets.
	e.g. (source.user_id = target.user_id AND source.age > target.age)
- unary logical expression: unary logical operator with factor.
	e.g. NOT source.has_data, !(source.age = target.age)
- binary logical expression: logical factors with binary logical operators, including and, or and comparison operators.
	e.g. source.age = target.age OR source.ticket = target.tck

h Expression

- expression: logical expression and math expression.

i Function

- argument: expression.
- function: function name with arguments between brackets.
	e.g. max(source.age, target.age), count(*)

j Clause

- select clause: the result columns like sql select clause, we can ignore the word "select" in Griffin DSL.
	e.g. select user_id.count(), age.max() as max, source.user_id.count() as cnt, source.age.min()
- from clause: the table name like sql from clause, in which the data source name must be one of data source names or the output table name of the former rule steps, we can ignore this clause by configoring the data source name.
	e.g. from source, from `target`
- where clause: the filter condition like sql where clause, optional.
	e.g. where source.id = target.id and source.age = target.age
- group-by clause: like the group-by clause in sql, optional. Optional having clause could be following.
	e.g. group by cntry, group by gender having count(*) > 50
- order-by clause: like the order-by clause, optional.
	e.g. order by name, order by first_name desc, age asc
- limit clause: like the limit clause in sql, optional.
	e.g. limit 5

k Accuracy Rule

- Accuracy rule expression in Griffin DSL is a logical expression, telling the mapping relation between data sources.
	e.g. source.id = target.id and source.name = target.name and source.age between (target.age, target.age + 5

l Profiling Rule

- Profiling rule expression in Griffin DSL is a sql-like expression, with select clause ahead, following optional from clause, where clause, group-by clause, order-by clause, limit clause in order.
	e.g. source.gender, source.id.count() where source.age > 20 group by source.gender, select country, max(age), min(age), count(*) as cnt from source group by country order by cnt desc limit 5

m Uniqueness Rule

- Uniqueness rule expression in Griffin DSL is a list of selection expressions separated by comma, indicates the duplicate columns to measure.
	e.g. name, (age + 1) as next_age

n Timeliness Rule

- Timeliness rule expression in Griffin DSL is a list of selection expressions separated by comma, indicates the input time and output time (calculate time as default if not set).
	e.g. ts

Griffin DSL translation to SQL

Griffin DSL is defined for DQ measurement, to describe DQ domain problem.
Actually, in Griffin, we get Griffin DSL rules, translate them into spark-sql rules for calculation in spark-sql engine.
In DQ domain, there're multiple dimensions, we need to translate them in different ways.

a Accuracy

For accuracy, we need to get the match count between source and target, the rule describes the mapping relation between data sources. Griffin needs to translate the dsl rule into multiple sql rules.
For example, the dsl rule is "source.id = target.id and source.name = target.name", which represents the match condition of accuracy. After the translation, the sql rules are as below:

get miss items from source: SELECT source.* FROM source LEFT JOIN target ON coalesce(source.id, '') = coalesce(target.id, '') and coalesce(source.name, '') = coalesce(target.name, '') WHERE (NOT (source.id IS NULL AND source.name IS NULL)) AND (target.id IS NULL AND target.name IS NULL), save as table miss_items.
get miss count: SELECT COUNT(*) AS miss FROM miss_items, save as table miss_count.
get total count from source: SELECT COUNT(*) AS total FROM source, save as table total_count.
get accuracy metric: SELECT miss_count.miss AS miss, total_count.total AS total, (total_count.total - miss_count.miss) AS matched FROM miss_count FULL JOIN total_count, save as table accuracy.

After the translation, the metrics will be persisted in table accuracy.

b Profiling

For profiling, the request is always the aggregation function of data, the rule is mainly the same as sql, but only supporting select, from, where, group-by, having, order-by, limit clauses, which can describe most of the profiling requests. If any complicate request, you can use sql rule directly to describe it.
For example, the dsl rule is "source.cntry, source.id.count(), source.age.max() group by source.cntry", which represents the profiling requests. After the translation, the sql rule is as below:

profiling sql rule: SELECT source.cntry, count(source.id), max(source.age) FROM source GROUP BY source.cntry, save as table profiling.

After the translation, the metrics will be persisted in table profiling.

c Uniqueness

For uniqueness, or called duplicate, is to find out the duplicate items of data, and rollup the items count group by duplicate times.
For example, the dsl rule is "name, age", which represents the duplicate requests, in this case, source and target are the same data set. After the translation, the sql rule is as below:

get distinct items from source: SELECT name, age FROM source, save as table src.
get all items from target: SELECT name, age FROM target, save as table tgt.
join two tables: SELECT src.name, src.age FROM tgt RIGHT JOIN src ON coalesce(src.name, '') = coalesce(tgt.name, '') AND coalesce(src.age, '') = coalesce(tgt.age, ''), save as table joined.
get items duplication: SELECT name, age, (count(*) - 1) AS dup FROM joined GROUP BY name, age, save as table grouped.
get total metric: SELECT count(*) FROM source, save as table total_metric.
get unique record: SELECT * FROM grouped WHERE dup = 0, save as table unique_record.
get unique metric: SELECT count(*) FROM unique_record, save as table unique_metric.
get duplicate record: SELECT * FROM grouped WHERE dup > 0, save as table dup_record.
get duplicate metric: SELECT dup, count(*) AS num FROM dup_records GROUP BY dup, save as table dup_metric.

After the translation, the metrics will be persisted in table dup_metric.

d Timeliness

For timeliness, is to measure the latency of each item, and get the statistics of the latencies.
For example, the dsl rule is "ts, out_ts", the first column means the input time of item, the second column means the output time of item, if not set, "__tmst" will be the default output time column. After the translation, the sql rule is as below:

get input and output time column: SELECT *, ts AS _bts, out_ts AS _ets FROM source, save as table origin_time.
get latency: SELECT *, (_ets - _bts) AS latency FROM origin_time, save as table lat.
get timeliness metric: SELECT CAST(AVG(latency) AS BIGINT) AS avg, MAX(latency) AS max, MIN(latency) AS min FROM lat, save as table time_metric.

After the translation, the metrics will be persisted in table time_metric.

Alternative Rules

You can simply use Griffin DSL rule to describe your problem in DQ domain, for some complicate requirement, you can also use some alternative rules supported by Griffin.

a Spark sql
Griffin supports spark-sql directly, you can write rule in sql like this:

{
	"dsl.type": "spark-sql",
	"name": "source",
	"rule": "SELECT count(id) AS cnt, max(timestamp) AS fresh_time FROM source"
}

Griffin will calculate it in spark-sql engine directly.

b Data frame operation
Griffin supports some other operations on data frame in spark, like converting json string data frame into extracted data frame with extracted object schema. For example:

{
	"dsl.type": "df-opr",
	"name": "ext_source",
	"rule": "from_json",
	"details": {
		"df.name": "json_source"
	}
}

Griffin will do the operation to extract json strings.
Actually, you can also extend the df-opr engine and df-opr adaptor in Griffin to support more types of data frame operations.

Tips

Griffin engine runs on spark, it might works in two phases, pre-proc phase and run phase.

a Pre-proc phase

Griffin calculates data source directly, to get appropriate data format, as a preparation for DQ calculation. In this phase, you can use df-opr and spark-sql rules.
After preparation, to support streaming DQ calculation, a timestamp column will be added in each row of data, so the data frame in run phase contains an extra column named "__tmst".

b Run phase

Griffin calculates with prepared data, to get the DQ metrics. In this phase, you can use griffin-dsl, spark-sql rules, and a part of df-opr rules.
For griffin-dsl rule, griffin translates it into spark-sql rule with a group-by condition for column "__tmst", it's useful for especially streaming DQ calculation. But for spark-sql rule, griffin use it directly, you need to add the "__tmst" column in your spark-sql rule explicitly, or you can't get correct metrics result after calculation.

Page tree

Purpose

Idea

Syntax BNF

Syntax description

Griffin DSL translation to SQL

Tips

19 Comments

neil wang

Lionel Liu

neil wang

Lionel Liu

Lionel Liu

Gaurav Malhotra

Lionel Liu

Gaurav Malhotra

Lionel Liu

Gaurav Malhotra

Lionel Liu

Gaurav Malhotra

Gaurav Malhotra

Lionel Liu

Lionel Liu

Gaurav Malhotra

Gaurav Malhotra

Gaurav Malhotra

sujeeth