Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

What Griffin should focus

Griffin is a generic framework to enable user to measure and monitor the data quality in an easy and extensive manner.

...

Griffin

...

What's the pain points we are facing in the current edition of Griffin from the architectural perspective?

...

0.7.0 problems

...

  - Incomplete and Inflexible Data Quality Definition: The current definition of data quality lacks completeness and flexibility. A comprehensive data quality rule should encompass recording metrics, anomaly detection, and actionable steps such as alerting.

 - Rigid Triggering Mechanism: The triggering mechanism for measures exhibits rigidity. Integration with the scheduler in enterprise production environments needs to be seamless and deeply integrated.

 - Over Reliance on Internal Data Comparison: The measure implementation overly depends on its own data comparison methods, neglecting the optimization capabilities inherent in the engine. There's a need to leverage the engine's optimization features more effectively. We need to focus on data quality benchmarks, rather than optimization queries.

 - Configurability of Gateway: To enhance flexibility, the gateway between Apache Griffin and the engine should be configurable. This ensures compatibility with popular gateways such as Trino, Kyuubi, etc.

 - Lack of Default Alert Channels: Currently, there is a deficit in default alert channels. Providing default channels such as Slack, WeChat, etc., is essential to ensure timely communication of alerts.

 - Absence of Anomaly Detection Module: An anomaly detection module is conspicuously absent. Presently, our thresholds are statically configured, indicating a need for dynamic anomaly detection capabilities.



Next generation Griffin architecture considerations

As one mission for Griffin is to reduce MTTD(mean-time-to-detect), 

  • During define phase, next generation architecture should use more expressive rules to define data quality requirements. SQL based rule is a good candidate for defining data quality, it is abstract but also concrete. It is abstract so that we can dispatch data quality rules to different query engines, it is concrete that all data quality stakeholders can understand the rules and align easily.
  • During define phase, the data quality should be uniformly defined among different scenarios such as batch, near realtime and realtime.
  • During measure phase, the next generation Griffin should standardize measure data quality pipelines to different stages as recording collecting stage, checking evaluating stage and alerting stage. It is easily for different data platform teams to integration with Griffin during different stages.
  • During measure collect phase, the next generation Griffin should not couple with any particular query engine, so it should able to dispatch/route requests to different query engineengines(spark, hive, flink, presto) upon different data quality rules or data catalogs.
  • During measure evaluate phase, the next generation Griffin should support different schedule strategies such as event trigger or time-based trigger.
  • During analyze evaluate phase, the next generation Griffin should provide standardize solutions as anomaly detection algorithm to detect anomaly, since in most cases, related stakeholders need our support to define anomaly.
  • Last but not least, the next generation Griffin should provides data quality reports/scorecards for different levels requirements. 

Next generation Griffin architecture proposal


Griffin Cluster Architecture

Image Added