Goals
Provide a nice user experience and feature set for supporting dataflows involving Avro formatted data including the ability to easily view, edit, split, combine, and route such data.
Background and strategic fit
Usage of Avro in and around Big Data projects is increasingly common. We should build a content viewer for Avro data, which allows a user to look at the content of a given Avro message based on the schema. We should also provide a mechanism to manipulate the content of Avro messages to insert or update values but also to perform schema evolution or transformation. Avro data can tend to arrive in bundles so splitting them is useful to enable individual message handling/routing cases. The reverse then is also true which is it is useful to be able to merge Avro messages based on like-schema. Finally, being able to run queries against avro data to make routing decisions is also valuable and given the JSON-based schema design is quite doable.
Assumptions
Requirements
# | Title | User Story | Importance | Notes |
---|---|---|---|---|
1 | Convert to Avro | Convert common data formats to Avro, such as CSV and JSON. | N/A |
|
2 | Convert from Avro | Convert from Avro to common data formats, such as CSV, XML, and JSON | Medium |
|
3 | Convert Between Avro Schemas | Convert Avro records from original schema to a destination schema allowing for user defined field mappings. | N/A |
|
4 | Merge Avro Files | Merge Avro records with compatible schemas into a single file so that appropriate sized files can be delivered to downstream systems such as HDFS. Support similar semantics to existing MergeContent processor, such as merging based on size, time, number of entries, etc. | High |
|
5 | Split Avro Files | Split an Avro file with multiple records into individual files so that each record can be processed independently by downstream processors. An example of downstream processing would be routing based on the value of a field in a given record. | High |
|
6 | Extract Schema Fingerprint | Extract the schema fingerprint of a given Avro file so that downstream processors can make decisions based on the schema, such as when merging together records of compatible schemas (i.e. the correlation attribute). | Medium |
|
7 | Evaluate Avro Paths | Evaluate a set of Avro paths against an incoming file, and extract the results to FlowFile attributes, or to the content of the FlowFile, similar to EvaluateJson. This would allow downstream processors to easily make decisions based on values in an Avro record, such as RouteOnAttribute. | High |
|
8 | Update Avro Records | Modify Avro records by inserting, updating, or removing fields. | Medium |
|
9 | Avro Content Viewer | Provide the ability to view an Avro record based on it's schema when clicking to view the content from a provenance event. | Medium |
User interaction and design
Questions
Below is a list of questions to be addressed as a result of this requirements document:
Question | Outcome |
---|---|