First-class Avro Support

Target release
Epic
Document status	DRAFT
Document owner	Joe Witt
Designer
Developers
QA

Goals

Provide a nice user experience and feature set for supporting dataflows involving Avro formatted data including the ability to easily view, edit, split, combine, and route such data.

Background and strategic fit

Usage of Avro in and around Big Data projects is increasingly common. We should build a content viewer for Avro data, which allows a user to look at the content of a given Avro message based on the schema. We should also provide a mechanism to manipulate the content of Avro messages to insert or update values but also to perform schema evolution or transformation. Avro data can tend to arrive in bundles so splitting them is useful to enable individual message handling/routing cases. The reverse then is also true which is it is useful to be able to merge Avro messages based on like-schema. Finally, being able to run queries against avro data to make routing decisions is also valuable and given the JSON-based schema design is quite doable.

Assumptions

Requirements

#	Title	User Story	Importance	Notes
1	Convert to Avro	Convert common data formats to Avro, such as CSV and JSON.	N/A	Existing functionality in kite-bundle.
2	Convert from Avro	Convert from Avro to common data formats, such as CSV, XML, and JSON	Medium
3	Convert Between Avro Schemas	Convert Avro records from original schema to a destination schema allowing for user defined field mappings.	N/A	Existing functionality in kite-bundle.
4	Merge Avro Files	Merge Avro records with compatible schemas into a single file so that appropriate sized files can be delivered to downstream systems such as HDFS. Support similar semantics to existing MergeContent processor, such as merging based on size, time, number of entries, etc.	High	NIFI-821 Ready for 0.3.0
5	Split Avro Files	Split an Avro file with multiple records into individual files so that each record can be processed independently by downstream processors. An example of downstream processing would be routing based on the value of a field in a given record.	High
6	Extract Schema Fingerprint	Extract the schema fingerprint of a given Avro file so that downstream processors can make decisions based on the schema, such as when merging together records of compatible schemas (i.e. the correlation attribute).	Medium	Information on obtaining a Schema Fingerprint
7	Evaluate Avro Paths	Evaluate a set of Avro paths against an incoming file, and extract the results to FlowFile attributes, or to the content of the FlowFile, similar to EvaluateJson. This would allow downstream processors to easily make decisions based on values in an Avro record, such as RouteOnAttribute.	High	Similar idea in Morphlines ExtractAvroPathsBuilder
8	Update Avro Records	Modify Avro records by inserting, updating, or removing fields.	Medium
9	Avro Content Viewer	Provide the ability to view an Avro record based on it's schema when clicking to view the content from a provenance event.	Medium

User interaction and design

Questions

Below is a list of questions to be addressed as a result of this requirements document:

Question	Outcome

Space shortcuts

Child pages