Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

Due to discrepancies of Avro and Pig data models, AvroStorage has:

  • Wiki MarkupLimited support for "record": we do not support recursively defined record because the number of fields in such records is data dependent. For instance, _\{"type":"record","name":"LinkedListElem", "fields":\[{"name":"data","type":"int"\},\{"name":"next", "type":\["null","LinkedListElem"\]\}\]\}_;
  • Wiki MarkupLimited support for "union": we only accept nullable union like \ ["null", "some-type"\].

For simplicity, we also make the following assumption:

...

Users can choose not to provide any parameter to AvroStorage and Avro schema of output data is derived from its Pig schema. This may result in undesirable schemas due to discrepancies of Pig and Avro data models or problems of Pig itself:

...

  • The derived Avro schema will wrap each (nested) field with a nullable union because Pig allows NULL values for every type and Avro doesn't. For instance, when you read in Avro data of schema _"boolean"_ and store it using AvroStorage(), you will get _\["null","boolean"\]_.
  • The derived Avro schema may contain unwanted tuple wrappers because: 1) Pig only generates tuples; 2) items of Pig bags can only be tuples. AvroStorage can automatically get rid of such wrappers, but sometimes you still see them as in example B.

...

  • field<n> notnull
    This indicates the n th field (and its nested fields) in the output tuple is notnull.
  • data pathunmigrated-wiki-markup
  • +field<n> def:name+
    Users can provide predefined schemas in Avro files using option +\--data path+, where _path_ points to a directory of Avro files or a single Avro file. This is used together with field parameter +field<n> def:name+. AvroStorage internally constructs two maps: map\[typeName\]=>schema and map\[fieldName\]=>schema and users can specify which schema to use by providing corresponding _name_. This option is useful when users want to do simple processing of input data (like filtering and projection) and store it using predefined schemas in input. Please refer to example C for more details.
  • field<n> str
    Users can directly specify the schema of field n where str is a string representation of Avro schema. The usage of this option is similar to schema str except that the schema is only applied to the n th field.

...

type name

schema

ImpressionSetEvent

the whole schema

ImpressionDetailsRecord

{"type": "record", "name": ImpressionDetailsRecord","fields" : [{"name":"itemId", "

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="7d237a60-cdfd-4318-8df9-395ad577a8f0"><ac:plain-text-body><![CDATA[

ImpressionDetailsRecord

{"type": "record", "name": ImpressionDetailsRecord","fields" : [{"name":"itemId", "type":"int"}, {"name":"itemType", "type":{"type":"enum", "name":"ItemType","symbols":["person", "job", "group", "company", "nus", "news", "ayn"]}}, {"name":"details","type":{"type":"map","values":"string" }}

ItemType

{"type":"enum"

]]></ac:plain-text-body></ac:structured-macro>

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="f4b28d4f-bb9a-424e-8179-7a48d9c91f6e"><ac:plain-text-body><![CDATA[

ItemType

{"type":"enum", "name":"ItemType","symbols":["person", "job", "group", "company", "nus", "news", "ayn"]}

]]></ac:plain-text-body></ac:structured-macro>

The other maps from field names to schema as:

", "news", "ayn"]}

The other maps from field names to schema as:

field name

schema

field name

schema

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="f96ec76d-ce2f-4d4b-a8ca-aa4bf90ec7db"><ac:plain-text-body><![CDATA[

pageNumber

["int", "null"] ]] ></ac:plain-text-body></ac:structured-macro>

impressionDetails

ImpressionDetailsRecord

impressionDetails.id

int

impressionDetails.type

ItemType

impressionDetails.details

{"type":"map","values":"string" }

...

This documentation was originally written by Lin Guo, and appeared at http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+dataImage Removed