- What is Avro?
- How can I get started quickly with Avro?
- How do I statically compile a schema or protocol into generated code?
- How are Strings represented in Java?
- More generally, how do Avro types map to Java types?
- What is the purpose of the sync marker in the object file format?
- Why isn't every value in Avro nullable?
What is Avro?
Avro is a data serialization system.
- Rich data structures.
- A compact, fast, binary data format.
- A container file, to store persistent data.
- Remote procedure call (RPC).
- Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
How can I get started quickly with Avro?
Check out the Quick Start Guide.
How do I statically compile a schema or protocol into generated code?
- Add the avro jar, the
java org.apache.avro.specific.SpecificCompiler <json file>.
- This appears to be out of date, the SpecificCompiler requires two arguments, presumably an input and and output file, but it isn't clear that this does.
Lastly, you can also use the "avro-tools" jar which ships with an Avro release. Just use the "compile (schema|protocol)" command.
How are Strings represented in Java?
More generally, how do Avro types map to Java types?
What is the purpose of the sync marker in the object file format?
From Doug Cutting:
HDFS splits files into blocks, and mapreduce runs a map task for each block. When the task starts, it needs to be able to seek into the file to the start of the block process through the block's end. If the file were, e.g., a gzip file, this would not be possible, since gzip files must be decompressed from the start. One cannot seek into the middle of a gzip file and start decompressing. So Hadoop's SequenceFile places a marker periodically (~64k) in the file at record and compression boundaries, where processing can be sensibly started. Then, when a map task starts processing an HDFS block, it finds the first marker after the block's start and continues through the first marker in the next block of the file. This requires a bit of non-local access (~0.1%). Avro's data file uses the same method as SequenceFile.
Why isn't every value in Avro nullable?
When serialized, if any value may be null then it must be noted that it
is non-null, adding at least a bit to the size of every value stored and
corresponding computational costs to create this bit on write and
interpret it on read. These costs are wasted when values may not in
fact be null, as is the case in many datasets. In Avro such costs are
only paid when values may actually be null.
Also, allowing values to be null is a well-known source of errors. In
Avro, a value declared as non-null will always be non-null and programs
need not test for null values when processing it nor will they ever fail
for lack of such tests.
Tony Hoare calls his invention of null references his "Billion Dollar
Also note that in some programming languages not all values are permitted to be null. For example, in Java, values of type boolean, byte, short, char, int, float, long, and double may not be null.