SerDe is short for Serializer/Deserializer. A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats. See Hive SerDe for an introduction to SerDes.
For JSON files, Amazon has provided a JSON SerDe available at:
- The owner of an object (either a row, a column, a sub field of a column, or the return value of a UDF) is the code that creates it, and the life time of an object expires when the corresponding object for the next row is created. That means several things:
- We should not directly cache any object. In both group-by and join, we copy the object and then put it into a hashmapHashMap.
- SerDe, UDF, etc can reuse the same object for the same column in different rows. That means we can get rid of most of the object creations in the data pipeline, which is a huge performance boost.