This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • SerDe

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

SerDe is short for Serializer/Deserializer. A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats. See Hive SerDe for an introduction to SerDes.

Built-in SerDes

  • Avro
  • ORC
  • RegEx
  • Thrift

Third-party SerDes

For JSON files, Amazon has provided a JSON SerDe available at:

...

  • The owner of an object (either a row, a column, a sub field of a column, or the return value of a UDF) is the code that creates it, and the life time of an object expires when the corresponding object for the next row is created. That means several things:
    • We should not directly cache any object. In both group-by and join, we copy the object and then put it into a hashmapHashMap.
    • SerDe, UDF, etc can reuse the same object for the same column in different rows. That means we can get rid of most of the object creations in the data pipeline, which is a huge performance boost.

...