Benefit of a native DateTime type
Currently Pig user can only use string type for DateTime data and rely on UDF which takes DateTime string. Consider the prevalence of DateTime data, have a native DateTime type is beneficial. There are several benefits to have a native DateTime type:
- Performance improvement
- We will have more compact serialized format, decrease the size of serialized data
- We will have a dedicated comparator, which will accelerate the comparison.
- DateTime will only convert from string once, even we process it by multiple DateTime UDF
- More intuitive to user
- UDF writer will be free from deciding the input DateTime string format. Currently every DateTime UDF need to deal with conversion, which seems to be very bogus.
- We may override operator (+/-/==/!=/</>) to make DateTime type more convenient.
- String sort is not reliable for DateTime type
Especially when DateTime are encoded differently.
Proposal for Pig DateTime type
- Memory structure
We will use joda DateTime datatype as the internal data structure. Joda DateTime is more powerful than java.util.DateTime, but much easier than java.util.Calendar. Current Pig DateTime UDF already use Joda DateTime.
- Serialized format
Serialized to 8 bytes, which is the long value of milliseconds since 1970-01-01
Will take ISO_8601 by default (which already true in DateTime UDF)
The following UDF will be initially supported in builtin (include most current DateTime UDF supported in piggybank)
- int DiffDate(DateTime d1, DateTime d2)
- int YearsBetween(DateTime d1, DateTime d2)
- int MonthsBetween(DateTime d1, DateTime d2)
- int DaysBetween(DateTime d1, DateTime d2)
- int HoursBetween(DateTime d1, DateTime d2)
- int MinutesBetween(DateTime d1, DateTime d2)
- int SecondsBetween(DateTime d1, DateTime d2)
- int GetYear(DateTime d1)
- int GetMonth(DateTime d1)
- int GetDate(DateTime d1)
- int GetHour(DateTime d1)
- int GetMinute(DateTime d1)
- int GetSecond(DateTime d1)
- DateTime DateAdd(DateTime d1)
- DateTime ToDate(String s)
- DateTime ToDate(String s, String format)
- DateTime ToDate(String s, String format, String timezone)
- DateTime toDate(long t)
- String ToString(DateTime d)
- String ToString(DateTime d, String format)
- long ToUnixTime(DateTime d)
- Date operations
We override +/-/==/!=/</> for DateTime type
Summary of changes
Mostly described in Alan's comment on PIG-1314: https://issues.apache.org/jira/browse/PIG-1314?focusedCommentId=12848285&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12848285, with minor change:
- Add support in parser, both for declaring an input to be of type datetime and datetime constants
- Don't allow implicit type cast from/to DateTime
- Change LoadCaster/StoreCaster interface to include bytesToDateTime/toBytes(DateTime) method, add method to default implementation
- Determerine which builtin UDFs that we want for datetime and get agreement from community. Implement these UDFs.
- Implement any allowed cast operators for datetime (probably just string <-> datetime).
- Implement datetime class represents datetime in memory. This needs to implement WritableComparable so that it can be serialized and compared in Hadoop
- Implement raw comparator for the type so it can be used as a key in groups bys and joins.
- Change physical operators and builtin UDFs to handle processing of datetime types.
- Change data conversion and type discovery routines in DataType
- Change BinInterSedes and BinInterSedesTupleRawComparator to include DateTime
- And, of course, add prolific tests
Backward incompatibility changes
- LoadCaster: add bytesToDateTime(byte b)
- StoreCaster: add toBytes(DateTime d)
DateTime support milliseconds, is the precision proper?
User will need to setup default timezone in config file. If user want to override it, use toDate UDF to convert string to DateTime by supplying timezone
- Input/Output format
Default will be ISO_8601. User can use toDate UDF to convert string to DateTime with the format string. Do we need a way to override default encoding?