Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For syntax descriptions of these statements, see the Trafodion SQL Reference Manual (pdf, 3.98 MB). The table below highlights the differences between these statements with the objective of helping you choose the appropriate statement for a given requirement.

CharacteristicINSERTUPSERTUPSERT USING LOADLOAD
TransactionYesYesNo, uses HBase WAL for recoveryNo, uses snapshot for recovery
Method of operationUses the standard HBase write path through its CheckAndPut call. Rows are held in transaction coprocessor memory until the transaction is committed.Uses the standard HBase write path through its Put call. Rows are held in transaction coprocessor memory until the transaction is committed.Uses the standard HBase write path through its Put call.Uses the HBase bulk load write path and creates HFiles directly, bypassing HBase RegionServers for most of its operation.
Uniqueness constraintEnforcedNot enforced. New row with the same key value overwrites previous row.Not enforced. New row with same key value overwrites the previous row.Enforced only within the set of rows in a single statement. Not enforced with rows already in the table.
IndexCan be used on a table with an index.Can be used on a table with an index.When used on a table with an index, it reverts to UPSERT.Can be used on a table with an index. Index is offline during the LOAD.
Maximum sizeSuitable for less than 10,000 * n rows, where n is the number of nodes in each invocation.Suitable for less than 10,000 * n rows, where n is the number of nodes in each invocation.Suitable for less than 5 million *n rows, where n is the number of nodes in each invocation.Suitable for less than 2 billion *n rows, where n is the number of nodes in each invocation.
Minimum size1 row1 row1 rowSuitable for greater than 1 million * n rows
SpeedSlowestFaster than INSERTFaster than UPSERTFastest

...

  1. Import data from the remote source to Hive tables on Trafodion cluster. Use one of these approaches:
    • Use a tool like Apache Sqoop(TM) to move data from a remote relational database management system (RDBMS), such as MySQL or Oracle, to Hive tables on the Trafodion system. For more information, see see Using Sqoop to Import Data Into Hive
    • Copy the data to be imported into Trafodion onto HDFS on the Trafodion cluster. Use Hive external tables to make this data visible from Hive. For more information, see Creating Create Hive External Tables.
  2. Use the Trafodion LOAD statement to import data from Hive to Trafodion tables. For more information, see see Bulk Loading Data Into into Trafodion Tables.

If you want to migrate a smaller amount of data (in the order of a few million rows) from a database that supports ODBC, then use the odb tool, which has parallel data loading and extracting capabilities. For more information, see Trickle Loading Data Into Trafodion Tables Trickle Loading Data Into Trafodion Tables. You can also use ETL tools like SQuirrel-SQLPentaho or Informatica. Those tools will connect to Trafodion through ODBC or JDBC and issue INSERT statements with an appropriate rowset size. They are expected to achieve a smaller throughput than odb.

Anchor
Using Sqoop to Import Data Into Hive
Using Sqoop to Import Data Into Hive
Using Sqoop to Import Data Into Hive

Use a tool like Apache Sqoop(TM) to move data from a remote relational database management system (RDBMS), such as MySQL or Oracle, to Hive tables on the Trafodion system.

Anchor
Installing Required Software
Installing Required Software
Installing Required Software

By default, Sqoop is not installed on Trafodion clusters. Install and start Sqoop on the Trafodion cluster using either the Ambari or Cloudera Manager GUI. See the Sqoop installation instructions. To use Sqoop, see the Scoop User Guide.

...

ParameterGuidelines
--split-by <column-name>By default, if not specified, sqoop will use the primary key column as a splitting column, which is not optimal most of the time. Also, if the table does not contain a primary key, you must manually specify the splitting column.
--null-string <null-string>This is the string to be written for a null value in a string column.
--null-non-string <null-string>This is the string to be written for a null value in a non-string column.
--hive-drop-import-delimsThis drops \n, \r, and \01 string fields when importing to Hive. NOTE: If the data contains \n or \r and if you do not use the hive-drop-import-delims option, then data will be truncated. You will need to use additional Sqoop options during migration by specifying the delimiter that you would like to use, which does not exist in the data itself.

Anchor
Create Hive External Tables
Create Hive External Tables
Creating Hive External Tables

A Hive table must be created from a Hive interface like the Hive shell. External tables in Hive are convenient to use during loading. You can load the data into Hive simply by copying the source data files into a single HDFS directory and then by creating a Hive external table to point to the directory. Data types used for each column in the Hive table must match the source data. For the syntax to create external tables, see the Hive wiki. For the data types available in Hive, see Hive LanguageManual Types. When Hive tables are accessed through Trafodion columns with integral, string and char types are accessible. To populate an external table, use this type of hadoop command to copy data to an HDFS directory:

...

When using INSERT or UPSERT USING LOAD with the objective of maximizing data ingestion throughput, increasing the HBase table attribute MEMSTORE_FLUSHSIZE from its default value helps. The actual value you use depends on the heap size allocated to each Region Server, the concurrent query workload, and the number of tables for which simultaneous fast data ingestion is needed. With a heap size of 31 GB for each Region Server in an environment with heavy concurrent query workload, setting this attribute 1 GB gives good performance. You can specify this attribute in the HBASE_OPTIONS clause when creating the table. Alternatively, you can also set it from the HBase shell through analter 'TRAFODION.<schema-name>.<table-name>', MEMSTORE_FLUSHSIZE => '1073741824' command.

Anchor
Bulk Loading Data Into Trafodion
Bulk Loading Data Into Trafodion
Bulk Loading Data Into Trafodion Tables

Using the TrafCI or sqlci command interface, set these required Control Query Defaults (CQDs) to improve load performance:

...

For the syntax of the LOAD statement, see the Trafodion SQL Reference Manual (pdf, 3.98 MB).

If multiple invocations of the LOAD statement are used to incrementally load sets of data to a single target table, there will be several HFiles for each partition of the target table. This will cause inefficient access during SELECT queries and may also cause a compaction to be triggered based on the policies configured in the HBase settings. To avoid this, it is good practice to perform a major compaction on a table that has been the target of more than two LOAD statements in a short period of time. To perform compaction, use this HBase shell command:

...

The command is no waited and will return immediately. Typically, compaction of a large table will take a long time (that is, several minutes to hours) to complete. You can monitor the progress of compaction from the HBase Master Web UI.

Anchor
Trickle Loading Data Into Trafodion
Trickle Loading Data Into Trafodion
Trickle Loading Data Into Trafodion Tables

The odb tool can be used to add data to a Trafodion table. This tool is a client utility that interacts with Trafodion through ODBC. It may be installed on the Trafodion cluster, the machine that contains source data, or an intermediate machine that is being used for data loading. Source data can be in any RDBMS (local or remote) that supports ODBC or in flat files local to the machine hosting the odb tool. It allows data to be added to a Trafodion table using three of the four approaches listed in this page: INSERT, UPSERT, and UPSERT USING LOAD. The odb tool does not use the bulk load command LOAD, and, therefore, throughput when using odb may be lower than what can be achieved with the bulk loader. However, when using the odb tool, source data need not be moved to the Trafodion cluster in a separate step. The odb tool's copy command will copy data directly from a remote RDBMS onto Trafodion tables. ODBC is used to connect to both source and target databases. Threads and splitting source data into non-overlapping ranges are used to achieve parallelism. Rowsets are used to improve throughput. Target Trafodion tables must exist before the copy or load commands can be used. Installation and usage syntax for the odb tool are discussed in the Trafodion odb User Guide (pdf, 632 KB)]. As discussed on this page, UPSERT USING LOAD will give the best throughput of the three insert types supported by the odb tool. The default insert type used by odb is INSERT; to use UPSERT USING LOAD, please specify :loadcmd=UL in odb's load or copy command. An example command line may look like:

...

For the syntax of the UPDATE STATISTICS statement, see the Trafodion SQL Reference Manual (pdf, 3.98 MB).

Troubleshooting Data Loading Problems

...

  • For INSERT and UPSERT USING LOAD statements, use the EXPLAIN statement, which is described in the Trafodion SQL Reference Manual (pdf, 3.98 MB).
  • For the LOAD statement, which is implemented as a utility operator (that is, a collection of secondary SQL statements), use these SQL statements to see the plan that it uses to add data to the target table:

...