Choosing the Appropriate Method to Add Rows

You can use these SQL statements to add rows of data to Trafodion tables:

  • LOAD

For syntax descriptions of these statements, see the Trafodion SQL Reference Manual (pdf, 3.98 MB). The table below highlights the differences between these statements with the objective of helping you choose the appropriate statement for a given requirement.

TransactionYesYesNo, uses HBase WAL for recoveryNo, uses snapshot for recovery
Method of operationUses the standard HBase write path through its CheckAndPut call. Rows are held in transaction coprocessor memory until the transaction is committed.Uses the standard HBase write path through its Put call. Rows are held in transaction coprocessor memory until the transaction is committed.Uses the standard HBase write path through its Put call.Uses the HBase bulk load write path and creates HFiles directly, bypassing HBase RegionServers for most of its operation.
Uniqueness constraintEnforcedNot enforced. New row with the same key value overwrites previous row.Not enforced. New row with same key value overwrites the previous row.Enforced only within the set of rows in a single statement. Not enforced with rows already in the table.
IndexCan be used on a table with an index.Can be used on a table with an index.When used on a table with an index, it reverts to UPSERT.Can be used on a table with an index. Index is offline during the LOAD.
Maximum sizeSuitable for less than 10,000 * n rows, where n is the number of nodes in each invocation.Suitable for less than 10,000 * n rows, where n is the number of nodes in each invocation.Suitable for less than 5 million *n rows, where n is the number of nodes in each invocation.Suitable for less than 2 billion *n rows, where n is the number of nodes in each invocation.
Minimum size1 row1 row1 rowSuitable for greater than 1 million * n rows
SpeedSlowestFaster than INSERTFaster than UPSERTFastest

The next table lists various data sources for each of these three types of SQL statements and the relative merits of each data source, with particular focus on speed. The objective again is to help you determine which data source may be most suitable. Your choice of data source is often influenced by the format in which data to be loaded is received. All data ingestion rates will depend on hardware, the length of row, and the type of columns. Numbers shown here are for a four-node system, each node with 16 cores, 128 GB memory, and a regular disk. Row length is 150 bytes, and column types are similar to the TPC-H Lineitem table. NOTE: Throughput numbers for cells marked <TBD> will be added in the future.

Single row at a time with the VALUES clause<TBD>. Slowest approach.<TBD><TBD>No
List of rows with the VALUES clause.<TBD>. Speed comparable to rowset INSERT. Single row version of CheckAndPut called.<TBD>. Speed comparable to rowset UPSERT. List version of Put called.<TBD>. Speed comparable to rowset UPSERT USING LOAD. List version of Put called.No
List of rows with ODBC/JDBC Rowset<TBD><TBD>~ 2.5 GB/hrNo
Native HBase tables<TBD>. Speed expected to be slower than a Trafodion table as the source for INSERT since the compiler has no stats and data will need to be encoded.<TBD>. Speed expected to be slower than a Trafodion table as the source for UPSERT since the compiler has no stats and data will need to be encoded.<TBD>. Speed expected to be slower than a Trafodion table as the source for UPSERT USING LOAD since the compiler has no stats and data will need to be encoded.<TBD>. Speed expected to be slower than a Trafodion table as the source for LOAD since the compiler has no stats and data will need to be encoded.
Trafodion tables<TBD>. Speed expected to be slower than a Hive table as the source for INSERT since a Trafodion scan is slower than a Hive scan.<TBD>. Speed expected to be slower than a Hive table as the source for UPSERT since a Trafodion scan is slower than a Hive scan.<TBD>. Speed expected to be slower than a Hive table as the source for UPSERT USING LOAD since a Trafodion scan is slower than a Hive scan.<TBD>. Speed expected to be slower than a Hive table as the source for LOAD since a Trafodion scan is slower than a Hive scan.
Hive tables<TBD>. Fastest source for INSERT.<TBD>. Fastest source for UPSERT.TBD>. Fastest source for UPSERT USING LOAD.Fastest, ~100 GB/hr.
UDF connected to other data sources.Speed depends on the data source and UDF. Expected to be slower than a Hive table as the source for INSERTSpeed depends on the data source and UDF.Speed depends on the data source and UDF.Speed depends on the data source and UDF.

Importing Data From Another RDBMS

Importing significant amounts of data from another RDBMS or data source external to the Trafodion cluster is best achieved in two steps:

  1. Import data from the remote source to Hive tables on Trafodion cluster. Use one of these approaches:
    • Use a tool like Apache Sqoop(TM) to move data from a remote relational database management system (RDBMS), such as MySQL or Oracle, to Hive tables on the Trafodion system. For more information, see Using Sqoop to Import Data Into Hive
    • Copy the data to be imported into Trafodion onto HDFS on the Trafodion cluster. Use Hive external tables to make this data visible from Hive. For more information, see Create Hive External Tables.
  2. Use the Trafodion LOAD statement to import data from Hive to Trafodion tables. For more information, see Bulk Loading Data into Trafodion.

If you want to migrate a smaller amount of data (in the order of a few million rows) from a database that supports ODBC, then use the odb tool, which has parallel data loading and extracting capabilities. For more information, see Trickle Loading Data Into Trafodion Tables Trickle Loading Data Into Trafodion Tables. You can also use ETL tools like SQuirrel-SQLPentaho or Informatica. Those tools will connect to Trafodion through ODBC or JDBC and issue INSERT statements with an appropriate rowset size. They are expected to achieve a smaller throughput than odb.

Using Sqoop to Import Data Into Hive

Use a tool like Apache Sqoop(TM) to move data from a remote relational database management system (RDBMS), such as MySQL or Oracle, to Hive tables on the Trafodion system.

Installing Required Software

By default, Sqoop is not installed on Trafodion clusters. Install and start Sqoop on the Trafodion cluster using either the Ambari or Cloudera Manager GUI. See the Sqoop installation instructions. To use Sqoop, see the Scoop User Guide.

Install JDK 1.8 and the Oracle JDBC driver on the Trafodion cluster. You will use them only for importing data from the RDBMS to Hive tables. Set these environment variables:

export JAVA_HOME=/opt/java/jdk1.8.0_11

Examples of Sqoop Commands

Listing All Oracle Tables
sqoop list-tables --driver oracle.jdbc.OracleDriver --connect jdbc:oracle:thin:@<Oracle host name>:<port>/<database> 
--username <user-name> --password <password>
Importing Data to Hive
sqoop import --connect jdbc:oracle:thin:@<Oracle host name:port>/<database> --username <user-name> 
--password <password> --table <tablename> --split-by <column-name> --hive-import --create-hive-table 
--hive-table <hive-table-name> --hive-overwrite --null-string '' --null-non-string '' 
--hive-drop-import-delims --verbose
--split-by <column-name>By default, if not specified, sqoop will use the primary key column as a splitting column, which is not optimal most of the time. Also, if the table does not contain a primary key, you must manually specify the splitting column.
--null-string <null-string>This is the string to be written for a null value in a string column.
--null-non-string <null-string>This is the string to be written for a null value in a non-string column.
--hive-drop-import-delimsThis drops \n, \r, and \01 string fields when importing to Hive. NOTE: If the data contains \n or \r and if you do not use the hive-drop-import-delims option, then data will be truncated. You will need to use additional Sqoop options during migration by specifying the delimiter that you would like to use, which does not exist in the data itself.

Creating Hive External Tables

A Hive table must be created from a Hive interface like the Hive shell. External tables in Hive are convenient to use during loading. You can load the data into Hive simply by copying the source data files into a single HDFS directory and then by creating a Hive external table to point to the directory. Data types used for each column in the Hive table must match the source data. For the syntax to create external tables, see the Hive wiki. For the data types available in Hive, see Hive LanguageManual Types. When Hive tables are accessed through Trafodion columns with integral, string and char types are accessible. To populate an external table, use this type of hadoop command to copy data to an HDFS directory:

hadoop fs -copyFromLocal <src-file> <target-hdfs-file>

Creating Trafodion Tables and Indexes

Create Trafodion tables using CREATE TABLE statements with the SALT USING <num> PARTITIONS clause for salting and the HBASE_OPTIONS clause for compression and encoding. For example:

create table trafodion.sch.demo
demo_sk int not null,
name varchar(100),
primary key (demo_sk)
MEMSTORE_FLUSH_SIZE = '1073741824'
salt using 8 partitions on (demo_sk);

Any indexes on the table may be salted or not. However, if they are salted, their salting key and number of partitions must be the same as the table.

create index demo_ix on sch.demo(name)
salt like table;

Choosing a Primary Key

The key for a Trafodion table must be chosen based on the workload that will be accessing that table. Since HBase is a key-value store, keyed access to Trafodion tables is very efficient. When you know the queries that will access a table in addition to the predicates and join conditions, you can choose a key such that the leading key columns have highly selective predicates applied to them. This will limit the number of rows that need to scanned in the HBase. Trafodion uses MDAM (Multi Dimensional Access Method) to limit the rows scanned when predicates are present to only trailing key columns and not the leading key column. MDAM works best when the unique entry count of leading key columns (on which predicates are absent) is low.

Salting a Table

Trafodion tables can be salted to avoid hot-spotting. With range partitioned data in some workloads, certain key ranges of data may see more access than other key ranges. This can lead to an unbalanced usage pattern with some HBase RegionServers handling most of the load. With Native HBase tables, this is often addressed by designing appropriate keys. In Trafodion, once you choose the key to a table, as discussed in Choosing a Primary Key, you can use salting to distribute the data evenly. Salting applies a hash function to the salt keys and distributes data to partitions based on this hash value. The hash value is physically stored in the table as the leading key value. Each split of the table will have only one salt key value. The salting key can be any subset (including the whole set) of the primary key. It is a good practice to keep the salting key as small as possible. The key should provide an even distribution of data. This can be achieved when the key values have a large unique entry count and no significant skew. The number of partitions must also be specified during table creation. This depends on the size of the cluster and the expected size of the table. A salted table can split if more data is added to it than initially estimated. This will lead to more than one partition having rows with the same salt value and may result in suboptimal execution plans for the table.

You can also choose not to salt Trafodion tables. This is similar to range partitioning in a traditional RDBMS. The number of partitions grows with the size of the table, and range boundaries are determined by HBase based on the specified split policy.

Using Compression and Encoding

Large Trafodion tables must be encoded and compressed. Since HBase stores the key separately for every column in a row, Trafodion tables that have a large key or several columns will grow in size to 10X or more when compared to a Hive table with equivalent data. HBase provides several types of encoding to avoid storing the same key value to disk for every column in the row. HBase also supports various types of compression of the entire data block, regardless whether it is encoded or not. For a discussion on the space savings achieved by various types of encoding and compression schemes, please see the Apache HBase blog entry, "The Effect of ColumnFamily, RowKey and KeyValue Design on HFile Size," dated FRIDAY APR 11, 2014. The performance of an algorithm is also a relevant factor when deciding on the type of compression. For more information, see slides 6, 7, and 8 of Govind Kamat's and Sumeet Singh's presentation, "Compression Options in Hadoop -- A Tale of Tradeoffs." For an excellent comparison of various compression and encoding algorithms, see Appendix E: Compression and Data Block Encoding In HBase in the Apache HBase™ Reference Guide.

Setting Memstore Flushsize

When using INSERT or UPSERT USING LOAD with the objective of maximizing data ingestion throughput, increasing the HBase table attribute MEMSTORE_FLUSHSIZE from its default value helps. The actual value you use depends on the heap size allocated to each Region Server, the concurrent query workload, and the number of tables for which simultaneous fast data ingestion is needed. With a heap size of 31 GB for each Region Server in an environment with heavy concurrent query workload, setting this attribute 1 GB gives good performance. You can specify this attribute in the HBASE_OPTIONS clause when creating the table. Alternatively, you can also set it from the HBase shell through analter 'TRAFODION.<schema-name>.<table-name>', MEMSTORE_FLUSHSIZE => '1073741824' command.

Bulk Loading Data Into Trafodion Tables

Using the TrafCI or sqlci command interface, set these required Control Query Defaults (CQDs) to improve load performance:

cqd hive_max_string_length '1000'; // if the widest column is 1KB

This setting will be required if there are columns of time related type in the target Trafodion table

cqd allow_incompatible_assignment 'on'; 

Issue the LOAD statement to load data into Trafodion tables from Hive. For example:

load with no populate indexes into trafodion.sch.demo select * from hive.hive.demo;

For the syntax of the LOAD statement, see the Trafodion SQL Reference Manual (pdf, 3.98 MB).

If multiple invocations of the LOAD statement are used to incrementally load sets of data to a single target table, there will be several HFiles for each partition of the target table. This will cause inefficient access during SELECT queries and may also cause a compaction to be triggered based on the policies configured in the HBase settings. To avoid this, it is good practice to perform a major compaction on a table that has been the target of more than two LOAD statements in a short period of time. To perform compaction, use this HBase shell command:

major_compact 'TRAFODION.SCH.DEMO'

The command is no waited and will return immediately. Typically, compaction of a large table will take a long time (that is, several minutes to hours) to complete. You can monitor the progress of compaction from the HBase Master Web UI.

Trickle Loading Data Into Trafodion Tables

The odb tool can be used to add data to a Trafodion table. This tool is a client utility that interacts with Trafodion through ODBC. It may be installed on the Trafodion cluster, the machine that contains source data, or an intermediate machine that is being used for data loading. Source data can be in any RDBMS (local or remote) that supports ODBC or in flat files local to the machine hosting the odb tool. It allows data to be added to a Trafodion table using three of the four approaches listed in this page: INSERT, UPSERT, and UPSERT USING LOAD. The odb tool does not use the bulk load command LOAD, and, therefore, throughput when using odb may be lower than what can be achieved with the bulk loader. However, when using the odb tool, source data need not be moved to the Trafodion cluster in a separate step. The odb tool's copy command will copy data directly from a remote RDBMS onto Trafodion tables. ODBC is used to connect to both source and target databases. Threads and splitting source data into non-overlapping ranges are used to achieve parallelism. Rowsets are used to improve throughput. Target Trafodion tables must exist before the copy or load commands can be used. Installation and usage syntax for the odb tool are discussed in the Trafodion odb User Guide (pdf, 632 KB)]. As discussed on this page, UPSERT USING LOAD will give the best throughput of the three insert types supported by the odb tool. The default insert type used by odb is INSERT; to use UPSERT USING LOAD, please specify :loadcmd=UL in odb's load or copy command. An example command line may look like:

odb64luo -u <src_username>:<tgt_username> -p <src_pswd>:<tgt_pswd> -d <src_dsn>:<tgt_dsn> -cp src:<source_catalog>.<source_schema>.
mytable tgt:trafodion.myschema.mytable :splitby=<col-name>:parallel=4:loadcmd=UL

Here src_usernamesrc_pswd, and src_dsn are the username, password and ODBC data source needed to connect to the source RDBMS. Similarly tgt_usernametgt_pswd, and tgt_dsn are the username, password, and ODBC data source needed to connect to the target Trafodion instance. We copy the table mytable from <source_catalog>.<source_schema> on the source RDBMS to trafodion.myschema.mytable on Trafodion. Choose a column with approximately evenly distributed values to splitby. If this can be a leading key column, that may work well for some source databases. With the parallel=4 option, odb will use four connections to extract data from the source database and another four connections to write data to the target Trafodion instance. The loadcmd=UL option specifies that odb will use UPSERT USING LOAD syntax to write data.

Updating Statistics on Trafodion Tables

To generate good plans that allow queries to execute quickly and use resources wisely, the optimizer must have a good idea about how the values of columns are distributed, the number of distinct values, and so on. Trafodion supplies this information to the optimizer in the form of histograms generated by executing the UPDATE STATISTICS statement.

Using Default Sampling

While accurate statistics are important, the time required to generate them by reading every row in the table may be prohibitive and is usually unnecessary. Random sampling of the rows of the table can give adequate results in a fraction of the time required to read all the values in the table. For most situations, the best option is to simply specify SAMPLE at the end of the UPDATE STATISTICS statement, which will use the default sampling protocol. For example, to use default sampling in the construction of histograms for each column of table T1, you would execute the following statement:

update statistics for table T1 on every column sample;

This default sampling protocol uses a high sampling rate for small tables, reducing the rate with a steep gradient until hitting 1% and capping the sample size at one million rows. The specific details of default sampling are as follows:

  • Use the full table for tables up to 10,000 rows.
  • For table sizes from 10,000 up to a million rows, 10,000 rows are randomly sampled. In effect, this causes the sampling rate to decline from 100% to 1% as a function of increasing table size.
  • For tables with one million to 100 million rows, use a 1% random sample.
  • For tables exceeding 100 million rows, the sampling rate is calculated as 1 million divided by the number of rows in the table. This limits the overall sample size to 1 million rows while ensuring uniform random sampling across the entire table.

Generating Single-Column and Multi-Column Histograms From One Statement

If you use the ON EVERY COLUMN syntax in an UPDATE STATISTICS statement, it is important to realize that multi-column histograms can be requested in the same statement. For example, if you wanted to generate a histogram for each single column of table T1, as well as multi-column histograms for column sets (c1, c2) and (c5, c6, c 7), you could use the following statement:

update statistics for table T1 on every column, (c1,c2), (c5,c6,c7) sample;

In terms of the end result, this is equivalent to the following pair of statements:

update statistics for table T1 on every column sample;
update statistics for table T1 on (c1, c2), (c5, c6, c7) sample;

However, the performance will be superior when they are combined into a single statement because a multi-column histogram depends on the single-column histograms of its component columns. Thus, separating the generation of single-column and multi-column histograms for a table into two statements will have the effect of redundantly calculating some of the single-column histograms. Even though the relevant single-column histograms already exist, they are recomputed at the time the multi-column histograms are generated.

Enabling Update Statistics Automation

If a standard set of queries is run on a regular basis, one way to generate only those histograms that are needed for efficient execution of those queries is to enable update statistics automation, and then PREPARE each of the queries:

control query default USTAT_AUTOMATION_INTERVAL '1440';
prepare s from select...;

The value of the CQD USTAT_AUTOMATION_INTERVAL is intended to determine the automation interval (in minutes) for update statistics automation. However, in Trafodion Release 1.0, that value is not used as a timing interval but must be any value greater than zero to enable automation. The PREPARE statement causes the compiler to compile and optimize a query without executing it. In the process of doing so with automation enabled, any histograms needed by the optimizer that are missing will cause those columns to be marked as needing histograms. Then, the following UPDATE STATISTICS statement can be run against each table to generate the needed histograms:

update statistics for table <table-name> on necessary columns sample;

Regenerating Histograms

Histograms can become "stale" as the underlying data changes and possibly reflects a different distribution of values, although it is possible that data turnover or accumulation can be high while maintaining the same distribution. To ensure that statistics remain accurate, you should regenerate histograms for a table once significant changes have been made to that table since its histograms were last generated. To refresh existing histograms without adding new ones, use the following statement:

update statistics for table <table-name> on existing columns sample;

The critical set of histograms that were previously generated with the ON NECESSARY COLUMNS syntax can be periodically regenerated using ON EXISTING COLUMNS. Note that using ON NECESSARY COLUMNS will only identify those columns that have been previously requested by the optimizer but do not exist. The current implementation of automation does not know which existing histograms might be stale.

For the syntax of the UPDATE STATISTICS statement, see the Trafodion SQL Reference Manual (pdf, 3.98 MB).

Troubleshooting Data Loading Problems

Improving Throughput

Tuplelists or Rowsets

When Tuplelists or Rowsets are used as the data source, performance typically increases with the number of rows in the Tuplelist or Rowset. Performance will peak at some value for the number of rows and remain more or less steady after that. This peak value will depend on row size. Typically a value in the range of 100 to few thousand is reasonable.

Native HBase Tables

When native HBase tables are used as the data source, it is important to override the default value for the attribute HBASE_MAX_COLUMN_VALUE_LENGTH (columnwise mode) or HBASE_MAX_COLUMN_INFO_LENGTH (rowwise mode) and set the value to the maximum for the table being used as the source. The default values may be too large.

Hive Tables

When Hive tables are used as the data source, it is important to override the default value for the attribute HIVE_MAX_STRING_LENGTH when the Hive source table has columns of type string. Please set the value to the length of the longest string in the Hive table. To determine that length, run this query from a Hive shell:

select max(length(<col-name>)) from <hive-tab-name>;

If the query returns a value less than the current HIVE_MAX_STRING_LENGTH, you will need to increase that value and retry. If the query returns a value that is far less than the current HIVE_MAX_STRING_LENGTH, you can achieve better performance by reducing the value. An approximate value can be used, too. The Trafodion default of 32000 may be too generous in some cases.

Monitoring Progress


For an INSERT statement, rows are written to the HBase table that represents the Trafodion table when the transaction commits. It is more difficult to see query progress here.


For an UPSERT USING LOAD statement, rows added are visible in the Trafodion table after each ListOfPut call succeeds. You can use a SELECT COUNT(*) statement to monitor progress. That way, you will know how many rows are already in the table when the statement starts executing.

select count(*) from trafodion.sch.demo ;


For LOAD, query progress goes through a few phases, which sometimes overlap:

  1. Hive scan
  2. Sort
  3. Create prep HFiles in HDFS bulkload staging directory (/bulkload by default)
  4. Move HFiles into HBase

You can monitor progress in step 2, sort, with this shell command:

lsof +L1 | grep SCR | wc –l

This gives a count of the number of overflow files for sort. Each file is 2GB in size. You need to have an approximate idea of the volume of data being loaded to know how much more data needs to be sorted. On a cluster, this will have to be done on all nodes with a pdsh-like utility. Trafodion data volume can also be larger than Hive data volume by a factor of 2 or 3.

In step 3, create prep HFiles, use this command to monitor the volume of data written out to the staging directory:

hadoop fs –dus /bulkload 

If compression and encoding are used, this size should be similar to the Hive source data volume. This command must be run from one node and does not have to be repeated across the cluster. There may be some remnant data in the staging directory from previous commands, so we have to take that into account. This step will start only when sort has completed.

Step 4 is usually the shortest and typically does not exceed a few minutes.

Checking Plan Quality

It is good practice to check the quality of the plan generated by the SQL compiler before executing a data loading statement that may take a long time to complete.

  • For INSERT and UPSERT USING LOAD statements, use the EXPLAIN statement, which is described in the Trafodion SQL Reference Manual (pdf, 3.98 MB).
  • For the LOAD statement, which is implemented as a utility operator (that is, a collection of secondary SQL statements), use these SQL statements to see the plan that it uses to add data to the target table:
control query default COMP_BOOL_226 'ON' ;
prepare s1 from LOAD TRANSFORM into <target-table> <select-query-used-as-source> ;
explain options 'f' s1 ;

A typical problem with the plan is that the scan is not parallel enough. For Trafodion tables, you can address this issue with the default attribute, PARALLEL_NUM_ESPS. Using this attribute, a Trafodion scan can be parallelized to as many number of SALT partitions that are defined for the table. For Hive source tables, the default attributes, HIVE_NUM_ESPS_PER_DATANODE and HIVE_MIN_BYTES_PER_ESP_PARTITION, can be used to adjust the degree of parallelism.

Cleaning up after a failed LOAD

When a LOAD statement fails it will likely leave some files behind in the /bulkload directory. These files are the HFiles that have been created before the failure occurred. These HFiles have not been moved in HBase control and are not part of the target table yet. They should be removed. If they are not taking space, the files can left alone. The next LOAD statement into the same target table will cleanup these files, before starting is actual load operation. Alternatively these commands can be used from any SQL interface to remove these files. 

control query default COMP_BOOL_226 'ON' ;
LOAD CLEANUP FOR TABLE <target-table> ;

This command should complete quickly.

Update statistics times out during sampling

Sampling in update statistics is implemented using the HBase Random RowFilter. For very large tables with several billion rows, the sampling ratio required to get a sample of 1 million rows is very small. This can result in HBase client connection timeout errors since there may be no row returned by a RegionServer for an extended period of time. You can avoid this problem by:

  1. Choosing a sampling percentage higher than the default setting of 1 million rows for large tables. For example, suppose table T has one billion rows. The following UPDATE STATISTICS statement will sample a million rows, or approximately one-tenth of one percent of the total rows:
    update statistics for table T on every column sample;

    To sample one percent of the rows, regardless of the table size, you must explicitly state the sampling rate as follows:

    update statistics for table T on every column sample random 1 percent;
  2. Setting hbase.rpc.timeout to a higher value than currently specified in the HBase settings.

Index creation takes too long

When creating an index, all rows of the Trafodion table must be scanned and a subset of columns returned to the client. This can take a while to complete. If there is a Hive table with the same data as the Trafodion table being scanned, you can specify the default attribute, USE_HIVE_SOURCE. This will cause the Hive table to be used as the source creating the index. NOTE: The name of the Hive table must use the Trafodion table name as its prefix. For example, if the Trafodion table is TRAFODION.SCH.DEMO, then the Hive table name can be DEMO_SRC. In this case, set the attribute as follows:

control query default USE_HIVE_SOURCE '_SRC' ;
create index demo_ix on sch.demo(name)

Large Deletes take too long or error out

If a large number of rows is either updated or deleted in a single SQL statement then it is likely that the statement will not complete successfully. Deleting or updating more than 10,000 rows with a single statement is not recommended. A large delete or update should be broken up into multiple statements each affecting less than 10,000*n rows, if possible. Here n is number of nodes in the cluster.

Large UPSERT USING LOAD on a table with index errors out

When UPSERT USING LOAD statement is used on a table with an index, the statement automatically reverts to a transactional Upsert. We then run into the limitation discussed previously that not more than 10,000*n rows (n = number of nodes) can be affected in a single statement. As a workaround the UPSERT USING LOAD operation can be placed in a LOAD statement as shown below. The LOAD statement will disable indexes on the table before the UPSERT USING LOAD starts. Once the UPSERT USING LOAD completes indexes are populated by the LOAD statement.

load with upsert using load into trafodion.sch.demo select * from hive.hive.demo;
Task: LOAD             Status: Started    Object: TRAFODION.SCH.DEMO
Task:  DISABLE INDEXE  Status: Started    Object: TRAFODION.SCH.DEMO
Task:  DISABLE INDEXE  Status: Ended      Object: TRAFODION.SCH.DEMO
Task:  UPSERT USING L  Status: Started    Object: TRAFODION.SCH.DEMO
       Rows Processed: 200000 
Task:  UPSERT USING L  Status: Ended      ET: 00:01:03.715
Task:  POPULATE INDEX  Status: Started    Object: TRAFODION.SCH.DEMO
Task:  POPULATE INDEX  Status: Ended      ET: 00:03:11.323

The LOAD statement can be used without the UPSERT USING LOAD syntax here. But that would create a new HFile for each region and skip the HBase write path. This issue can be tracked at LP1417337

  • No labels