DataImportHandler is designed to stream row one-by-one. It passes a fetch size value (default: 500) to Statement#setFetchSize which some drivers do not honor. For MySQL, add batchSize property to dataSource configuration with value -1. This will pass Integer.MIN_VALUE to the driver as the fetch size and keep it from going out of memory for large tables.
Should look like:
<dataSource type="JdbcDataSource" name="ds-2" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:8889/mysqldatabase" batchSize="-1" user="root" password="root"/> |
This happens frequently with MySQL databases, but it could be a problem with any database.
What's happening here is that Solr (at the Lucene layer) has at least three segment merges scheduled, and one of those merges is very large, taking long enough that the database connection is sitting idle, and either the JDBC driver or the server closes the connection.
Solr uses Lucene's ConcurrentMergeScheduler by default. The maxMergeCount setting for that scheduler controls what happens with merging and indexing. The merge scheduler will always handle the largest merges on the schedule first. As long as the number of merges currently being scheduled is below maxMergeCount, ongoing indexing is allowed to run. If the number of merges scheduled reaches that number (default 3), ongoing indexing is paused until the number drops below what's configured. The maxMergeCount in the mergeScheduler must be increased, so that more merges can be scheduled while still allowing ongoing indexing. This config needs to go at the top level of solrconfig.xml. If there is an existing indexConfig section, it would need to be incorporated into that section:
<indexConfig> <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxThreadCount">1</int> <int name="maxMergeCount">6</int> </mergeScheduler> </indexConfig> |
If the index is on standard spinning disks, leave maxThreadCount at 1. If it's on SSD, you can increase it, but probably shouldn't go higher than about 2 or 3.
There's a connection property called responseBuffering in the sqljdbc driver whose default value is "full" which causes the entire result set to be fetched. See http://msdn.microsoft.com/en-us/library/ms378988.aspx for more details. You can set this property to "adaptive" to keep the driver from getting everything into memory. Connection properties like this can be set as an attribute (responseBuffering="adaptive") in the dataSource configuration OR directly in the jdbc url specified in DataImportHandler's dataSource configuration.
Note that this is not an issue since version 1.2 of the SQL Server JDBC driver. Since version 1.2, "adaptive" is the default setting for responseBuffering.
If you define the property inside of your <core/> in solr.xml then you can refer to the property in your data-config.xml file directly. However if you define it in the <solr/> level, so that it applies to multiple cores, then you need to use the technique below:
add the TemplateTransformer to the transformer attribute of the entity and create a field as follows
<field column="the_static_field_name" template="any-string-can-be-put-here"/> |
A few reasons
It is possible to do delta import using full-import. Taken from the delta-import example
<dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" /> <document> <entity name="findDelta" query="select id from item where id in (select item_id as id from feature where last_modified > '${dataimporter.last_index_time}') or id in (select item_id as id from item_category where item_id in (select id as item_id from category where last_modified > '${dataimporter.last_index_time}') or last_modified > '${dataimporter.last_index_time}') or last_modified > '${dataimporter.last_index_time}'" rootEntity="false"> <entity name="item" query="select * from item where id='${findDelta.id}'> <entity name="feature" query="select description as features from feature where item_id='${item.ID}'"> </entity> <entity name="item_category" query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"> <entity name="category" query="select description as cat from category where id = '${item_category.CATEGORY_ID}'"> </entity> </entity> </entity> </entity> </document> </dataConfig> |
notice that for the entity 'findDelta' rootEntity is set to false, so a document is created for each row from 'item'. The command has to be
command=full-import&clean=false
If you are using multiple Solr indexes (or multiple cores) in the same Solr webapp, each core will run a separate instance of DataImportHandler and each core should have its own data-config.xml file in the conf directory. See MultipleIndexes for more details on using multiple indexes.
Solr1.4 No more memory issues with this data source configuration:
<dataSource type="JdbcDataSource" driver="org.postgresql.Driver" url="jdbc:postgresql://host/db" user="user" password="password" readOnly="true" autoCommit="false" transactionIsolation="TRANSACTION_READ_COMMITTED" holdability="CLOSE_CURSORS_AT_COMMIT"/> |
Using v8.3 of postgres server and JDBC driver.
Solr1.4 Setup your JNDI datasource and specify its name in the data-config.xml
<dataSource jndiName="java:comp/env/jdbc/myDataSource" type="JdbcDataSource" user="" password=""/> |
The problem occurs because blobs are read as a byte array which does not have a useful toString method. The workaround is to:
Write a Transformer which converts the byte\[\] into a string type |
Note that "convertType" attribute uses the target schema field's type name to convert the value returned by the result set. Therefore, any transformers being used on the entity should be aware of the type information.
Use &zeroDateTimeBehavior=convertToNull parameter in you JDBC connection string to avoid this.
This is an oddity of the JDBC specification where tinyint(1) values are treated as bits and their corresponding Java type is a boolean. Use tinyInt1isBit=false parameter in your JDBC connection string to avoid the conversion to boolean value.