Apache Solr Documentation

6.4 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

6.5 Draft Ref Guide Topics

Meta-Documentation

This Unreleased Guide Will Cover Apache Solr 6.5

Skip to end of metadata
Go to start of metadata

Many search applications store the content to be indexed in a structured data store, such as a relational database. The Data Import Handler (DIH) provides a mechanism for importing content from a data store and indexing it. In addition to relational databases, DIH can index content from HTTP based data sources such as RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate fields.

The example/example-DIH directory contains several collections many of the features of the data import handler. To run this "dih" example:

For more information about the Data Import Handler, see https://wiki.apache.org/solr/DataImportHandler.

Topics covered in this section:

Concepts and Terminology

Descriptions of the Data Import Handler use several familiar terms, such as entity and processor, in specific ways, as explained in the table below.

Term

Definition

Datasource

As its name suggests, a datasource defines the location of the data of interest. For a database, it's a DSN. For an HTTP datasource, it's the base URL.

Entity

Conceptually, an entity is processed to generate a set of documents, containing multiple fields, which (after optionally being transformed in various ways) are sent to Solr for indexing. For a RDBMS data source, an entity is a view or table, which would be processed by one or more SQL statements to generate a set of rows (documents) with one or more columns (fields).

Processor

An entity processor does the work of extracting content from a data source, transforming it, and adding it to the index. Custom entity processors can be written to extend or replace the ones supplied.

Transformer

Each set of fields fetched by the entity may optionally be transformed. This process can modify the fields, create new fields, or generate multiple rows/documents form a single row. There are several built-in transformers in the DIH, which perform functions such as modifying dates and stripping HTML. It is possible to write custom transformers using the publicly available interface.

Configuration

Configuring solrconfig.xml

The Data Import Handler has to be registered in solrconfig.xml. For example:

The only required parameter is the config parameter, which specifies the location of the DIH configuration file that contains specifications for the data source, how to fetch data, what data to fetch, and how to process it to generate the Solr documents to be posted to the index.

You can have multiple DIH configuration files. Each file would require a separate definition in the solrconfig.xml file, specifying a path to the file.

Configuring the DIH Configuration File

An annotated configuration file, based on the "db" collection in the dih example server, is shown below (example/example-DIH/solr/db/conf/db-data-config.xml). It extracts fields from the four tables defining a simple product database, with this schema. More information about the parameters and options shown here are described in the sections following.

Datasources can still be specified in solrconfig.xml. These must be specified in the defaults section of the handler in solrconfig.xml. However, these are not parsed until the main configuration is loaded.

The entire configuration itself can be passed as a request parameter using the dataConfig parameter rather than using a file. When configuration errors are encountered, the error message is returned in XML format.

A reload-config command is also supported, which is useful for validating a new configuration file, or if you want to specify a file, load it, and not have it reloaded again on import. If there is an xml mistake in the configuration a user-friendly message is returned in xml format. You can then fix the problem and do a reload-config.

You can also view the DIH configuration in the Solr Admin UI and there is an interface to import content.

Request Parameters

Request parameters can be substituted in configuration with placeholder ${dataimporter.request.paramname}.   

Then, these parameters can be passed to the full-import command or defined in the <defaults> section in solrconfig.xml. This example shows the parameters with the full-import command:

dataimport?command=full-import&jdbcurl=jdbc:hsqldb:./example-DIH/hsqldb/ex&jdbcuser=sa&jdbcpassword=secret

Data Import Handler Commands

DIH commands are sent to Solr via an HTTP request. The following operations are supported.

Command

Description

abort

Aborts an ongoing operation. The URL is http://<host>:<port>/ solr/ <collection_name>/ dataimport? command=abort.

delta-import

For incremental imports and change detection. The command is of the form http://<host>:<port>/ solr/ <collection_name>/ dataimport? command=delta-import. It supports the same clean, commit, optimize and debug parameters as full-import command.  Only the SqlEntityProcessor supports delta imports.

full-import

A Full Import operation can be started with a URL of the form http://<host>:<port>/ solr/ <collection_name>/ dataimport? command=full-import. The command returns immediately. The operation will be started in a new thread and the status attribute in the response should be shown as busy. The operation may take some time depending on the size of dataset. Queries to Solr are not blocked during full-imports.
When a full-import command is executed, it stores the start time of the operation in a file located at conf/dataimport.properties. This stored timestamp is used when a delta-import operation is executed.
For a list of parameters that can be passed to this command, see below.

reload-config

If the configuration file has been changed and you wish to reload it without restarting Solr, run the command

http://<host>:<port>/solr/<collection_name>/command=reload-config

status

The URL is http://<host>:<port>/ solr/ <collection_name>/ dataimport? command=status. It returns statistics on the number of documents created, deleted, queries run, rows fetched, status, and so on.

show-configresponds with configuration

Parameters for the full-import Command

The full-import command accepts the following parameters:

Parameter

Description

clean

Default is true. Tells whether to clean up the index before the indexing is started.

commit

Default is true. Tells whether to commit after the operation.

debug

Default is false Runs the command in debug mode. It is used by the interactive development mode. Note that in debug mode, documents are never committed automatically. If you want to run debug mode and commit the results too, add commit=true as a request parameter.

entity

The name of an entity directly under the <document> tag in the configuration file. Use this to execute one or more entities selectively. Multiple "entity" parameters can be passed on to run multiple entities at once. If nothing is passed, all entities are executed.

optimize

Default is true. Tells Solr whether to optimize after the operation.

synchronousBlocks request until import is completed. Default is false.

Property Writer

The propertyWriter element defines the date format and locale for use with delta queries. It is an optional configuration. Add the element to the DIH configuration file, directly under the dataConfig element.

The parameters available are:

Parameter

Description

dateFormat

A java.text.SimpleDateFormat to use when converting the date to text. The default is "yyyy-MM-dd HH:mm:ss".

type

The implementation class. Use SimplePropertiesWriter for non-SolrCloud installations. If using SolrCloud, use ZKPropertiesWriter. If this is not specified, it will default to the appropriate class depending on if SolrCloud mode is enabled.

directory

Used with the SimplePropertiesWriter only). The directory for the properties file. If not specified, the default is "conf".

filename

Used with the SimplePropertiesWriter only). The name of the properties file. If not specified, the default is the requestHandler name (as defined in solrconfig.xml, appended by ".properties" (i.e., "dataimport.properties").

locale

The locale. If not defined, the ROOT locale is used. It must be specified as language-country (BCP 47 language tag). For example, en-US.

Data Sources

A data source specifies the origin of data and its type. Somewhat confusingly, some data sources are configured within the associated entity processor. Data sources can also be specified in solrconfig.xml, which is useful when you have multiple environments (for example, development, QA, and production) differing only in their data sources.

You can create a custom data source by writing a class that extends org.apache.solr.handler.dataimport.DataSource.

The mandatory attributes for a data source definition are its name and type. The name identifies the data source to an Entity element.

The types of data sources available are described below.

ContentStreamDataSource

This takes the POST data as the data source. This can be used with any EntityProcessor that uses a DataSource<Reader>.

FieldReaderDataSource

This can be used where a database field contains XML which you wish to process using the XPathEntityProcessor. You would set up a configuration with both JDBC and FieldReader data sources, and two entities, as follows:

The FieldReaderDataSource can take an encoding parameter, which will default to "UTF-8" if not specified.It must be specified as language-country. For example, en-US.

FileDataSource

This can be used like an URLDataSource, but is used to fetch content from files on disk. The only difference from URLDataSource, when accessing disk files, is how a pathname is specified.

This data source accepts these optional attributes.

Optional Attribute

Description

basePath

The base path relative to which the value is evaluated if it is not absolute.

encoding

Defines the character encoding to use. If not defined, UTF-8 is used.

JdbcDataSource

This is the default datasource. It's used with the SqlEntityProcessor. See the example in the FieldReaderDataSource section for details on configuration. JdbcDatasource supports at least the following attributes: driver, url, user, password, encryptKeyFile. All of them substitute properties via ${placeholders}.  

URLDataSource

This data source is often used with XPathEntityProcessor to fetch content from an underlying file:// or http:// location. Here's an example:

The URLDataSource type accepts these optional parameters:

Optional Parameter

Description

baseURL

Specifies a new baseURL for pathnames. You can use this to specify host/port changes between Dev/QA/Prod environments. Using this attribute isolates the changes to be made to the solrconfig.xml

connectionTimeout

Specifies the length of time in milliseconds after which the connection should time out. The default value is 5000ms.

encoding

By default the encoding in the response header is used. You can use this property to override the default encoding.

readTimeout

Specifies the length of time in milliseconds after which a read operation should time out. The default value is 10000ms.

Entity Processors

Entity processors extract data, transform it, and add it to a Solr index. Examples of entities include views or tables in a data store.

Each processor has its own set of attributes, described in its own section below. In addition, there are non-specific attributes common to all entities which may be specified.

Attribute

Use

dataSource

The name of a data source. If there are multiple data sources defined, use this attribute with the name of the data source for this entity.

name

Required. The unique name used to identify an entity.

pk

The primary key for the entity. It is optional, and required only when using delta-imports. It has no relation to the uniqueKey defined in schema.xml but they can both be the same. It is mandatory if you do delta-imports and then refers to the column name in ${dataimporter.delta.<column-name>} which is used as the primary key.

processor

Default is SqlEntityProcessor. Required only if the datasource is not RDBMS.

onError

Permissible values are (abort|skip|continue) . The default value is 'abort'. 'Skip' skips the current document. 'Continue' ignores the error and processing continues.

preImportDeleteQuery

Before a full-import command, use this query this to cleanup the index instead of using '*:*'. This is honored only on an entity that is an immediate sub-child of <document>.

postImportDeleteQuery

Similar to the above, but executed after the import has completed.

rootEntity

By default the entities immediately under the <document> are root entities. If this attribute is set to false, the entity directly falling under that entity will be treated as the root entity (and so on). For every row returned by the root entity, a document is created in Solr.

transformer

Optional. One or more transformers to be applied on this entity.

cacheImplOptional. A class (which must implement DIHCache) to use for caching this entity when doing lookups from an entity which wraps it. Provided implementation is "SortedMapBackedCache".
cacheKeyThe name of a property of this entity to use as a cache key if cacheImpl is specified.
cacheLookupAn entity + property name that will be used to lookup cached instances of this entity if cacheImpl is specified.
wherean alternative way to specify cacheKey and cacheLookup concatenated with '='. eg where="CODE=People.COUNTRY_CODE" is equal to cacheKey="CODE" cacheLookup="People.COUNTRY_CODE"
child="true"Enables indexing document blocks aka Nested Child Documents for searching with Block Join Query Parsers. It can be only specified on <entity> under another root entity. It switches from default behavior (merging field values) to nesting documents as children documents. Note: parent <entity> should add a field which is used as a parent filter in query time.

join="zipper"

Enables merge join aka "zipper" algorithm for joining parent and child entities without cache. It should be specified at child (nested) <entity>. It implies that parent and child queries return results ordered by keys, otherwise it throws an exception. Keys should be specified either with where attribute or with cacheKey and cacheLookup.

Caching of entities in DIH is provided to avoid repeated lookups for same entities again and again. The default SortedMapBackedCache is a HashMap where a key is a field in the row and the value is a bunch of rows for that same key.

In the example below, each manufacturer entity is cached using the 'id' property as a cache key.  Cache lookups will be performed for each product entity based on the product's "manu" property. When the cache has no data for a particular key, the query is run and the cache is populated

The SQL Entity Processor

The SqlEntityProcessor is the default processor. The associated data source should be a JDBC URL.

The entity attributes specific to this processor are shown in the table below.

Attribute

Use

query

Required. The SQL query used to select rows.

deltaQuery

SQL query used if the operation is delta-import. This query selects the primary keys of the rows which will be parts of the delta-update. The pks will be available to the deltaImportQuery through the variable ${dataimporter.delta.<column-name>}.

parentDeltaQuery

SQL query used if the operation is delta-import.

deletedPkQuery

SQL query used if the operation is delta-import.

deltaImportQuery

SQL query used if the operation is delta-import. If this is not present, DIH tries to construct the import query by(after identifying the delta) modifying the 'query' (this is error prone). There is a namespace ${dataimporter.delta.<column-name>} which can be used in this query. For example, select * from tbl where id=${dataimporter.delta.id}.

The XPathEntityProcessor

This processor is used when indexing XML formatted data. The data source is typically URLDataSource or FileDataSource. Xpath can also be used with the FileListEntityProcessor described below, to generate a document from each file.

The entity attributes unique to this processor are shown below.

Attribute

Use

Processor

Required. Must be set to "XpathEntityProcessor".

url

Required. HTTP URL or file location.

stream

Optional: Set to true for a large file or download.

forEach

Required unless you define useSolrAddSchema. The Xpath expression which demarcates each record. This will be used to set up the processing loop.

xsl

Optional: Its value (a URL or filesystem path) is the name of a resource used as a preprocessor for applying the XSL transformation.

useSolrAddSchema

Set this to true if the content is in the form of the standard Solr update XML schema.

Each field element in the entity can have the following attributes as well as the default ones.

Attribute

Use

xpath

Required. The XPath expression which will extract the content from the record for this field. Only a subset of Xpath syntax is supported.

commonField

Optional. If true, then when this field is encountered in a record it will be copied to future records when creating a Solr document.

flatten

Optional: If set to true, then any children text nodes are collected to form the value of a field. (warning) The default value is false, meaning that

if there are any sub-elements of the node pointed to by the XPath expression, they will be quietly omitted.

Here is an example from the "rss" collection in the dih example (example/example-DIH/solr/rss/conf/rss-data-config.xml):

The MailEntityProcessor

The MailEntityProcessor uses the Java Mail API to index email messages using the IMAP protocol. The MailEntityProcessor works by connecting to a specified mailbox using a username and password, fetching the email headers for each message, and then fetching the full email contents to construct a document (one document for each mail message).

Here is an example from the "mail" collection of the dih example (example/example-DIH/mail/conf/mail-data-config.xml):

The entity attributes unique to the MailEntityProcessor are shown below.

Attribute

Use

processor

Required. Must be set to "MailEntityProcessor".

user

Required. Username for authenticating to the IMAP server; this is typically the email address of the mailbox owner.

password

Required. Password for authenticating to the IMAP server.

host

Required. The IMAP server to connect to.

protocol

Required. The IMAP protocol to use, valid values are: imap, imaps, gimap, and gimaps.

fetchMailsSince

Optional. Date/time used to set a filter to import messages that occur after the specified date; expected format is: yyyy-MM-dd HH:mm:ss.

folders

Required. Comma-delimited list of folder names to pull messages from, such as "inbox".

recurseOptional (default is true). Flag to indicate if the processor should recurse all child folders when looking for messages to import.
includeOptional. Comma-delimited list of folder patterns to include when processing folders (can be a literal value or regular expression).
exclude

Optional. Comma-delimited list of folder patterns to exclude when processing folders (can be a literal value or regular expression); excluded folder patterns take precedence over include folder patterns.

processAttachement

or

processAttachments

Optional (default is true). Use Tika to process message attachments.
includeContentOptional (default is true). Include the message body when constructing Solr documents for indexing.

Importing New Emails Only

After running a full import, the MailEntityProcessor keeps track of the timestamp of the previous import so that subsequent imports can use the fetchMailsSince filter to only pull new messages from the mail server. This occurs automatically using the Data Import Handler dataimport.properties file (stored in conf). For instance, if you set fetchMailsSince=2014-08-22 00:00:00 in your mail-data-config.xml, then all mail messages that occur after this date will be imported on the first run of the importer. Subsequent imports will use the date of the previous import as the fetchMailsSince filter, so that only new emails since the last import are indexed each time.

GMail Extensions

When connecting to a GMail account, you can improve the efficiency of the MailEntityProcessor by setting the protocol to gimap or gimaps. This allows the processor to send the fetchMailsSince filter to the GMail server to have the date filter applied on the server, which means the processor only receives new messages from the server. However, GMail only supports date granularity, so the server-side filter may return previously seen messages if run more than once a day.

The TikaEntityProcessor

The TikaEntityProcessor uses Apache Tika to process incoming documents. This is similar to Uploading Data with Solr Cell using Apache Tika, but using the DataImportHandler options instead.

Here is an example from the "tika" collection of the dih example (example/example-DIH/tika/conf/tika-data-config.xml):

The parameters for this processor are described in the table below:

Attribute

Use

dataSource

This parameter defines the data source and an optional name which can be referred to in later parts of the configuration if needed. This is the same dataSource explained in the description of general entity processor attributes above.

The available data source types for this processor are:

  • BinURLDataSource: used for HTTP resources, but can also be used for files.
  • BinContentStreamDataSource: used for uploading content as a stream.
  • BinFileDataSource: used for content on the local filesystem.

url

The path to the source file(s), as a file path or a traditional internet URL. This parameter is required.

htmlMapper

Allows control of how Tika parses HTML. The "default" mapper strips much of the HTML from documents while the "identity" mapper passes all HTML as-is with no modifications. If this parameter is defined, it must be either default or identity; if it is absent, "default" is assumed.

format

The output format. The options are text, xml, html or none. The default is "text" if not defined. The format "none" can be used if metadata only should be indexed and not the body of the documents.

parser

The default parser is org.apache.tika.parser.AutoDetectParser. If a custom or other parser should be used, it should be entered as a fully-qualified name of the class and path.

fields

The list of fields from the input documents and how they should be mapped to Solr fields. If the attribute meta is defined as "true", the field will be obtained from the metadata of the document and not parsed from the body of the main text.

extractEmbedded

Instructs the TikaEntityProcessor to extract embedded documents or attachments when true. If false, embedded documents and attachments will be ignored.

onErrorBy default, the TikaEntityProcessor will stop processing documents if it finds one that generates an error. If you define onError to "skip", the TikaEntityProcessor will instead skip documents that fail processing and log a message that the document was skipped.

The FileListEntityProcessor

This processor is basically a wrapper, and is designed to generate a set of files satisfying conditions specified in the attributes which can then be passed to another processor, such as the XPathEntityProcessor. The entity information for this processor would be nested within the FileListEnitity entry. It generates five implicit fields: fileAbsolutePath, fileDir, fileSize, fileLastModified, file, which can be used in the nested processor. This processor does not use a data source.

The attributes specific to this processor are described in the table below:

Attribute

Use

fileName

Required. A regular expression pattern to identify files to be included.

basedir

Required. The base directory (absolute path).

recursive

Whether to search directories recursively. Default is 'false'.

excludes

A regular expression pattern to identify files which will be excluded.

newerThan

A date in the format yyyy-MM-ddHH:mm:ss or a date math expression (NOW - 2YEARS).

olderThan

A date, using the same formats as newerThan.

rootEntity

This should be set to false. This ensures that each row (filepath) emitted by this processor is considered to be a document.

dataSource

Must be set to null.

The example below shows the combination of the FileListEntityProcessor with another processor which will generate a set of fields from each file found.

LineEntityProcessor

This EntityProcessor reads all content from the data source on a line by line basis and returns a field called rawLine for each line read. The content is not parsed in any way; however, you may add transformers to manipulate the data within the rawLine field, or to create other additional fields.

The lines read can be filtered by two regular expressions specified with the acceptLineRegex and omitLineRegex attributes. The table below describes the LineEntityProcessor's attributes:

Attribute

Description

url

A required attribute that specifies the location of the input file in a way that is compatible with the configured data source. If this value is relative and you are using FileDataSource or URLDataSource, it assumed to be relative to baseLoc.

acceptLineRegex

An optional attribute that if present discards any line which does not match the regExp.

omitLineRegex

An optional attribute that is applied after any acceptLineRegex and that discards any line which matches this regExp.

For example:

While there are use cases where you might need to create a Solr document for each line read from a file, it is expected that in most cases that the lines read by this processor will consist of a pathname, which in turn will be consumed by another EntityProcessor, such as XPathEntityProcessor.

PlainTextEntityProcessor

This EntityProcessor reads all content from the data source into an single implicit field called plainText. The content is not parsed in any way, however you may add transformers to manipulate the data within the plainText as needed, or to create other additional fields.

For example:

Ensure that the dataSource is of type DataSource<Reader> (FileDataSource, URLDataSource).

SolrEntityProcessor

Uses Solr instance as a datasource, see https://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor. In addition to that, SolrEntityProcessor also supports the following parameters:

cursorMark="true"specify it to enable cursor for efficient result set scrolling
sort="id asc"in this case it usually needs to specify sort param referencing uniqueKey field. see Pagination of Results for details.

Transformers

Transformers manipulate the fields in a document returned by an entity. A transformer can create new fields or modify existing ones. You must tell the entity which transformers your import operation will be using, by adding an attribute containing a comma separated list to the <entity> element.

Specific transformation rules are then added to the attributes of a <field> element, as shown in the examples below. The transformers are applied in the order in which they are specified in the transformer attribute.

The Data Import Handler contains several built-in transformers. You can also write your own custom transformers, as described in the Solr Wiki (see http://wiki.apache.org/solr/DIHCustomTransformer). The ScriptTransformer (described below) offers an alternative method for writing your own transformers.

Solr includes the following built-in transformers:

Transformer Name

Use

ClobTransformer

Used to create a String out of a Clob type in database.

DateFormatTransformer

Parse date/time instances.

HTMLStripTransformer

Strip HTML from a field.

LogTransformer

Used to log data to log files or a console.

NumberFormatTransformer

Uses the NumberFormat class in java to parse a string into a number.

RegexTransformer

Use regular expressions to manipulate fields.

ScriptTransformer

Write transformers in Javascript or any other scripting language supported by Java.

TemplateTransformer

Transform a field using a template.

These transformers are described below.

ClobTransformer

You can use the ClobTransformer to create a string out of a CLOB in a database. A CLOB is a character large object: a collection of character data typically stored in a separate location that is referenced in the database. See http://en.wikipedia.org/wiki/Character_large_object. Here's an example of invoking the ClobTransformer.

The ClobTransformer accepts these attributes:

Attribute

Description

clob

Boolean value to signal if ClobTransformer should process this field or not. If this attribute is omitted, then the corresponding field is not transformed.

sourceColName

The source column to be used as input. If this is absent source and target are same

The DateFormatTransformer

This transformer converts dates from one format to another. This would be useful, for example, in a situation where you wanted to convert a field with a fully specified date/time into a less precise date format, for use in faceting.

DateFormatTransformer applies only on the fields with an attribute dateTimeFormat. Other fields are not modified.

This transformer recognizes the following attributes:

Attribute

Description

dateTimeFormat

The format used for parsing this field. This must comply with the syntax of the Java SimpleDateFormat class.

sourceColName

The column on which the dateFormat is to be applied. If this is absent source and target are same.

locale

The locale to use for date transformations. If not defined, the ROOT locale is used. It must be specified as language-country (BCP 47 language tag). For example, en-US.

Here is example code that returns the date rounded up to the month "2007-JUL":

The HTMLStripTransformer

You can use this transformer to strip HTML out of a field. For example:

There is one attribute for this transformer, stripHTML, which is a boolean value (true/false) to signal if the HTMLStripTransformer should process the field or not.

The LogTransformer

You can use this transformer to log data to the console or log files. For example:

Unlike other transformers, the LogTransformer does not apply to any field, so the attributes are applied on the entity itself.

The NumberFormatTransformer

Use this transformer to parse a number from a string, converting it into the specified format, and optionally using a different locale.

NumberFormatTransformer will be applied only to fields with an attribute formatStyle.

This transformer recognizes the following attributes:

Attribute

Description

formatStyle

The format used for parsing this field. The value of the attribute must be one of (number|percent|integer|currency). This uses the semantics of the Java NumberFormat class.

sourceColName

The column on which the NumberFormat is to be applied. This is attribute is absent. The source column and the target column are the same.

locale

The locale to be used for parsing the strings. The locale. If not defined, the ROOT locale is used. It must be specified as language-country (BCP 47 language tag). For example, en-US.

For example:

The RegexTransformer

The regex transformer helps in extracting or manipulating values from fields (from the source) using Regular Expressions. The actual class name is org.apache.solr.handler.dataimport.RegexTransformer. But as it belongs to the default package the package-name can be omitted.

The table below describes the attributes recognized by the regex transformer.

Attribute

Description

regex

The regular expression that is used to match against the column or sourceColName's value(s). If replaceWith is absent, each regex group is taken as a value and a list of values is returned.

sourceColName

The column on which the regex is to be applied. If not present, then the source and target are identical.

splitBy

Used to split a string. It returns a list of values. note: this is a regular expression – it may need to be escaped (e.g. via back-slashes)

groupNames

A comma separated list of field column names, used where the regex contains groups and each group is to be saved to a different field. If some groups are not to be named leave a space between commas.

replaceWith

Used along with regex . It is equivalent to the method new String(<sourceColVal>).replaceAll(<regex>, <replaceWith>).

Here is an example of configuring the regex transformer:

In this example, regex and sourceColName are custom attributes used by the transformer. The transformer reads the field full_name from the resultset and transforms it to two new target fields, firstName and lastName. Even though the query returned only one column, full_name, in the result set, the Solr document gets two extra fields firstName and lastName which are "derived" fields. These new fields are only created if the regexp matches.

The emailids field in the table can be a comma-separated value. It ends up producing one or more email IDs, and we expect the mailId to be a multivalued field in Solr.

Note that this transformer can either be used to split a string into tokens based on a splitBy pattern, or to perform a string substitution as per replaceWith, or it can assign groups within a pattern to a list of groupNames. It decides what it is to do based upon the above attributes splitBy, replaceWith and groupNames which are looked for in order. This first one found is acted upon and other unrelated attributes are ignored.

The ScriptTransformer

The script transformer allows arbitrary transformer functions to be written in any scripting language supported by Java, such as Javascript, JRuby, Jython, Groovy, or BeanShell. Javascript is integrated into Java 8; you'll need to integrate other languages yourself.

Each function you write must accept a row variable (which corresponds to a Java Map<String,Object>, thus permitting get,put,remove operations). Thus you can modify the value of an existing field or add new fields. The return value of the function is the returned object.

The script is inserted into the DIH configuration file at the top level and is called once for each row.

Here is a simple example.

The TemplateTransformer

You can use the template transformer to construct or modify a field value, perhaps using the value of other fields. You can insert extra text into the template.

Special Commands for the Data Import Handler

You can pass special commands to the DIH by adding any of the variables listed below to any row returned by any component:

Variable

Description

$skipDoc

Skip the current document; that is, do not add it to Solr. The value can be the string true|false.

$skipRow

Skip the current row. The document will be added with rows from other entities. The value can be the string true|false

$docBoost

Boost the current document. The boost value can be a number or the toString conversion of a number.

$deleteDocById

Delete a document from Solr with this ID. The value has to be the uniqueKey value of the document.

$deleteDocByQuery

Delete documents from Solr using this query. The value must be a Solr Query.

  • No labels

27 Comments

  1. The sample data-config.xml file is another place where in the PDF version of the page about half a page of blank space is being inserted in the code block and pushing the end of the example to the next page. Again, not sure why that's happening or how to fix it, but warrants a bit more investigation for next Ref Guide release.

    1. yeah ... i tried to edit the config down to remove a lot of whitespace but it still doesn't fit on one page, so it inserts that big gap in the middle.

      nothing left to cut out w/o affecting the utility of hte example, so best just to live with it for now.

  2. Format error in preImportDeleteQuery definition :  …  this to cleanup the index instead of using ':' …. Stars of match all docs query are forgotten.  Somwhow confluence does not allow to write  * : * without spaces.

    1. I was finally able to get it to take. I probably only tricked it temporarily, though.

    1. CachedSqlEntityProcessor was deprecated a long time ago, so we probably shouldn't be bringing it up here – but a very good point Doug made on IRC was that the "replacement" option of specifying a cacheImpl on entities doesn't seem to be documented at all.

    2. noble has added some info about using the cacheImpl + cacheKey + cacheLookup attributes on entities

  3. Edit: I pasted the link to the JIRA URL and this editor did something automatically to turn it into an object, which I see when editing, which looked cool.  But then it views as an error when not in edit mode.

    The JIRA I'm trying to point to is SOLR 3076 SOLR-3076

    I think adding support for Block Joins parent / child may have been added to DIH in 

    JIRA Issues Macro: Unable to locate JIRA server for this macro. It may be due to Application Link configuration.
    , see the dih-config.xml patch from April 14th, 2013.  A couple issues:

    1: I'm not 100% sure that this was the final supported syntax and that everything made it in.  I don't have a similar database setup to just "try it".

    2: The example XML doesn't have much in the way of comments, so it's not really clear what it's doing.

    3: If we do really now support creating full block join parent child docs, it should be explained in this page (I realize that's a lot of work)

    1. there is no direct block join support in DIH, it looks like the patch you're referring to was pulled out of SOLR-3076 and spun off into SOLR-5147

  4. In the example that is given in DateFormatTransformer, 

    <entity name="en" pk="id" transformer="DateTimeTransformer" ... >

    should be

    <entity name="en" pk="id" transformer="DateFormatTransformer" ... >
    1. Thanks Sandesh, that example is fixed now.

  5. FieldReaderDataSource example is very helpful. Thanks!

    processor="SQLEntityProcessor"

    should be 

    processor="SqlEntityProcessor"

    https://lucene.apache.org/solr/4_10_2/solr-dataimporthandler/org/apache/solr/handler/dataimport/SqlEntityProcessor.html

  6. I think under Entity Processors , the first table entry datasource should actually be dataSource (capital S). DIH is case-sensitive yet ignores unkown attributes.

    1. Fixed, thanks! I also edited the description for the entry to be more clear.

  7. There's some inconsistency in spelling between "SortedMapBackedCache" and "SortedMapBachedCache".

    1. Thanks for the report.  Fixed.

  8. Missing SolrEntityProcessor, one of the strongest capabilities dataimport has.

    I use it to reindex after version or schema changes, works perfect for me.

    https://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor

  9. There is an error on pk setting in the JDBC Delta import example:  https://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example 

    If there is a deltaQuery, we cannot set more than one column as pk:

    <entity name="item_category" pk="ITEM_ID, CATEGORY_ID"

                         query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"
                         deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dih.last_index_time}'"
                         parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}">

    I have tested this example, and this was the error info :

    2015-11-16 15:59:56,096 ERROR org.apache.solr.handler.dataimport.DataImporter: Delta Import Failed
    java.lang.RuntimeException: java.lang.IllegalArgumentException: deltaQuery has no column to resolve to declared primary key pk='ITEM_ID,CATEGORY_ID'
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
    at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:444)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:482)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)

     

    I changed the pk as  pk="ITEM_ID" , and it finished delta import successfully.

    1. Multiple column PK seems implemented, but not well covered by tests. wangyanbin, can you try to remove space between them, i.e. pk="ITEM_ID,CATEGORY_ID"?

      1. Mikhail Khludnev ,in my test there is no space between two columns. We can see from the log: java.lang.IllegalArgumentException: deltaQuery has no column to resolve to declared primary key pk='ITEM_ID,CATEGORY_ID'

        I have checked the code, in the collectDelta function, for each ModifiedRowKey, it will get the value of pk column: 

        Object pkValue = row.get(pk);

        But it doesn't consider the case of multi-columns pk.

        And the process of deletedSet also doesn't support the multi-columns pk.

        In deltaQuery or deletedPkQuery sql, we can concat two column values into one pk  value, which matches the uniquekey of solr document.

        So we may don't need the support for multi-columns pk.

  10. What does "refer to the column ....primary key" in the sentence "It is mandatory if you do delta-imports and then refers to the column name in ${dataimporter.delta.<column-name>} which is used as the primary key." mean? I mean I know we can use dih.delta.<column-name> to access return field name in deltaQuery in deltaImportQuery but why explain it in description of 'pk' attribute? What has it got to do with "pk" attrbiutes? And what actully setting in "pk" influence when processing the entity?

  11. If you use the "pk" parameter, then every SQL query (query, deltaQuery, and deltaImportQuery) must have that column name in its result set, or the import will fail.  It's a validation parameter and I am really not sure why it's required for delta functionality.  There's probably a good reason, but I do not know what that reason is.

    At one point I was actually using delta-import, though I am not now.  Back then, my "query" and "deltaImportQuery" were identical, but I had problems when I tried to use "SELECT 1" as my deltaQuery.  It didn't work until I changed that to "SELECT 1 AS did" ... because my pk value is "did".  I was controlling the delta query using externally supplied parameters, so I did not need the deltaQuery to return anything useful, just a success value indicating "yes, there are records to import."

  12. Hi,

    I'm still very much a newbie in the realms of Solr.

    I m working on a web application in which i have used apache solr and mysql database

    i'm using data import handler to index data from my database

    and i 'm wondering if you can help me to find a solution to my problem:

    my database schemas look like this

     

    one book is associated to one category (or discipline) which can have many subcategory (or subdiscipline) and each subcategory can have other subcategories and so on...

    1/ how can solr index each with it's associated categories and subcategories and sub subcategories and so on...?

    2/ i tried to configure a data import handler in this way :

     

    <entity name="Book" dataSource="testDatabase_Source" query="SELECT
    `bookRecordId`,
    `bookRecordAuthor`,
    `bookRecordEdition`,
    `bookRecordIsbn`
    FROM `BookRecord` where bookRecordId='${Nomenclature.bookRecordId}'">
    <field column="bookRecordId" name="bookRecordId" />
    <field column="bookRecordAuthor" name="bookRecordAuthor" />
    <field column="bookRecordEdition" name="bookRecordEdition" />
    <field column="bookRecordIsbn" name="bookRecordIsbn" />

    <entity name="Discipline_BookRecord" dataSource="testDatabase_Source" query="SELECT
    `Discipline_disciplineId`
    FROM Discipline_BookRecord where bookRecords_bookRecordId='${BookRecord.bookRecordId}'">
    <entity name="Discipline" dataSource="BbpDatabase_Source" query="SELECT
    `disciplineId`,
    `disciplineName`,
    `disciplineDescription`
    FROM Discipline where disciplineId='${Discipline_BookRecord.Discipline_disciplineId}'">
    <field name="SubDisciplineId" column="disciplineId" />
    <field name="SubDisciplineName" column="disciplineName" />
    <field name="SubDisciplineDescription" column="disciplineDescription" />
    </entity>
    <entity name="SubDiscipline" dataSource="testDatabase_Source" query="SELECT
    `SubDiscipline_disciplineId`
    FROM SubDiscipline where SubDiscipline_subDisciplineId='${Discipline_BookRecord.Discipline_disciplineId}'">
    <field name="DisciplineId" column="SubDiscipline_disciplineId" />
    <entity name="Discipline" dataSource="BbpDatabase_Source" query="SELECT
    `disciplineName`,
    `disciplineDescription`
    FROM Discipline where disciplineId='${SubDiscipline.SubDiscipline_disciplineId}'">
    <field name="DisciplineName" column="disciplineName" />
    <field name="DisciplineDescription" column="disciplineDescription" />
    </entity>

    </entity>
    </entity>
    </entity>

    but in this way i can only index book info with a discipline and a subdiscipline only without getting the subsubdiscipline indexed, what should i modify there in order to index all subdisciplines and sub subdisciplines and so on...


    thank you so much

     

     

    1. The comments on the reference guide are not the proper place to get support.  Please use the mailing list or IRC channel.  The mailing list has a larger audience.

      http://lucene.apache.org/solr/resources.html#community

  13. Can the schemaless with the updateRequestProcessorChain be used for this in some ways? If it could , how to do that ? thanks very much !

    1. Please use the mailing list or the IRC channel for support requests.

      http://lucene.apache.org/solr/resources.html#community