Clinical Documents from an EMR are typically stored in a database. We developed an UIMA CollectionReader that retrieves these documents for annotation (thereby avoiding exporting the documents to the file system). The DBCollection reader works as follows:

  1. read all document unique ids into an internal list
  2. iterate through each id in the list, and retrieve the document for annotation.

Documents in a database typically are given a Unique ID, and are cross-referenced with other information (e.g. patient id, document type, document date, etc.). The YTEX DBConsumer can store the unique id in the database along with other document annotations. This allows you to cross-reference annotation data with other document data.

By default, YTEX supports an integer document identifier (INSTANCE_ID) and/or a string document identifier (INSTANCE_KEY). When using the FileSystemCollection reader, the file name is stored in the INSTANCE_KEY.

The YTEX DBCollection reader is parameterized by 2 queries: a key query and a document query. The key query loads document unique ids from a database, and the document query loads a document for a given unique id. For more information, refer to the example below.

Note that you must perform the additional YTEX installation tasks to use this component.

CPE DBCollectionReader Example

To illustrate this, we can simply retrieve sample documents stored in the fracture_demo table. To configure and execute the CPE, do the following:

1. Start New CPE

  • Start CPE Configuration Tool: Run CTAKES_HOME\bin\runctakesCPE.bat
  • Clear the CPE Configuration: File->Clear

2. Configure the Collection Reader

  • In the Collection Reader section click the Browse Button. For the Descriptor, specify CTAKES_HOME\desc\ytex-uima\desc\collection_reader\DBCollectionReader.xml.
  • Configure Queries: The 'Query Document Keys' and 'Query Get Document' parameters will appear. Specify queries to retrieve keys and documents. The key query returns a integer key, named instance_id. The document query has a instance_id parameter. The instance_id is stored in the ytex document table, allowing you to link annotations to the source document. Here are the queries for the fracture demo:

MS SQL Server

  • Document Key Query (replace schema to match your configuration)
    select note_id instance_id from <schema>.fracture_demo
  • Document Query
    select note_text from <schema>.fracture_demo where note_id = :instance_id


  • Document Key Query
    select note_id instance_id from  fracture_demo
  • Document Query
    select note_text from fracture_demo where note_id = :instance_id


  • Document Key Query
    select note_id instance_id from fracture_demo
  • Document Query
    select note_text from fracture_demo where note_id = :instance_id

3. Configure Analysis Engine

Click on the 'Add' button in the 'Analysis Engine' section, and select CTAKES_HOME\desc\ytex-uima\desc\analysis_engine\AggregatePlaintextUMLSProcessor.xml

  • analysis_batch: Documents can be assigned a 'group' or analysis_batch. specify 'test2' here.
  • Run the CPE: Click the 'Play' button
  • View results
  select * from 
where analysis_batch = 'test2'

Notice that the instance_id column is set, and refers to the fracture_demo.note_id.

Custom Key Mapping Example

This example demonstrates how to use custom document keys. Let's assume that clinical documents have a unique identifier that comprises 2 fields - note_id (integer) and site_id (character) - and that these documents are stored in the fracture_demo table. We would like to link our annotations to the original document, so we need to store both the note_id and site_id in the ytex document table. In this example, we map these columns to the instance_id and site_id columns in the document table. (We could map the site_id to INSTANCE_KEY, but that would be no fun.)

To run this example, do the following:

  • modify the ytex document table and add a site_id column. e.g. for MS SQL Server:


alter table document add site_id varchar(20)
  • Start the collection processing engine

Windows: Run YTEX_HOME/ytexCPE.cmd Unix: from a shell run the following commands

. ${HOME}/ytex.profile
cd $
java $
  • Open the CPE Descriptor

Go to File->Open, and select YTEX_HOME/examples/cpe-fracture/fracture-demo.cpe.xml. The query to get the keys looks like this (notice how note_id was renamed to instance_id):

select note_id instance_id, site_id from fracture_demo

The query to get the document looks like this:

select note_text from fracture_demo where note_id = :instance_id and site_id = :site_id

The CPE config "Store Doc Text" checkbox is unchecked: we will note store the document text in the document.doc_text column, because it is already in the database - we can join the document and fracture_demo tables on the uid/site_id columns to get the corresponding text.

  • Run the CPE: Press the play button
  • Verify that the instance_id and site_id fields have been set
select * from document where analysis_batch = 'cpe-fracture'


DBCollectionReader Configuration Parameters

  • Query Document Keys
    This query retrieves the document keys - the way to uniquely identify each document.
  • Query Get Document
    This query retrieves the document using a key. Key parameters are specified using :. The column names returned from the document key query must match the parameter names exactly (case sensitive). Oracle users: oracle will return all column names as uppercase, unless you rename them using a quoted string (as in the uid column in the example above).
  • DB Driver / DB URL: 
    By default the DBCollectionReader queries the YTEX database (leave these fields empty). You can use these fields configure the DBCollectionReader to query a different database to read documents (annotations will continue to be stored in the YTEX database). Note that most JDBC URLs support a format that allows the specification of a database username and password in the URL.
  • Key to document column mapping
    We match key values to columns in the document table by column name (case-insensitive) and data type. We have tested this with non-decimal numeric types (short, int, bigint ...) and character types (char, varchar, ...). Problems may arise with illegal column names that require escaping.
  • No labels