Overview of Smoking status

The "smoking status" pipeline processes flat files or CDA (Clinical Document Architecture) documents to classify patient records into five pre-determined categories - past smoker (P), current smoker (C), smoker (S), nonsmoker (N), and unknown (U), where a past and current smoker are distinguished based on temporal expressions in the patient's medical records.

Analysis engines (annotator)


The file desc/analysis_engine/SimulatedProdSmokingTAE.xml provides a working example of the smoking status pipeline, utilizing the aggregate TAEs. This Aggregate includes Token, Sentence, SentenceAdjuster, ClassifiableEntries (which in turn invokes the ProductionPostSentenceAggregate annotators internally).

Shipped with this annotator:

  • ExternalBaseAggregateTAE
  • SentenceAdjuster
  • ClassifiableEntriesAnnotator

SimulatedProdSmokingTAE_CDA.xml is also provided to process CDA documents. The aggregate flow will contain the annotator version ExternalBaseAggregateTAE_CDA.xml which will process the document as a Clinical Document Architecture (CDA) file.


The file desc/analysis_engine/ProductionPostSentenceAggregate_step1.xml Aggregate TAE is used to run the first step classification stage via the KuRuleBasedClassifierAnnotator.

  • TokenizerAnnotator (core project)
  • KuRuleBasedClassifierAnnotator

This annotator is not contained in the aggregate flow, but introduced via the resource settings of the ClassifiableEntriesAnnotator (see the method initialize() in this class). UIMAFramework.produceAnalysisEngine(taeSpecifierStep1, ResMgr, null) instantiates the AE and CasCreationUtils.createCas(taeStep1.getAnalysisEngineMetaData()).getJCas() retrieves the CAS.


The file desc/analysis_engine/ProductionPostSentenceAggregate_step2_libsvm.xml is the Aggregate TAE used to run the second classification stage via the libSVM training module. Shipped with this annotator:

  • PcsClassifierAnnotator_libsvm,
  • ArtificialSentenceAnnotator,
  • SentenceAdjuster,
  • SmokingStatusDictionaryLookupAnnotator,
  • NegationAnnotator.

This annotator is not contained in the aggregate flow, but introduced via the resource settings of the ClassifiableEntriesAnnotator (see the method initialize() in this class). UIMAFramework.produceAnalysisEngine(taeSpecifierStep2, ResMgr, null) instantiates the AE and the ClassifiableEntriesAnnotator process method will process if the smoking status is known.


The file desc/analysis_engine/ExternalBaseAggregateTAE.xml provides an aggregate flow for the external annotations, SimpleSegmentAnnotator, TokenizerAnnotator, SentenceDetectorAnnotator, and LvgAnnotator. Shipped with this annotator:

  • SimpleSegmentAnnotator,
  • TokenizerAnnotator (core project),
  • SentDetectorAnnotator (core project),
  • LvgAnnotation (LVG project).

ExternalBaseAggregateTAE_CDA.xml is also provided to process CDA documents. The aggregate flow will contain the specialized class CdaCasInitializer (replacing the SimpleSegmentAnnotator used by flat file/non-CDA version) which will process the document as a Clinical Document Architecture (CDA) file. This annotator is contained in the SimulatedProdSmokingTAE_CDA aggregate. Red text indicates shipped with this annotator.


The file desc/analysis_engine/SentenceAdjuster.xml drives the java class edu.mayo.bmi.smoking.ae.SentenceAdjuster annotator that uses some patterns and some rules about those patterns to adjust certain annotations. This annotator was extended to handle sentence boundaries for the Smoking status classification.

Example: "Tobacco: none" has two sentences as detected by the original cTAKES sentence boundary detector. This annotator merges them into one sentence to enable correct negation detection.

UseSegments <Boolean/Single-valued/Optional>
(Default Value = 'false') Flag whether to use segments or full doc text.

SegmentsToSkip <String/Multi-valued/Optional>

WordsToIgnore <String/Multi-valued/Optional>
(Default Value = 'null') Set of words that PostModifier should ignore (act as if the word was not there) when looking for a pattern match.

WordsInPattern <String/Multi-valued/Required>
(Default Value = 'no none never quit smoked ;') The list of words ("none", "no", etc) used in the pattern.


The file desc/analysis_engine/ClassifiableEntriesAnnotator.xml drives the java class edu.mayo.bmi.smoking.ae.ClassifiableEntries. Converts Sentences to ClassifiableEntries (required by SmokingStatus pipeline) and ultimately to RecordSentence.

TruthFile <String/Single-valued/Optional>
(Default Value = 'null') Delimited Truth file. Delimiter is expected to be the TAB char. If not specified, then the classification feature of the RecordSentence object will not be set.

AllowedClassifications <String/Multi-valued/Optional>
UNKNOWN"') See edu.mayo.bmi.smoking.Const.java for permitted string

SectionsToIgnore <String/Multi-valued/Optional>
(Default Value = '"20109" "20138"') Sections to ignore for ClassifiableEntries - Family History (20109). A given patient's smoking status could be confused by smoking status of others. To avoid this confusion there is an option to exclude certain sections such as family history.

ConWordsFile <Boolean/Single-valued/Optional>
(Default Value = '$main_root/resources/ss/data/context/negationContradictionWords.txt')
Contradiction words list. If this word appears in sentence do not

(Default Value = '$main_root/desc/analysis_engine/ProductionPostSentenceAggregate_step1.xml')
Annotator responsible for the first classification step, namely,

(Default Value = '$main_root/desc/analysis_engine/ProductionPostSentenceAggregate_step2_libsvm.xml')
Annotator responsible for second classification step.

The UimaDescriptorStep1/UimaDescriptorStep2 are introduced as resources via the ClassifiableEntriesAnnotator annotator during the initialization step. This allows the aggregates specified to be instantiated and analysis processing to be handled on a separate asynchronized thread. This enhances performance overall by ensuring the resources required by the process method will have output of the ProductionPostSentenceAggregates prepared without requiring a synchronized data flow (i.e. explicit aggregate flow via component descriptor aggregate flow).


The file desc/analysis_engine/KuRuleBasedClassifierAnnotator.xml drives the java class edu.mayo.bmi.smoking.ae.KuRuleBasedClassifierAnnotator. Known vs Unknown classifier using smoking related keywords.

CaseSensitive <String/Single-valued/Required>
(Default Value = 'false') Specifies if a distinction between lower and upper case text will be considered.

classAttribute <String/Single-valued/Required>
(Default Value = 'smoking_status') Value used by the NominalAttributeValue via setAttributeName.

SmokingWordsFile <String/Single-valued/Required>
(Default Value = 'ss/data/KU/keywords.txt') Smoking related keywords to identify "known" class.

UnknownWordsFile <String/Single-valued/Required>
(Default Value = 'ss/data/KU/unknown_words.txt') If this word/phrase appears, treat the sentence as UNKNOWN.


The file desc/analysis_engine/PcsClassifierAnnotator.xml smoking status classifier using libsvm. This annotator plays the same role as PcsBOWFeatureAnnotator.xml, PcsClassifierAnnotator.xml, and BOWFeatureRemovalAnnotator.xml, which use libsvm.

CaseSensitive <String/Single-valued/Required>
(Default Value = 'false') Specifies if a distinction between lower and upper case text will be considered.

(Default Value = 'file:ss/data/PCS/stopwords_PCS.txt)'
Resource file that provides terms used as stop words, e.g. "a" "an" "the".

(Default Value = 'file:ss/data/PCS/keywords_PCS.txt)'
Resource file that provides terms used as PCS key words, e.g.
'"refrain" "discussed" "to_quit" (if bigram it is connected by
underscore, i.e. "_")'.

(Default Value = 'file:ss/data/PCS/pcs_libsvm-2.91.model')
Resource file that provides trained model for smoking status classification.


The file desc/analysis_engine/ArtificialSentenceAnnotator.xml drives the java class edu.mayo.bmi.uima.core.ae.CopyAnnotator. Artificially creates a new SentenceAnnotation object by treating the entire document as a sentence. The offset values from the DocumentAnnotation object are transferred over to the new SentenceAnnotation object.

srcObjClass <String/Single-valued/Required>
(Default Value = 'false') Source JCas object class.
This must be an object that already exists in the JCas.

destObjClass <String/Single-valued/Required>
(Default Value = 'false') Destination JCas object class.
A new JCas object will be created.

dataBindMap <String/Multi-valued/Required>
(Default Value = 'false')
Binds data from source to destination.
Format for each entry is the getter method name of the source to the
setter method name of the destination. e.g. getMyValue|setMyValue


The file desc/analysis_engine/SmokingStatusDictionaryLookupAnnotator.xml drives the java class edu.mayo.bmi.uima.lookup.ae.DictionaryLookupAnnotator. Performs dictionary lookup and stores the hits as NamedEntityAnnotation objects.

(Default Value = 'file:ss/data/SmokingStatusLookupConfig.xml)'
Defines which dictionaries will be used, the implementation specifics, and metaField configuration.

(Default Value = 'file:ss/data/smoker.dictionary)'
Resource file that provides terms used as smoking words, e.g. '"smokes" "tobacco"'.

(Default Value = 'file:ss/data/nonsmoker.dictionary')
Resource file that provides terms used as non-smoking words, e.g. '"non-smoker"'.


The file desc/analysis_engine/NegationAnnotator.xml drives the java class edu.mayo.bmi.uima.context.ContextAnnotator. Boundary tokens moved to external resource - ss/data/context/boundaryData.txt.

(Default Value = 'file:ss/data/context/boundaryData.txt')
Resource file that provides terms used as sentence boundaries, e.g. '"nevertheless" "how" ";" "."'.

The parameters provided act the same way that the core's version of the 'NegationAnnotator', but since the boundary stop words are different for the smoking status pipeline, a separate implementation was necessary. However, current release of 'NegationAnnotator' does not use this resource.

CAS consumers - RecordResolutionCasConsumer.xml

The CAS consumer provided in /desc/cas_consumper/RecordResolutionCasConsumer.xml drives the java class edu.mayo.bmi.smoking.cc.RecordResolutionCasConsumer iterates over all sentences (each CAS equals one sentence) for a record and resolves the final classification value for the record. Output is saved to an delimited file. Additionally, optionally provides the overall patient level classification based on record level classification.

OutputFile <String/Single-valued/Required>
(Default Value = 'c:\temp\record_resolution.txt')
Specifies the location of the detail and summary report.

Delimiter <String/Single-valued/Required>
(Default Value = '|')
Specifies the delimiter for the output file.

ProcessingCDADocument <Boolean/Single-valued/Required>
(Default Value = 'false')
Specifies whether the processed files should be handled as CDA documents.

RunPatientLevelClassification <Boolean/Single-valued/Required>
(Default Value = 'false')
Specifies whether the post processing step of generating a summary patient level classification is done.

FinalClassificationOutputFile <String/Single-valued/Optional>
(Default Value = 'null')
Specifies name and location of the summary report file which holds the final patient level classifications.

The support vector machine (SVM) classificiation tool provided at /lib/libsvm-2.91.jar used to train the smoking status model.

How to Create your own smoking status classifier model

  • Create sentence-level smoking status data with the format of: sentence|class_label (class_label: P, C, S).

He quit smoking three years ago.|P She is smoking currently.|C The patient has a history of tobacco use.|S

  • Run the script edu.mayo.bmi.smoking.MLutil.GenerateTrainingData.java on the sentence-level smoking status data to generate the libSVM training data.

In this script, the variable "dataFile" in main() must point to the sentence-level smoking status data. Set the other variables also if necessary. Users might create their own keywordFile that contains keywords used in smoking status classification (see GenerateTrainingData.java for details.)

  • Create new model on the libSVM training data.

The command with our options used in the current model is:
java -classpath path_of_libsvm_jar_file svm_train -s 0 -t 1 -g 1 -r 1 -d 1 training_data_file new_model
Users might use their own customized libSVM options.

  • Save new_model in the resources/ss/data/PCS/
  • Change the Resources of "PathOfModel" in PcsClassifierAnnotator_libsvm.xml to "new_model"
  • No labels