Overview of Smoking status
The "smoking status" pipeline processes flat files or CDA (Clinical Document Architecture) documents to classify patient records into five pre-determined categories - past smoker (P), current smoker (C), smoker (S), nonsmoker (N), and unknown (U), where a past and current smoker are distinguished based on temporal expressions in the patient's medical records.
Analysis engines (annotator)
SimulatedProdSmokingTAE.xml
The file desc/analysis_engine/SimulatedProdSmokingTAE.xml provides a working example of the smoking status pipeline, utilizing the aggregate TAEs. This Aggregate includes Token, Sentence, SentenceAdjuster, ClassifiableEntries (which in turn invokes the ProductionPostSentenceAggregate annotators internally).
Shipped with this annotator:
- ExternalBaseAggregateTAE
- SentenceAdjuster
- ClassifiableEntriesAnnotator
SimulatedProdSmokingTAE_CDA.xml is also provided to process CDA documents. The aggregate flow will contain the annotator version ExternalBaseAggregateTAE_CDA.xml which will process the document as a Clinical Document Architecture (CDA) file.
ProductionPostSentenceAggregate_step1.xml
The file desc/analysis_engine/ProductionPostSentenceAggregate_step1.xml Aggregate TAE is used to run the first step classification stage via the KuRuleBasedClassifierAnnotator.
- TokenizerAnnotator (core project)
- KuRuleBasedClassifierAnnotator
This annotator is not contained in the aggregate flow, but introduced via the resource settings of the ClassifiableEntriesAnnotator (see the method initialize() in this class). UIMAFramework.produceAnalysisEngine(taeSpecifierStep1, ResMgr, null) instantiates the AE and CasCreationUtils.createCas(taeStep1.getAnalysisEngineMetaData()).getJCas() retrieves the CAS.
ProductionPostSentenceAggregate_step2_libsvm.xml
The file desc/analysis_engine/ProductionPostSentenceAggregate_step2_libsvm.xml is the Aggregate TAE used to run the second classification stage via the libSVM training module. Shipped with this annotator:
- PcsClassifierAnnotator_libsvm,
- ArtificialSentenceAnnotator,
- SentenceAdjuster,
- SmokingStatusDictionaryLookupAnnotator,
- NegationAnnotator.
This annotator is not contained in the aggregate flow, but introduced via the resource settings of the ClassifiableEntriesAnnotator (see the method initialize() in this class). UIMAFramework.produceAnalysisEngine(taeSpecifierStep2, ResMgr, null) instantiates the AE and the ClassifiableEntriesAnnotator process method will process if the smoking status is known.
ExternalBaseAggregateTAE.xml
The file desc/analysis_engine/ExternalBaseAggregateTAE.xml provides an aggregate flow for the external annotations, SimpleSegmentAnnotator, TokenizerAnnotator, SentenceDetectorAnnotator, and LvgAnnotator. Shipped with this annotator:
- SimpleSegmentAnnotator,
- TokenizerAnnotator (core project),
- SentDetectorAnnotator (core project),
- LvgAnnotation (LVG project).
ExternalBaseAggregateTAE_CDA.xml is also provided to process CDA documents. The aggregate flow will contain the specialized class CdaCasInitializer (replacing the SimpleSegmentAnnotator used by flat file/non-CDA version) which will process the document as a Clinical Document Architecture (CDA) file. This annotator is contained in the SimulatedProdSmokingTAE_CDA aggregate. Red text indicates shipped with this annotator.
SentenceAdjuster.xml
The file desc/analysis_engine/SentenceAdjuster.xml drives the java class edu.mayo.bmi.smoking.ae.SentenceAdjuster annotator that uses some patterns and some rules about those patterns to adjust certain annotations. This annotator was extended to handle sentence boundaries for the Smoking status classification.
Example: "Tobacco: none" has two sentences as detected by the original cTAKES sentence boundary detector. This annotator merges them into one sentence to enable correct negation detection.
Parameters
UseSegments <Boolean/Single-valued/Optional>
(Default Value = 'false') Flag whether to use segments or full doc text.
SegmentsToSkip <String/Multi-valued/Optional>
WordsToIgnore <String/Multi-valued/Optional>
(Default Value = 'null') Set of words that PostModifier should ignore (act as if the word was not there) when looking for a pattern match.
WordsInPattern <String/Multi-valued/Required>
(Default Value = 'no none never quit smoked ;') The list of words ("none", "no", etc) used in the pattern.
ClassifiableEntriesAnnotator.xml
The file desc/analysis_engine/ClassifiableEntriesAnnotator.xml drives the java class edu.mayo.bmi.smoking.ae.ClassifiableEntries. Converts Sentences to ClassifiableEntries (required by SmokingStatus pipeline) and ultimately to RecordSentence.
Parameters
TruthFile <String/Single-valued/Optional>
(Default Value = 'null') Delimited Truth file. Delimiter is expected to be the TAB char. If not specified, then the classification feature of the RecordSentence object will not be set.
AllowedClassifications <String/Multi-valued/Optional>
(Default Value = '"SMOKER" "CURRENT_SMOKER" "NON_SMOKER" "PAST_SMOKER
UNKNOWN"') See edu.mayo.bmi.smoking.Const.java for permitted string
values.
SectionsToIgnore <String/Multi-valued/Optional>
(Default Value = '"20109" "20138"') Sections to ignore for ClassifiableEntries - Family History (20109). A given patient's smoking status could be confused by smoking status of others. To avoid this confusion there is an option to exclude certain sections such as family history.
ConWordsFile <Boolean/Single-valued/Optional>
(Default Value = '$main_root/resources/ss/data/context/negationContradictionWords.txt')
Contradiction words list. If this word appears in sentence do not
negate.
Resources
UimaDescriptorStep1
(Default Value = '$main_root/desc/analysis_engine/ProductionPostSentenceAggregate_step1.xml')
Annotator responsible for the first classification step, namely,
KuRuleBasedClassifierAnnotator.
UimaDescriptorStep2
(Default Value = '$main_root/desc/analysis_engine/ProductionPostSentenceAggregate_step2_libsvm.xml')
Annotator responsible for second classification step.
The UimaDescriptorStep1/UimaDescriptorStep2 are introduced as resources via the ClassifiableEntriesAnnotator annotator during the initialization step. This allows the aggregates specified to be instantiated and analysis processing to be handled on a separate asynchronized thread. This enhances performance overall by ensuring the resources required by the process method will have output of the ProductionPostSentenceAggregates prepared without requiring a synchronized data flow (i.e. explicit aggregate flow via component descriptor aggregate flow).
KuRuleBasedClassifierAnnotator.xml
The file desc/analysis_engine/KuRuleBasedClassifierAnnotator.xml drives the java class edu.mayo.bmi.smoking.ae.KuRuleBasedClassifierAnnotator. Known vs Unknown classifier using smoking related keywords.
Parameters
CaseSensitive <String/Single-valued/Required>
(Default Value = 'false') Specifies if a distinction between lower and upper case text will be considered.
classAttribute <String/Single-valued/Required>
(Default Value = 'smoking_status') Value used by the NominalAttributeValue via setAttributeName.
SmokingWordsFile <String/Single-valued/Required>
(Default Value = 'ss/data/KU/keywords.txt') Smoking related keywords to identify "known" class.
UnknownWordsFile <String/Single-valued/Required>
(Default Value = 'ss/data/KU/unknown_words.txt') If this word/phrase appears, treat the sentence as UNKNOWN.
PcsClassifierAnnotator_libsvm.xml
The file desc/analysis_engine/PcsClassifierAnnotator.xml smoking status classifier using libsvm. This annotator plays the same role as PcsBOWFeatureAnnotator.xml, PcsClassifierAnnotator.xml, and BOWFeatureRemovalAnnotator.xml, which use libsvm.
Parameters
CaseSensitive <String/Single-valued/Required>
(Default Value = 'false') Specifies if a distinction between lower and upper case text will be considered.
Resources
StopWordsFile
(Default Value = 'file:ss/data/PCS/stopwords_PCS.txt)'
Resource file that provides terms used as stop words, e.g. "a" "an" "the".
PCSKeyWordFile
(Default Value = 'file:ss/data/PCS/keywords_PCS.txt)'
Resource file that provides terms used as PCS key words, e.g.
'"refrain" "discussed" "to_quit" (if bigram it is connected by
underscore, i.e. "_")'.
PathOfModel
(Default Value = 'file:ss/data/PCS/pcs_libsvm-2.91.model')
Resource file that provides trained model for smoking status classification.
ArtificialSentenceAnnotator.xml
The file desc/analysis_engine/ArtificialSentenceAnnotator.xml drives the java class edu.mayo.bmi.uima.core.ae.CopyAnnotator. Artificially creates a new SentenceAnnotation object by treating the entire document as a sentence. The offset values from the DocumentAnnotation object are transferred over to the new SentenceAnnotation object.
Parameters
srcObjClass <String/Single-valued/Required>
(Default Value = 'false') Source JCas object class.
This must be an object that already exists in the JCas.
destObjClass <String/Single-valued/Required>
(Default Value = 'false') Destination JCas object class.
A new JCas object will be created.
dataBindMap <String/Multi-valued/Required>
(Default Value = 'false')
Binds data from source to destination.
Format for each entry is the getter method name of the source to the
setter method name of the destination. e.g. getMyValue|setMyValue
SmokingStatusDictionaryLookupAnnotator.xml
The file desc/analysis_engine/SmokingStatusDictionaryLookupAnnotator.xml drives the java class edu.mayo.bmi.uima.lookup.ae.DictionaryLookupAnnotator. Performs dictionary lookup and stores the hits as NamedEntityAnnotation objects.
Resources
LookupDescriptor
(Default Value = 'file:ss/data/SmokingStatusLookupConfig.xml)'
Defines which dictionaries will be used, the implementation specifics, and metaField configuration.
SmokerDictionary
(Default Value = 'file:ss/data/smoker.dictionary)'
Resource file that provides terms used as smoking words, e.g. '"smokes" "tobacco"'.
NonSmokerDictionary
(Default Value = 'file:ss/data/nonsmoker.dictionary')
Resource file that provides terms used as non-smoking words, e.g. '"non-smoker"'.
NegationAnnotator.xml
The file desc/analysis_engine/NegationAnnotator.xml drives the java class edu.mayo.bmi.uima.context.ContextAnnotator. Boundary tokens moved to external resource - ss/data/context/boundaryData.txt.
Resources
BoundaryData
(Default Value = 'file:ss/data/context/boundaryData.txt')
Resource file that provides terms used as sentence boundaries, e.g. '"nevertheless" "how" ";" "."'.
The parameters provided act the same way that the core's version of the 'NegationAnnotator', but since the boundary stop words are different for the smoking status pipeline, a separate implementation was necessary. However, current release of 'NegationAnnotator' does not use this resource.
CAS consumers - RecordResolutionCasConsumer.xml
The CAS consumer provided in /desc/cas_consumper/RecordResolutionCasConsumer.xml drives the java class edu.mayo.bmi.smoking.cc.RecordResolutionCasConsumer iterates over all sentences (each CAS equals one sentence) for a record and resolves the final classification value for the record. Output is saved to an delimited file. Additionally, optionally provides the overall patient level classification based on record level classification.
Parameters
OutputFile <String/Single-valued/Required>
(Default Value = 'c:\temp\record_resolution.txt')
Specifies the location of the detail and summary report.
Delimiter <String/Single-valued/Required>
(Default Value = '|')
Specifies the delimiter for the output file.
ProcessingCDADocument <Boolean/Single-valued/Required>
(Default Value = 'false')
Specifies whether the processed files should be handled as CDA documents.
RunPatientLevelClassification <Boolean/Single-valued/Required>
(Default Value = 'false')
Specifies whether the post processing step of generating a summary patient level classification is done.
FinalClassificationOutputFile <String/Single-valued/Optional>
(Default Value = 'null')
Specifies name and location of the summary report file which holds the final patient level classifications.
Resources
libsvm-2.91.jar
The support vector machine (SVM) classificiation tool provided at /lib/libsvm-2.91.jar used to train the smoking status model.
How to Create your own smoking status classifier model
- Create sentence-level smoking status data with the format of: sentence|class_label (class_label: P, C, S).
He quit smoking three years ago.|P She is smoking currently.|C The patient has a history of tobacco use.|S
- Run the script edu.mayo.bmi.smoking.MLutil.GenerateTrainingData.java on the sentence-level smoking status data to generate the libSVM training data.
In this script, the variable "dataFile" in main() must point to the sentence-level smoking status data. Set the other variables also if necessary. Users might create their own keywordFile that contains keywords used in smoking status classification (see GenerateTrainingData.java for details.)
- Create new model on the libSVM training data.
The command with our options used in the current model is:
java -classpath path_of_libsvm_jar_file svm_train -s 0 -t 1 -g 1 -r 1 -d 1 training_data_file new_model
Users might use their own customized libSVM options.
- Save new_model in the resources/ss/data/PCS/
- Change the Resources of "PathOfModel" in PcsClassifierAnnotator_libsvm.xml to "new_model"