Introduction

YTEX provides a generalizable framework for the computation of path finding, corpus & intrinsic information content based semantic similarity measures from any domain ontology. This page describes the usage and configuration of the YTEX Semantic Similarity Tools. For a high-level overview, refer to our paper: Semantic similarity in the biomedical domain: an evaluation across knowledge sources.

Semantic similarity measures include path finding measures based purely on path distances, and information-content based measures based on taxonomic relationships and information content (IC) of concepts, a measure of concept frequency. Semantic similarity measures utilize a concept graph where vertices represent concepts and edges represent taxonomical relationships. The similarity between concepts is computed from the length of the path between concepts and their nearest common ‘parent’. Previous studies that took advantage of a large, annotated medical corpus to estimate concept frequencies showed that IC based measures of semantic similarity outperform path finding measures. Unfortunately, large annotated corpora are not typically available for many applications. To overcome this limitation, methods that estimate IC from the structure of the concept graph have been developed and their accuracy shown to rival that of corpus-based measures.

Usage

YTEX provides a web application client, web services interface, RESTful interface, and command-line interface to compute similarity measures. The demo similarity web app is available under http://informatics.med.yale.edu/ytex.web; if you plan to use this application extensively, please install ytex locally. Please refer to Sanchez & Batet for an excellent overview of similarity measures in general, and intrinsic information content (IC) based measures in particular. We scale all measures to the unit interval; see YTEX Semantic Similarity Measures for details.

YTEX allows the declarative definition of concept graphs in which nodes represent concepts and edges taxonomical relationships, and can compute the similarity between nodes in these graphs. YTEX comes with two concept graphs derived from the UMLS (version 2013AA)

sct-rxnorm: concepts from SNOMED-CT and RXNORM. This is the default.
sct-msh-csp-aod: concepts from the SNOMED-CT, MeSH, CRISP, and Alcohol and Drug thesaurus
umls: concepts from all restriction free (level 0) UMLS source vocabularies and SNOMED-CT

You can configure additional concept graphs (see below).

Note that you must perform the additional YTEX installation tasks to use this component. You must install the UMLS if you want to create your own concept graphs.

Similarity Web App

The similarity web app allows you to select

select a concept graph against which measures should be computed
specify concept pair(s)
specify measures

The similarity web application has two pages:

Similarity Single

Compute similarities for a single concept pair. In addition to the similarity values, this page outputs the path between concepts. You can enter the text of the concept, and the application will attempt to find the corresponding concept id (CUI). Alternatively, you can simply enter the concept id.

Similarity Multiple

Similarity Multiple: Compute the similarity between multiple pairs of concepts. Enter each concept pair on a different line, and separate concepts by a comma or whitespace. The output can be exported to a CSV file or Excel spreadsheet.

Similarity Web/RESTful Services

As with the web application, you can specify the concept graph, concept pairs, and measures for which similarities should be computed. Both methods accept a list of measures; these are:

Path-Finding Measures

WUPALMER: Wu & Palmer
LCH: Leacock & Chodorow
PATH: Path
RADA: Rada

Corpus IC Based Measures:

LIN: Lin

Intrinsic IC Based Measures:

INTRINSIC_LIN: Intrinsic IC based Lin
INTRINSIC_LCH: Intrinsic IC based Leacock & Chodorow
INTRINSIC_PATH: Intrinsic IC based Path, identical to Jiang & Conrath
INTRINSIC_RADA: Intrinsic IC based Rada
JACCARD: Intrinsic IC based Jaccard
SOKAL: Intrinsic IC based Sokal & Sneath

RESTful interface

To get the similarity between a pair of concepts using the concept graph sct-umls, and the LCH and Intrinsic LCH measures:http://informatics.med.yale.edu/ytex.web/services/rest/similarity?conceptGraph=umls&concept1=C0018787&concept2=C0024109&metrics=LCH,INTRINSIC_LCH&lcs=true

The parameters are:

concept1/concept2 the concept ids
metrics comma-separated list of metrics
conceptGraph (optional) concept graph to use; if not specified will use the default
lcs (optional) set to true to get the paths through the Least Common Subsumer.

Will return XML with a list of similarities corresponding to the list of metrics. See the WSDL for the corresponding web service for the schema.

To get a list of concept graphs: http://informatics.med.yale.edu/ytex.web/services/rest/getConceptGraphs

To get the 'default' concept graph: http://informatics.med.yale.edu/ytex.web/services/rest/getDefaultConceptGraph

Web Services interface

The Web Services interface is analogous to the restful interface, but allows the computation of similarities fro multiple concept pairs. Seehttp://informatics.med.yale.edu/ytex.web/services/conceptSimilarityWebService?wsdl

Command-Line Interface

The ConceptSimilarityServiceImpl java program accepts a list of concept pairs, and outputs their similarities in a tab-delimited format. It accepts the following arguments:

Java Command Line Arguments:

-Dytex.conceptGraphName: to override the default concept graph name (defined in ytex.properties, by default sct-rxnorm). e.g. -Dytex.conceptGraphName=umls to use the umls concept graph.
-Xmx<memory>: set the java heap size. The amount of memory needed depends on the concept graph. For the sct-rxnorm graph, 256 mb is sufficient (-Xmx256m), the larger UMLS concept graph requires 1200 mb (-Xmx1200m)

ConceptSimilarityServiceImpl Command Line Arguments:

-metrics: required, comma separated list of metrics (see above in for valid values)
-out: optional file to send output to. if not specified will send output to standard out.
-lcs: should the least common subsumer and paths be output for each concept pair?
-concepts: a list of concept pairs, or a file with concept pairs. For a file place each concept pair on a separate line, separate concepts by whitespace or commas. For a list of concept pairs, separate each concept by a comma, each pair by a semicolon:

cd CTAKES_HOME
bin\setenv.bat
java -cp %CLASSPATH% -Dlog4j.configuration=file:/%CTAKES_HOME%/config/log4j.xml -Xmx256m org.apache.ctakes.ytex.kernel.metric.ConceptSimilarityServiceImpl -concepts C0018787,C0024109;C0034069,C0242379 -metrics LCH,INTRINSIC_LCH

Web Interface

To start the Semantic Similarity Web Application, run CTAKES_HOME\bin\ytexweb.bat (windows) or CTAKES_HOME\bin\ytexweb.sh (linux) and open http://localhost:8080/semanticSim.jsf.

Configuration

Creating a Concept Graph

To create a concept graph, you create a properties file that contains a query that retrieves all the edges from a taxonomy. The ConceptDaoImpldoes the following:

executes this query
builds a concept graph
removes edges that induce cycles
computes the depth and intrinsic information content of each node in the graph
writes the concept graph to the file system

Computing the intrinsic IC is very memory intensive - give this task all the memory that you have for large concept graphs. Computing the intrinsic IC for the entire UMLS takes 1.5 hours with an 8GB java heap.

As an example, here is what you do to create a concept graph with just the SNOMED-CT vocabulary from the UMLS:

1) Create a properties file that defines the required parameters

This file must be located in CTAKES_HOME/resources/org/apache/ctakes/ytex/conceptGraph/<concept graph name>.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <entry key="ytex.conceptGraphQuery"><![CDATA[
        select distinct cui1, cui2 
        from umls.MRREL 
        where sab in ('SNOMEDCT')
        and rel in ('PAR')
        order by cui1, cui2
        ]]></entry>
</properties>

2) Run the ConceptDaoImpl: (modify the memory option -Xmx1g to as much memory as you can spare)

cd CTAKES_HOME
bin\setenv.bat
java -cp %CLASSPATH% -Dlog4j.configuration=file:/%CTAKES_HOME%/config/log4j.xml -Xmx1g org.apache.ctakes.ytex.kernel.dao.ConceptDaoImpl -name <concept graph name>

You will get warnings about removing cycles. The concept graph will be stored in the CTAKES_HOME/resources/org/apache/ctakes/ytex/conceptGraph directory.

Corpus Information Content

We compute the intrinsic information content (intrinsic IC) when creating the concept graph. The InfoContentEvaluatorImpl class computes the corpus information content (corpus IC) for a given concept graph and corpus. This class takes as input a properties file that contains a query used to retrieve concept frequencies from the database; it then computes the information content of each node in the concept graph; finally it stores this in the feature_eval and feature_rank ytex database tables.

Concept frequencies may come from the YTEX annotation tables, but can come from any database table. For example, to compute the corpus ic using all the concepts from all annotated documents in the ytex databases, we would create the following properties file - corpusIC.props.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <!-- the query to retireve concept counts -->
        <entry key="ytex.freqQuery"><![CDATA[
        select code, count(*)
        from anno_ontology_concept o
        inner join anno_base b on b.anno_base_id = o.anno_base_id
        inner join document d on d.document_id = b.document_id
        group by code
        ]]></entry>
        <!-- corpusName is a required property -->
        <entry key="ytex.corpusName">wsd</entry>
        <!-- the name of the concept graph -->
        <entry key="ytex.conceptGraphName">umls</entry>
</properties>

To compute corpus information content, run the InfoContentEvaluatorImpl:

cd CTAKES_HOME
bin\setenv.bat
java -cp %CLASSPATH% -Dlog4j.configuration=file:/%CTAKES_HOME%/config/log4j.xml -Xmx1g org.apache.ctakes.ytex.kernel.InfoContentEvaluatorImpl -prop=corpusIC.props.xml > ic.out 2>&1

To use the corpus IC with the Lin similarity measure, specify the corpus name in CTAKES_HOME/resources/org/apache/ctakes/ytex/ytex.properties, e.g.:

ytex.conceptGraphName=umls
ytex.corpusName=wsd

Space shortcuts

Child pages

Introduction

Usage

Similarity Web App

Similarity Single

Similarity Multiple

Similarity Web/RESTful Services

RESTful interface

Web Services interface

Command-Line Interface

Web Interface

Configuration

Creating a Concept Graph

Corpus Information Content

Space shortcuts

Child pages

cTAKES 3.1.2 - Semantic Similarity

Introduction

Usage

Similarity Web App

Similarity Single

Similarity Multiple

Similarity Web/RESTful Services

RESTful interface

Web Services interface

Command-Line Interface

Web Interface

Configuration

Creating a Concept Graph

Corpus Information Content