YTEX provides a generalizable framework for the computation of path finding, corpus & intrinsic information content based semantic similarity measures from any domain ontology. This page describes the usage and configuration of the YTEX Semantic Similarity Tools. For a high-level overview, refer to our paper: Semantic similarity in the biomedical domain: an evaluation across knowledge sources.
Semantic similarity measures include path finding measures based purely on path distances, and information-content based measures based on taxonomic relationships and information content (IC) of concepts, a measure of concept frequency. Semantic similarity measures utilize a concept graph where vertices represent concepts and edges represent taxonomical relationships. The similarity between concepts is computed from the length of the path between concepts and their nearest common ‘parent’. Previous studies that took advantage of a large, annotated medical corpus to estimate concept frequencies showed that IC based measures of semantic similarity outperform path finding measures. Unfortunately, large annotated corpora are not typically available for many applications. To overcome this limitation, methods that estimate IC from the structure of the concept graph have been developed and their accuracy shown to rival that of corpus-based measures.
YTEX provides a web application client, web services interface, RESTful interface, and command-line interface to compute similarity measures. The demo similarity web app is available under http://informatics.med.yale.edu/ytex.web; if you plan to use this application extensively, please install ytex locally. Please refer to Sanchez & Batet for an excellent overview of similarity measures in general, and intrinsic information content (IC) based measures in particular. We scale all measures to the unit interval; see YTEX Semantic Similarity Measures for details.
YTEX allows the declarative definition of concept graphs in which nodes represent concepts and edges taxonomical relationships, and can compute the similarity between nodes in these graphs. YTEX comes with two concept graphs derived from the UMLS (version 2013AA)
- sct-rxnorm: concepts from SNOMED-CT and RXNORM. This is the default.
- sct-msh-csp-aod: concepts from the SNOMED-CT, MeSH, CRISP, and Alcohol and Drug thesaurus
- umls: concepts from all restriction free (level 0) UMLS source vocabularies and SNOMED-CT
You can configure additional concept graphs (see below).
Note that you must perform the additional YTEX installation tasks to use this component. You must install the UMLS if you want to create your own concept graphs.
The similarity web app allows you to select
- select a concept graph against which measures should be computed
- specify concept pair(s)
- specify measures
The similarity web application has two pages:
Compute similarities for a single concept pair. In addition to the similarity values, this page outputs the path between concepts. You can enter the text of the concept, and the application will attempt to find the corresponding concept id (CUI). Alternatively, you can simply enter the concept id.
Similarity Multiple: Compute the similarity between multiple pairs of concepts. Enter each concept pair on a different line, and separate concepts by a comma or whitespace. The output can be exported to a CSV file or Excel spreadsheet.
As with the web application, you can specify the concept graph, concept pairs, and measures for which similarities should be computed. Both methods accept a list of measures; these are:
- Path-Finding Measures
- WUPALMER: Wu & Palmer
- LCH: Leacock & Chodorow
- PATH: Path
- RADA: Rada
- Corpus IC Based Measures:
- LIN: Lin
- Intrinsic IC Based Measures:
- INTRINSIC_LIN: Intrinsic IC based Lin
- INTRINSIC_LCH: Intrinsic IC based Leacock & Chodorow
- INTRINSIC_PATH: Intrinsic IC based Path, identical to Jiang & Conrath
- INTRINSIC_RADA: Intrinsic IC based Rada
- JACCARD: Intrinsic IC based Jaccard
- SOKAL: Intrinsic IC based Sokal & Sneath
To get the similarity between a pair of concepts using the concept graph sct-umls, and the LCH and Intrinsic LCH measures:http://informatics.med.yale.edu/ytex.web/services/rest/similarity?conceptGraph=umls&concept1=C0018787&concept2=C0024109&metrics=LCH,INTRINSIC_LCH&lcs=true
The parameters are:
- concept1/concept2 the concept ids
- metrics comma-separated list of metrics
- conceptGraph (optional) concept graph to use; if not specified will use the default
- lcs (optional) set to true to get the paths through the Least Common Subsumer.
Will return XML with a list of similarities corresponding to the list of metrics. See the WSDL for the corresponding web service for the schema.
To get a list of concept graphs: http://informatics.med.yale.edu/ytex.web/services/rest/getConceptGraphs
To get the 'default' concept graph: http://informatics.med.yale.edu/ytex.web/services/rest/getDefaultConceptGraph
The Web Services interface is analogous to the restful interface, but allows the computation of similarities fro multiple concept pairs. Seehttp://informatics.med.yale.edu/ytex.web/services/conceptSimilarityWebService?wsdl
The ConceptSimilarityServiceImpl java program accepts a list of concept pairs, and outputs their similarities in a tab-delimited format. It accepts the following arguments:
Java Command Line Arguments:
-Dytex.conceptGraphName: to override the default concept graph name (defined in ytex.properties, by default sct-rxnorm). e.g.
-Dytex.conceptGraphName=umlsto use the umls concept graph.
-Xmx<memory>: set the java heap size. The amount of memory needed depends on the concept graph. For the sct-rxnorm graph, 256 mb is sufficient (
-Xmx256m), the larger UMLS concept graph requires 1200 mb (
ConceptSimilarityServiceImpl Command Line Arguments:
-metrics: required, comma separated list of metrics (see above in for valid values)
-out: optional file to send output to. if not specified will send output to standard out.
-lcs: should the least common subsumer and paths be output for each concept pair?
-concepts: a list of concept pairs, or a file with concept pairs. For a file place each concept pair on a separate line, separate concepts by whitespace or commas. For a list of concept pairs, separate each concept by a comma, each pair by a semicolon:
To start the Semantic Similarity Web Application, run CTAKES_HOME\bin\ytexweb.bat (windows) or CTAKES_HOME\bin\ytexweb.sh (linux) and open http://localhost:8080/semanticSim.jsf.
To create a concept graph, you create a properties file that contains a query that retrieves all the edges from a taxonomy. The ConceptDaoImpldoes the following:
- executes this query
- builds a concept graph
- removes edges that induce cycles
- computes the depth and intrinsic information content of each node in the graph
- writes the concept graph to the file system
Computing the intrinsic IC is very memory intensive - give this task all the memory that you have for large concept graphs. Computing the intrinsic IC for the entire UMLS takes 1.5 hours with an 8GB java heap.
As an example, here is what you do to create a concept graph with just the SNOMED-CT vocabulary from the UMLS:
1) Create a properties file that defines the required parameters
This file must be located in
CTAKES_HOME/resources/org/apache/ctakes/ytex/conceptGraph/<concept graph name>.xml
2) Run the ConceptDaoImpl: (modify the memory option -Xmx1g to as much memory as you can spare)
You will get warnings about removing cycles. The concept graph will be stored in the CTAKES_HOME/resources/org/apache/ctakes/ytex/conceptGraph directory.
We compute the intrinsic information content (intrinsic IC) when creating the concept graph. The InfoContentEvaluatorImpl class computes the corpus information content (corpus IC) for a given concept graph and corpus. This class takes as input a properties file that contains a query used to retrieve concept frequencies from the database; it then computes the information content of each node in the concept graph; finally it stores this in the feature_eval and feature_rank ytex database tables.
Concept frequencies may come from the YTEX annotation tables, but can come from any database table. For example, to compute the corpus ic using all the concepts from all annotated documents in the ytex databases, we would create the following properties file - corpusIC.props.xml:
To compute corpus information content, run the InfoContentEvaluatorImpl:
To use the corpus IC with the Lin similarity measure, specify the corpus name in CTAKES_HOME/resources/org/apache/ctakes/ytex/ytex.properties, e.g.: