Interlinking SNIK subontologies with LIMES

Abstract

We used LIMES to find link candidates between similar concepts in the SNIK subontologies. The preliminary matching results in 110 exact label matches and 149 similar label matches. Compared to the 80 existing manually created links, 62 are rediscovered as an exact match, 1 is rediscovered as an similar match, 17 are not found again, which leaves 48 new exact matches and 148 new similar matches. The new links need manual verification and closeless rating through choosing between skos:exactMatch, skos:closeMatch and so on (see SNIK wiki).

Introduction

Problem

The SNIK ontology consists of the central meta ontology and several subontologies. While the meta ontology contains central terms that are used by nearly all subontologies, some terms are only used by several of them or they have slightly different meanings or are not central. In these cases, we modelled those terms as separate classes in each of the subontologies.

As the extractions are done by different people, they don’t know which classes are related, and thus this information is not contained in the extraction tables.

As a separate effort, our domain experts manually created 80 links between the bb and the ob subontologies. We suspect that there are more than 80 similar classes between the two but the search space is too large for manual inspection, as the number of potential class pairs is equal to the product of the numbers of instances in both, which are 1123 for the blue book and 702 for the orange book. As further subontologies are being extracted at the moment, the number of potential matches will be not only quadratic in the number of instances but also in the number of subontologies. For example, for 4 ontologies with 700 instances each, the number of comparisons is $700^2 \cdot \frac{4 \cdot 3}{2} = 2940000$.

Goal

As creating the links manually is impossible, we need an existing tool that can easily be configured to automatically find potential links between our subontologies. We plan to use those relations, called interlinks, to compare linked concepts to find differences and similarities between the subontologies and thus the different textbook sources. With the links, we also aim to enable integrating instance data from concrete hospitals in the future, when they use different subontologies.

Implementation

As interlinking tool we chose LIMES, the “Link Discovery Framework for Metric Spaces”.

Installation and Execution

Download the newest release limes.jar and run java -jar limes.jar <config_file>. Requires Java 8+1.

Input Ontologies

We used the SNIK ontologies ob.rdf and bb.rdf in version 0.3.3.

Configuration File

The config file syntax is well documented with an XML DTD, a manual and examples. Using an example as basis, we arrived at the following preliminary bb-ob mapping configuration.

First, we define the doctype and the prefixes, which we abbreviate here:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE LIMES SYSTEM "limes.dtd">
<LIMES>
	<PREFIX>[...]</PREFIX>
	[...]
	<PREFIX>[...]</PREFIX>

Then we define source and target datasource, in this case in the form of files, which is faster than using an endpoint. The files are generated by converting bb.rdf and ob.rdf to n triples. Language tags are removed from the labels, which prevents them from being taken into the similarity calculation but also compares labels of different languages:

	<SOURCE>
		<ID>bb</ID>
		<!-- LIMES 1.1.1 requires absolute paths-->
		<ENDPOINT>/mypath/snik-ontology/limes/tmp/bb.nt</ENDPOINT>
		<VAR>?bb</VAR>
		<PAGESIZE>-1</PAGESIZE>
		<RESTRICTION>?bb a owl:Class</RESTRICTION>
		<PROPERTY>rdfs:label AS nolang RENAME label</PROPERTY>
		<TYPE>N-TRIPLE</TYPE>
	</SOURCE>

	<TARGET>
		<ID>ob</ID>
		<ENDPOINT>/mypath/snik-ontology/limes/tmp/ob.nt</ENDPOINT>
		<VAR>?ob</VAR>
		<PAGESIZE>-1</PAGESIZE>
		<RESTRICTION>?ob a owl:Class</RESTRICTION>
		<PROPERTY>rdfs:label AS nolang RENAME label</PROPERTY>
		<TYPE>N-TRIPLE</TYPE>
	</TARGET>

For now, we only use a very simple metric on the labels:

	<METRIC>trigrams(bb.label,ob.label)</METRIC>
	

We only accept classes with labels that are exactly the same…

	<ACCEPTANCE>
		<THRESHOLD>1</THRESHOLD>
		<FILE>bb-ob-exact.ttl</FILE>
		<RELATION>skos:exactMatch</RELATION>
	</ACCEPTANCE>

… but we collect class pairs with similar labels as well2</sup>, which we will have to validate manually more closely later:

	<REVIEW>
		<THRESHOLD>0.8</THRESHOLD>
		<FILE>bb-ob-close.ttl</FILE>
		<RELATION>skos:closeMatch</RELATION>
	</REVIEW>

	<EXECUTION>
		<REWRITER>default</REWRITER>
		<PLANNER>default</PLANNER>
		<ENGINE>default</ENGINE>
	</EXECUTION>

	<OUTPUT>TAB</OUTPUT>
</LIMES>

Results

The preliminary matching results in 110 exact label matches and 149 similar label matches. Compared to the 80 existing manually created links, 62 are rediscovered as an exact match, 1 is rediscovered as an similar match, 17 are not found again, which leaves 48 new exact matches and 148 new similar matches. The new links need manual verification and closeless rating through choosing between skos:exactMatch, skos:closeMatch and so on (see SNIK wiki).

Future Work

In the near future, will manually rate all link candidates generated here and add the accepted ones to the SNIK ontologies.

We plan to incorporate more properties than just the label into the link metric, such as the:

Also, we will generate interlinks between all future subontologies and the other subontologies.

The task of interlinking has its roots in the database world, where it has a long tradition under the name of record matching with many relevant applications in integrating databases, for example between a library and a publisher. Failure to achieve record matching can be disastrous. For example, when hundreds of thousands of refugees came to Germany in 2015, different government institutions in different states didn’t match their databases so that much of the registration work had to be repeated many times, resulting in long wait times, missing statistics and high costs (German non-scientific source). Even worse, regionalism and data protection laws prevent a unified medical database, so that patients that are examined at one place may be examined again at another, leading to wasted time, money, wrong treatments, and heightened exposure to side effects of examination environments, such as X-rays in radiology or multidrug-resistant organisms in hospitals. Of course, hospitals communicate with each other but sending latters is extremely ineffective as compared to a common database and information that is not asked for will never be sent while it could have been automatically be checked if, for example, the patient has an allergy to a certain medication.

In the context of the Semantic Web, this task is called (entity/data) (inter-)linking, instance matching, record linkage, data deduplication, entity reconciliation or coreference resolution (source). The task of interlinking can be seen as a subtask of ontology matching (also called ontology alignment). For further reading, see Ontology Matching, Euzenat et al., 2013 (paywall with preview) and State of the Art on Ontology Alignment, Vargas-Vera et al. 2015, (paywall).

1 Execution: Java 8+ JRE, including JavaFX if using the GUI. Compilation: Java 8+ JDK including JavaFX, Maven. See also the installation manual and the FAQ.

2 For exact matches alone we wouldn’t have needed LIMES at all as they would be possible with a simple SPARQL query.