Converting Tables to RDF/OWL Ontologies (Draft)

Problem

The SNIK ontology consists of the central meta ontology and several subontologies. Each subontology is manually created by people reading text books and extracting their facts to spreadsheets, from now on called “tables”. As our final representation format is RDF/OWL, and there are thousands of rows extracted from each textbook, we need an automatic conversion method. This task is commonly abbreviated as CSV2RDF, referring to the text based comma separated value table format.

For our existing subontologies¹, we used a specifically developped tool called Excel2OWL. Its advantages are that it:

specifies an intuitive table input format that doesn’t require RDF/OWL knowledge from the extractors and
validates the input and warns about certain inconsistencies with the meta ontology

Its disadvantages are that:

The input format is hard coded and only works for a very specific case, thus
it is time consuming to implement an improved modelling method and
it is not easyily adaptable to new subtasks, especially as
it is not maintained anymore, thus there will be no more bugfixes or updates.

Instead, we are going to use a more flexible approach where we use the generic SPARQLify CSV rewriter tool that provides the SML mapping language. This has the advantages that

The development time is massively reduced (writing a mapping configuration vs developing a whole tool).
No maintenance is necessary.
SML is very close to SPARQL, so resulting RDF can be modelled very intuitively and little extra training is necessary.
SPARQLify is open source and the developer is very responsive to requests for help or new features.

and the disadvantages that

The mapping languing is less expressive than a full-fledged programming language, thus
the input format needs to be more closely aligned to RDF/OWL, thus
the extractors need a basic knowledge of RDF/OWL.
We need to validate the input separately.
SPARQLify is a research prototype made in part time by a single (even though very talented and enthusiastic) developer, not industrial strength software made by a large team. Thus, setting it up is a bit tricky and a few workarounds are necessary.

A third option would be a imperative mapping language where you “program” your mapping like in Excel2OWL but don’t have to develop or maintain the tool itself.

Due to the major saving of time and the increased flexibility, we decided to use SPARQLify CSV for our upcoming subontologies. Providing a basic level of RDF/OWL knowledge is useful is useful anyways, because it helps the extractors understand, wheter the process captures what they actually intend to model. This gap was especially problematic for owl:someValuesFrom existential restrictions, which were always generated for all triples using meta ontology properties by Excel2OWL, even though the actual intention of the extractor may have been something different, like a universal owl:allValuesFrom quantifier or a direct connection between the classes.

CSV2RDF is a common task in the field, which caused the W3C to produce a recommendation document on 2015-12-17. There are many existing CSV2RDF tools (see the lists of the W3C and timrdf). Almost any of them could be used for the task, so I chose SPARQLify CSV for my own convenience because I am already experienced with it and know that it works reliably and because I have a good connection with the developer, Claus Stadler from AKSW, so that I can get help and give feedback about bugs and useful enhancements on short notice. Still, I went through the lists in case there is some killer feature available somewhere else and to contribute something back to Claus.

Mapping Large Scale Research Metadata to Linked Data compares the speed and memory usage of different RDF producing tools, which is relevant when you have hundreds of millions of triples but we only have several thousand triples in total so the performance is irrelevant for us. They initially considered Tarql, SPARQLify and Tabels

[csv2rdf4lod]

Karma is a semi-automatic approach with a graphical user interface that seems great on unknown data with a large schema but not necessary in our case where we designed a simple input format ourselves.

Any23 accepts a wide range of input formats

CSVImport is a special purpose tool for RDF Data Cubes, that is multidimensional numerical data.

Tarql is actually quite similar to SPARQLify CSV; you write the mapping as a SPARQL 1.1 query. Some of my feature requests for SPARQLify are already implemented there such as “skip bad rows” with WHERE { FILTER (BOUND(?d)) }, “split a field value into multiple values” with the apf:strSplit (?var "<delimiter>") property function and expandPrefixedName(?var). This lead me to abandon SPARQLify for the moment and switch to tarql.

basically write SPARQL is very comfortable

Future Work mapping would be a Groovy script file but that needs Groovy expertise bus factor Claus Stadler thinks about SPARQL based imperative extensions three components in the mapping

line by line
program code using lambda expressions

mapper.define("y").asUri((col)-> prefix("owl")+col)
?y = uri (owl:,?col)

mapper.setPrefix("ex","http://")
.addMapping("myMapping")
.addTriple("?s a ex:foobar")
.define("s").asUri()

load the table into an SQL server to use SQL like explode (“”)
workflow engine for large files to resume after crashes

Future Work

train extractors and check adoption by them
use with full data when available

SPARQLify (CSV)

SPARQLify ool we chose for the task, has already been used successfully multiple times. For example, Ermilov, Auer, Stadler used SPARQLify CSV to transform almost 10000 datasets with a total 7.3 billion triples for the publicdata.eu data portal. Ermilov, Höffner, Lehmann used SPARQLify to transform parts of the database of ITMO university, Saint Petersburg.

Installation

As I don’t have admin rights on my work computer, I installed it via mvn clean install, which took more than 12 minutes. README says to do mvn assembly:assembly afterwards but that is broken right now, so we create the command ourselves (replace paths accordingly):

echo `java -cp /insert/path/to/sparqlify/sparqlify-cli/target/sparqlify-cli-0.8.0-jar-with-dependencies.jar org.aksw.sparqlify.csv.CsvMapperCliMain -h $@` > ~/bin/sparqlify
chmod +x ~/bin/sparqlify

The -h parameter says that our first row consists of table row headers for identification.

Now we can test it with:

sparqlify -c sparqlify-examples/src/main/resources/sparqlify-examples/csv/example1.sml -f sparqlify-examples/src/main/resources/sparqlify-examples/csv/example1.csv

If successfull, this generates some logging messages and ends with:

<http://example.org/hello> <http://example.org/name> "hello" .
<http://example.org/hello> <http://example.org/age> "world" .
<http://example.org/hello> <http://example.org/email> "bar" .
<http://example.org/hello> <http://example.org/isPositive> "true"^^<http://www.w3.org/2001/XMLSchema#boolean> .
<http://example.org/hello> <http://example.org/gender> "foo" .
Variable	#Unbound
Triples generated:	5
Potential triples omitted:	0
Triples total:	5

Now want to apply it to our own data and take look at the first few rows including the headers, where we removed spaces to use them in the SML Mapping Language:

SubjektUri	SubjDe	SubjEn	SubjAltDe	Subjekttyp	Relation	Objekt	SeiteRelation	Definition	SeiteDefinition	Kapitel
Abgb	Allgemeines Bürgerliches Gesetzbuch (Ö)		ABGB	EntityType	rdfs:subClassOf	Gesetz		Allgemeines Bürgerliches Gesetzbuch (Österreich).	Website;67	RECHT
Ablauforganisation	Ablauforganisation	Organizational Structure		EntityType	meta:isBasedOn	ImKonzept	18			ERMOD
AbleitenTeilstrategien	Ableiten von Teilstrategien			Function					117;120
AbstimmenUnternehmensstrategie	Abstimmen mit der Unternehmensstrategie			Function	meta:uses	Wettbewerbsstrategie	117
AbstimmenUnternehmensstrategie	Abstimmen mit der Unternehmensstrategie	business-IT-alignment		Function				Da die strategischen IT-Ziele […]	117
Abstraktionsprinzip	Abstraktionsprinzip			EntityType	rdfs:subClassOf	Architekturprinzip	52	Das Abstraktionsprinzip verlangt […]	52	ARCHI
Abstraktionsprinzip	Abstraktionsprinzip			EntityType	rdfs:subClassOf	Architekturprinzip	52	Das Abstraktionsprinzip verlangt […]	52	ARCHI

As SPARQlify cannot guess the prefixes of our entites, we had to provide them in the “Relation” column. For SubjektUri, Subjekttyp, Relation and Objekt and Kapitel, we don’t need prefixes though because they always imply the local (default) prefix, which we can add in the mapping.

The SML file, which specifies the mapping for a specific table format, is surprisingly simple and needs only minor adaptions for future subontologies:

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX he: <http://www.snik.eu/ontology/he/>
PREFIX meta: <http://www.snik.eu/ontology/meta/>

Create View Template test As
  Construct {
    ?s  a owl:Class;
        meta:subTopClass ?st;
        rdfs:label ?lde, ?len;
        skos:altLabel ?lade, ?laden;
        ?p ?o;
        he:page ?pd;
        skos:definition ?d;
        he:chapter ?ch.
  }

With
    ?s = uri(he:, ?SubjektUri)
    ?st = uri(meta:, ?Subjekttyp)
    ?lde = plainLiteral(?SubjDe,"de")
    ?len = plainLiteral(?SubjEn,"en")
    ?lade = plainLiteral(?SubjAltDe,"de")
    ?laen = plainLiteral(?SubjAltEn,"en")
    ?p = uri(?Relation)
    ?o = uri(he:,?Objekt)
    ?pr = typedLiteral(?SeiteRelation,xsd:positiveInteger)
    ?d = plainLiteral(?Definition,"de")
    ?pd = typedLiteral(?SeiteDefinition,xsd:positiveInteger)
    ?ch = uri(he:,?Kapitel)

We execute this mapping and filter out debug messages with:

sparqlify 2>&1 -c heinrich.sml -f htest.csv > heinrich.nt | egrep -v "(TRACE)|(DEBUG)"

This results in 8729 triples from our 1439 rows long test table, which is a preliminary excerpt of the ongoing extraction of the new “Heinrich” (he) ontology.

<http://www.snik.eu/ontology/he/Ablauforganisation> <meta:isBasedOn> <http://www.snik.eu/ontology/he/ImKonzept> .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.w3.org/2004/02/skos/core#definition> ""@de .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.w3.org/2004/02/skos/core#altLabel> ""@de .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.w3.org/2000/01/rdf-schema#label> "Ablauforganisation"@de .                                                                                                                    
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.snik.eu/ontology/he/page> ""^^<http://www.w3.org/2001/XMLSchema#positiveInteger> .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.snik.eu/ontology/he/chapter> <http://www.snik.eu/ontology/he/ERMOD> .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.w3.org/2000/01/rdf-schema#label> "Organizational Structure"@en .

Unfortunately, SPARQLify creates empty triples for empty objects so that we must filter out lines containing <> and "" for now in a fix script. Different null handling choices will be available soon, however. SPARQLify does not support already prefixed URIs and embraces them like <meta:isBasedOn>, which we correct in the fix script as well. Finally, we convert the ntriples output to turtle for easier viewing. Because ntriples doesn’t support prefixed values, we merge the output with a predefined prefix file and treat it like a turtle file. The full fix script:

cp prefix.ttl /tmp/tmp.ttl                                                                                                                                                                                                                    
egrep -v '("")|(<>)' heinrich.nt | sed -r "s|<([^ /]+):([^ /]+)>|\1:\2|g" >> /tmp/tmp.ttl
rapper -i turtle -o turtle /tmp/tmp.ttl > fixed.ttl

While this results in a syntactically correct turtle file, the modelling of meta relations is semantically incorrect, as they should be OWL restrictions and not direct connections between classes. For more detail about OWL restrictions, see my last blogpost about property validation. I assume that the old tool had a list of meta properties and treated them differently then the others but SML does not allow this so that we have to create two additional columns RestrictionProperty and RestrictionObject. On the positive side, this allows us to add a third column named “RestrictionType” to finally make explicit and allow not only the formerly exclusively used “someValuesFrom” restriction but also “allValuesFrom” and potentially later the others (owl:hasValue, owl:cardinality, owl:minCardinality, owl:maxCardinality, more information here).

TARQL

Installation

TARQL is very easy to install. Download the latest release, unzip it to a folder of your choice and execute bin/tarql. Requires an installed Java 8 JRE.

Input Table

We reuse the modified input table from our SPARQLify CSV try without changes.

Mapping File

Unlike SPARQLify, TARQL doesn’t have a formally specified language but it has an introduction that gives a general idea and some examples, that hopefully will be enough for our purposes.

Regular expression to change the binding format, manually changes are still necessary. :%s/$?[a-z]*$ = $.*$$/BIND (\2 AS \1)/ expandPrefixedName

Plain literals: :%s/plainLiteral/STRLANG

Our required null handling by default. See…

he:AbleitenTeilstrategien
        rdf:type          owl:Class ;
        meta:subTopClass  meta:Function ;
        rdfs:label        "Ableiten von Teilstrategien"@de .

… which has a German label but no English one. Still, the bound part of the row is mapped and there are no empty values.

1 bb—”blue book”, “Health Information Systems”. ob—”orange book”, “IT-Projektmanagement im Gesundheitswesen” ↩

SNIK Technical Blog

Converting Tables to RDF/OWL Ontologies (Draft)

Problem

Related Work

Future Work

SPARQLify (CSV)

Installation

TARQL

Installation

Input Table

Mapping File