Converting Tables to RDF/OWL Ontologies (Draft)

Problem

The SNIK ontology consists of the central meta ontology and several subontologies. Each subontology is manually created by people reading text books and extracting their facts to spreadsheets, from now on called “tables”. As our final representation format is RDF/OWL, and there are thousands of rows extracted from each textbook, we need an automatic conversion method. This task is commonly abbreviated as CSV2RDF, referring to the text based comma separated value table format.

For our existing subontologies1, we used a specifically developped tool called Excel2OWL. Its advantages are that it:

Its disadvantages are that:

Instead, we are going to use a more flexible approach where we use the generic SPARQLify CSV rewriter tool that provides the SML mapping language. This has the advantages that

and the disadvantages that

A third option would be a imperative mapping language where you “program” your mapping like in Excel2OWL but don’t have to develop or maintain the tool itself.

Due to the major saving of time and the increased flexibility, we decided to use SPARQLify CSV for our upcoming subontologies. Providing a basic level of RDF/OWL knowledge is useful is useful anyways, because it helps the extractors understand, wheter the process captures what they actually intend to model. This gap was especially problematic for owl:someValuesFrom existential restrictions, which were always generated for all triples using meta ontology properties by Excel2OWL, even though the actual intention of the extractor may have been something different, like a universal owl:allValuesFrom quantifier or a direct connection between the classes.

CSV2RDF is a common task in the field, which caused the W3C to produce a recommendation document on 2015-12-17. There are many existing CSV2RDF tools (see the lists of the W3C and timrdf). Almost any of them could be used for the task, so I chose SPARQLify CSV for my own convenience because I am already experienced with it and know that it works reliably and because I have a good connection with the developer, Claus Stadler from AKSW, so that I can get help and give feedback about bugs and useful enhancements on short notice. Still, I went through the lists in case there is some killer feature available somewhere else and to contribute something back to Claus.

Mapping Large Scale Research Metadata to Linked Data compares the speed and memory usage of different RDF producing tools, which is relevant when you have hundreds of millions of triples but we only have several thousand triples in total so the performance is irrelevant for us. They initially considered Tarql, SPARQLify and Tabels

[csv2rdf4lod]

Karma is a semi-automatic approach with a graphical user interface that seems great on unknown data with a large schema but not necessary in our case where we designed a simple input format ourselves.

Any23 accepts a wide range of input formats

CSVImport is a special purpose tool for RDF Data Cubes, that is multidimensional numerical data.

Tarql is actually quite similar to SPARQLify CSV; you write the mapping as a SPARQL 1.1 query. Some of my feature requests for SPARQLify are already implemented there such as “skip bad rows” with WHERE { FILTER (BOUND(?d)) }, “split a field value into multiple values” with the apf:strSplit (?var "<delimiter>") property function and expandPrefixedName(?var). This lead me to abandon SPARQLify for the moment and switch to tarql.

basically write SPARQL is very comfortable

Future Work mapping would be a Groovy script file but that needs Groovy expertise bus factor Claus Stadler thinks about SPARQL based imperative extensions three components in the mapping

  1. line by line
  2. program code using lambda expressions
mapper.define("y").asUri((col)-> prefix("owl")+col)
?y = uri (owl:,?col)

mapper.setPrefix("ex","http://")
.addMapping("myMapping")
.addTriple("?s a ex:foobar")
.define("s").asUri()
  1. load the table into an SQL server to use SQL like explode (“”)
  2. workflow engine for large files to resume after crashes

Future Work

SPARQLify (CSV)

SPARQLify ool we chose for the task, has already been used successfully multiple times. For example, Ermilov, Auer, Stadler used SPARQLify CSV to transform almost 10000 datasets with a total 7.3 billion triples for the publicdata.eu data portal. Ermilov, Höffner, Lehmann used SPARQLify to transform parts of the database of ITMO university, Saint Petersburg.

Installation

As I don’t have admin rights on my work computer, I installed it via mvn clean install, which took more than 12 minutes. README says to do mvn assembly:assembly afterwards but that is broken right now, so we create the command ourselves (replace paths accordingly):

echo `java -cp /insert/path/to/sparqlify/sparqlify-cli/target/sparqlify-cli-0.8.0-jar-with-dependencies.jar org.aksw.sparqlify.csv.CsvMapperCliMain -h $@` > ~/bin/sparqlify
chmod +x ~/bin/sparqlify

The -h parameter says that our first row consists of table row headers for identification.

Now we can test it with:

sparqlify -c sparqlify-examples/src/main/resources/sparqlify-examples/csv/example1.sml -f sparqlify-examples/src/main/resources/sparqlify-examples/csv/example1.csv

If successfull, this generates some logging messages and ends with:

<http://example.org/hello> <http://example.org/name> "hello" .
<http://example.org/hello> <http://example.org/age> "world" .
<http://example.org/hello> <http://example.org/email> "bar" .
<http://example.org/hello> <http://example.org/isPositive> "true"^^<http://www.w3.org/2001/XMLSchema#boolean> .
<http://example.org/hello> <http://example.org/gender> "foo" .
Variable	#Unbound
Triples generated:	5
Potential triples omitted:	0
Triples total:	5

Now want to apply it to our own data and take look at the first few rows including the headers, where we removed spaces to use them in the SML Mapping Language:

SubjektUri SubjDe SubjEn SubjAltDe SubjAltEn Subjekttyp Relation Objekt SeiteRelation Definition SeiteDefinition Kapitel
Abgb Allgemeines Bürgerliches Gesetzbuch (Ö)   ABGB   EntityType rdfs:subClassOf Gesetz   Allgemeines Bürgerliches Gesetzbuch (Österreich). Website;67 RECHT
Ablauforganisation Ablauforganisation Organizational Structure     EntityType meta:isBasedOn ImKonzept 18     ERMOD
AbleitenTeilstrategien Ableiten von Teilstrategien       Function         117;120  
AbstimmenUnternehmensstrategie Abstimmen mit der Unternehmensstrategie       Function meta:uses Wettbewerbsstrategie 117      
AbstimmenUnternehmensstrategie Abstimmen mit der Unternehmensstrategie business-IT-alignment     Function       Da die strategischen IT-Ziele […] 117  
Abstraktionsprinzip Abstraktionsprinzip       EntityType rdfs:subClassOf Architekturprinzip 52 Das Abstraktionsprinzip verlangt […] 52 ARCHI
Abstraktionsprinzip Abstraktionsprinzip       EntityType rdfs:subClassOf Architekturprinzip 52 Das Abstraktionsprinzip verlangt […] 52 ARCHI

As SPARQlify cannot guess the prefixes of our entites, we had to provide them in the “Relation” column. For SubjektUri, Subjekttyp, Relation and Objekt and Kapitel, we don’t need prefixes though because they always imply the local (default) prefix, which we can add in the mapping.

The SML file, which specifies the mapping for a specific table format, is surprisingly simple and needs only minor adaptions for future subontologies:

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX he: <http://www.snik.eu/ontology/he/>
PREFIX meta: <http://www.snik.eu/ontology/meta/>

Create View Template test As
  Construct {
    ?s  a owl:Class;
        meta:subTopClass ?st;
        rdfs:label ?lde, ?len;
        skos:altLabel ?lade, ?laden;
        ?p ?o;
        he:page ?pd;
        skos:definition ?d;
        he:chapter ?ch.
  }

With
    ?s = uri(he:, ?SubjektUri)
    ?st = uri(meta:, ?Subjekttyp)
    ?lde = plainLiteral(?SubjDe,"de")
    ?len = plainLiteral(?SubjEn,"en")
    ?lade = plainLiteral(?SubjAltDe,"de")
    ?laen = plainLiteral(?SubjAltEn,"en")
    ?p = uri(?Relation)
    ?o = uri(he:,?Objekt)
    ?pr = typedLiteral(?SeiteRelation,xsd:positiveInteger)
    ?d = plainLiteral(?Definition,"de")
    ?pd = typedLiteral(?SeiteDefinition,xsd:positiveInteger)
    ?ch = uri(he:,?Kapitel)

We execute this mapping and filter out debug messages with:

sparqlify 2>&1 -c heinrich.sml -f htest.csv > heinrich.nt | egrep -v "(TRACE)|(DEBUG)"

This results in 8729 triples from our 1439 rows long test table, which is a preliminary excerpt of the ongoing extraction of the new “Heinrich” (he) ontology.

<http://www.snik.eu/ontology/he/Ablauforganisation> <meta:isBasedOn> <http://www.snik.eu/ontology/he/ImKonzept> .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.w3.org/2004/02/skos/core#definition> ""@de .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.w3.org/2004/02/skos/core#altLabel> ""@de .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.w3.org/2000/01/rdf-schema#label> "Ablauforganisation"@de .                                                                                                                    
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.snik.eu/ontology/he/page> ""^^<http://www.w3.org/2001/XMLSchema#positiveInteger> .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.snik.eu/ontology/he/chapter> <http://www.snik.eu/ontology/he/ERMOD> .
<http://www.snik.eu/ontology/he/Ablauforganisation> <http://www.w3.org/2000/01/rdf-schema#label> "Organizational Structure"@en .

Unfortunately, SPARQLify creates empty triples for empty objects so that we must filter out lines containing <> and "" for now in a fix script. Different null handling choices will be available soon, however. SPARQLify does not support already prefixed URIs and embraces them like <meta:isBasedOn>, which we correct in the fix script as well. Finally, we convert the ntriples output to turtle for easier viewing. Because ntriples doesn’t support prefixed values, we merge the output with a predefined prefix file and treat it like a turtle file. The full fix script:

cp prefix.ttl /tmp/tmp.ttl                                                                                                                                                                                                                    
egrep -v '("")|(<>)' heinrich.nt | sed -r "s|<([^ /]+):([^ /]+)>|\1:\2|g" >> /tmp/tmp.ttl
rapper -i turtle -o turtle /tmp/tmp.ttl > fixed.ttl

While this results in a syntactically correct turtle file, the modelling of meta relations is semantically incorrect, as they should be OWL restrictions and not direct connections between classes. For more detail about OWL restrictions, see my last blogpost about property validation. I assume that the old tool had a list of meta properties and treated them differently then the others but SML does not allow this so that we have to create two additional columns RestrictionProperty and RestrictionObject. On the positive side, this allows us to add a third column named “RestrictionType” to finally make explicit and allow not only the formerly exclusively used “someValuesFrom” restriction but also “allValuesFrom” and potentially later the others (owl:hasValue, owl:cardinality, owl:minCardinality, owl:maxCardinality, more information here).

TARQL

Installation

TARQL is very easy to install. Download the latest release, unzip it to a folder of your choice and execute bin/tarql. Requires an installed Java 8 JRE.

Input Table

We reuse the modified input table from our SPARQLify CSV try without changes.

Mapping File

Unlike SPARQLify, TARQL doesn’t have a formally specified language but it has an introduction that gives a general idea and some examples, that hopefully will be enough for our purposes.

Regular expression to change the binding format, manually changes are still necessary. :%s/\(?[a-z]*\) = \(.*\)$/BIND (\2 AS \1)/ expandPrefixedName

Plain literals: :%s/plainLiteral/STRLANG

Our required null handling by default. See…

he:AbleitenTeilstrategien
        rdf:type          owl:Class ;
        meta:subTopClass  meta:Function ;
        rdfs:label        "Ableiten von Teilstrategien"@de .

… which has a German label but no English one. Still, the bound part of the row is mapped and there are no empty values.

1 bb—”blue book”, “Health Information Systems”. ob—”orange book”, “IT-Projektmanagement im Gesundheitswesen”