Tornado aims to leverage semantic web technologies to evaluate the integrity of heterogeneous data, with particular focus on referential integrity. The goal of the project is to develop a web-based platform to evaluate the consistency of unstructured data sets with the help of reference ontologies.
Maintaining data consistency is a major challenge when dealing with information originating from heterogeneous sources. This is especially true for semi-structured or unstructured data with lots of interdependencies across different domains.
Semantic web technologies are designed with the scale of the Web in mind. Furthermore, their expressive power allows them to capture knowledge about more complex data, as opposed to traditional database approaches.
TORNADO’s goal is to leverage semantic web technologies to evaluate the quality of large sets of heterogeneous and multifaceted data. Its key idea is to define a reference ontology to evaluate the data against.
Ontologies as quality criteria
Ontologies define meanings of terms and relationships in a formal, yet user-accessible manner. The vocabulary used by an OWL ontology is reusable and thus can serve as a basis for solutions based on technologies of the Semantic Web. Moreover, the ontology language supports expressions that can describe criteria against the data better than other data modelling approaches.
Within TORNADO, we test an RDF dataset against the axioms stated in an OWL ontology. The data quality violations are encoded semantically which helps explain the problems in terms relevant for the end-users.
SWRL Rules help establish relationships between resources automatically with a suitable reasoner. In TORNADO, these are however understood as a data quality criteria used to evaluate a data set with.
SPARQL is the query language of the semantic web. It is based on graph patterns that select, order and filter a data set. TORNADO attempts to extract quality criteria from the queries in order to test the data set in an automatic manner.