Tutorial 1: Exploring and aggregating RDF data with SCB
Author: Reto Bachmann-Gmür - clerezza.org
Contributor: Hasan - clerezza.org
Update : Florent - apache.org
Date: 2010-06-14
Table of Contents
2. Initializing a Maven Project
3. Creating a Graph and loading Data
6. Putting it all together: the example app
1. Objective
In this tutorial you will learn how to use SCB to manage data modeled as a graph based on the RDF [1] standard maintained by W3C.
You'll learn how to get Graph objects from serialized RDF data on the web and how to access such a graph using the core SCB package and the SCB utilities package.
Key advantages of SCB include the support of OSGi [2] to allow for a better modularization of application and the support of other triple store APIs through technology specific façades (adapters). However, you will learn to know these key advantages in next tutorials. This tutorial provides for a good foundation to work with and understand the basic concept of SCB graph data model.
Our example will download data about BBC television sitcoms from dbpedia into a local graph, display the context of a given resource and download additional data from the web when the user requests it. The time it takes to go through this tutorial is approximately an hour. This tutorial is intended for java developers, some familiarity with the build tool maven [3] is an advantage.
2. Setting up the Maven project
We use maven to build our project and to keep track of dependencies in an IDE independent way. Maven will take care of downloading the required dependencies from their respective repositories.
2.1. Initializing
First, a maven project with the groupId org.example.clerezza.scb and the artifactId tutorial1 will be created by executing the following command in a shell:
If all goes well the output output of the command contains the following:
A new directory called tutorial1 is created containing a source directory src and a file called pom.xml used by maven to build the project. A program file called App.java is created and placed under the directory src/main/java/org/example/clerezza/scb/tutorial1/, we will modify this Class to build our demo application, but before we add the required dependencies to our pom.xml.
2.2. Adding dependencies
As the required Clerezza artifacts are not yet in the maven default
repositories we need to add the respective repository locations to
our pom.xml (alternatively we could add them globally
to the maven settings.xml). Add the following as a child element of
project in your pom.xml:
Now we can add the dependencies to the dependencies
section that maven will download from the clerezza repositories.
The following are the compile-time dependencies (the default scope
for dependencies is compile). Beside
org.clerezza.rdf.core which provides the core scb bundles
we add org.clerezza.rdf.utils that contains handy utility
classes and org.clerezza.rdf.ontologies which contains
classes containing constants for the terms of popular ontologies.
The set version numbers were the latest at time of writing to find the latest release or snapshot version check https://repository.apache.org/content/repositories/releases/ respectively https://repository.apache.org/content/repositories/snapshots/.
The above dependencies will be sufficient to compile our application, to run the application. However as SCB provides mainly interfaces to exchangeable implementations we should add some runtime dependencies:
The two dependencies are implementations of rdf parsers and serializers for various formats. They are based on the Jena Framework [4] but you don't have to care about this.
Almost forgot, maven defaults to some rather old java version, to fix
this we should add the following to configure the
maven-compiler-plugin to use java 6, the
build element is a child of project.
Try an "$mvn compile" command, build successful will apear.
Enough configuration, lets get our hands dirty and write some code.
3. Creating a Graph and loading Data
In RDF, Graphs are collections of triples. Strictly speaking graphs
are immutable : if you add or remove a triple its a new graph. For
that SCB distinguishes between two types of
TripleCollections: Graph and
MGraph where 'M' stands for "mutable". The MGraph
and Graph interfaces both extend TripleCollection which
apart from extending java.util.Collection<Triple>
provide a method filter
to query RDF triples according to filter parameters
specified: subject, predicate, and object.
The factory we need for getting TripleCollections
is
TcManager
, depending on the available
storage providers the returned instances may be backed on an efficient
triple store like Sesame, or if no provider is available a simple and
terribly inefficient HashSet based implementation is returned.
To store our accumulated knowledge around BBC television sitcoms
we create an MGraph with the following code:
The code creates an empty MGraph with the name
<http://tutorial.example.org/>. To verify that all went well
we can output the size of mGraph with the following:
To compile and run the application using maven issue the following command in the directory where the pom.xml is:
The actual program output will be armored by the maven logging, you may pass the -q argument, and you'll only see the actual output of our program:
Boring emptiness, lets add the triples dbpedia has about <http://dbpedia.org/resource/Category:BBC_television_sitcoms>. First use standard classes from the java.net package to dereference this URI.
The URI <http://dbpedia.org/resource/Category:BBC_television_sitcoms> represents the abstract notion (the category) of BBC television sitcoms which is not something that can actually be passed over the wire, the server will answer with "303 See Other" response pointing to a document describing the category we originally requested, in our case this is <http://dbpedia.org/data/Category:BBC_television_sitcoms.rdf> for a normal browser it would be <http://dbpedia.org/page/Category:BBC_television_sitcoms>. URLConnection transparently handles this redirection so we don't have to care about sending the second request to the server.
Now that we have an InputStream from which rdf/xml can be read we
use org.apache.clerezza.rdf.core.serializedform.Parser to
convert it to a graph:
Using the addAll which MGraph inherits from
Collection<triple> we can add the triples of the
retrieved Graph to mGraph:
Outputting the size of the graph now returns something else (the number of triples will vary as dbpedia evolves):
4. Accessing the Triples
It's good to know that by loading data into our MGraph
its size has increased, but actually we would like to get data
out of mGraph. The easiest would be to just use
the Serializer to write the Graph to standard output:
The above code serialized mGraph in the turtle format
to the standard output. You may want to try "text/rdf+nt" and
"application/rdf+xml" to see the triples serialized in different ways.
The typical way to get specific triples is to use the filter
method which Graph and MGraph inherit from
TripleCollection. The following outputs the
RDF:type of the resource
<http://dbpedia.org/resource/Category:BBC_television_sitcoms>:
Note the use of RDF.type a constant from the
org.clerezza.rdf.ontologies package and maven artifact, null
is used as a wild card, here in the object position
5. Resource context
Often we want to get a concise description of a resource, the context of a resource. In terms of RDF this can be formalized "context" as the set of statements in which the resource is either subject or object. If such a statement contains a blank node its context is included as well [5].
The context can easily be accessed by using the GraphNode
class in the org.apache.clerezza.rdf.utils package.
The method above returns the context of the resource of which
selectedUri contains the name.
6. Putting it all together: the example app
Putting what we learned together and adding a swing front-end:
The pom.xml should be equivalent to what you already have if you followed this tutorial, the java code creates a swing frame with a table containing the context of a selected resource. By default, when clicking on a named resource that is the subject or object of a statement, the context of this resource is shown. By clicking on the button "Load Context from Web" the resource is dereferenced and the triples are added to the local store.
7. Taking it further
A trivially achievable improvement of the example application would be to add persistent storage.
By adding the sesame persitent storage provider to the runtime classpath
of the application our MGraph is stored in a sesame store
[8] (this obsolotes the dependency on
org.apache.clerezza.rdf.sesame.storage).
After adding this dependency on the second launch of the application
we should get an exception complaining that the graph already exists, the
reason for this is that TcManager contains separate methods
to access an existing MGraph and for creating a new one. The
following would solve the issue:
If anything is unclear or you'd like to take it even further, ask about it on our mailing list http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/
8. References
[1] W3C: Resource Description Framework (RDF): Concepts and Abstract Syntax; 2004, http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
[2] OSGi, http://www.osgi.org/Main/HomePage
[3] Maven, http://maven.apache.org/
[4] Jena Framework, http://jena.sourceforge.net/
[5] The introduced concept of "context" is close to the one of "RDF Molecules" [6] and Minimum Self contained Graphs [7]
[6] Ding L.; Finin, T; Peng, Y; Pinheiro da Silva, P; , McGuinness, D , "Tracking RDF Graph Provenance using RDF Molecules" , 2005, Proceedings of the Fourth International Semantic Web Conference, November 2005
[7] Tummarello G.,;Morbidoni C.; Puliti P; Piazza F. "Signing individual fragments of an RDF graph" , 2005, World Wide Web Conference 2005 Poster Track
[8] Sesame, http://openrdf.com/
Copyright (c) 2008-2009 trialox.org (trialox AG, Switzerland)