Tutorial 1: Exploring and aggregating RDF data with SCB

Author: Reto Bachmann-Gmür - clerezza.org

Contributor: Hasan - clerezza.org

Update : Florent - apache.org

Date: 2010-06-14

Table of Contents

1. Objective

2. Initializing a Maven Project

3. Creating a Graph and loading Data

4. Accessing the Triples

5. Resource context

6. Putting it all together: the example app

7. Taking it further

8. References

1. Objective

In this tutorial you will learn how to use SCB to manage data modeled as a graph based on the RDF [1] standard maintained by W3C.

You'll learn how to get Graph objects from serialized RDF data on the web and how to access such a graph using the core SCB package and the SCB utilities package.

Key advantages of SCB include the support of OSGi [2] to allow for a better modularization of application and the support of other triple store APIs through technology specific façades (adapters). However, you will learn to know these key advantages in next tutorials. This tutorial provides for a good foundation to work with and understand the basic concept of SCB graph data model.

Our example will download data about BBC television sitcoms from dbpedia into a local graph, display the context of a given resource and download additional data from the web when the user requests it. The time it takes to go through this tutorial is approximately an hour. This tutorial is intended for java developers, some familiarity with the build tool maven [3] is an advantage.

2. Setting up the Maven project

We use maven to build our project and to keep track of dependencies in an IDE independent way. Maven will take care of downloading the required dependencies from their respective repositories.

2.1. Initializing

First, a maven project with the groupId org.example.clerezza.scb and the artifactId tutorial1 will be created by executing the following command in a shell:

$ mvn archetype:generate --batch-mode \ -DarchetypeGroupId=org.apache.maven.archetypes \ -DarchetypeArtifactId=maven-archetype-quickstart \ -DgroupId=org.example.clerezza.scb \ -DartifactId=tutorial1 \ -Dversion=1.0-SNAPSHOT \ -Dpackage=org.example.clerezza.scb.tutorial1

If all goes well the output output of the command contains the following:

------------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------------

A new directory called tutorial1 is created containing a source directory src and a file called pom.xml used by maven to build the project. A program file called App.java is created and placed under the directory src/main/java/org/example/clerezza/scb/tutorial1/, we will modify this Class to build our demo application, but before we add the required dependencies to our pom.xml.

2.2. Adding dependencies

As the required Clerezza artifacts are not yet in the maven default repositories we need to add the respective repository locations to our pom.xml (alternatively we could add them globally to the maven settings.xml). Add the following as a child element of project in your pom.xml:

<repositories> <repository> <id>apache</id> <name>apache repository</name> <snapshots> <updatePolicy>always</updatePolicy> <checksumPolicy>warn</checksumPolicy> </snapshots> <url>http://repository.apache.org/content/groups/snapshots-group</url> <layout>default</layout> </repository> </repositories>

Now we can add the dependencies to the dependencies section that maven will download from the clerezza repositories.

The following are the compile-time dependencies (the default scope for dependencies is compile). Beside org.clerezza.rdf.core which provides the core scb bundles we add org.clerezza.rdf.utils that contains handy utility classes and org.clerezza.rdf.ontologies which contains classes containing constants for the terms of popular ontologies.

<dependency> <groupId>org.apache.clerezza</groupId> <artifactId>org.apache.clerezza.rdf.core</artifactId> <version>0.12-incubating-SNAPSHOT</version> </dependency> <dependency> <groupId>org.apache.clerezza</groupId> <artifactId>org.apache.clerezza.rdf.utils</artifactId> <version>0.13-incubating-SNAPSHOT</version> </dependency> <dependency> <groupId>org.apache.clerezza</groupId> <artifactId>org.apache.clerezza.rdf.ontologies</artifactId> <version>0.11-incubating-SNAPSHOT</version> </dependency>

The set version numbers were the latest at time of writing to find the latest release or snapshot version check https://repository.apache.org/content/repositories/releases/ respectively https://repository.apache.org/content/repositories/snapshots/.

The above dependencies will be sufficient to compile our application, to run the application. However as SCB provides mainly interfaces to exchangeable implementations we should add some runtime dependencies:

<dependency> <groupId>org.apache.clerezza</groupId> <artifactId>org.apache.clerezza.rdf.jena.parser</artifactId> <version>0.10-incubating-SNAPSHOT</version> <scope>runtime</scope> </dependency> <dependency> <groupId>org.apache.clerezza</groupId> <artifactId>org.apache.clerezza.rdf.jena.serializer</artifactId> <version>0.9-incubating-SNAPSHOT</version> <scope>runtime</scope> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.5.5</version> <scope>runtime</scope> </dependency> <dependency> <groupId>org.apache.clerezza</groupId> <artifactId>org.apache.clerezza.rdf.simple.storage</artifactId> <version>0.7-incubating-SNAPSHOT</version> <scope>runtime</scope> </dependency>

The two dependencies are implementations of rdf parsers and serializers for various formats. They are based on the Jena Framework [4] but you don't have to care about this.

Almost forgot, maven defaults to some rather old java version, to fix this we should add the following to configure the maven-compiler-plugin to use java 6, the build element is a child of project.

<build> <pluginManagement> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>2.0.2</version> <configuration> <source>1.6</source> <target>1.6</target> <encoding>utf-8</encoding> </configuration> </plugin> </plugins> </pluginManagement> </build>

Try an "$mvn compile" command, build successful will apear.

Enough configuration, lets get our hands dirty and write some code.

3. Creating a Graph and loading Data

In RDF, Graphs are collections of triples. Strictly speaking graphs are immutable : if you add or remove a triple its a new graph. For that SCB distinguishes between two types of TripleCollections: Graph and MGraph where 'M' stands for "mutable". The MGraph and Graph interfaces both extend TripleCollection which apart from extending java.util.Collection<Triple> provide a method filter to query RDF triples according to filter parameters specified: subject, predicate, and object.

The factory we need for getting TripleCollections is TcManager , depending on the available storage providers the returned instances may be backed on an efficient triple store like Sesame, or if no provider is available a simple and terribly inefficient HashSet based implementation is returned.

To store our accumulated knowledge around BBC television sitcoms we create an MGraph with the following code:

import org.apache.clerezza.rdf.core.*; import org.apache.clerezza.rdf.core.access.TcManager; ... //get the singleton instance of TcManager final TcManager tcManager = TcManager.getInstance(); //the arbitrary name we use for our mutable graph final UriRef mGraphName = new UriRef("http://tutorial.example.org/"); //the m-graph into which we'll put the triples we collect final MGraph mGraph = tcManager.createMGraph(mGraphName);
We don't repeat the skeleton code generated by the maven archetype but trust the reader can add the statements above at a sensible place in App.java.

The code creates an empty MGraph with the name <http://tutorial.example.org/>. To verify that all went well we can output the size of mGraph with the following:

System.out.println("Size of mGraph: "+mGraph.size());

To compile and run the application using maven issue the following command in the directory where the pom.xml is:

$ mvn compile exec:java -Dexec.mainClass=org.example.clerezza.scb.tutorial1.App

The actual program output will be armored by the maven logging, you may pass the -q argument, and you'll only see the actual output of our program:

Size of mGraph: 0

Boring emptiness, lets add the triples dbpedia has about <http://dbpedia.org/resource/Category:BBC_television_sitcoms>. First use standard classes from the java.net package to dereference this URI.

final URL url = new URL("http://dbpedia.org/resource/Category:BBC_television_sitcoms"); final URLConnection con = url.openConnection(); con.addRequestProperty("Accept", "application/rdf+xml"); final InputStream inputStream = con.getInputStream();
The above code sets the "Accept"-Header of the HTTP-Request to "application/rdf+xml" this tells the server that we can handle responses in that format, for comparison the value of the Accept-Header in the request of a browser might look like "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8".
The URI <http://dbpedia.org/resource/Category:BBC_television_sitcoms> represents the abstract notion (the category) of BBC television sitcoms which is not something that can actually be passed over the wire, the server will answer with "303 See Other" response pointing to a document describing the category we originally requested, in our case this is <http://dbpedia.org/data/Category:BBC_television_sitcoms.rdf> for a normal browser it would be <http://dbpedia.org/page/Category:BBC_television_sitcoms>. URLConnection transparently handles this redirection so we don't have to care about sending the second request to the server.

Now that we have an InputStream from which rdf/xml can be read we use org.apache.clerezza.rdf.core.serializedform.Parser to convert it to a graph:

//get the singleton instance of Parser final Parser parser = Parser.getInstance(); Graph deserializedGraph = parser.parse(inputStream, "application/rdf+xml");

Using the addAll which MGraph inherits from Collection<triple> we can add the triples of the retrieved Graph to mGraph:

mGraph.addAll(deserializedGraph);

Outputting the size of the graph now returns something else (the number of triples will vary as dbpedia evolves):

Size of mGraph: 251

4. Accessing the Triples

It's good to know that by loading data into our MGraph its size has increased, but actually we would like to get data out of mGraph. The easiest would be to just use the Serializer to write the Graph to standard output:

final Serializer serializer = Serializer.getInstance(); serializer.serialize(System.out, mGraph, "text/turtle");

The above code serialized mGraph in the turtle format to the standard output. You may want to try "text/rdf+nt" and "application/rdf+xml" to see the triples serialized in different ways.

The typical way to get specific triples is to use the filter method which Graph and MGraph inherit from TripleCollection. The following outputs the RDF:type of the resource <http://dbpedia.org/resource/Category:BBC_television_sitcoms>:

Iterator<Triple> typeTriples = mGraph.filter(new UriRef("http://dbpedia.org/resource/Category:BBC_television_sitcoms"), RDF.type, null); while (typeTriples.hasNext()) { System.out.println(typeTriples.next()); }

Note the use of RDF.type a constant from the org.clerezza.rdf.ontologies package and maven artifact, null is used as a wild card, here in the object position

5. Resource context

Often we want to get a concise description of a resource, the context of a resource. In terms of RDF this can be formalized "context" as the set of statements in which the resource is either subject or object. If such a statement contains a blank node its context is included as well [5].

The context can easily be accessed by using the GraphNode class in the org.apache.clerezza.rdf.utils package.

public Graph getCurrentContext() { return new GraphNode(new UriRef(selectedUri), mGraph).getNodeContext(); }

The method above returns the context of the resource of which selectedUri contains the name.

6. Putting it all together: the example app

Putting what we learned together and adding a swing front-end:

The pom.xml should be equivalent to what you already have if you followed this tutorial, the java code creates a swing frame with a table containing the context of a selected resource. By default, when clicking on a named resource that is the subject or object of a statement, the context of this resource is shown. By clicking on the button "Load Context from Web" the resource is dereferenced and the triples are added to the local store.

7. Taking it further

A trivially achievable improvement of the example application would be to add persistent storage.

By adding the sesame persitent storage provider to the runtime classpath of the application our MGraph is stored in a sesame store [8] (this obsolotes the dependency on org.apache.clerezza.rdf.sesame.storage).

<dependency> <groupId>org.apache.clerezza</groupId> <artifactId>org.apache.clerezza.rdf.sesame.storage</artifactId> <version>0.13-incubating-SNAPSHOT</version> <!-- <scope>runtime</scope> --> </dependency>

After adding this dependency on the second launch of the application we should get an exception complaining that the graph already exists, the reason for this is that TcManager contains separate methods to access an existing MGraph and for creating a new one. The following would solve the issue:

try { mGraph = tcManager.getMGraph(mGraphName); } catch (NoSuchEntityException e) { mGraph = tcManager.createMGraph(mGraphName); }

If anything is unclear or you'd like to take it even further, ask about it on our mailing list http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/

8. References

[1] W3C: Resource Description Framework (RDF): Concepts and Abstract Syntax; 2004, http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/

[2] OSGi, http://www.osgi.org/Main/HomePage

[3] Maven, http://maven.apache.org/

[4] Jena Framework, http://jena.sourceforge.net/

[5] The introduced concept of "context" is close to the one of "RDF Molecules" [6] and Minimum Self contained Graphs [7]

[6] Ding L.; Finin, T; Peng, Y; Pinheiro da Silva, P; , McGuinness, D , "Tracking RDF Graph Provenance using RDF Molecules" , 2005, Proceedings of the Fourth International Semantic Web Conference, November 2005

[7] Tummarello G.,;Morbidoni C.; Puliti P; Piazza F. "Signing individual fragments of an RDF graph" , 2005, World Wide Web Conference 2005 Poster Track

[8] Sesame, http://openrdf.com/

Copyright (c) 2008-2009 trialox.org (trialox AG, Switzerland)