Stanbol — getting started

Radek Stankiewicz
Radoslaw Stankiewicz Desk
5 min readJan 11, 2017

--

This blog entry covers using Apache Stanbol’s enhancer and entityhub with custom vocabulary.

https://stanbol.apache.org/docs/trunk/components/enhancer/

What I want to achieve - I store/write documents as plain text and we want to search them cleverly.

Detecting phrases, multi word synonyms are possible to implement with Solr but term boosting based on alternative synonym or broader and narrower definition is something that is more difficult to implement (eg. I search for weapon and I get document about AK47 which doesn’t mention weapon, and both words are not synonyms).

Meet Apache Stanbol, tool that eat custom vocabularies and help me enhance text with definitions.

Apache Stanbol provides a set of reusable components for semantic content management.

https://stanbol.apache.org/docs/trunk/components/

I am interested in two components on the left side:

The Entityhub is the component, which lets you cache and manage local indexes of repositories such as DBPedia but also custom data (e.g. product descriptions, contact data, specialized topic thesauri).

The Enhancer component together with its Enhancement Engines provides you with the ability to post content to Apache Stanbol and get suggestions for possible entity annotation in return. The enhancements are provided via natural language processing, metadata extraction and linking named entities to public or private entity repositories.

Installation

Prerequisites: maven, git, jdk

Clone & build repository:

(git clone https://github.com/apache/stanbol.git && cd stanbol && mvn install -Dmaven.test.skip=true)

Test if it’s running:

(cd stanbol && java -Xmx1g -jar launchers/full/target/org.apache.stanbol.launchers.full-1.0.1-SNAPSHOT.jar -p 8082)

Configuration

I follow short instruction:

Preparing index for knowledge mining (line 1), Initialize directory for vocabulary (line 2)

(mkdir km_for_stanbol && cp stanbol/entityhub/indexing/genericrdf/target/org.apache.stanbol.entityhub.indexing.genericrdf-1.0.1-SNAPSHOT.jar km_for_stanbol)
(cd km_for_stanbol && java -jar org.apache.stanbol.entityhub.indexing.genericrdf-*.jar init

I edit km_for_stanbol/indexing/config/indexing.properties — I change name to ‘km’. I upload rdf files to km_for_stanbol/indexing/resources/rdfdata — if you don’t have one — here are some real life datasets. Finally I index rdf.

java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf-1.0.1-SNAPSHOT.jar index

After last step I can find two new files in km_for_stanbol/indexing/dist/ directory — index and jar bundle.

There are some important notes from authors that I need to remember when reindexing:

Already imported RDF files should be removed from the {indexing-working-dir}/indexing/resources/rdfdata to avoid to re-import them on every run of the tool. NOTE: newer versions of the Entityhub indexing tool might automatically move successfully imported RDF files to a different folder.

If the RDF data change you will need to delete the Jena TDB store so that those changes are reflected in the created index. To do this delete the {indexing-working-dir}/indexing/resources/tdb folder

Importing index to Stanbol EntityHub

I copy index to stanbol (stanbol/stanbol/datafiles/)

I install org.apache.stanbol.data.site.km-1.0.0.jar to the OSGI environment of your Stanbol instance e.g. by using the Bundle tab of the Apache Felix web console.

In last step Stanbol is creating SolrYard (storage component for EntityHub containing full local index of our vocabulary) and Reference in EntityHub which is pointing to newly created Yard.

My vocabulary is available under following link.

Querying EntityHub

Under my new site link I can find many usages for REST API, one that I’m really interested in is ldpath service. Ldpath is powerful query language for linked data.

For example I can easily query for narrower and broader id’s of my entity:

@prefix skos : <http://www.w3.org/2004/02/skos/core#>;
label = rdfs:label[@en] :: xsd:string ;
narrower = skos:narrower;
broader = skos:broader;

with simple http call:

curl -X POST -H "Accept: application/rdf+json" -H "Content-Type: application/x-www-form-urlencoded"\
-d ‘ldpath=@prefix skos : <http://www.w3.org/2004/02/skos/core#>%3B label %3D dfs:label[@en] :: xsd:string%3B narrower %3D skos:narrower%3B broader %3Dskos:broader%3B&context=http://localhost/my_vocabulary/14' \
"http://localhost:8082/entityhub/site/km/ldpath"
result:[ {
"@id" : "http://localhost/my_vocabulary/14",
"broader" : [ {
"@id" : "http://localhost/my_vocabulary/5"
} ],
"label" : [ {
"@value" : "Revolver",
"@language" : "en"
}, {
"@value" : "Revolving gun",
"@language" : "en"
} ],
"narrower" : [ {
"@id" : "http://localhost/my_vocabulary/20"
}, {
"@id" : "http://localhost/my_vocabulary/30"
} ]
} ]

Setting up Enhancer

I feel this part is more difficult, because I don’t know Apache Felix very well so I’ve made some screenshots.

First I create Extraction by pressing plus on right side of Entityhub linking:

I set name and reference site name:

I create List Chain by pressing plus on right side (at this moment there are no chains):

I setup list of engines with newly created kmExtraction as last step.

I can verify if chain works by going to address or calling:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -H "Accept: application/rdf+xml" -d 'content=My handgun is hot, how can I fix it' "http://localhost:8082/enhancer/chain/km-list"

Wrapping up

Enhancer and ldpath query return structures in jsonld or rdf formats which I can parse with appropriate reader or do it manually.

When indexing dataset I run enhancer to collect all the id each document mentions, I save them into multifield.

For example having sentence “which weapon is most reliable, maybe revolving gun?” I can find that weapon is the Concept and revolving gun is revolver Concept.

Having ids for those concept ids I can use them as facets. If needed I can query for documents that cover skos narrower topics (like finding document that doesn’t mention concept weapon but is more specific like AK47).

Sources: https://stanbol.apache.org (images, parts of instructions)

--

--

Strategic Cloud Engineer at Google Warsaw - Helping customers solve their biggest data & analytics challenges using the best of Google.