Pragmatic topic map streaming

I started this day quite innocent at 5am in the morning---until the word "sensor data" on linkeddata.deri.ie toggled the Robert Barta-switch in my head, and, after a short disscusion on the #topicmaps IRC channel, a chain reaction started. Here is the relevant core dump:

I've been thinking a lot about knowledge streaming lately. C-SPARQL [PDF] seems very interesting, and I started wondering how to implement some kind of streaming for Topic Maps. The general idea of streaming topic maps and topic map changes is not new. SDShare was presented at the TMRA 2008, NetworkedPlanet had its own update feed long before that; I remember looking closer at this feed in connection with automatic update of the GREP topic map (The Norwegian Curriculum is published as a topic map, in case you haven't heard about this before).

The above mentioned protocols supply a feed of changes to a topic map and can be used to sync these changes with other topic maps. Pragmatics ahead.

I'm going to concentrate on a subset of this problem, the problem of knowledge aggregation. This means that I'm not directly interested in changes to a topic map, only newly added topic map constructs. Many popular services like twitter, facebook, delicious, flickr provide APIs which again provide feeds of newly added information. I use a twitter client, an RSS feed reader, and other applications. Additionally, I like to store snippets of intersting web pages (I currently use DevonThink for this task). I'm trying to keep up with all those feeds on a daily basis, and this works mostly fine---until, sometimes months later, I try to find that information again. Have you once tried to find a tweed that you saw a couple of weeks ago?

The solution is obvious: Topic Maps can help me to aggregate that information, and, hopefully, make it easy to find it when I need it most.

Here is the idea: I store all information from the social services I use in a personal topic map (short: TM). To build this topic map, I run a client ("topic map stream reader") that periodically checks all topic map stream feeds. If new items appear, it fetches the relevant topic map and merges it into my TM. For each service that I'm interested in, I create a simple wrapper that provides me with an ATOM feed. Each item of that feed is, you guessed it, a topic map. The trick is, that the items are topic maps, not topic map fragments. This allows me to make use of the most powerful and most frightening feature of ISO 13250-2: merging.

Note that I don't say anything about the complexity of the generated topic maps. They can contain everything between only one association or a topic stub with just one occurrence and a more complex topic map with serveral topics and associations.

tm-syndication.png

Let's take a tweet as an example. I can easily create a service that creates a topic map for a tweet. It could have a person or twitter account topic, maybe some associations for tweet-mentions-account, tweet-mentions-hashtag, some meta data such as the posting date and a subject locator to the tweet itself. Such a service would be easy to set up on a Google App Engine account. Then I can create a feed of all tweets of the people I follow with the twitter API, and this feed can be converted to an ATOM feed. Maybe it would even be easy to create a Google App Engine application that generates such a personalized feed for me, but I'm not sure if I would publish this feed (but that's a different story). By iterating over the most popular services, the Topic Maps community could provide small even dynamically generated topic maps for all kinds of information pieces. It should e.g. be easy to convert an ATOM feed of a blog into an ATOM feed of topic maps that describe or contain the blog entries.

What is missing now, is the ability to read and combine those ATOM feeds. It doesn't sound hard to write a little topic map stream reader that uses my favourite Topic Maps engine TME, reads all feeds F that I'm interested in, for each feed item i fetches the topic map Ti, and merges that topic map into my personal knowledge base topic map. Et voilà: Topic Maps streaming in action!

What do you think? Would that be useful? It seems to me that such an aggregated topic map would be useful for integrating the social services that I use, and to store other personal information. The quality of the information depends a lot on how the different services are mapped to topic maps. However, it should be possible to find what you're looking for with some custom TMQL queries. Also, there is no limit of the services that can be wrapped into such topic map streaming feeds. It can be blog feeds, a feed of topic maps in Maiana, photos from flickr. You get it. A side effect of such streaming wrappers would be that many small topic maps become available on the web. I'm sure that there are many was to link them together!

That's it for now. I hope that you could get a basic understanding of my idea. I'll try to put up an example of a topic map streaming feed in one of the next posts.

My plan was to finish this blog post long before the Topic Maps 2010 conference in Oslo, but unfortunately, I didn't make it. Inspired by the discussions about semantic mashups on the Topic Maps mailing list a while ago, I started thinking about how I could combine Topic Maps with other web services. This blog post is about what I came up with. In one of my last posts I presented how parts of the Norwegien Red List of Threatended species can be transformed into a topic map. Now, I'll show you how this topic map can be used to annotate web sites with the subjects stored in this map.

The basic idea of my mashup is to send textual data that is part of a web site to a service on the server hosting the topic map. The server posts the text to the taxonFinder web service at ubio.org which extracts biological names and returns an XML structure will all recognized names contained in the text. These names are then matched agains the species in the Red List topic map. The result which is sent back to client consists of all red-listed species including their Red List status together with their Published Subject Identifier. This is a rough overview of the mashup:

mashup.png

The plan is to annotate a text in a web side with information from the Red List. Assume that this is our original HTML contained in a web page:

<p id="content">Agonum muelleri is a 7-9.5mm long black beetle, with brassy or purplish elytra and bright green reflecting, metallic foreparts. It occurs on open, moderately dry ground, including arable land. Widely distributed. Agonum marginatum is a 8.5-10.5mm long bright metallic green beetle with conspicuous yellow sides to the elytra. Locally common in marshy places, especially bare mud at the side of ponds and lakes.</p> (original text taken from Ground Beetles of Ireland).

With over 800 beetle species on the Red List it is hard to remember the status of all beetles, and instead of looking up every species name in the printed version, I want to do this automatically in the web browser. The text mentions Agonum marginatum, a species which unfortunately is listed as endangered on the Norwegian Red List. I want to annotate species names like this:

<p id="content">Agonum muelleri is a 7-9.5mm long black beetle, with brassy or purplish elytra and bright green reflecting, metallic foreparts. It occurs on open, moderately dry ground, including arable land. Widely distributed. <span class="Redlist2006EN" rel="http://psi.entomologi.org/coleoptera/agonum_marginatum">Agonum marginatum</span> is a 8.5-10.5mm long bright metallic green beetle with conspicuous yellow sides to the elytra. Locally common in marshy places, especially bare mud at the side of ponds and lakes.</p>

Note that the uBio service recognizes Agonum muelleri and Agonum marginatum as biological names, but only Agonum marginatum is on the Red List. You can take a look at the final result.

Implementation details

In this section I'll present a few implementation details of both the client and the server. In its current state, all that is needed to enable the service in a client page is to include of a little JavaScript file named coleoptera.js to get access to the name extraction service and some logic to annotate the HTML code. For simplicity, I've included jQuery via the Google AJAX Libraries API. Here is the head section of the web page that we are going to annotate:

    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
    <title>A Semantic Mashup with Topic Maps (Demo)</title>
    <script src="http://www.google.com/jsapi?key=XXXXXXXXXXXXXX" type="text/javascript"></script>
    <script type="text/javascript" src="coleoptera.js"></script>
    <link rel="stylesheet" type="text/css" media="all" href="coleoptera.css" />
    <script type="text/javascript">
    //<![CDATA[
    google.load("jquery", "1.4.2");
    google.setOnLoadCallback(function () {
        var id = 'content';
        Coleoptera.extract_subjects($('#'+id).get(0),
            function (subjects) {
                var i, sp;
                if (subjects && subjects.status === 'success') {
                    for (i=0; i<subjects.result.length; i+=1) {
                        sp = subjects.result[i];
                        $("#"+id).html($("#"+id).html().replace(new RegExp(sp.name),
                            '<span class="species redlist-'+sp.redlist+'" title="'+sp.psi+'">'+sp.name+'</span>'));
                    }
                }
            });
    });
    //]]>        
    </script>

As you can see, all the client has to do to extract red-listed species names is to call the Coleoptera.extract_subjects(node, callback) with a DOM node containing the unannotated text and a callback function which is later called with the extracted subjects as a parameter. The result of the web service is basically a JSON-encoded array of species names together with their PSI and their Red List status. In a real-world scenario it would be interesting to implement this service as a browser plug-in, which would enable you to annotate any web site. But, for simplicity, I've chosen to integrate the code directly into the page. Now, that you've seen what the page on the client looks like, let's take a look at coleoptera.js:

The first step is to extract the text from a DOM node. The node itself might contain more HTML elements, but we are only interested in the text:

extract_text = function (node) {
    var i, text, child_txt, skipnodes = ['SCRIPT', 'OBJECT', 'EMBED', 'STYLE']; // more...?
    // Check if the node should be skipped
    if (node.nodeType === 1) {
        for (i=0; i<skipnodes.length; i+=1) {
           if (node.nodeName === skipnodes[i]) {
               return '';
           }
        }
    }
    if (node.nodeType === 3 && node.nodeValue.match(/\S/)) {
        return node.nodeValue;
    } else {
        text = '';
        for(i=0;i<node.childNodes.length;i+=1) {
            child_txt = extract_text(node.childNodes[i]);
            child_txt = trim(child_txt);
            if (child_txt !== '') {
                text += ' '+child_txt;
            }
        }
        return text;
    }
};

The trim() function is trivial and not shown here. extract_text() starts with a DOM node and walks through all node elements that are not script, object, embed or style and collects the contained text nodes. Now, that the script has extracted the text, it needs to be send to the server, and this is where the real work happenens. The function that sends the code to the server is quite simple:

    extract_subjects = function (node, callback) {
        var text = extract_text(node);
        $.ajax({
            url: 'http://www.entomologi.org/coleoptera/mashup/text2subj.php',
            type: 'POST',
            data: {q: text},
            dataType: 'json',
            timeout: 10000,
            success: function(json){
                json.status = 'success';
                callback(json);
            }
        });
    };

Now to the server-side part of the application. As mentioned earlier, the mashup uses the uBio taxonFinder service. Calls to the service are a simple HTTP GET request to this URL, with FOOBAR being the url-encoded text to be analyzed:

http://www.ubio.org/webservices/service.php?function=taxonFinder&includeLinks=0&freeText=FOOBAR

Putting the text into the URL has the drawback that the length of the URL which the uBio web server can handle is limited. POST might have been a better choice for uBio, but there is also the possiblity to send an URL to the document to extract from. The reason why I didn't supply a URL was that I wanted to be able to analyse only parts of a web page. In a real world scenario the script would have to split the text into chunks and analyze each of them. uBio keeps a database of over 11,000,000 scientific names from various sources, so chances are high that names will be recognized. It should be easy to dump all names from the Red List into a text file and check how many of the names are missed by the taxonFinder service.

taxonFinder returns an XML string with all recognized names. The script parses the XML with PHP's SimpleXML API. The names are then transformed into PSIs, which is possible because I used a simple pattern to convert scientific names into PSIs. Basically, the name is converted to lower case and all characters that are not in the range a-z are converted to _, with no double underscores. The resulting string is prefixed with http://psi.entomologi.org/species/coleoptera/. The PSIs are then looked up in my Red List topic map. If the topic exists, the occurrence of type Red List status is fetched and the resulting list is converted into JSON and sent back to the client. Note that the PSI trick works fine in this case because the scientific names are unique within the order Coleoptera which is at least true for the Scandivian fauna.

Limitations

The presented prototype has some limitiations. Some can be solved easily, some not. First, the uBio web service has a relativly short limit for the text to analyse. In a real world scenario we would have to split the text in smaller chunks which are processed one by one. A second short coming is that the uBio has not registered all known Scandinavian species, so some names will not be recognized at all. Another problem are synonyms. Since the taxonFinder service will not resolve synonyms for all species, I would have to add PSIs that encode common synonyms to my Red List topic map. Alternatively, I could try to find topics by their names instead of PSIs, but this again would require me to add all or as many synonums as possible my toipc map. However, synonym databases exist on the web, so it is possible to write a script that fetches known synonyms and adds them to my map.

Outlook

As you have seen, this prototype has many limitations. I wish I chould have used standardized protocol to extract subjects from a text. Maybe a variant of Lars Marius' get-topic-page that could take a text as a parameter and returns a topic map fragment would be a solution. Another shortcoming is that the page itself has to include coleoptera.js. It would be interesting to create a browser plugin that allowed me to mark a text within a web page and extract the contained subjects. taxonFinder is also able to extract specimen names from other documument formats, e.g. PDFs, so a browser plug-in could handle those document types as well.

I hope that I could show you how easy it is to combine Topic Maps with other web services and apply the result to a web page. Try it if you havn't done so already. I'm interested in all kinds of feedback as well, so feel free to leave a comment!

tmjs 0.2.2 released

Today, I've released tmjs 0.2.2. tmjs is a Topic Maps engine written in pure JavaScript. It aims to be compliant with the Topic Maps API TMAPI 2.0 and implements the Topic Maps - Data Model TMDM. The changes for this release include:

  • Implemented ScopedIndex and TypeInstanceIndex
  • Removed 'MemImpl' from all class names but TopicMapSystemMemImpl (as suggested by Robert Cerny)
  • Implemented check for topic in use in Topic.remove()
  • More unit tests (now 162 tests)
  • Minor fixes for both code and documentation

The next step will be the implementation of the still missing Topic.mergeIn() and TopicMap.mergeIn().

The full distribution include the scripts needed to build the source code is available at http://github.com/jansc/tmjs. The uncompressed JavaScript source including the comments is also available for download at github.com. Finally, I've uploaded the current API documentation, but keep in mind the the docs are still very incomplete.

Taming the Norwegian Red List with Topic Maps

This is the first part in a series of blog posts about my Coleoptera topic map project. In this post I show you how easy it is to convert the Norwegian Red List of Threatended species into a topic map. The Norwegian Red List is essentially a forecast of the risk of species becoming extinct in Norway. As you might know, 2010 is the International Year of Biodiversity, so I thought it would be a nice pet project to make the data from the Red List available to Topic Maps based applications. Beetles, by the way, is the group of insects with the largest numbers of species--more than 350.000 world wide (according to wikipedia). In Norway, a little more than 3500 species are known, and 801 beetle species are on the Norwegian Red List 2006, which, unfortunately, make them the largest group of species described in the Red List.

A Red List classifies threatended species into a small set of categories:

CritierionNameDescription
EXExtinctNo individuals remaining
RERegionally extinctVery little doubt that it is extinct in the region concerned (here: Norway)
CRCritically EndangeredExtremely high risk of extinction in the wild.
ENEndangeredHigh risk of extinction in the wild.
VUVulnerableHigh risk of endangerment in the wild.
NTNear ThreatenedLikely to become endangered in the near future.
LCLeast ConcernLowest risk. Does not qualify for a more at risk category. Widespread and abundant taxa are included in this category.
DDData DeficientNot enough data to make an assessment of its risk of extinction.
NENot EvaluatedHas not yet been evaluated against the criteria.
(source: artsdatabanken.no)
240px-Status_iucn3.1.png
(Image licensed under the Creative Commons Attribution 2.5 Generic license. Graphic credit: Peter Halasz.)

The thought is to model species and Red List categories as topics, and then to associate the species to their corresponding Red List category. According to the plan, a new Red List with updated information is going to be published every fourth year. To allow the information from several Red Lists (or even Red Lists from other countries) to exist in parallel in our topic map, we create a topic representing the Red List itself. The association between the species and the Red List category is the scoped with the topic representing the Red List. In Compact Topic Maps Notation (CTM), the Red List and the criteria can be modelled like this:

%encoding "utf-8"
%version 1.0
%prefix tmcl <http://psi.topicmaps.org/tmcl/>
%prefix lang <http://psi.oasis-open.org/geolang/iso639/#>
%prefix redlist <http://psi.entomologi.org/redlist/>
%prefix redlist-category <http://psi.entomologi.org/redlist/category#>
%prefix ent <http://psi.entomologi.org/>

shortname isa tmcl:name-type;
    = http://psi.entomologi.org/topic-name/shortname .

ent:artsdatabank_id isa tmcl:occurrence-type;
    - "Artsdatabank ID" .

redlist:criterion isa tmcl:topic-type;
    - "Criterion";
    - "Kriterium" @ lang:nno .

redlist:category isa tmcl:topic-type;
    - "Red List category";
    - "Rødlistekategori" @ lang:nno .

redlist-category:EX isa tmcl:topic-type;
    ako redlist:category;
    - shortname: "EX";
    - "Extinct";
    - "Utdødd" @ lang:nno .
    
# [...] corresponding topics representing the other Red List categories

# Topic type for Red Lists
ent:redlist isa tmcl:topic-type;
 - "Red List" .

redlist:redlist2006 isa ent:redlist;
    - "2006 Norwegian Red List";
    - "Norsk Rødliste 2006" @ lang:nno .

redlist:is-redlisted-as isa tmcl:association-type;
    - "Is redlisted as";
    - "Er rødlisted som" @ lang:nno .

redlist:is-possibly-redlisted-as isa tmcl:association-type;
    - "Is redlisted as (uncertain)";
    - "Er rødlisted som (usikkert)" @ lang:nno .

Artsdatabanken.no provides a search interface to the 2006 Red List. On the "Alle arter" (all species) tab, it is possible to query the database for all species listed under specified categories. The result can be exported as a comma separated value file. The file looks like this:

"ArtsID","Artsgruppe","Vitenskapelig artsnavn","Norsk artsnavn","Underart Kode","Kategori","Kriterier","URL"
1582,Biller,Pseudomicrodota paganetti,,A,DD,,http://www2.artsdatabanken.no/rodlistesok/Artsinformasjon.aspx?artsID=1582
19,Biller,Haliplus fulvicollis,,A,DD,,http://www2.artsdatabanken.no/rodlistesok/Artsinformasjon.aspx?artsID=19
1929,Biller,Denticollis borealis,,A,VU,B2ab(iii),http://www2.artsdatabanken.no/rodlistesok/Artsinformasjon.aspx?artsID=1929
3210,Biller,Cionus alauda,,A,NT,,http://www2.artsdatabanken.no/rodlistesok/Artsinformasjon.aspx?artsID=3210
...

Ok, what have we got here? All species listed have an ID ("ArtsId"). They belong to an order ("Artsgruppe", in this case "biller" which means beetles). There is a scientific name ("Vitenskapelig artsnavn") and sometimes a Norwegian name ("Norsk artsnavn"). "Underart Kode" is a subspecies code, which we can ignore in this setting. Then we have the Red List category ("Kategori"), which is one of the categories mentioned above. "Kriterier" is the criterion that was used to classify the species as redlisted. This valus is a code that we won't decipher any further. It might look like this example: "B1ab(iii)+2ab(iii)". These criterions describe the parameters which, on the basis of population models, are known to be important for the risk of extinction.

The file from artsdatabanken.no seems to be encoded in UTF-16LE, so we need some iconv-magic before the file can be converted into a UTF-8 encoded topic map:

iconv -f UTF-16LE -t UTF-8 artsdatabanken.csv > redlist.csv

Unfortunately, something else is wrong with the CVS export. The problem lies in the criterion field. If you search for "i,i" with your text editor, you will see that "," within field values is not quoted, and "," is used as a field separator. I wrote a mail about this issue to artsdatabanken.no several months ago, but never got an answer. Unitl this is fixed, manual cleanup is needed, so 2674,Biller,Corticeus suturalis,,A,EN,B2ab(ii,iii)c(ii),http://www2.artsdatabanken.no/rodlistesok/Artsinformasjon.aspx?artsID=2674 should become 2674,Biller,Corticeus suturalis,,A,EN,"B2ab(ii,iii)c(ii)",http://www2.artsdatabanken.no/rodlistesok/Artsinformasjon.aspx?artsID=2674 (with " added).

To convert the file to CTM, download redlist2ctm, a little Perl script that I wrote some time ago from my github.com account. Run the script with:

./redlist2ctm redlist.csv > redlist.ctm

And voilà, there we have our topic map containing a part of the Norwegian Red List. Every species gets it own Published Subject Locator (PSI), based on its scientific name, e.g. http://psi.entomologi.org/species/coleoptera/agathidium_badium. The Red List category is associated to species with a is-redlisted-as association. If a criterion for the categorization is available, the criterion is represented as a separate topic that plays a third role in the mentioned association. The URL is added as a subject locator to the species, since the referred page can be seen as a represention of the species containing all available information about the species registered at artsdatabanken.no. The Norwegian names, if available, are added as a name in the scope lang:nno. That's about it. To make the topic map really useful, you'll probably need a taxonomy (a tree consisting of orders, families, genera, etc.). Also, the scientific names could be modelled in a better way. The Red List export does not include the author name. But we will be able to fix this later, as long as all species got their unique PSI. Here are two species represented in CTM:

haliplus_fulvicollis isa ent:species;
    http://psi.entomologi.org/species/coleoptera/haliplus_fulvicollis
    = http://www2.artsdatabanken.no/rodlistesok/Artsinformasjon.aspx?artsID=19;
    ent:artsdatabank_id : 19;
    - full_species_name: "Haliplus fulvicollis" .
redlist:is-redlisted-as(ent:species : haliplus_fulvicollis,
    redlist:category : redlist-category:DD) @ redlist:redlist2006

cassida_nebulosa isa ent:species;
    http://psi.entomologi.org/species/coleoptera/cassida_nebulosa
    = http://www2.artsdatabanken.no/rodlistesok/Artsinformasjon.aspx?artsID=2871;
    ent:artsdatabank_id : 2871;
    - "Prikket skjoldbille" @ lang:nno;
    - full_species_name: "Cassida nebulosa" .
redlist:is-redlisted-as(ent:species : cassida_nebulosa,
    redlist:category : redlist-category:EN,
    redlist:criterion : cassida_nebulosa_crit) @ redlist:redlist2006
cassida_nebulosa_crit isa redlist:criterion;
     = "B1ab(i,ii,iii)+2ab(i,ii,iii)" .

Related work

If you are familiar with the Ontopia Topic Maps engine, there is a module to convert CSV files into a topic map: db2tm. Using this module you can map a CSV file to a topic map that can be exported in any Topic Maps format supported by Ontopia.

Outlook and lessons learned

The next blog post is where the fun really starts: I'll show you a simple semantic mashup (great buzzword in these times) that uses this topic map and an external web service to annotate web pages with Red List information and PSIs. Until then, this is what we've learned so far:

  • Converting data from other applications into a topic map can be really easy.
  • No Topic Maps engine needed to create a topic map!
  • There is more than one way to do it!
  • Using Perl as scripting language keeps the Perl language from dying out, and this is a good thing(tm).

For 2010 a new Norwegian Red list is planned. It remains to see if the data for 2006 will still be available in CVS format. So hurry, and create you own Red List topic map now! With Topic Maps your data will probably last a little longer :-)

Edit: fixed some typos

top