Sunday, March 18, 2007

OKF - Open Knowledge 1.0 First Musings

So, yesterday was the first OKF meeting at limehouse town hall in East London. I'm not going to try and report back each presentation blow by blow, others will do that far more accurately than I ever could. I did make notes about my "general feel" for the day, and I've got some specific thoughts on the level of cohesion between the diversity of all the projects presented. I guess my biggest concern is that there's lots of good intention, and lots of willingness to put effort in, but from what I could tell, nobody had really really started to grasp the nettles of interoperability amongst hugely heterogenous datasets. There's lots lots lots more to OKF than that, but because it's what I've spent most of my working life trying to deal with, it's inevitable that I see that problem everywhere I look, and that thats the problem area I'm most likely to be able to have a positive impact on. But.. my thoughts on the general feel will have to wait until I can decode my chicken scratch handwriting.....

What I wanted to get down, rather than going over the conf again, was how yesterday has changed my thinking.... My interest in OKF started because of huge links with work in projects such as seamless UK (Community Information for the People of Essex), The Peoples Network Discovery Service (A clearinghouse for cultural heritage resources funded under various digitisation programs), IT For Me (Public/Local information in south yorkshire), and a load more. These projects are all basically aggregators taking diverse sets of data from providers who don't always have a public interface, munging the data into a cannonical format, and then pushing it out again both via OAI, SRW/SRU and a web interface. Along the way, full text and spatial indexes are added to make the works searchable in lots of interesting ways. There are many similarities or links with almost all the projects of OKF. Where there's not a similarity in terms of sharing collections, theres a potential data provider link.. for example.. the planning alerts service would make a great feed for IT For Me and Seamless.

So.. how did yesterday affect my thinking.... Well, For Peoples Network we've been working on a new repository format. The current PN and IT For Me systems use a relational database as the repository. We have several "Filters" on the front end of the system that allows it to ingest many many different metadata schemas. We then cram this onto a single relational structure, doing a mapping job as we go. This kills two(Maybe even three) birds with one stone: The storgae and access is dealt with in one blow, once in the repository, we are set to search. once in the repository the records can be output in any form we can transform the canonical schema into.

I've been working on some filters to get the public domain works database into mysql so we can include the content in peoples network perhaps. But I'm worried about the diversity of data the OKF might generate and how much semantic density we might looks by cramming everything into one DC-like Schema.

What about RDF? This question has plagued me for a long time. Judging by some of the comments at the govt information presentation, it worries some of the conf attendees too ;). I've been a long time tinkerer with jena, since it's *very* early days. I never felt, however, that it was ready to have seven million objects poured into it and be able to perform at anything like the level our database could. it also lacks the full text and spatial operators we get for free in mysql. What it does have on it's side is a hugely expressive power.

So, after OKF 1.0 whats changed? I think maybe it's really time to seperate the repository and aggregation functions from the searchable index ones. It might be that we can make jena perform by adding full text and spatial indexing capabilities to the database backen. However, I don't think this can be a major goal. If whatever is (I) produce for the next peoples network data model or similar project is to be useful to the OKF community, it needs to separate out the concerns of storage and use. To do this, some architecture is needed. We need to start being able to store and distribute the "Knowledge" richness without loss of semantics. Selecting resources for inclusion in other indexing services is something else. I'm going to refactor some of the composer import code I wrote to create a RepositorySubmission interface, this thing needs to be able to let me make some choices later on about the MySQL / Jena backends, and we need to do a spike with several million records to see if we need to trade off semantic density for performance (And if we do, we need to split the repository into two seperate functions so this ceases to be an issue). For now, my interest is in generating a jena backend to the repository submission interface.

No comments: