Thursday, March 22, 2007

oai-ore and open knowledge and other thoughts

OK, I've been reviewing the oai-ore work, mostly with one eye on whats going on in OKF, but also relating to other work k-intis involved in (http://www.openarchives.org/ore/). Bloody hell people.. if there isn't an opportunity for some "Disruptive" technology here I don't know where there is.... Storing and replicating actual digitial objects as well as their metadata seems like an ideal... the trouble is, many of todays "Digital Objects" are not discrete. They are composed of evolving streams of data such as map data, scientific data, or in the case of PN/IT For Me just a stream of records such as local council alerts. It's almost like we need a recusrsive oai structure where the metadata record sits atop each recursive entry, with the ability to freshen the objects below it....

Sunday, March 18, 2007

Repository Use Cases, first thoughts

I seem to be suffering a case of bloggorhia...

Here are my initial thoughts about a set of use cases for an Open Source Repository that might suit the needs of a subsequent version of It For Me, Peoples Network, Seamless, or any of the Open Knowledge Foundation projects. It's just one of a thousand flowers that might blossom in the OKF community, and I've got specific questions of my own to answer outside the domain of OKF, but if there's the potential for reuse, why not exploit it.

These initial use cases are going to be used to focus attention for Spike 1 - In which I'm going to try and figure out if JENA is capable of performing as a back end for a several million item database. For the spike we'll use jena and the MySQL database. I've got a straight 8 million item database for MySQL, so can do some meaningful comparisons. The use cases should stand as useful outside the spike, but here's what they are for:

Terms:
PDWDA - Public Domain Work Detection Agent - A software module that uses a number of rules to identify public domain works and notify the repository of that data.

UC#1 - New Known Schema Resource (Sound Recording), with New Related Data (New Composer/Artist), via internal API (No web services)
In this UC A "PD Works Detection Agent" Submits a New Work by a previously unidentified artist/composer. We will be using a SoundRecording DTO object and the interface will need to call SoundRecordingDTO -> RDF, then insert the RDF, creating new SoundRecording and Composer/Artist data.

UC#2 - New Known Schema Resource (Sound Recording), with Known Related Data (Existing Composer/Artist)
As UC#1 but with reference to an existing Artist/Composer. We will deal with deduplication of errant composer/artist data in a later version. For now, identify via composer->person->normalised name,dob,dod.

UC#3 - Arbitrary XML Schema Submission
We'd like to be able to ingest arbitary XML without the need for code. This is going to require some kind of codified "Profile"... more to be filled out. Ideally, the system would hold the input document in a queue if the schema was unidentified until the administrator could create the profile. A central repository of profiles would be cool, so people could reference or download storage, indexing and dissemination rules. -I think we want more than just lucene style indexing tho... more structure is needed. Some jena/lucene crossover might be very interesting, and of broader scope than any of these projects individually-

UC#4 - Arbitary RDF Submission
Like 3 but for RDF.. Easy for the RDF engine, hard for the database engine (But retrieval performance goes the other way.. thats what the spike is for).

Public Domain Works Database - Thoughts

Just blogging this for info really. Apparently there are issues with access to the BL's music database, for checking works copyright status. I don't know if we can legitimately use the Library Of Congress instead, but I believe their public SRU/W server includes their music collections. Hence this query can be used to identify works of Bob Dylan. Interestingly the AAAF (The LC Name Authority File) Seems to contain much richer alias and pseudonym data for artists than the composer list, and my sample matched 100% for the data we were trying to find. it may be that the LC server is a good additional resource for checking resource status. Of course having identified the works getting hold of them could be a different problem.

OKF - Open Knowledge 1.0 First Musings

So, yesterday was the first OKF meeting at limehouse town hall in East London. I'm not going to try and report back each presentation blow by blow, others will do that far more accurately than I ever could. I did make notes about my "general feel" for the day, and I've got some specific thoughts on the level of cohesion between the diversity of all the projects presented. I guess my biggest concern is that there's lots of good intention, and lots of willingness to put effort in, but from what I could tell, nobody had really really started to grasp the nettles of interoperability amongst hugely heterogenous datasets. There's lots lots lots more to OKF than that, but because it's what I've spent most of my working life trying to deal with, it's inevitable that I see that problem everywhere I look, and that thats the problem area I'm most likely to be able to have a positive impact on. But.. my thoughts on the general feel will have to wait until I can decode my chicken scratch handwriting.....

What I wanted to get down, rather than going over the conf again, was how yesterday has changed my thinking.... My interest in OKF started because of huge links with work in projects such as seamless UK (Community Information for the People of Essex), The Peoples Network Discovery Service (A clearinghouse for cultural heritage resources funded under various digitisation programs), IT For Me (Public/Local information in south yorkshire), and a load more. These projects are all basically aggregators taking diverse sets of data from providers who don't always have a public interface, munging the data into a cannonical format, and then pushing it out again both via OAI, SRW/SRU and a web interface. Along the way, full text and spatial indexes are added to make the works searchable in lots of interesting ways. There are many similarities or links with almost all the projects of OKF. Where there's not a similarity in terms of sharing collections, theres a potential data provider link.. for example.. the planning alerts service would make a great feed for IT For Me and Seamless.

So.. how did yesterday affect my thinking.... Well, For Peoples Network we've been working on a new repository format. The current PN and IT For Me systems use a relational database as the repository. We have several "Filters" on the front end of the system that allows it to ingest many many different metadata schemas. We then cram this onto a single relational structure, doing a mapping job as we go. This kills two(Maybe even three) birds with one stone: The storgae and access is dealt with in one blow, once in the repository, we are set to search. once in the repository the records can be output in any form we can transform the canonical schema into.

I've been working on some filters to get the public domain works database into mysql so we can include the content in peoples network perhaps. But I'm worried about the diversity of data the OKF might generate and how much semantic density we might looks by cramming everything into one DC-like Schema.

What about RDF? This question has plagued me for a long time. Judging by some of the comments at the govt information presentation, it worries some of the conf attendees too ;). I've been a long time tinkerer with jena, since it's *very* early days. I never felt, however, that it was ready to have seven million objects poured into it and be able to perform at anything like the level our database could. it also lacks the full text and spatial operators we get for free in mysql. What it does have on it's side is a hugely expressive power.

So, after OKF 1.0 whats changed? I think maybe it's really time to seperate the repository and aggregation functions from the searchable index ones. It might be that we can make jena perform by adding full text and spatial indexing capabilities to the database backen. However, I don't think this can be a major goal. If whatever is (I) produce for the next peoples network data model or similar project is to be useful to the OKF community, it needs to separate out the concerns of storage and use. To do this, some architecture is needed. We need to start being able to store and distribute the "Knowledge" richness without loss of semantics. Selecting resources for inclusion in other indexing services is something else. I'm going to refactor some of the composer import code I wrote to create a RepositorySubmission interface, this thing needs to be able to let me make some choices later on about the MySQL / Jena backends, and we need to do a spike with several million records to see if we need to trade off semantic density for performance (And if we do, we need to split the repository into two seperate functions so this ceases to be an issue). For now, my interest is in generating a jena backend to the repository submission interface.

Monday, March 05, 2007

Exposing UK Cultural Heritage Digital Resources in Second Life

The Peoples Network Discover Service is an MLA funded service that aggregates content sources that deal with cultural heritage and other MLA funded activities, and provides a digital preservation service for the metatdata generated by such services.

This blog category gathers together my link research into how it might be possible to create an empty shell bulding in second life. The building would essentially be an empty museum-type structure with a console in the lobby. The idea would be to have a SL user enter an “Instance” (Given that I don’t even know if second-life could support such “Instances”, we are at first base here, although it might be an interesting topic for the SL open source program) and use the console to provide topics (Well, search terms really, but it would be nice to do something more advanced). Once happy with the kinds of results the user would select a go control and fill the museum with digital artifacts which they could then explore, and interact with.

Later stages…. there’s lots of work to even arrive at a proof of concept for this, but once sorted, here are other ideas I want to explore

  • Abstract out the features and create a generic JZKit -> SecondLife bridge so that anyone using JZKit to expose their repository can expose that data in SL. Ideally, I’d like to be able to create vanilla search-houses and search-walls and search-pictures (you get the idea) that users can use in their own structures or place on their own land.

  • Provide possible integration with e-commerce (Dependent upon license data in metadata). This would allow two distinct options -> Allow users to request prints of digital artifacts etc in the real world, and possiby more interesting to SL, allow users to download inages for use as textures and other SL resources.

  • Organisation of the “virtual museum” could be interesting, how we categorise and structure the exhibit so it’s not just a “Relevance Ranked” list of stuff arranged throughout the shell of a building is going to be challenging.

  • Providing some way for people to “Store” or “Share” pre-populated museum collections, and link them into their own structures. Perhaps this could be a bit like a saved search or and RSS feed for exhibits people can use in their own structures. In fact, a possible idea for SL commerce might be to provide a “Changable Picture” that users can buy to hang on the walls of their SL buildings. These pictures would be backed by a search of the MLA discover service and the content would change at a user specified interval, for example “Nottingham Lace Market” every 5 mins, could be hung in the lobby of a person or organisation connected to that area.

  • This also provides a new opportunity to MLA.. the inclusion of interactive SL resources in the discover application.

A Bit About Me

Ok, here's the obligatory "About Me" post so that people who follow a link here from somewhere I've Registered can actually discover something: I'm an open source software developer (Actually, I part own a software company and officially my job title is director, but I still call myself a developer, and I'll be cold in the ground long before I ever introduce myself as MD).

My work interests center around information retrieval, information repositories, semantic and systems interoperability, systems integation and semantic web / text processing. I develop and maintain an open source toolkit JZkit for developers looking to access or expose information repositories using z3950, SRW/SRU or OpenSearch (Amongst others). On top of that there's a portal application called iNode, but I'll put up a projects category and make some entries for the individual things I'm involved with. I also maintain a really low level asn.1 to java precompiler and ber runtime called a2j which is used allover the place, apparently even in some voip software on mobile phones... Makes me wonder what I was doing when I lgpl'd that one ;). Sector wise, I work mostly in libraries, learning and public information, and recently I'm doing an alarming amount with vocabulary management, which seems to excite some people quite wildly, but not me.

"Academically" (Most people I know would never use that term around me I guess) my interests are mostly aligned with work, IR algorithms and text mining. I've always loved organisational cybernetics tho (Especially the work of Stafford Beer), and have a healthy (Semi-active) interest in general systems theory. I secretly hold a bit of an ambition to go back to research at least in some small way before I'm done, probably just part time. Work takes up most of my time at the moment tho, although I'm currently trying to get to grips with some bioinformatics and I'm reading genetics for dummies.

What little free time I have is split between My wonderful wife, who doesn't get nearly enough attention, my kids (2 boys aged 7 and 9), fun coding, and martial arts (I've sorta seriously practised HapKiDo and TaeKwonDo in the past, along with a superficial smattering of Aikido, Taijutsu and Iado, and I'm currently full on having fun with Sensei Chris Robins at Norton Dojo Traditional Goju Ryu Karate School, I'm trying to keep a dojo blog up to date.

Sometimes I read (Fiction: mostly sci-fi and fantasy, *very* rarely horror, last fiction I read was Excession by ian m banks, and before that The Algebraist. Non-Fiction just about anything. Currently I'm having a go at re-reading Sartre's classic existentalist text being and nothingness, my teeny IQ struggles a bit to keep all the words in my head, but it's quite rewarding on the hole (See what I did there, eh?). The last fun non-fiction I read was The Progressive Patriot Billy Bragg, and before that I read The God Delusion by Richard Dawkins, which whilst I mostly "Get" I can't say I enjoyed, mostly because whilst the man is clearly bright and passionate, he has the charm and grace of a pubic louse. A friend also gave me a copy of Gödel, Escher, Bach: an Eternal Golden Braid: A metaphorical fugue on minds and machines in the spirit of Lewis Carroll which I sometimes take to the winter garden at lunch time in an attempt to look intellectual. I think the furrowed brow always gives it away.. or maybe it adds to the illusion. Next up I'm hoping to read On Formally Undecidable Propositions of "Principia Mathematica" and Related Systems Since it sorta ties together some of my interests in cybernetics (And metasystems as a means to resolve undecidable propositions) and the maths of GEB.

Sometimes I listen to music and I get lots of stick in the office about my marillion t-shirts and 47 minute guitar solos. Generally speaking, I'm shy rather than antisocial, so it's OK to speak to me if you see me out and about. I'm a Crab, but don't believe in any of that crap beyond the ability of believers to use it as a prompt for introspection (How I feel about most systems of belief really). Although I really like the The Desiderata of Hope by Max Ehrmann quote below, I did recently see a friends religions status listed as "I believe in people, not fairy tales", which I also like, although it's a bit strong for me personally, it does kinda sum up how I feel about certain things :)

Well, if you want to know any more you best talk to me :)

Peace,
e.