Sunday, March 18, 2007

Repository Use Cases, first thoughts

I seem to be suffering a case of bloggorhia...

Here are my initial thoughts about a set of use cases for an Open Source Repository that might suit the needs of a subsequent version of It For Me, Peoples Network, Seamless, or any of the Open Knowledge Foundation projects. It's just one of a thousand flowers that might blossom in the OKF community, and I've got specific questions of my own to answer outside the domain of OKF, but if there's the potential for reuse, why not exploit it.

These initial use cases are going to be used to focus attention for Spike 1 - In which I'm going to try and figure out if JENA is capable of performing as a back end for a several million item database. For the spike we'll use jena and the MySQL database. I've got a straight 8 million item database for MySQL, so can do some meaningful comparisons. The use cases should stand as useful outside the spike, but here's what they are for:

Terms:
PDWDA - Public Domain Work Detection Agent - A software module that uses a number of rules to identify public domain works and notify the repository of that data.

UC#1 - New Known Schema Resource (Sound Recording), with New Related Data (New Composer/Artist), via internal API (No web services)
In this UC A "PD Works Detection Agent" Submits a New Work by a previously unidentified artist/composer. We will be using a SoundRecording DTO object and the interface will need to call SoundRecordingDTO -> RDF, then insert the RDF, creating new SoundRecording and Composer/Artist data.

UC#2 - New Known Schema Resource (Sound Recording), with Known Related Data (Existing Composer/Artist)
As UC#1 but with reference to an existing Artist/Composer. We will deal with deduplication of errant composer/artist data in a later version. For now, identify via composer->person->normalised name,dob,dod.

UC#3 - Arbitrary XML Schema Submission
We'd like to be able to ingest arbitary XML without the need for code. This is going to require some kind of codified "Profile"... more to be filled out. Ideally, the system would hold the input document in a queue if the schema was unidentified until the administrator could create the profile. A central repository of profiles would be cool, so people could reference or download storage, indexing and dissemination rules. -I think we want more than just lucene style indexing tho... more structure is needed. Some jena/lucene crossover might be very interesting, and of broader scope than any of these projects individually-

UC#4 - Arbitary RDF Submission
Like 3 but for RDF.. Easy for the RDF engine, hard for the database engine (But retrieval performance goes the other way.. thats what the spike is for).

No comments: