View unanswered posts | View active topics It is currently Mon Dec 11, 2017 2:44 pm



Reply to topic  [ 5 posts ] 
Relational databases 
Author Message
ANDS Partners
User avatar

Joined: Wed Aug 19, 2009 11:42 am
Posts: 4
Location: University of Sydney Library
Following is an enquiry I sent to Margaret after a discussion about a particular research data collection. Margaret suggested that others may be interested in this area. ANDS have offered assistance in developing and documenting a methodology which applies to the collection, but which could equally be used by other institutions. (For background on the dataset discussed below & Andy's work, see http://www.theage.com.au/articles/2004/ ... 12905.html )

I'm interested in responses to the questions below (as well as others I've not considered), & any thoughts people may have in approaching this type of content.

~~~~~~~~~~~
Andy Short’s beaches research data is held within a relational database. The database underpins an information service currently accessible to authorised users and administered by the Surf Lifesaving Association. Going back some years, the University of Sydney Library has expressed interest in playing an archiving role, and Andy and the SLAA are happy for the Library to undertake this task. The Library however has no experience in archiving either relational databases or content derived from relational databases, and requires guidance.

How might ANDS assist? Is this a common issue facing other libraries, or is it exotic? Would it be useful to have an ANDS guide on the subject? Might Andy’s dataset provide an example to work through in partnership with ANDS?

Questions include…

1. Should the data be submitted to the Library’s repository? Is this the most suitable home?

2. If the data were to be submitted to the repository, what would the Library need to know in order to assert that it is being archived in a form most likely to enable access and use at a future point?

3. Should the Library seek to preserve the functionality of the database including any stored procedures and triggers?
Ongoing management within its native RDBMS?
Services in development such as PLANETS-SIARD?

4. Should the Library seek to archive only the data?
Translation to an intermediary format? Is there a standardised archiving format for database content?
Export each table as a text file?
Documentation of database schema, entity relationship diagram?

5. What additional metadata is needed?
Contextual (background, purpose), administrative (rights, provenance), technical (data constraints, primary & foreign keys)?
Who would create or provide metadata and associated documentation?

6. What are the costs compared to those of managing a typical item of repository content?

7. What are the issues for considering whether this would be a one-off, become part of a routine service or be considered for special cases only?

At this stage I’m in the process of receiving files from the SLAA, and I’ve looked a little at relational database submission guidelines provided by agencies such as the UK Archaeology Data Service and the old AHDS.

ADS guidelines for depositors
http://ads.ahds.ac.uk/project/userinfo/ ... e=database

AHDS relational database preservation handbook
http://www.google.com.au/url?sa=t&sourc ... EKH8Tb-DqA

_________________
Rowan Brownlee
Digital Project Analyst, University of Sydney Library
Fisher Library F03,
University of Sydney, NSW 2006, Australia
Phone: +61 02 9036 6450
e-mail: r.brownlee@library.usyd.edu.au
web: http://escholarship.library.usyd.edu.au/dpa/


Wed Aug 19, 2009 6:48 pm
Profile WWW
ANDS Partners
User avatar

Joined: Thu Aug 20, 2009 3:22 pm
Posts: 1
This is a fascinating project, of particular interest to me as CSIRO Marine and Atmospheric research, which is my "home" in CSIRO, does considerable research into coastal management http://www.cmar.csiro.au/research/icom.html
I think it would have very wide application, and would be a useful test case for ANDS.
I don't have any answers to your questions, however, Rowan.


Thu Aug 20, 2009 4:58 pm
Profile
ANDS Staff
User avatar

Joined: Fri Aug 21, 2009 4:35 pm
Posts: 2
Personally, I think it depends on the size of the database. If it is not "huge", why not make a backup copy of the database to USYD library (become a mirror database and you can archive in your side, if you want ). If you make a script to regularly backup the database, you won't worry lose data and functions.

You may also interest in this paper:
Relational Database Active Tablespace Archives Using HSM Technology
http://www4.clearlake.ibm.com/hpss/about/BoomerRDBMSHSM.pdf


Mon Aug 24, 2009 3:25 pm
Profile
ANDS Staff
User avatar

Joined: Wed Aug 12, 2009 1:19 pm
Posts: 18
ANDS is interested in this collection for several reasons. It would be a worthwhile addition to the ANDS Collections Registry. It also provides an interesting case study in terms of the methodology around its organisation and description, and the means used to ensure its longer term access and use. So, I suspect, are lots of other people in institutions with similar collections.

ANDS has taken on a new member of staff with experience in this area. We have allocated some of her time to help with this. This will involve developing and documenting a methodology which applies to the collection, but which will be made available to be used by other institutions.


Tue Aug 25, 2009 2:22 pm
Profile
Hi everyone,

I am the new ANDS staff member Margaret mentioned and I am very interested in this topic. I was very pleased to read Rowan's post on my first day.

I won't presume to answer any of Rowan's questions yet but I know that it is a complex area where the standards are still evolving and the availability of tools is limited. There are also many levels at which the problem can be addressed. The blog entry I've appended below from Ben O'Steen covers many of the issues. I also found a lot useful information from the DISC-UK Datashare Project - http://www.disc-uk.org/datashare.html

My initial impression is that approaches fall into four categories:
  • Describe the database as an item in the repository with as much descriptive information as possible and package up the database in both its proprietary format as well as a standard export format eg SQLdump, series of CSV files etc. One can add to the package any associated documentation such as data dictionaries, ER model etc. For full reuse, this assumes that the consumer has either the same database software as the creator, or software which fully understands the proprietary format of the original database software.
  • The approach used by the social science community with DDI/SDMX for social science data for which there are tools but this doesn't adapt to a general model (there is a good intro to DDI and SDMX at http://www.opendatafoundation.org/paper ... d_SDMX.pdf). This improves reuse possibilities and data sharing.
  • The attempt to capture all of the functionality of the original relational database as well as add provenance and descriptive information at all levels of the database (dataset, tables, fields) to improve the context for the re-user. A new standard SIARD (Software Indpendent Archiving of Relational databases) adopted by EU-PLANETS, which Rowan mentioned, aims to capture all of the functionality of the original relational database (only as it accords to SQL:1999 standard) but also allows addition of administrative and descriptive metadata that was not documented in the database by the designer/creator. The theory behind this standard is that any database product (commercial or open-source) could be enhanced to accept this as an import format and would recreate the dabatase functionality including relations, constraints and triggers. There are no tools available yet for this method (eg converters) but the Swiss Federal Archives have developed tools and will be releasing them as free/open software once they get clearance from their bureaucracy. Briefing paper on SIARD at http://issuu.com/dpe-staff/docs/databas ... 2?mode=a_p
  • A more modern variation of the previous approach is to attempt the same outcome but by converting the contents of the database to a standardised RDF triple store format (with OWL ontology and SPARQL end=points) which would not only enable re-use of the data but would make it more instantly accessible via the linked data initiative.

If anyone can point to other methods, real examples, or any any tools that they know of or have used themselves that would be very helpful.

Monica

Quote:
---------------------------------
From http://oxfordrepo.blogspot.com - Ben O'Steen
Handling Tabular data
"Storage"

I put the s-word in quotes because the storing of the item is actually a very straightforward process - we have been dealing with storing tabular data for computation for a very long time now. Unfortunately, this also means that there are very many ways to capture, edit and present tables of information.

One realisation to make with regards to preserving access to data coming from research is that there is a huge backlog of data in formats that we shall kindly call 'legacy'. Not only is there this issue, but data is being made with tools and systems that effectively 'trap' or lock-in a lot of this information - case in point being any research being recorded using Microsoft Access. While the tables of data can often be extracted with some effort, it is normally difficult to impossible to extract the implicit information; how tables interlink, how the Access Form adds information to the dataset, etc.

It is this implicit knowledge that is the elephant in the room. Very many serialisations, such as SQL 'dumps', csv, xsl and so on, rely on implicit knowledge that is either related to the particulars of the application used to open it, or is actually highly domain specific.

So, it is trivial and easy to specify a model for storing data, but without also encoding the implied information and without making allowances for the myriad of sources, the model is useless; it would be akin to defining the colour of storage boxes holding bric-a-brac. The datasets need to be characterised, and the implied information recorded in as good a way as possible.

Characterisation

The first step is to characterise the dataset that has been marked for archival and reuse. (Strictly, the best first step is to consult with the researcher or research team and help and guide them so that as much of the unsaid knowledge is known by all parties.)

Some serialisations so a good job of this themselves, *SQL-based serialisations include basic data type information inside the table declarations themselves. As a pragmatic measure, it seems sensible to accept SQL-style table descriptions as a reasonable beginning. Later, we'll consider the implicit information that also needs to be recorded alongside such a declaration.

Some others, such as CSV, leave it up to the parsing agent to guess at the type of information included. In these cases, it is important to find out or even deduce the type of data held in each column. Again, this data can be serialised in a SQL table declaration held alongside the original unmodified dataset.

(It is assumed that a basic data review will be carried out; does the csv have a consistent number of columns per row, is the version and operating system known for the MySQL that held the data, is there a PI or responsible party for the data, etc.

Implicit information

Good teachers are right to point out this simple truth: "don't forget to write down the obvious!" It may seem obvious that all your data is latin-1 encoded, or that you are using a FAT32 filesystem, or even that you are running in a 32-bit environment, the painful truth is that we can't guarantee that these aspects won't affect how the data is held, accessed or stored. There may be systematic issues that we are not aware of, such as the problems with early versions of ZFS causing [, at the time, detected] data corruption, or MySQL truncating fields when serialised in a way that is not anticipated or discovered until later.

In characterising the legacy sets of data, it is important to realise that there will be loss, especially with the formats and applications that blend presentation with storage. For example, it will require a major effort to attempt to recover the forms and logic bound into the various versions of MS Access. I am even aware of a major dataset, a highly researched dictionary of old english words and phrases, that the final output of which is a Macromedia Authorware application, and the source files are held by an unknown party (that is if they still exist at all) - the Joy of hiring Contractors. In fact, this warrants a slight digression:

The gap in IT support for research

If an academic researcher wishes to gain an external email account at their institution, there is an established protocol for this. Email is so commonplace, it sounds an easy thing to provide, but you need expertise, server hardware, multiuser configuration, adoption of certain access standards (IMAP, POP3, etc), and generally there are very few types of email (text or text with MIME attachments - NB the IM in MIME stands for Internet Mail)

If a researcher has a need to store tables of data, where do they turn? They should turn to the same department, who will handle the heavy lifting of guiding standards, recording the implicit information and providing standard access APIs to the data. What the IT departments seem to be doing currently is - to carry on the metaphor - handing the researcher the email server software and telling them to get on with it, to configure it as they want. No wonder the resulting legacy systems are as free-form as they are.

Practical measures - Curation

Back to specifics now, consider that a set of data has been found to be important, research has been based on it, and it's been recognised that this dataset needs to be looked after. [This will illustrate the technical measures. Licencing, dialogue with the data owners, and other non-technical analysis and administration is left out, but assumed.]

First task is to store the incoming data, byte-for-byte, as much as is possible - storing the iso image of the media the data is stored on, storing the SQL dump of a database, etc.

Analyse the tables of data - record the base types of each column (text, binary, float, decimal, etc) apeing the syntax of a SQL table declaration, as well as trying to identify the key columns.

Record the inter-table joins between primary and secondary keys, possibly by using a "table.column SAMEAS table.column;" declaration after the table declarations.

Likewise, attempt to add information concerning each column, information such as units or any other identifying material.

Store this table description alongside the recorded tabular data source.

Form a representation of this data in a well-known, current format such as a MySQL dump. For spreadsheets that are 'frozen', cells that are the results of embedded formula should be calculated and added as fixed values. It is important to record the environment, library and platform that these calculations are made with.

Table description as RDF (strictly, referencing cols/rows via the URI)

One syntax I am playing around with is the notion that by appending sensible suffixes to the base URI for a dataset, we can unique specify a row, a column, a region or even a single cell. Simply put:

http://datasethost/datasets/{data-id}#table/{table-name}/column/{column-id} to reference a whole column
http://datasethost/datasets/{data-id}#table/{table-name}/row/{column-id} to reference a whole row, etc

[The use of the # in the position it is in will no doubt cause debate. Suffice it to say, this is a pragmatic measure, as I suspect that an intermediary layer will have to take care of dereferencing a GET on these forms in any case.]

The purpose for this is so that the tabular description can be made using common and established namespaces to describe and characterise the tables of data. Following on from a previous post on extending the BagIt protocol with an RDF manifest, this information can be included in said manifest, alongside the more expected metadata without disrupting or altering how this is handled.

A possible content type for tabular data

By considering the base Fedora repository object model, or the BagIt model, we can apply the above to form a content model for a dataset:

As a Fedora Object:

* Original data in whatever forms or formats it arrives in (dsid prefix convention: DATA*)
* Binary/textual serialisation in a well-understood format (dsid prefix convention: DERIV*)
* 'Manifest' of the contents (dsid convention: RELS-INT)
* Connections between this dataset and other objects, like articles, etc as well as the RDF description of this item (RELS-EXT)
* Basic description of dataset for interoperability (Simple dublin core - DC)


As a BagIt+RDF:

Zip archive -

* /MANIFEST (list of files and checksums)
* /RDFMANIFEST (RELS-INT and RELS-EXT from above)
* /data/* (original dataset files/disk images/etc)
* /derived/* (normalised/re-rendered datasets in a well known format)

Presentation - the important part

What is described above is the archival of the data. This is a form suited for discovery, but is not in a form suited for reuse. So, what is the possibility?

BigTable (Google) or HBase (Hadoop) provides a platform where tabular data can be put in a scalable manner. In fact, I would go on to suggest that HBase should be a basic service offered by the IT department of any institution. By providing this database as a service, it should be easier to normalise, and to educate the academic users in a manner that is useful to them, not just to the archivist. Google spreadsheet is an extremely good example of how such a large, scalable database might be presented to the end-user.

For archival sets with a good (RDF) description of the table, it should be possible to instantiate working versions of the tabular data on a scalable database platform like HBase on demand. Having a policy to put to 'sleep' unused datasets can provide a useful comprimise, avoiding having all the tables live but still providing a useful service.

It should also be noted that the adoption of popular methods of data access should be part of the responsibility of the data providers - this will change as time goes on, and protocols and methods for access alter with fashion. Currently, Atom/RSS feeds of any part of a table of data (the google spreadsheet model) fits very well with the landscape of applications that can reuse this information.

Summary

* Try to record as much information as can be found or derived - from host operating system to column types.
* Keep the original dataset byte-for-byte as you recieved it.
* Try to maintain a version of the data in a well-understood format
* Describe the tables of information in a reusable way, preferably by adopting a machine-readable mechanism
* Be prepared to create services that the users want and need, not services that you think they should have.



Posted by Ben O'Steen at 18:28 0 comments


Wed Sep 02, 2009 5:29 pm
Display posts from previous:  Sort by  
Reply to topic   [ 5 posts ] 

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software for PTF.