Lifespan of datasets and use of persistent identifiers 
(This is a re-location of what first appeared in the ANDS-general group.
Lyle Winton suggested this re-location)

It seems the custom that, once published, a scientific paper has an
ongoing existence, even if getting a physical copy may get harder as
the years go by. You could say that it becomes a part of our
intellectual heritage. We expect to know that it exists in a century
from now.

What about a dataset? If we choose to publish a dataset (have a look
at, the custom is to
assign it a persistent identifier and encourage re-users of the
dataset to cite it, much as one would a scientific paper.

This suggests that when we publish a dataset, we intend that dataset
to become a part of our intellectual heritage too. Papers are reviewed
by peers before publication, and quality control extends beyond the
content to the form and layout of the paper itself.

Question 1) Is this indeed what we intend when we publish a dataset?

Our department is intending to publish datasets. What quality control
should we impose on a dataset before deciding whether to publish it?

I see two broad ways to approach this problem. We could work by
analogy with papers, require peer review, and impose some standard on
the "look and feel" of our datasets. The problem with this is that we
are not editors of a scientific journal, we are a government
department that happens to have a large number of scientists.

Or we could let the scientific market decide.

We could do this via a half way house you might call "provisional

That is, suppose suppose we are not sure whether some dataset is of
lasting value to science, and don't have the internal resources to do
good quality control on it. However we are willing to make the effort
of cataloguing it just in case. We could publish it, minus a
persistent identifier -- that is, just make public a catalogue entry
which describes it, as best we can given, and wait a while to see what
response we get.

One can imagine several responses from the market.

1) little or no interest. We could interpret that as a sign that it
has no lasting value, and quietly withdraw the catalogue entry.

2) significant interest but critical feedback from re-users, eg "we
would like to re-use it but the metadata is inadequate / the format it
is in is unusable / the data has some suspect features that make re-
using it unwise". We could then decide whether we can address the
criticisms. If, for whatever reason, we cannot, we withdraw the
catalogue entry from public view. Otherwise we address the criticism
and maintain the catalogue entry. In time the dataset may attract
positive feedback and go into category 2).

3) significant interest and positive feedback from re-users. We could
interpret that as a sign of lasting value, assign the dataset a
persistent identifier, and commit to publishing it for the long term.

