[OAI-eprints] Re: Interoperability - subject classification/terminology

Thu, 27 Mar 2003 09:36:42 +0000 (GMT)

On Thu, 27 Mar 2003, Hussein Suleman wrote:

> ...why not use sets for the separate 
> disciplines, aimed at particular service providers?...
> some disciplines are not well-defined (namely, computer science) 
> so such archives may want to play ball with multiple service providers 
> and hence may need different sets.

The question of taxonomic classification sets and version-control for
Open Archives is a technical one, so I will not presume to comment on it
except from the point of view of the potential *users* of one particular
kind of Archive Content, namely, unrefereed preprints and refereed
postprints of research papers from one or many or all disciplines: This
-- in the google-age of boolean inverted full-text searchability --
does not require a detailed a-priori taxonomy, as book metadata or the
metadata for other kinds of material might. A fairly general sorting by
discipline should suffice.
http://www.eprints.org/self-faq/#26.Classification
http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2385.html

> ...the service provider can provide an 
> interface for potential data providers to self-register.

I hope that once the number and contents of Open-Access Eprint Archives
for research preprints and postprints have scaled up toward something
closer to universality, the simple metadata descriptors "pre-refereeing
preprint" and "refereed journal article" plus perhaps "discipline name"
will be enough to guide relevant service-providers in automatically
harvesting their relevant metadata. Multiple self-registration seems a
tedious and unnecessary constraint. (Possibly a master-registry of valid
institutions and disciplinary archives will also help, but may not be
necessary unless commercial spamming invades this sector too.)

> what remains a difficult problem, however, is how to recreate the 
> metadata used by the service provider as its native format. so, for a 
> typical example, if arXiv classifies items using a specific set 
> structure, this is certainly not going to be the default for an 
> institutional archive. does the service provider automatically or 
> manually reclassify? or does it not allow browsing by categories? 

Worrying about "recreating the categories" in this boolean full-text age
is, I believe, a waste of time (for research preprints/postprints). Just
harness google's harvested full-text to your engine's search capability,
if it is incapable of contending with boolean full-text search on its
own. (Manual reclassification! Heaven forfend! Don't bother classifying
this material in the first place, beyond the simplest of first-cuts,
such as discipline. Any further classification should be algorithmic and
text-data-driven, not manual.)

> in either event, the quality of the metadata from the perspective of the 
> service provider may be an impetus for potential users to want to 
> replicate their effort rather than rely on the automated submission from 
> their own institutions ... this needs more thought ...

Again, I speak only for research preprints/postprints, but please let's
not inject any further credibility into the notion that self-archiving
author/institutions will also have to self-advertise by multiple
self-archiving of the same paper. Surely that is one headache that
OAI-interoperability should eradicate from the planet! Self-archiving
itself is self-advertising (and effort) enough. Please let us not
now -- when the momentum is still not big enough -- saddle would-be
self-archivers with needless extra worries, and tasks!
http://www.ecs.soton.ac.uk/~harnad/Temp/tim-arch.htm

Stevan Harnad