PID-rieha – PIDapalooza 2020 Lissabonissa 28.-30.1.

Hakala J (2020). PID-rieha – PIDapalooza 2020 Lissabonissa 28.-30.1. Tietolinja, 2020(1). Pysyvä osoite: http://urn.fi/URN:NBN:fi-fe2020050324725

Kuva 1. PIDapalooza 2020 järjestettiin Centro cultural de Belémissä Lissabonissa. Kuva: Juha Hakala, 2020.

PIDapalooza on vuosittain järjestettävä pysyvien tunnisteiden konferenssi. Sen taustaorganisaatiot ovat University of Californian California Digital Library[1], Crossref[2], DataCite[3] ja ORCID[4], ja tunnistejärjestelmistä ovat tämän mukaisesti olleet esillä erityisesti Suomessa heikosti tunnettu Archival Resource Key (ARK), Digital Object Identifier (DOI) ja tutkijoiden tunniste ORCID.

Tapaamisia on järjestetty vuodesta 2016 alkaen. Lisätietoa menneiden vuosien konferensseista ja kaikki niiden esitelmät ovat käytettävissä osoitteessa https://www.pidapalooza.org/past-events.

Osallistuin kollegani Riitta Koikkalaisen kanssa tammikuussa 2020 Lissabonissa järjestettyyn PIDapaloozaan. Meno ei ollut niin hurjaa kuin konferenssin nimen perusteella voisi päätellä (palooza on riehakas juhla), mutta ajatuksen aihetta kokous antoi riittämiin – ja myös aihetta huoleen.

Konferenssissa olivat pääroolissa tiedeyhteisön omaksumat tai sen kannalta merkittävät, kehitteillä olevat tunnistejärjestelmät, kuten:

  • ORCID
  • Tutkimusorganisaatioiden tunniste ROR (Research Organization Registry)[5]
  • Tutkimushankkeiden tunniste RAiD (Research Activity Identifier)
  • Tutkimusjulkaisujen (erityisesti artikkeleiden) ja data-aineistojen tunniste DOI, jakelijoina Crossref ja DataCite

Nämä tunnisteet ovat paikanneet perinteisten tunnistejärjestelmien kuten ISBN:n ja ISSN:n jättämiä aukkoja. Esimerkiksi DOI tarjosi tieteellisille kustantajille toimivan ratkaisun artikkeleiden identifiointiin. Kehitteillä olevan RAiD-tunnisteen toivotaan ratkaisevan tutkimushankkeiden identifioinnin ongelmat. RAiDia lukuun ottamatta yllä mainittuja järjestelmiä hallinnoivat suuret tieteelliset kustantajat ja niiden kontrolloimat yhteisöt, kuten International DOI Foundation (IDF)[6].

Toimijoiden tunnistamiseen käytetyn ISNI-tunnistejärjestelmän johtokunnassa ovat edustettuina tekijänoikeusjärjestöt ja tieteelliset kirjastot, erityisesti kansalliskirjastot. Tieteellisten kustantajien kannalta sekä ISNI:n hallinto että tunnisteen keskitetty jakelu ovat ongelmallisia, minkä vuoksi ne ovat perustaneet ISNI:n kanssa kilpailevat ORCID- ja ROR-tunnisteet. Rinnakkaisten tunnistejärjestelmien ylläpito syö resursseja ja pakottaa esimerkiksi kansallisten nimitietopalvelujen ylläpitäjät keräämään järjestelmiinsä mahdollisuuksien mukaan kaikki saman toimijan tunnukset.

Nähtäväksi jää, leviääkö tunnistejärjestelmien kisa toimijoiden identifioinnista myös julkaisuihin. Järjestelmien perinteisten mandaattien hämärtymistä kuvaa se, että Crossref suosittelee DOI-tunnisteen antamista tieteellisten kausijulkaisujen kotisivuille. Tässä ei sinänsä ole mitään väärää, mutta maallikon voi olla vaikea hahmottaa, onko DOI tässä tapauksessa kotisivun vai kausijulkaisun tunniste. Jos se tulkitaan kausijulkaisun tunnisteeksi, DOI- ja ISSN-tunnukset törmäävät.

Pitkä lasipöytä, jonka päällä on tarjolla lukemattomia lautasellisia cocktail-naposteltavaa. Konferenssivieraat ovat kerääntyneet pöydän ympärille keskustelemaan.

Kuva 2. Kylmä, mutta maittava konferenssilounas. Kuva: Juha Hakala, 2020

Yllä mainituista tunnistejärjestelmistä DOI ja ORCID ovat tieteellisten kustantajien tuella vakiintuneet merkittäviksi osiksi tieteellistä julkaisujärjestelmää. ROR-tunnuksen osalta on vaikea ennakoida tulevaa; tunnus perustettiin GRIDin (Global Research Identifier Database)[7] varaan, ja PIDapaloozassa nähdyn esitelmän perusteella se on vuosi lanseerauksen jälkeen edelleen vahvasti GRID-vetoinen. Tunnuksen tulevaisuuden ratkaisee tutkimusorganisaatioiden panostus, koska ne vastaavat itse tunnuksen edellyttämien, sinänsä varsin niukkojen auktoriteettitietojen tallennuksesta. Samaan aikaan näiden organisaatioiden toivotaan osallistuvan tietojensa täydentämiseen ISNI-tietokannassa.

RORia heikommassa kunnossa on RAiD, josta on tarkoitus tehdä Handle-tunnukseen perustuva, hajautetusti ylläpidetty tutkimushankkeiden tunnistejärjestelmä. Sen oleellinen osa olisi keskitetty tietokanta, joka sisältäisi perustiedot identifioiduista hankkeista. RAiDia kehittää ISO TC 46:n alakomitea 9, jonka vastuulla on perinteisten kirja-alan tunnisteiden ohella esim. DOI ja ISNI. RAiD-standardin Committee Draft -luonnos oli valitettavasti niin heikko, että se vedettiin helmikuussa 2020 pois äänestyksestä, mikä on ISO-standardointiprosessissa hyvin poikkeuksellista. Toivottavasti tekstin ongelmat saadaan korjattua ja tämän sinänsä hyvin tarpeellisen tunnistejärjestelmän kehittäminen etenee.

Artikkelin loppuosassa kuvataan kiinnostavimpia niistä esitelmistä, joita ennätin konferenssin aikana kuunnella. Tekstit ovat englanninkielisiä ja perustuvat tekemiini muistiinpanoihin. Erityisesti mieleeni jäi DOI-tunnuksen historiaa kuvannut Jonathan Clarkin esitys, jolle olisi suonut enemmän aikaa ja näkyvämmän sijan. Koskaan aiemmin ei ole kerrottu järjestelmän käynnistykseen liittyneistä ongelmista, kuten konkurssin uhasta. Myös konferenssin avannut Maria Fernanda Rollon keynote-esitys oli mielenkiintoinen kuvaus siitä, miten tieteen tekemistä voidaan ohjata lainsäädännön avulla.

Pitkä, valkoiseksi rapattu, punakattoinen luostari, jossa on yksi kahdeksankulmainen korkea torni näkyy korkeiden sypressien takaa kuvan taustalla.

Kuva 3. Konferenssikeskuksen parvekkeelta näki hyvin Unescon suojeleman Hieronymuksen luostarin puiston laidalla. Kuva: Juha Hakala, 2020.

Towards the circular science: PIDs for a new generation of knowledge creation and management paradigm in Portugal: from vision to reality / Maria Fernanda Rollo (Opening keynote)

The Portuguese government and the Ministry of Science, Technology, and Higher Education have defined as a priority the commitment of scientific research to the principles and practices of Open Science. They are engaged in the elaboration and implementation of a National Open Science Policy based on the statement “Knowledge belongs to all and is for all”.

Expectations towards scientific research are increasing, but at the same time scientific findings are challenged if they are not “politically correct”.

Open science: scientific information should be accessible for all. New models of scientific publishing can support this, both in Portugal and elsewhere.

PIDs, when implemented, should not be an end in itself, or a tool for bureaucracy.

More science, less bureaucracy:

  • Usage of national e-citizen ID card
  • Digitization of services
  • Interoperability between services and systems

According to Maria Fernanda Rollo, technical and semantic interoperability are not at an acceptable level yet. In this, Finland and Portugal are in my opinion not too different.

Solution: IDs for students and scientists are in production, and an ID for Portuguese organisations is under development.

Student ID is connected to the European Student Card[8]. It is unique and persistent and given to all students, including foreign exchange students. It can be expressed as an HTTP URI, and it is given to everyone who enters a university or a polytechnic.

Format of the student ID is: https:/estudante-id.pt/nnnn-nnnn-nnnn

Researchers, research administrators, and advanced students can get a science ID. It provides in principle access to all scientific services in Portugal. This national ID system is connected to ORCID (but it is not ORCID).

Ciencia vitae[9] is a single access point for all scientific information, which is currently still under construction. Access is limited to the science ID holders. The coverage of the service is grown fast: already in 2020-04-16, it contained information about 41,093 researchers (who are encouraged to provide their CVs), 646,854 articles, 132,284 projects and 23,672 institutions.

Achieving technical and semantic interoperability has been a major challenge. Research organisations were invited to cooperate, but organisations that do not receive state funding are a challenge to encourage to get involved.

Legislative basis for Portuguese open science is solid. All relevant laws have been collected in https://www.ciencia-aberta.pt/legislation. Open science is defined broadly, it is not just open access of publications and data, it is also openness of scientific process and transfer of scientific knowledge to society (see https://www.ciencia-aberta.pt/home).

The EOSC PID policy / Sarah Jones, Brian Matthews, Anders Sparre Conrad

European Open Science Cloud is developing a PID policy. A draft document is available at https://doi.org/10.5281/zenodo.357420. It was released in December 2019; next version will be published in March 2020[10], final document in October/November.

The policy is:
written for senior decision makers within potential EOSC service and infrastructure providers, and will be of interest to all EOSC stakeholders. It defines a set of expectations about what persistent identifiers will be used in support of a functioning environment of FAIR research.

The aim is to establish a sustainable, trusted PID infrastructure. To this end, it is important to accommodate a range of PID practices and suppliers, independent of technology. The policy encourages new and innovative uses for PIDs. Related: Revision of the National Library of Finland’s URN resolver will support this.

Regarding PID services and service providers, it is important to specify roles and responsibilities, minimum levels of service provision, and requirements for maturity.

EOSC is a European initiative, but the authors of the policy recognize the need to be interoperable in the global scale (not just Europe). There has already been comments from outside Europe. So it should be possible to avoid a European silo. The team is looking forward to receiving constructive feedback.

Data curation and trust in a PID – how do we maintain the right balance? / Brian Kirkegaard Lunn

Errors in PID metadata endanger trust and validity that is vital to the success of a persistent identifier. Errors that may not surface until data is processed, aggregated, or displayed in new ways. Brian Kirkegaard Lunn discussed the problem areas and focused on errors in publication metadata.

The Digital science service[11] contains 107 million publications, 1.1 billion links in them. Most of this data comes from Crossref and PubMed, so 97 million publications in the service have a DOI. Full text and abstracts are used whenever possible. This includes both OA and non-OA publications that the publishers allow to be indexed (but not made available).

Deduplication is necessary (and done) since sources overlap. Even small and innocent looking errors, such as typos or wrong publication dates may have strong negative impacts on deduplication and end users.

Correcting data issues one-by-one is very resource consuming. Error corrections should be scalable, which requires automatization. Unfortunately metadata quality issues are difficult to locate (in a large bibliographic database like looking for a needle in a haystack).

When publishers are contacted about problems, some are very responsible, some not at all. Based on the experiences, Brian Kirkegaard Lunn drew a few conclusions:

  • Correcting all errors is not feasible
  • New errors are surfacing all the time
  • There is not a single case that is not valuable to correct
  • Perfect metadata does not exist – everything is error prone
  • Intermediate corrections should be avoided whenever possible

Life is indeed too short for bad metadata!

The science ecosystem and open science: a multi-legged stool / Beth Plale

Open science applied to the scientific research enterprise is a principle of openness that will advance the frontiers of knowledge and help ensure a nation’s future prosperity. Realizing open science in academic research has challenges both social/organisational and technical.

Effective practices for data – guidelines

  • Assign PIDs for data
  • Cite datasets in publications based on them
  • Include a statement of data availability

Must every digital product of my research be made available? – No, because more data is created than can be preserved.

Research data is not a homogeneous product. But if the data is the basis of published works, it belongs to a trusted repository. And if the data is an asset of known value, it belongs there too, even if there are no published works relying on it.

If important, research data from small projects should be placed in repositories as well, because otherwise the data will not be preserved beyond the project itself. Research data from large projects will usually be preserved by the research community until it is known if it has lasting value. Some of the community specific data archives will be persistent but not in the same way as trusted repositories (like CSC’s data PAS service in Finland).

Data-intensive discovery is dependent on large data archives, which tend to be community specific. Convergent research is interdisciplinary, and will require data from multiple archives. Such cooperation will be of major importance for solving grand challenges.

The National Science Foundation will strongly promote DOI for datasets and publications, and other PID solutions (Handle) for software. These systems support PID kernel metadata, with small amount of information maintained at a resolver. Simple metadata means fast and simple decisions. If there are hundreds of millions of PIDs, speed is essential.

Lightning talks

ORCID in the UK

Joint Information Systems Council (JISC)[12] has an ORCID consortium. It started in January 2013. Growth of ORCID usage has been stable since then. There are now 98 member organisations (universities, research institutes, funders).

In Germany there are still only 56 ORCID user organisations, with about 160,000 assigned ORCIDs.

Maintaining ORCID metadata is a problem. When an organisation derives a list of IDs linked to it, usual findings include the following:

  • Some of the records returned are for people no longer associated with your university, but were past affiliations
  • Some of the records returned are out of date
  • Some of the records will not be returned because there is no linked email with your institutional domain, or the email is private
  • You may not have found all the records because you have not used a complete list of affiliation IDs – some institutions have more than one OrgID in ORCID (ROR, ISNI, GRID and Ringgold).

See JISC report iDentifying your researchers: challenges and opportunities[13] for more information.

PURL issues / Paul Walk

PURL (Persistent URL) was introduced by OCLC in 1995. 20 years later it had turned into an endangered species. The service was eroding; for instance the management interface was no longer working, and OCLC did not respond to the users’ queries about the future of the system.

In 2016, the Internet Archive assumed management of the service. The PURL resolver software was updated, and the acute crisis was over. But there are still issues with PURL. Paul Walk from Dublin Core Metadata Initiative listed the DCMI’s concerns (see also http://bitly.com/purl-crisis):

  • DCMI PURL namespace was corrupted by a malicious agent who started to create bogus sub-domains to the DCMI PURL namespace
  • Ongoing inability to edit PURLs that are tied to legacy PURL system user accounts belonging to, e.g. people who have retired (this problem has been reported by several PURL user organisations)
  • No response from support email address (also reported by many other PURL user organisations)
  • The features implemented in the resurrected resolver do not match the original behaviour and there was no spec and no guarantees/contracts for what features could be relied upon either
  • PURL cannot compete with other PID systems (lack of investment on maintenance and further development, technical shortcomings)

If PURL system cannot be trusted:

  1. How do we plan for a gradual decline in service (which PURL users are still experiencing)?
  2. How do we plan for a catastrophic failure, where the system stops resolving PURLs (this has not yet happened)?

While the issues listed above are specific to PURL system, questions about what to do if a PID system cannot be trusted should concern all organisations who are using or who are planning to use a PID system. It is important to choose a widely used system that is both technically and organisationally strong.

Following the transfer of PURL management responsibility from OCLC to the Internet Archive, the risk of immediate catastrophic failure has diminished, but all the DCMI’s concerns listed above are still acute. The Internet Archive does not seem to be investing more than the minimum to the system, and there does not seem to be any ongoing development work.

Identifiers for heritage collections

A virtual integrated national arts collection[14] is being created by the UK Research and Innovation (UKRI)[15].

As a part of the project, the British Library is investigating the current use of PIDs for art collections in the UK. Based on the results it may be possible to make recommendations on how to proceed. It will be interesting to see if ISO International Standard Collection Identifier[16] will have a role in the project.

Unified identifier management and resolution services / Tommi Suominen, Jessica Parland-von Essen (CSC)

CSC runs an integrated PID management service. It covers all PID services CSC supports, including DOI, URN, RAID, GRID/ROR, and ORCID.

Both ORCID and ISNI are used for researchers. Coverage is improving.

63% of publications in CSC’s Virta Higher Education Achievement Register[17] have DOIs.

Different services have different PID practices. URN is used for research infrastructures, plus documents with ISBN and/or ISSN. The CLARIN language bank hosted by CSC uses URNs and Handles. In the Fairdata.fi service metadata and external datasets are identified with URNs, datasets within the service get DOIs.

Datasets can have multiple different kinds of PIDs. Initially it may get a URN, and later a DOI. When archived elsewhere the datasets may get even more PIDs. All PIDs assigned to a dataset should ideally be interlinked in the metadata.

PID creation in CSC’s information systems is mainly automatic. Initially a URN is minted, later other PIDs. Quality checks are done frequently to make sure PIDs resolve correctly.

120 resolutions a second: the DOI story / Jonathan Clark

There are currently 218 million registered DOIs, and like the title says, average rate of resolutions is 120 per second. The system is well established. But this has not always been the case.

According to Clark, scary years for DOI were 2004 and 2005. Crossref was doing well, but there were no other major DOI implementations (the next big step forward was DataCite in 2009). This caused financial problems, which forced a complete reorganisation.

For instance, the billing model was changed. Initially there was an annual fee for every assigned DOI. I remember that this was a show stopper for many potential DOI users, including the national libraries. Business model then went through several iterations. In the current model, Registration Authorities (RAs) have a fixed fee and they can mint as many DOIs as they want. They have partner organisations (such as TSV and CSC in Finland), to which assigning DOIs can be subcontracted for a fee.

As regards the costs, IDF only wants to make ends meet. It is a not-for-profit organisation, and will remain that way.

During the scary years there was also a rogue party that tried to patent the DOI. Because of this, a legal structure has been set up in such a way that if the International DOI Foundation (IDF) goes bankrupt, the DONA Foundation[18] (organisation behind the Handle system) will step in. Therefore IDF cannot be bought by a party that would then make money out of DOIs.

DOI is a trademark and nobody can start using the name for another identifier. Although in principle the domain name doi.org (or handle.net) could be taken over because domain names cannot be owned, just rented (Handle and DOI prefixes are different in this respect; they are not re-assigned to other organisations).

In the future the target is the full implementation of Bob Kahn’s (inventor of Handle system and TCP/IP, and the person behind the DONA organisation) digital object architecture[19], which also includes object-related metadata.

DOIs do die, but not necessarily with the publishers, because the responsibility to maintain DOI actionability can be passed to other publishers who adopt the serial, and ultimately, when there are no other resolution services left, IDF will establish a tombstone to the DOI/resource.

Sidenote: In the Netherlands, Koninklijke Bibliotheek will enable access to Elsevier and Kluwer periodicals, when these publishers no longer exist. This practice is related to peculiar Dutch deposit practices (there is no legal deposit act). There are no similar arrangements elsewhere yet, so national libraries as a rule do not provide last stop resolution services for commercial scientific articles with DOIs. However, this may change in the future, because when a publisher goes bankrupt and no other publisher takes over its periodicals, legal deposit collections maintained in national libraries may be the last resort.

The technical infrastructure for DOI resolvers is Amazon Elastic Beanstalk cloud service. Scalability was an important factor in choosing the service. Current load of about 200 resolutions per second[20] can be processed easily, but even this configuration would not be capable of minting millions of DOIs within minutes. The Amazon cloud DOI infrastructure however could be extended even to this level, if necessary.

Handle resolvers have a different (locally established) technical infrastructure, which means the DOI-like quality of service overall cannot be guaranteed in the entire Handle system, even though the software used is the same.  Some Handle resolvers may be very fast, while others may be slow or even non-existent. Handles are not and will not be controlled in the same way as DOIs. On the technical side, the Denial of Service attacks targeting https://doi.org are a concern, but in this the IDF is in the same boat as any other organisation providing PID resolution services. Specific concerns are people who pretend to have DOIs, when they do not have them; this concern is shared by all identifier systems, not just PIDs.

Viitteet

[1] https://cdlib.org/

[2] https://www.crossref.org/

[3] https://datacite.org/

[4] https://orcid.org/

[5] https://ror.org/

[6] https://www.doi.org/idf-member-list.html

[7] https://www.grid.ac/

[8] https://europeanstudentcard.eu/

[9] https://cienciavitae.pt/?lang=en

[10] It had not been published yet by 2020-04-23

[11] https://www.digital-science.com/products/

[12] https://www.jisc.ac.uk/

[13] http://bit.ly/orcid-res-id

[14] https://www.ukri.org/news/first-steps-towards-an-integrated-virtual-national-arts-collection/

[15] https://www.ukri.org/

[16] https://www.iso.org/standard/44293.html

[17] https://confluence.csc.fi/display/VIRTA

[18] https://www.dona.net/

[19] https://www.dona.net/sites/default/files/2018-11/DOIPv2Spec_1.pdf

[20] DOI usage is growing so fast that the title of Clark’s presentation was already out of date.

Kirjoittajan yhteystiedot

Juha Hakala, erityisasiantuntija
Kansalliskirjasto, kirjastoverkkopalvelut
PL 15 (Yliopistonkatu 1), 00014 Helsingin yliopisto
juha.hakala [at] helsinki.fi

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Theme by Anders Norén