DSpace Translators not longer valid?

bollini · May 15, 2009

Hi all,
I have recently send an email to the dev list
http://groups.google.com/group/zotero-dev/browse_thread/thread/b6eda90295e5be3f
about issues with some DSpace sites, more properly I have noted that when the DSpace translator is used it doesn't work properly and no data are imported in Zotero.
I continue this discussion in this tread because seams the most appropriate place where do it.
You can check the issues with this url (standard DSpace installation):
http://dspace-testhaton.cilea.it/jspui/handle/123456789/60

Anyway, I have learned from the Richard Karnesky mail (thanks for your answer) that the DSpace translator is used when the keyword "dspace" is present in the web site URL and after have read some line of code I think that it also look for specific html element in the page that is present only in one of the two UI that dspace now provide...
The DSpace platform has been update since the translator has been produced so some html details are changed and these changes break the translator to work. Finnally, in the last version of DSpace we have RDF info embedded in one of the UI (JSPUI) and COINS (Z39.88) data in the other
(XMLUI). Where the DSpace Translator are not used (because the url doesn't contains the dspace keywords) the RDF import work well...
see for examples:
http://researchspace.auckland.ac.nz/handle/2292/3065
http://www.archenvimat.pz.cnr.it/handle/10122/365

so should the DSpace translator to be removed? if we want to mantain it for "old" dspace site could be reduced the priorioty of the Site Translator so that the RDF translator will always win?
Thanks,
Andrea

noksagt · May 15, 2009

I have noted that when the DSpace translator is used it doesn't work properly and no data are imported in Zotero.

On some sites (due mostly, as you said, to the DSpace update).

Finnally, in the last version of DSpace we have RDF info embedded in one of the UI (JSPUI) and COINS (Z39.88) data in the other (XMLUI).

Why, in modern versions, do you not have both RDF & COinS, regardless of UI?

Note that both the eRDF & COinS translators in Zotero are still somewhat limited. The COinS translator is about as rich as it could be, without extending the OpenURL spec further. The RDF vocabulary is getting richer. However, it is a shame that DSpace offers up a lot of data that Zotero can't capture through either of these methods. In particular, the abstract & attached files are not captured (and cannot ever be captured via COinS 1.0). Also, there is no batch import (and currently can't be via eRDF).

so should the DSpace translator to be removed?

I would say "no:" it should be improved, if possible, to work with modern versions of DSpace and/or DSpace should use a method to get very rich data into Zotero.

Re this latter point: look into unAPI+RDF/MODS XML for now & please follow the development of the newer bibo ontology for RDF.

As far as improving the DSpace-specific translator: are there robust ways to get rich information from DSpace, regardless of the version? Or, at least methods to get the version? If the latter, we could default to RDF (or COinS) & supplement the info w/ abstract & attachments and also support batch import.

mdiggory · May 15, 2009

I wrote the CoinS implementation and know about the html meta dc inclusion as well. We are trying to provide more than one option in DSpace at the Moment. My goals are to eventually include as many different approaches as possible into DSpace instances.

I think we would rather not see an impractical constraint like dspace instances having to have "/dspace/ in their URL's. That was never an accurate expectation (nor /handle/ or any other URL structure for that matter).

Ideally we could identify later versions of a DSpace instance by adding html/head/meta identifying it as such. Would this allow a DSpace translator for Zotero to not be reliant on the URL structure?

If at all possible there shouldn't eventually not be a "DSpace" translator and we should be relying on metadata fields in the html, RDFA and/or attached RDF and finally the Open URL CoinS. The existing DSpace translator is probably more appropriate for existing pre-1.5.2 DSpace sites and should be kept for legacy purposes until we can clean up the Zotero behavior and end of life versions earlier than 1.5.0.

Mark Diggory

noksagt · May 15, 2009

I think we would rather not see an impractical constraint like dspace instances having to have "/dspace/ in their URL's. That was never an accurate expectation (nor /handle/ or any other URL structure for that matter).

Note that individual repositories can and have had other URI schemes added to translators (including for dspace). Perhaps a regex for 'handle' with numbers should be added (despite it not being needed or sufficient to show it is a DSpace site).

Most translators use URI-specific code so that Zotero does not have to interrogate every page with every translator. So, putting information in head/meta is useful for version info & confirmation for a site-specific translator, but it is probably not the best solution here.

If at all possible there shouldn't eventually not be a "DSpace" translator and we should be relying on metadata fields in the html, RDFA and/or attached RDF and finally the Open URL CoinS.

Yes. And I would still add unAPI+MODS to that list: it is not that hard to code & it works now. Batch import, abstracts, file attachments are all handled.

Simes · October 22, 2009

Reviving this thread because it seemed pertinent.

We've been working on an implementation of unAPI for our repository, but Zotero doesn't offer unAPI on our site because the DSpace translator (which does not work for our repository as it a) is DSpace 1.5 and b) does not follow the stock DSpace site design) gets there first. So it looks like the current behaviour is for Zotero to apply the translators first and only fall back to unAPI if none of them match.

For anyone adding unAPI to an existing DSpace repository, this isn't going to work. If unAPI is better than scraping the HTML (which it unquestionably would be) why isn't this the other way around?

dstillman · October 22, 2009

Site-specific translators always take precedence over embedded metadata translators (and they're not always doing screen scraping). This case is only problematic because it's a generic site translator with differing implementations.

If we're comfortable saying that unAPI and COinS should always take precedence over DSpace, we can adjust the translator priority accordingly.

Simes · October 23, 2009

I would certainly be comfortable with that, but I can only speak for one repository and not the DSpace community as a whole.

tdm27 · November 27, 2009

Although I speak for the same repository, it does seem logical that those implementations should take precedence. This would maintain the current situation for those repositories that don't need to override defaults, yet allow others to make changes if needed.

Maybe someday the default DSpace translator can be improved so it can cope with various versions, but currently it seems suboptimal that so many repositories are being locked out from offering Zotero.

As we are ramping up to include more and more theses in our repository, this is becoming a problem for us and our (Zotero) users.

Can we go ahead with giving unAPI and COinS precedence?

Christophe Dupriez · December 9, 2009

I am a newcomer to this list: sorry for incomplete undersanding!
On this issue, I would suggest:

1) a tag added to HTML content to repel the old DSpace->Zotero translator from sites who does not want it to be used. It could be even a tag to tell to Zotero which strategy to use for the site. For instance:
<meta name="zotero.translator" value="meta"/>

2) If the DSpace translator is repeled, the <meta name="DC{TERM}.element.qualifier" content="xxxx"/> seems just fine for me: the DSpace community would certainly be open to improve meta generation in DSpace items display. Andrea Bollini gives a nice example of this (source of http://researchspace.auckland.ac.nz/handle/2292/3065)

Where the <meta name=... /> scraping by Zotero is documented?

Thanks!

Christophe

mdiggory · December 9, 2009

Agree with Christophe.

1.) there is no guarantee that "/handle" will be available over the long term or be appropriate for DSpace instances.

2.) We've worked very had to provide appropriate metadata in the html head metad fields and likewise, using COinS.

Further improvements will come in the future. There is no reason for Zotero to differentiate DSpace sites from other more generic resources, if it were possible to have priority on META tags and CoinS first, and then various DSpace centric features.

Also note that in 1.6.0, we will be introducing a specific DSpace version META tag. Which should alleviate things a bit.

My biggest question... who is supposed to be doing this work?

Mark

--
Mark R. Diggory
Head of U.S. Operations
http://www.atmire.com - Institutional Repository Solutions

sshreeves · December 9, 2009

As a repository manager for a DSpace site, I'd echo the comments above to repeal the DSpace translator. I think that the DSpace community would be behind this move.

Sarah Shreeves
IDEALS - http://www.ideals.illinois.edu/

Christophe Dupriez · December 10, 2009

Hi again!

There was a discussion today about the support of Google Scholar indexing by DSpace:
http://jira.dspace.org/jira/browse/DS-396

It is mainly based on the following meta tags:
<meta name="citation_journal_title" content="Journal Name">
<meta name="citation_authors" content="Last Name1, First Name1; Last Name2, First Name2">
<meta name="citation_title" content="Article Title">
<meta name="citation_date" content="01/01/2007">
<meta name="citation_volume" content="10">
<meta name="citation_issue" content="1">
<meta name="citation_firstpage" content="1">
<meta name="citation_lastpage" content="15">
<meta name="citation_doi" content="10.1074/jbc.M309524200">
<meta name="citation_pdf_url" content="http://www.publishername.org/10/1/1.pdf">
<meta name="citation_abstract_html_url" content="http://www.publishername.org/cgi/content/abstract/10/1/1">
<meta name="citation_fulltext_html_url" content="http://www.publishername.org/cgi/content/full/10/1/1">
<meta name="dc.Contributor" content="Last Name1, First Name1">
<meta name="dc.Contributor" content="Last Name2, First Name2">
<meta name="dc.Title" content="Article Title">
<meta name="dc.Date" content="01/01/2007">
<meta name="citation_publisher" content="Publisher Name">

May be the presence of the pattern <meta name="citation_xxxx" should be considered by Zotero as "bibliographically friendly" and scraped in priority whatever is the underlying software?

Have a nice evening!

Christophe

dstillman · December 10, 2009

Are the citation_xxxx meta tags actually documented anywhere other than in e-mails from Google employees? Are they part of any published standard, or are they just arbitrary tags that the Google Scholar folks made up? (This isn't to say we couldn't support them, but a modicum of documentation would be helpful.)

Christophe Dupriez · December 10, 2009

The E-mail from Google to Drupal community:
http://drupal.org/node/641580

A report by someone who experimented it:
http://www.monperrus.net/martin/accurate+bibliographic+metadata+and+google+scholar

Many things are not publicly documented with Google (Search engine orthographic approximations? sorting rule for search results?). They are mainstream: alternatives must be encouraged but the main stream must be perfectly supported (IMHO).

fbennett · December 10, 2009

They are mainstream: alternatives must be encouraged but the main stream must be perfectly supported (IMHO).

There's a difference between mainstream companies and mainstream standards. Google is a creative firm, they're good at cooking up new ideas; but I think Dan is looking for a bit of assurance that this scheme is something to which the company itself has made a serious commitment (i.e. that it's reasonably well defined, and unlikely to be swapped for some other approach in the near future -- that it's not just an idea that someone who happens to work there has come up with).

Christophe Dupriez · December 11, 2009

Google has a commercial approach to this: they disclose (rarely) only when it is in their direct interest. Zotero community may (or not) have a pragmatic approach to adapt. I agree, the two links I found above are the only one serious on the subject. But, if it works, this is fulfilling users' needs. Scraping has always been some kind of traffic fighting: you are not in control of what others do. Ever tried to get automatically PDFs from a citation?!
This being said, others have a similar approach to make Dublin Core more precise:
http://iodeweb1.vliz.be/odin/handle/1834/882?mode=full
http://www.ceemar.org/dspace/handle/11099/897?mode=full
https://doclib.uhasselt.be/dspace/handle/1942/10024?mode=full
(they are using "bibliographicCitation" instead of "citation" but the approach is similar: better map MARC to an extended DC).

tdm27 · March 17, 2010

Can anything be done about this? I notice that it still doesn't work, not with 1.5 and not with the new version of DSpace, 1.6. It would be rather nice if people could automatically get metadata from papers stored/offered in DSpace.

Hilton Gibson · May 7, 2014

Hi

Something that may help.

http://wiki.lib.sun.ac.za/index.php/SUNScholar/XMLUI_Theme/Tutorial#DRI2XHTML_Transformers

aurimas · May 7, 2014

Sorry, but what exactly are you trying to point to? DSpace translator has not existed for about 2 years now.

Hilton Gibson · May 7, 2014

Ok. Thx.