Wrong item types detected by Zotero Connector (mostly, Journal Articles instead of theses)
When I add theses from universities repositories through the Zotero Connector, often I find they are included as Journal Articles. From now, I will include here the examples I find:
https://unbscholar.lib.unb.ca/handle/1882/38229
https://minerva.usc.gal/entities/publication/a86c0626-d258-4ffe-ba5a-aea9b3280ebc
https://repository.library.northeastern.edu/files/neu:ms36wk00b
https://mountainscholar.org/items/0797c24c-aca8-4eb1-965f-7c4463cafd59
https://scholars.wlu.ca/etd/2733/
https://vtechworks.lib.vt.edu/items/5cd36846-e693-49ea-88fe-3ebb0f917232
https://iris.unito.it/handle/2318/2065333?mode=simple
https://digitalcommons.usu.edu/etd2023/486
https://unbscholar.lib.unb.ca/handle/1882/38229
https://minerva.usc.gal/entities/publication/a86c0626-d258-4ffe-ba5a-aea9b3280ebc
https://repository.library.northeastern.edu/files/neu:ms36wk00b
https://mountainscholar.org/items/0797c24c-aca8-4eb1-965f-7c4463cafd59
https://scholars.wlu.ca/etd/2733/
https://vtechworks.lib.vt.edu/items/5cd36846-e693-49ea-88fe-3ebb0f917232
https://iris.unito.it/handle/2318/2065333?mode=simple
https://digitalcommons.usu.edu/etd2023/486
citation_dissertation_institution
. We're working on major improvements to the "generic" translators that we use for sites without site-specific support, though, and we should be able to address this as part of that effort.dc.type master thesis
and I believe this is enough machine-readable metadata marking the item as a thesis.
There is probably a way to write a translator, maybe calling an existing one, but I can't tell how much work that would involve as I'm not sure what is available at the moment. There's a significant backlog of new translators waiting for review on Github as well...
Sorry if my comment was too general: I am aware that updates to at least some existing translators are processed efficiently enough, my perception of the new translator case is perhaps biased by my own experience. One of my PRs has been waiting for any kind of action for over a year ;-)
Some questions:
-Are they wrong designed websites?
-Is Datacite schema wrongly implemented in those webpages?
- Isn't there some standard/ISO to use Datacite in a correct way in webpages?
-On @AbeJellinek comment on Datacite XML. Why Zotero can only translate JSON, but not XML. Wouldn't be this a feature to be improved in Zotero if Datacite XML is as valid as Datacite JSON?
-If there is a correct way to use Datacite on webpages and some (like the examples here) are not following it, is there some way to make pressure to correct them? I mean, some declaration, or foundation looking at these implementation?
The "dc" on that page stands for Dublin Core, not DataCite. I think you may (understandably) be getting the two confused. Zotero supports importing Dublin Core metadata, but it needs to be in a machine-readable format, not just a table on the page. The actually machine-readable metadata made available by UNB Scholar is DataCite XML, which Zotero unfortunately doesn't yet support.
In any case, we might be able to start building a translator for relatively standard DSpace sites that handles things like the UNB Scholar Dublin Core metadata. I'll keep this thread updated.
https://jyx.jyu.fi/jyx/Record/jyx_123456789_100515
Or another example. I don't know which item type should fit a score, but not a webpage: https://bmlsh.ulpgc.es/item/213082w
Should this issue be attributed to Omeka or to Zotero?
Thanks again!
Even if these webpages don't have machine-readable metadata, some of them have buttons to export, for example to REFWORKS and MENDELEY (e.g. https://minerva.usc.gal/entities/publication/9a4fd001-4717-428f-96a5-44812f8f3805).
I thought translators looked for such buttons, but it seems not to be their behaviour. Shouldn't them? Wouldn't there be a way to make translators webscrapping webpages looking for such export buttons?
https://minerva.usc.gal/entities/publication/9a4fd001-4717-428f-96a5-44812f8f3805 saves pretty well for me. I get a journal article item with the correct title, date, and URL, and a full text PDF attachment. The main issues are:
1. It saves as a journal article. That's on them to correct by fixing their Highwire metadata.
2. The author name is split incorrectly. Some sites put all names in the same entry, some sites split them; we use heuristics to guess which format they're using, but those tend to fall apart on Iberian names. The translator currently can't tell that "San José Capilla, María Esther" is one name but "Alice Jones, John Smith" is two.
I get similar/worse metadata when I import the RefWorks file they provide — it still saves as a journal article, and although the author name is split correctly, it incorrectly lists the university as an editor.
As I previously said, we're working on improved support for DSpace, which will address these issues with thesis repositories. There's no need to link any more thesis repository pages; they pretty much all use DSpace, and the problems are the same across all items. I'll keep this thread updated.
Just, let me write down another example where it happens: http://rave.ohiolink.edu/etdc/view?acc_num=osu1744247240174044
It is precisely an example where thesis item type is rightly detected. Could it be assumed that thesis type implies a unique creator, even with 2 surnames? Or, it could specify the thesis advisor after the thesis author..., so I am aware of it does not seem to have a robust solution
citation_author
has one comma and there's only one listed institution and ORCID, we might be able to assume that it's a single author. But we're really getting into the weeds here, and for every couple articles that that heuristic fixes, it could break another.