Wrong item types detected by Zotero Connector (mostly, Journal Articles instead of theses)

iagogv · April 7, 2025

When I add theses from universities repositories through the Zotero Connector, often I find they are included as Journal Articles. From now, I will include here the examples I find:
https://unbscholar.lib.unb.ca/handle/1882/38229
https://minerva.usc.gal/entities/publication/a86c0626-d258-4ffe-ba5a-aea9b3280ebc
https://repository.library.northeastern.edu/files/neu:ms36wk00b
https://mountainscholar.org/items/0797c24c-aca8-4eb1-965f-7c4463cafd59
https://scholars.wlu.ca/etd/2733/
https://vtechworks.lib.vt.edu/items/5cd36846-e693-49ea-88fe-3ebb0f917232
https://iris.unito.it/handle/2318/2065333?mode=simple
https://digitalcommons.usu.edu/etd2023/486

AbeJellinek · April 7, 2025

There unfortunately isn't anything in the page's machine-readable metadata marking it as a thesis — they should be setting citation_dissertation_institution. We're working on major improvements to the "generic" translators that we use for sites without site-specific support, though, and we should be able to address this as part of that effort.

iagogv · April 8, 2025

@AbeJellinek Thanks for answering. I see in the Full Item page link (https://unbscholar.lib.unb.ca/items/568c48b7-9dbd-4bea-90ae-8154f3220524/full):
dc.type master thesis

and I believe this is enough machine-readable metadata marking the item as a thesis.

AbeJellinek · April 8, 2025

I wish we could use that metadata, but it isn't present on the main page, only on /full, and even there, it's just in a human-readable table with no machine-readable semantic markup. I'm not really sure what the point of having it is.

aborel · April 8, 2025

The interface seems to be the frontend of a Dspace7 server, I can see some typical API urls in the network log of my browser.

There is probably a way to write a translator, maybe calling an existing one, but I can't tell how much work that would involve as I'm not sure what is available at the moment. There's a significant backlog of new translators waiting for review on Github as well...

AbeJellinek · April 8, 2025

Yeah, it's DSpace, but DSpace is way too diverse to write a single translator for all sites powered by it. The only structured metadata I'm seeing on that site in particular is Datacite XML, which we unfortunately don't have a translator for. (JSON, yes, but XML, no.)

There's a significant backlog of new translators waiting for review on Github as well...

True, although many of those have pending comments that were never addressed by the authors (or are just no longer necessary and should be closed). As you can see from the commit history, we regularly merge new translator PRs!

aborel · April 8, 2025

I agree, Dspace instances can be quite different from each other, my unverified hypothesis was that there might be a minimal core on which one might rely. But embedded metadata of passable quality would be easier to deal with, of course - and not unreasonable requirement for repository admins.

Sorry if my comment was too general: I am aware that updates to at least some existing translators are processed efficiently enough, my perception of the new translator case is perhaps biased by my own experience. One of my PRs has been waiting for any kind of action for over a year ;-)

iagogv · May 1, 2025

Another example, but in this case Zotero connector does not detect a journal article, but just a webpage: https://www.theseus.fi/handle/10024/875160

adamsmith · May 1, 2025

Same issue; the metadata details is actualy pretty decent here, but nothing thesis specific --and you can actually see the broken 'type' field (DC.type in the metadata) after import, where they try to put three different languages in a single string. There's just no way to reasonably parse stuff like this.

iagogv · May 1, 2025

Thanks. Anyway, I will continue posting these webs, even if this is not a Zotero issue.

Some questions:
-Are they wrong designed websites?
-Is Datacite schema wrongly implemented in those webpages?
- Isn't there some standard/ISO to use Datacite in a correct way in webpages?
-On @AbeJellinek comment on Datacite XML. Why Zotero can only translate JSON, but not XML. Wouldn't be this a feature to be improved in Zotero if Datacite XML is as valid as Datacite JSON?
-If there is a correct way to use Datacite on webpages and some (like the examples here) are not following it, is there some way to make pressure to correct them? I mean, some declaration, or foundation looking at these implementation?

AbeJellinek · May 1, 2025

Why Zotero can only translate JSON, but not XML. Wouldn't be this a feature to be improved in Zotero if Datacite XML is as valid as Datacite JSON?

Because we haven't needed Datacite XML for anything in the past. I took a look at what would be involved in implementing it — seems pretty straightforward, just a 1:1-ish mapping to JSON.

The "dc" on that page stands for Dublin Core, not DataCite. I think you may (understandably) be getting the two confused. Zotero supports importing Dublin Core metadata, but it needs to be in a machine-readable format, not just a table on the page. The actually machine-readable metadata made available by UNB Scholar is DataCite XML, which Zotero unfortunately doesn't yet support.

In any case, we might be able to start building a translator for relatively standard DSpace sites that handles things like the UNB Scholar Dublin Core metadata. I'll keep this thread updated.

iagogv · May 1, 2025

The "dc" on that page stands for Dublin Core, not DataCite. I think you may (understandably) be getting the two confused.

Indeed. Sorry. My fault.

In any case, we might be able to start building a translator for relatively standard DSpace sites that handles things like the UNB Scholar Dublin Core metadata. I'll keep this thread updated.

Thanks!

iagogv · May 1, 2025

Then, my last question before would become if there is some way to ask to webpages with Dublin Core information for having it in a machine-readable format. May we say that Dublin Core not being machine readable is not useful at all?

adamsmith · May 2, 2025

Yeah, any metadata (Dublin Core and other formats like Highwire -- the name for the citation_title etc. tags) should be in meta tags in the site header.

iagogv · May 5, 2025

Another example wher only a webpage is detected:
https://jyx.jyu.fi/jyx/Record/jyx_123456789_100515

AbeJellinek · May 5, 2025

@iagogv: That page has COinS metadata that gets prioritized over Embedded Metadata (for mostly historical reasons). Right-click the Zotero Connector toolbar button -> Save to Zotero -> Embedded Metadata.

iagogv · May 8, 2025

I have just discovered another project of the Corporation for Digital Scholarship, Omeka. Attending to what they claim through the website:

Omeka Classic is a web publishing platform for sharing digital collections and creating media-rich online exhibits.

Create complex narratives and share rich collections, adhering to Dublin Core standards with Omeka Classic on your server, designed for scholars, museums, libraries, archives, and enthusiasts.

Therefore, I would expect their webpages would fit well with Zotero Connector. But, for example, if I try to capture https://omeka.svsu.edu/items/show/8452, I get a webpage instead of an artwork (even if I right-click Zotero Connector and Save to Zotero (COinS)).

Or another example. I don't know which item type should fit a score, but not a webpage: https://bmlsh.ulpgc.es/item/213082w

Should this issue be attributed to Omeka or to Zotero?

Thanks again!

iagogv · May 9, 2025

@AbeJellinek

Even if these webpages don't have machine-readable metadata, some of them have buttons to export, for example to REFWORKS and MENDELEY (e.g. https://minerva.usc.gal/entities/publication/9a4fd001-4717-428f-96a5-44812f8f3805).

I thought translators looked for such buttons, but it seems not to be their behaviour. Shouldn't them? Wouldn't there be a way to make translators webscrapping webpages looking for such export buttons?

AbeJellinek · May 9, 2025

Omaka doesn't currently expose its RDF in the format that Zotero looks for. A lot of Omeka metadata wouldn't really map cleanly onto Zotero item types; some pages contain multiple items, which our generic translators can't yet support. It's possible that we'll be able to support Omeka items better once we have more robust generic translation support, which is coming, hopefully soon.

https://minerva.usc.gal/entities/publication/9a4fd001-4717-428f-96a5-44812f8f3805 saves pretty well for me. I get a journal article item with the correct title, date, and URL, and a full text PDF attachment. The main issues are:

1. It saves as a journal article. That's on them to correct by fixing their Highwire metadata.

2. The author name is split incorrectly. Some sites put all names in the same entry, some sites split them; we use heuristics to guess which format they're using, but those tend to fall apart on Iberian names. The translator currently can't tell that "San José Capilla, María Esther" is one name but "Alice Jones, John Smith" is two.

I get similar/worse metadata when I import the RefWorks file they provide — it still saves as a journal article, and although the author name is split correctly, it incorrectly lists the university as an editor.

As I previously said, we're working on improved support for DSpace, which will address these issues with thesis repositories. There's no need to link any more thesis repository pages; they pretty much all use DSpace, and the problems are the same across all items. I'll keep this thread updated.

AbeJellinek · May 9, 2025

@iagogv: That author name, and some others like it, should now save correctly.

iagogv · May 12, 2025

Thanks!

iagogv · May 12, 2025

@AbeJellinek BTW, regarding the name, your point on "Alice Jones, John Smith" against "San José Capilla, María Esther" is undeniable.
Just, let me write down another example where it happens: http://rave.ohiolink.edu/etdc/view?acc_num=osu1744247240174044

It is precisely an example where thesis item type is rightly detected. Could it be assumed that thesis type implies a unique creator, even with 2 surnames? Or, it could specify the thesis advisor after the thesis author..., so I am aware of it does not seem to have a robust solution

AbeJellinek · May 12, 2025

If citation_author has one comma and there's only one listed institution and ORCID, we might be able to assume that it's a single author. But we're really getting into the weeds here, and for every couple articles that that heuristic fixes, it could break another.