• There unfortunately isn't anything in the page's machine-readable metadata marking it as a thesis — they should be setting citation_dissertation_institution. We're working on major improvements to the "generic" translators that we use for sites without site-specific support, though, and we should be able to address this as part of that effort.
  • @AbeJellinek Thanks for answering. I see in the Full Item page link (https://unbscholar.lib.unb.ca/items/568c48b7-9dbd-4bea-90ae-8154f3220524/full):
    dc.type master thesis

    and I believe this is enough machine-readable metadata marking the item as a thesis.
  • I wish we could use that metadata, but it isn't present on the main page, only on /full, and even there, it's just in a human-readable table with no machine-readable semantic markup. I'm not really sure what the point of having it is.
  • edited April 8, 2025
    The interface seems to be the frontend of a Dspace7 server, I can see some typical API urls in the network log of my browser.

    There is probably a way to write a translator, maybe calling an existing one, but I can't tell how much work that would involve as I'm not sure what is available at the moment. There's a significant backlog of new translators waiting for review on Github as well...
  • Yeah, it's DSpace, but DSpace is way too diverse to write a single translator for all sites powered by it. The only structured metadata I'm seeing on that site in particular is Datacite XML, which we unfortunately don't have a translator for. (JSON, yes, but XML, no.)
    There's a significant backlog of new translators waiting for review on Github as well...
    True, although many of those have pending comments that were never addressed by the authors (or are just no longer necessary and should be closed). As you can see from the commit history, we regularly merge new translator PRs!
  • edited April 8, 2025
    I agree, Dspace instances can be quite different from each other, my unverified hypothesis was that there might be a minimal core on which one might rely. But embedded metadata of passable quality would be easier to deal with, of course - and not unreasonable requirement for repository admins.

    Sorry if my comment was too general: I am aware that updates to at least some existing translators are processed efficiently enough, my perception of the new translator case is perhaps biased by my own experience. One of my PRs has been waiting for any kind of action for over a year ;-)
  • Another example, but in this case Zotero connector does not detect a journal article, but just a webpage: https://www.theseus.fi/handle/10024/875160
  • Same issue; the metadata details is actualy pretty decent here, but nothing thesis specific --and you can actually see the broken 'type' field (DC.type in the metadata) after import, where they try to put three different languages in a single string. There's just no way to reasonably parse stuff like this.
  • Thanks. Anyway, I will continue posting these webs, even if this is not a Zotero issue.

    Some questions:
    -Are they wrong designed websites?
    -Is Datacite schema wrongly implemented in those webpages?
    - Isn't there some standard/ISO to use Datacite in a correct way in webpages?
    -On @AbeJellinek comment on Datacite XML. Why Zotero can only translate JSON, but not XML. Wouldn't be this a feature to be improved in Zotero if Datacite XML is as valid as Datacite JSON?
    -If there is a correct way to use Datacite on webpages and some (like the examples here) are not following it, is there some way to make pressure to correct them? I mean, some declaration, or foundation looking at these implementation?
  • edited 12 days ago
    Why Zotero can only translate JSON, but not XML. Wouldn't be this a feature to be improved in Zotero if Datacite XML is as valid as Datacite JSON?
    Because we haven't needed Datacite XML for anything in the past. I took a look at what would be involved in implementing it — seems pretty straightforward, just a 1:1-ish mapping to JSON.

    The "dc" on that page stands for Dublin Core, not DataCite. I think you may (understandably) be getting the two confused. Zotero supports importing Dublin Core metadata, but it needs to be in a machine-readable format, not just a table on the page. The actually machine-readable metadata made available by UNB Scholar is DataCite XML, which Zotero unfortunately doesn't yet support.

    In any case, we might be able to start building a translator for relatively standard DSpace sites that handles things like the UNB Scholar Dublin Core metadata. I'll keep this thread updated.
  • The "dc" on that page stands for Dublin Core, not DataCite. I think you may (understandably) be getting the two confused.
    Indeed. Sorry. My fault.
    In any case, we might be able to start building a translator for relatively standard DSpace sites that handles things like the UNB Scholar Dublin Core metadata. I'll keep this thread updated.
    Thanks!
  • edited 12 days ago
    Then, my last question before would become if there is some way to ask to webpages with Dublin Core information for having it in a machine-readable format. May we say that Dublin Core not being machine readable is not useful at all?
  • Yeah, any metadata (Dublin Core and other formats like Highwire -- the name for the citation_title etc. tags) should be in meta tags in the site header.
  • edited 8 days ago
    Another example wher only a webpage is detected:
    https://jyx.jyu.fi/jyx/Record/jyx_123456789_100515
  • @iagogv: That page has COinS metadata that gets prioritized over Embedded Metadata (for mostly historical reasons). Right-click the Zotero Connector toolbar button -> Save to Zotero -> Embedded Metadata.
  • edited 4 days ago
    I have just discovered another project of the Corporation for Digital Scholarship, Omeka. Attending to what they claim through the website:
    Omeka Classic is a web publishing platform for sharing digital collections and creating media-rich online exhibits.
    Create complex narratives and share rich collections, adhering to Dublin Core standards with Omeka Classic on your server, designed for scholars, museums, libraries, archives, and enthusiasts.
    Therefore, I would expect their webpages would fit well with Zotero Connector. But, for example, if I try to capture https://omeka.svsu.edu/items/show/8452, I get a webpage instead of an artwork (even if I right-click Zotero Connector and Save to Zotero (COinS)).

    Or another example. I don't know which item type should fit a score, but not a webpage: https://bmlsh.ulpgc.es/item/213082w

    Should this issue be attributed to Omeka or to Zotero?

    Thanks again!
  • @AbeJellinek

    Even if these webpages don't have machine-readable metadata, some of them have buttons to export, for example to REFWORKS and MENDELEY (e.g. https://minerva.usc.gal/entities/publication/9a4fd001-4717-428f-96a5-44812f8f3805).

    I thought translators looked for such buttons, but it seems not to be their behaviour. Shouldn't them? Wouldn't there be a way to make translators webscrapping webpages looking for such export buttons?
  • edited 4 days ago
    Omaka doesn't currently expose its RDF in the format that Zotero looks for. A lot of Omeka metadata wouldn't really map cleanly onto Zotero item types; some pages contain multiple items, which our generic translators can't yet support. It's possible that we'll be able to support Omeka items better once we have more robust generic translation support, which is coming, hopefully soon.

    https://minerva.usc.gal/entities/publication/9a4fd001-4717-428f-96a5-44812f8f3805 saves pretty well for me. I get a journal article item with the correct title, date, and URL, and a full text PDF attachment. The main issues are:

    1. It saves as a journal article. That's on them to correct by fixing their Highwire metadata.

    2. The author name is split incorrectly. Some sites put all names in the same entry, some sites split them; we use heuristics to guess which format they're using, but those tend to fall apart on Iberian names. The translator currently can't tell that "San José Capilla, María Esther" is one name but "Alice Jones, John Smith" is two.

    I get similar/worse metadata when I import the RefWorks file they provide — it still saves as a journal article, and although the author name is split correctly, it incorrectly lists the university as an editor.

    As I previously said, we're working on improved support for DSpace, which will address these issues with thesis repositories. There's no need to link any more thesis repository pages; they pretty much all use DSpace, and the problems are the same across all items. I'll keep this thread updated.
  • @iagogv: That author name, and some others like it, should now save correctly.
  • @AbeJellinek BTW, regarding the name, your point on "Alice Jones, John Smith" against "San José Capilla, María Esther" is undeniable.
    Just, let me write down another example where it happens: http://rave.ohiolink.edu/etdc/view?acc_num=osu1744247240174044

    It is precisely an example where thesis item type is rightly detected. Could it be assumed that thesis type implies a unique creator, even with 2 surnames? Or, it could specify the thesis advisor after the thesis author..., so I am aware of it does not seem to have a robust solution
  • If citation_author has one comma and there's only one listed institution and ORCID, we might be able to assume that it's a single author. But we're really getting into the weeds here, and for every couple articles that that heuristic fixes, it could break another.
Sign In or Register to comment.