Automatic metadata retrieval after browser 'Save to Zotero (PDF)' not fetching source URL

floriand · May 7, 2020

Dear Zotero-team, I have a question, but let me also thank you while debating/discussing for all your great work, zotero makes my life a lot more fluent and easier!

The question/idea, chronologically:

1. After saving a pdf directly from the browser using 'Save to Zotero (PDF)...

...2.a. -> The metadata retrieval does not find content, so the PDF remains 'nakedly' in the folder.
* In this case, when doing 'Create Parent Item', the original URL is transferred to the 'URL' field of a new zotero item 9 [ -> :) ]

or

...2.b -> The metadata retrieval does find content, so a new Zotero item gets created.
* In this case, the PDF URL field of the new Zotero item remains blank (although the original PDF item is still associated to this URL).

My question: In case 2.b (automatic metadata retrieval), is there an option to associate the original url from which the pdf got fetched to the new item? Or could this be implemented (if not possible for now) ? It seems a logical thing, but maybe I don't see the full picture / there are other reasons not to have this functionality / ... .

Best regards,

Florian

dstillman · May 7, 2020

Yes, this is by design.

First, the attachment has the original URL in both cases, so you can always get it from there (though it'd be better if you could copy it to the clipboard without clicking it and loading the URL).

The URL field for a regular item contains the URL that would normally be cited. In the case of a scholarly article, that's the publisher's abstract page, not the URL of the PDF itself.

2a transfers the URL to the parent item because, when metadata retrieval doesn't work, there's a greater chance that it's a non-academic document where the PDF itself is the canonical version of the item. Say, a report that an organization puts out, where there's no webpage representing the report, just the PDF itself that's linked from somewhere.

For 2b, in most cases Zotero is actually getting high-quality official metadata from a resource like Crossref, which usually includes the publisher URL if there is one. It's not appropriate to include the PDF URL in that URL field in that case.

There are some cases where the PDF can't be recognized but Zotero can still extract certain basic fields like the title (in which case it will says "Zotero" for Library Catalog), and those sort of blur the line, but since it uses the same PDF recognition mechanism, it doesn't transfer the URL.

In any case, if anything it's 2a that would change to not transfer the URL, not the other way around, since the PDF URL isn't generally what's expected to be cited for most things people are putting into Zotero.

floriand · November 30, 2020

Hey Daniel,

Sorry for the delay in coming back to this issue. Thanks a lot for your explanation. I can follow the logic to a great extent, but for '2b' above (automatic metadata retrieval), I was following the reasoning 'better something than nothing'. If automatic retrieval finds a URL (either through a repository or pdf extraction), I assume it is best to attach this URL. But if there is no URL found, why not attaching the URL from the original pdf ?

I use Zotero intensively to have a structured way of referring to sources / organising and sharing references (not limited to academic publications with a Crossref/doi reference), and it would ease things to retain this source (if no better one is found).

dstillman · December 2, 2020

We just can't really vouch for the PDF URL, so it's not appropriate to put it in a field that's used in automatically generated citations. It might be from some unofficial source (e.g., clicking a random PDF link from Google Scholar), it might include an institution-specific proxy, it might be a long randomly generated CDN URL, it might be a URL that expires after a few minutes, it might be from a source of dubious legality… People are obviously ultimately responsible for their own data, but as a rule, we try not to add anything to the item metadata that can't fairly safely and reliably be added to citations. That's particularly true for something like a URL where many people wouldn't necessarily understand what it's representing or revealing.