Source of BibTex data in a PDF file.

louarnold · April 24, 2021

I linked a PDF file into Zotero. It subsequently displayed BibTex data for the article. How do I know if that data came from within the PDF or from web sources. (I don't want the web sources to creep in.) I realize the data is rarely in a PDF, but I need a way of telling.

Regards,
Lou.

dstillman · April 24, 2021

Zotero isn't a BibTeX manager and doesn't display "BibTeX data".

It attempts to identify the PDF and retrieve high-quality, canonical metadata from the publisher. I don't know what you're trying to imply by "web sources", but if you're using Zotero you presumably want high-quality data for your items.

https://www.zotero.org/support/retrieve_pdf_metadata

louarnold · April 24, 2021

OK. Let me provide some more detail so we can skip the terminology barrier.

I link a PDF to a library. Zotero first displays, in the rightmost pane, a 'Note' frame. containg the title, filename, #pages, modified date, Indexed indicator, and Realted[click here] and Tags[click here]. I will call this metadata since it is quite limited.

Then seconds later much more information is displayed in the rightmost pane: Item type, Title, Authors, abstract Publication, and later in the list "DOI.org" which I assume is the source of all the above dataq. This is its own 'bib data' as I choose to call it now.

That bib data may exist in the PDF file and Zotero may have spent a few seconds retieving it from the PDF file and formatting it for display, or Zotero may have retrieved it from the web somewhere, perhaps from DOI.org. (Documentation states that Zotero does go out to the web and somehow finds this data.)

I need to know unambiguously where this data comes/came from. How would I know which of these scenarios happened?

Thanks.
Lou.

bwiernik · April 24, 2021

It doesn’t come from the PDF. See https://www.zotero.org/support/retrieve_pdf_metadata

adamsmith · April 24, 2021

And specifically, the library catalog field tells you which web service Zotero used for it, so in this case doi.org (I think it would actually say which DOI registration service in parentheses after that).

louarnold · April 25, 2021

I'll take you at your words. But ZabRef and Docear can write XMP data into PDF files; I have done it - I think. That's what I am trying to prove.

But that begs the question: Isn't the XMP data the same as what is often called metadata; not the same values, but the same kinds of information?

adamsmith · April 25, 2021

Technically, yes, XMP tags _can_ hold metadata about items in a PDF, and yes, JabRef (and Docear which uses/used Jabref at its core) do write XMP metatags.

Practically, XMP tags in scholarly PDFs are so frequently useless and even misleading that it's not worth it trying to parse them in the (imo correct) judgment of Zotero's developers.

louarnold · April 25, 2021

I assume you are correct, but that's why I want to determine what metadata is in the file itself. I can't know that if Zotero doesn't display what's internal - f it always goes to the web to get the metadata for that PDF. I have been trying to point people to this question for several posts, but the discussion always gets sidetracked.

All this goes to long term research: re-using PDFs in later papers, and being able to regenerate, in the future, old published papers in their exact submission content, and yet use the old PDFs, without the old metadata, for a new paper. See a web page about using Docear: 'Sustainable Research...Part II" by Saul Albert. Later in the document, there is a para numbered "3" which discusses a way to structure information for multi-project use.

https://saulalbert.net/blog/sustainable-research-literature-management-with-docearii/

adamsmith · April 25, 2021

I'm not sure what you're asking then. If you don't believe that what we're saying is correct, the only way you'll get more "proof" is by looking at the source code, which you're obviously welcome to do.

If you're asking how to ensure the long-term usability of entries in Zotero, that's a question about the most meaningful unit. For Zotero, that's not a PDF, but the metadata entry for an item: items can have no attachments, attachments in different file formats, or no attachments at. The way Zotero ensures longterm usability is by making sure the metadata entries are easily exportable (along with attached files) in a large number of standard file formats.

louarnold · April 25, 2021

[Re: I'm not sure what you're asking then. If you don't believe that what we're saying is correct, the only way you'll get more "proof" is by looking at the source code, which you're obviously welcome to do.]
What is to believe? You haven't answered the question. Someone says the PDF has no metadata, yet it you say it has, but its poor. What I want to know is: How can I tell what Zotero is displaying, the internal data or the current web-sourced data?

adamsmith · April 25, 2021

As bwiernik says above:

It doesn’t come from the PDF.

Zotero doesn't use XMP tags in any way whatsoever. All metadata Zotero displays comes from the internet.

louarnold · April 25, 2021

Good answer. Since I want to know what's internal, I'll look for other software that can help.
Thanks for your help.
Don't let that little COVID thing getchya. :).