Retrieve PDF Metadata: a suggestion

jhmorneau · November 16, 2022

Hi,

In the old days, when one used the "Retrieve PDF Metadata" function on a PDF that did not contain a DOI, Zotero would tell you it could not retrieve anything and would leave the PDF in the central column. So this function would basically only work with PDFs of (recent) articles.

Sometimes in the past, this function was modified.
Now, if Zotero does not find an embedded identifier in the PDF file, it will nonetheless create an hollow and flawed Zotero item with just the title and the number of pages.

This is all the trickier if you have left this function in automatic mode in the "Preferences" pannel (which is the default setting).

Indeed, as soon as you drop a new PDF file in Zotero, an item is created and you might think that the job has been (well) done and forget about it. And if you don't check carefully the item that was produced, you may end up with a great number of flawed items in your library.

Students who are not seasoned Zotero users will easilly fall into that trap and realize that something's wrong only much later on, after having Zotero produce a first bibliography.

This problem especially happen when trying to retrieve metadata from thesis available online on open archive repositories. When I realized that, I decided to explicitly advise students to switch off automatic retrieval of PDF metadata.

So I think this modifcation is totally counter-productive.
I advise you to go back to the initial setting: instead of creating a flawed item for the PDF, Zotero would be much wisier to tell the user that it could not retrieve the metadata.

Thanks for your help,
Best regards,
JH Morneau

adamsmith · November 16, 2022

People really hated that this failed as much as it did and left them to manually create the parent item. The change here was quite popular and I don't see Zotero going back.

The better thing to teach students would be to rely on retrieve metadata as little as possible and instead import things like theses via the repository (landing) page for the thesis instead of importing the PDF.

aborel · November 16, 2022

The problem shouldn't happen if one follows the recommended method to enter references, i.e. use the browser to download the metadata and the PDF together https://www.zotero.org/support/adding_items_to_zotero . This should be the advice given to the students, shouldn't it?
Entering PDF directly will almost always yield a poorer record (even when a DOI is recognized, the abstract will typically be missing).

jhmorneau · November 22, 2022

I get your point, but I felt important to share what I think is a counter-productive effect of this change in parameters.

Indeed, if students read and followed the recommended method, everything would be fine. But many of them don't and will go for the easiest & less time-consuming method for creating references.

And the fact that metadata retrieval from PDF is so quick makes it harder for the user to spot the problems.

Maybe a solution could involve some kind of feedback from Zotero in case of parent item creation not based on an identifier?
Any type of feedback would help raising the awareness of the user and have him/her check the quality of the reference Zotero produced...

dstillman · November 22, 2022

The premise here is wrong, though, on two different fronts:

1) If Zotero can't detect anything at all, it will still leave a standalone PDF. The difference is that — almost five years ago at this point — we added a completely new PDF recognition system that, if it can't find an identifier, still tries to pull out at least some basic metadata from the item, including title, authors, and page number, to minimize the amount you have to type manually. Zotero is designed to save you time, so of course it's going to do that if it can.

2) Teaching students to turn off metadata retrieval is awful, misguided advice, and I'd implore you to stop doing that. Even if we always recommend saving from an article page when possible, PDF metadata retrieval is an extremely useful feature for PDFs that do have identifiers, which is the vast majority of current academic PDFs.

The most basic thing to teach students who will be creating citations is that they need to check the metadata for each item they import, no matter what tool they use and no matter how they import the data. This is a universal rule of reference managers and is in no way specific to Zotero.

jhmorneau · November 22, 2022

I strongly disagree with your analysis.

Out of thesis PDF, the setting creates an item with a "Journal article" type, and only a title and page number.

Therefore, my opinion is that this setting is currently totally insatisfactory.

And to teach students to turn off this function is, to my mind, a no brainer.

I advise the students to only use it manually on article PDFs and carefully check the result. I also advise them to systematicaly create thesis items from our national higher education catalogue.

Problem solved.

You may not agree with me, but let's agree to disagree and refrain from making derogatory remarks.

adamsmith · November 22, 2022

we added a completely new PDF recognition system that, if it can't find an identifier, still tries to pull out at least some basic metadata from the item, including title, authors, and page number, to minimize the amount you have to type manually.

I do wonder if some sort of flag for those items might make sense. I think that's what jhmorneau suggests above, too. Mendeley, e.g., did that -- for their significantly worse metadata, admittedly.

I understand the point that one should review all metadata (and it's definitely the thing to teach students), but honestly, CrossRef metadata and Zotero web translator metadata is almost always good enough for references in, e.g., an initial submission of an article, so I suspect many people don't routinely do this and flagging items where it definitely needs to happen would be quite helpful (I know I can do this with the library catalog field and a saved search).

dstillman · November 22, 2022

Out of thesis PDF, the setting creates an item with a "Journal article" type, and only a title and page number.

It depends on the PDF. Depending on the formatting, it will save authors, abstracts, and other fields as well. But even if it only saves a title, so what? That's a title that doesn't need to be typed manually. When a PDF can't be recognized, there's a very good chance there's no good source for metadata and the remaining details will need to be manually filled in (e.g., any random report or generic PDF document that people save directly to Zotero).

If you don't like the auto-created item and think there's a better source, you can right-click and choose Undo Retrieve Metadata (until Zotero restart). A future version of Zotero will also make it possible to update the metadata for an existing item by adding a DOI or other identifier, for cases where that's relevant.

When there actually is high-quality metadata available somewhere, teaching them to save from there is exactly what we recommend — but then why also have them waste time by manually running metadata retrieval on other PDFs? This is a feature that retrieves high-quality metadata for the vast majority of academic PDFs and that is frequently described as "magic" and one of Zotero's best features. Teaching them to turn that off in favor of a manual process just seems like a real disservice. (Honestly, this is only a configurable setting at all for privacy reasons, since it involves a request to Zotero servers.)

I do wonder if some sort of flag for those items might make sense.

@adamsmith: We can certainly consider that, but I don't think it's particularly specific to the basic items discussed here — more to the general point that people should always review metadata. As far as I know, Mendeley did it for all new items, and I assume we would do the same. (This is also perhaps related to the idea of being able to auto-tag new items.)

adamsmith · November 23, 2022

I don't think it's particularly specific to the basic items discussed here — more to the general point that people should always review metadata. As far as I know, Mendeley did it for all new items, and I assume we would do the same.

I'm not exactly a Mendeley expert ;)) but my recollection was that they did this for items where they (almost always rightly) suspected that the metadata was particularly bad -- I think the "Zotero" catalog items really are in a very different category in terms of metadata quality (you *always* have to edit them, which very much isn't the case for CrossRef data), so I would indeed have suggested handling them differently, yes.

jhmorneau · November 23, 2022

I think a flag would indeed do the trick!
It would signal the need for the item to be reviewed by the user.

As for the context: many of our students use web search engines and find PDF versions of thesis on open archive repositories. They save the files on their computer and that's it. They don't pay attention to the source of the file or the related metadata.

When we teach them about Zotero, they want Zotero to create the items automatically out of the PDFs they have already downloaded, because it saves them precious time. Unfortunately, as it is, the results are almost always unsatisfactory, as I explained before.

If you want to give it a try, I suggest you download a random PDF thesis from https://theses.fr/ and/or https://dumas.ccsd.cnrs.fr/ and see the result of PDF metadata retrieval.

For example:
https://theses.fr/2021CYUN1084/document
https://dumas.ccsd.cnrs.fr/dumas-01644833/document

As we do have a national online higher education union catalogue (http://www.sudoc.abes.fr/) where the thesis are properly described, the best way to proceed is to create the item from that source.