Available for beta testing: new PDF recognizer

dstillman · February 23, 2018

The latest Zotero beta features a redesigned PDF recognizer that no longer relies on Google Scholar and should allow the recognition of an essentially unlimited number of PDFs without throttling.

PDFs are now recognized using a Zotero-designed web service that operates on the first few pages of text using extraction algorithms and known metadata from CrossRef, paired with CrossRef and ISBN lookups in the client as before. The Zotero lookup service doesn't require a Zotero account, and we don't log any data about the content or results of searches. No data is now sent to Google Scholar.

Recognition still has to be triggered manually for the moment, but an upcoming version will perform automatic recognition of PDFs added to Zotero, now that rate limits are no longer a concern. This also opens the door to a wider variety of PDF-based workflows in the future.

We're still fine-tuning some of the recognition logic, so you may see some worse or incomplete results in some cases, but in many cases you'll get better results than before (particularly for older articles). Most importantly, you’ll no longer be cut off by Google Scholar after a small number of searches (which has become more of a problem with the standalone-only Zotero 5.0, which doesn’t share the browser’s cookie store and is therefore blocked more quickly).

If you do try the beta, let us know if a file isn’t recognized in a way that you would expect.

LiborA · February 23, 2018

Great info, thanks. Is this new recogniser able to read the metadata from pdf?

dstillman · February 23, 2018

No, it doesn't use metadata from the PDF. It would be easy to do, but I suspect it would only make sense as a fallback if all the other methods failed.

adamsmith · February 24, 2018

Started testing & liking what I'm seeing so far -- Congrats on getting that out! There's no particular reason you'd have to do a CrossRef query, right? If we did a joint DOI translator for CrossDataCite along the lines of https://github.com/zotero/translators/pull/1135
it could query both?
I'm asking because we put our Newletter PDFs on Zenodo which assigns DataCite DOIs. Zotero does find the DOI (yay), but the query fails e.g. for this PDF:
https://zenodo.org/record/889811/files/Büthe-Jacobs_2015_Letter_QMMR_13_2.pdf

adamsmith · February 24, 2018

First false positive I found:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3800026/pdf/nihms516171.pdf

(the PDF of https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3800026/ ) imports as
"NIH-PA Author Manuscript"
(oddly with the correct abstract)

dstillman · February 26, 2018

@adamsmith: Title extraction is fixed for that one — thanks.

(That's a case where there are no identifiers and all the metadata is being extracted from the PDF itself, and we were grabbing the wrong part. The article is actually in CrossRef too, but without the prefix, which might be why we're not matching it to its DOI. We do, however, get all three authors, compared to CrossRef's one, though CrossRef has a few other fields. Anyway, more improvements to come.)

johnmilojamison · February 28, 2018

This is great news since I'm still trying to get my zotero library caught up with my offline library of PDFs. Could we put version numbers on this discussion since I'm not sure exactly where in development this is and whether my stable (release) version is getting this feature yet. Your link to the beta has a download for (win) v.5.0.35-beta.28+1d367f016; but my stable version is already v.5.0.35.1. So do I already have this feature running? If so, why isn't the recognizer reflected in the version history (https://www.zotero.org/support/changelog)? Sorry, just feeling a bit confused without knowing what versions we're talking about.

johnmilojamison · February 28, 2018

Also (dumb question, perhaps) will this feature work with pdfs without editable text, e.g. scanned pages? I have a lot of old pdfs that are basically electronic xerox copies and it would be great if zotero could automatically run them through an OCR when trying to figure out what they are. That sounds like a very high bar to set, but it would be incredibly helpful for pdfs and old journals.

dstillman · February 28, 2018

Your link to the beta has a download for (win) v.5.0.35-beta.28+1d367f016; but my stable version is already v.5.0.35.1.

Sorry, we just haven't bumped the version on the beta — it should be 5.0.36-beta. This feature will be included in 5.0.36. (We'll also post here once the feature is available in a release version.)

Also (dumb question, perhaps) will this feature work with pdfs without editable text, e.g. scanned pages?

No, no different from the current version in that regard, I'm afraid.

bjohas · March 1, 2018

That sounds great!

I wonder what to do with PDFs that aren't recognised. My understanding is that recognition is "first few pages of text using extraction algorithms and known metadata from CrossRef, paired with CrossRef and ISBN lookups in the client as before" (as above) - I assume the PDF file metadata (if present) is also used?

How about letting users submit metadata for PDFs that don't have CrossRef/DOI/ISBN or metadata? I.e. go by file hash, and store the metadata in a separate database? Then at least once I've added metadata manually, this is available to other users (with a warning that it's user generated).

(Or maybe such a store already exists...)

Bjoern

dstillman · March 1, 2018

I assume the PDF file metadata (if present) is also used

It's not. (Mentioned above as well.) Generally it's of quite poor quality, which is why we've never done it.

While developing this we explored various other mechanisms, including using file hashes, and some of those might return in later versions. (File hashes are often useless due to watermarking, though.)

dstillman · March 1, 2018

In the latest beta (5.0.36-beta.1), we've added automatic metadata retrieval. Now, if you drag a PDF into Zotero as a standalone item, use "Store Copy of File"/"Link to File", or use "Save to Zotero" on a PDF from the connector, Zotero will automatically run the recognizer and create a parent item if possible. If it finds a match, it will also rename the file using the parent metadata. (We'll be adding additional customization of renaming in a future version.)

Both automatic recognition and automatic renaming can be disabled from the preferences.

bjohas · March 1, 2018

| (File hashes are often useless due to watermarking, though.)

Yes, agreed, e.g. for journal papers. However, what I've got in mind is grey literature in the international development space, where there's only one version, with a canonical URL (or maybe a couple of URLs, one with the org, and one with the donor). However, because there's not DOI etc everybody ends up typing the metadata in (unless you find it on google scholar, but you've already found it on the web, so why then go to google scholar...).

[[Perhaps off topic: Don't know whether this would be too intrusive, but perhaps users could even agree to share their metadata anonymously - or rather, for the metadata to be processed (as it sits on the Zotero server anyway). So behind the scenes Zotero could then check metadata consistency (across several people adding), and where it has several independently added records, the metadata can then be offered to other users.]]

Bjoern

bjohas · March 1, 2018

Another question: Given that you're inviting people to use the beta, I assume it's (reasonably) 'safe for work'? We're working on a literature review at the moment, so it may not be a good idea to try the beta at this stage (on our main library)? At the same time, the PDF recognition sounds great...

dstillman · March 1, 2018

However, what I've got in mind is grey literature in the international development space, where there's only one version, with a canonical URL

Sure. We've seen a lot of these, so they're on our radar.

Don't know whether this would be too intrusive, but perhaps users could even agree to share their metadata anonymously - or rather, for the metadata to be processed (as it sits on the Zotero server anyway).

Yes, we've experimented with using synced data (limited to fields that appear in the document itself, so there's no real privacy issue). It's something we're still evaluating, but it's a lot more complicated than the current implementation.

Given that you're inviting people to use the beta, I assume it's (reasonably) 'safe for work'?

The current beta should be pretty safe to use — other than the recognizer changes, it mostly just includes an overhaul to word processor integration that's been in beta for over a month and hasn't had any recent reports of problems.

bjohas · March 1, 2018

Great - thank you for the heads up!

johnmilojamison · March 2, 2018

Tried dumping 35 pdfs into a folder and it worked good, both for books and journal articles. I also tried by adding links to pdf and also had excellent results.
An issue I notice is that only about 1/10 had the abstracts. There doesn't seeem to be any logic to which abstracts appear, since I tried adding links to 18 articles from the same issue of the same journal, and 2 had abstracts while the others didn't. Abstracts are really important for how I search in zotero, so I'm hoping this is being looked at :)

dstillman · March 2, 2018

@johnmilojamison: Could you provide a couple links to PDFs that didn't end up with abstracts?

Abstracts are actually an area that should be much improved with the new recognizer. Neither CrossRef nor Google Scholar offer abstracts, so you didn't get them with the previous recognizer, and with the new recognizer we're trying to extract abstracts and include them. But the extraction logic can likely be improved further.

johnmilojamison · March 5, 2018

Below are a few sets, all from A management journals. The first is from JAP which has some that pulled bibliographies and some that didn't. The second are titles from AMA, none of which found abstracts. Third are from JPSP, also none of which found abstracts. I wondered if it might be since I had most of these in Zotero, but even ones for which I had abstracts before didn't pull the abstract when using the new pdf recognizer (e.g. Webb et al., 2017).

2017 Ployhart, Robert E.; Schmitt, Neal; Tippins, Nancy T. Solving the Supreme Problem: 100 years of selection and recruitment at the Journal of Applied Psychology.
Journal of Applied Psychology
1939-1854, 0021-9010
10.1037/apl0000081
2017 Hofmann, David A.; Burke, Michael J.; Zohar, Dov 100 years of occupational safety research: From basic protections and work analysis to a multilevel view of workplace safety and risk.
Journal of Applied Psychology
1939-1854, 0021-9010
10.1037/apl0000114
2017 Kanfer, Ruth; Frese, Michael; Johnson, Russell E. Motivation related to work: A century of progress.
Journal of Applied Psychology
1939-1854, 0021-9010
10.1037/apl0000133
2017 Cortina, Jose M.; Aguinis, Herman; DeShon, Richard P. Twilight of dawn or of evening? A century of research methods in the Journal of Applied Psychology.
Journal of Applied Psychology
1939-1854, 0021-9010
10.1037/apl0000163

Roberson, Q., Holmes, O., & Perry, J. L. 2017. Transforming Research on Diversity and Firm Performance: A Dynamic Capabilities Perspective. Academy of Management Annals, 11(1): 189–216.
Rothman, N. B., Pratt, M. G., Rees, L., & Vogus, T. J. 2017. Understanding the Dual Nature of Ambivalence: Why and When Ambivalence Leads to Good and Bad Outcomes. Academy of Management Annals, 11(1): 33–72.
Jason A. Colquitt, & Cindy P. Zapata-Phelan. 2007. Trends in Theory Building and Theory Testing: A Five-Decade Study of the “Academy of Management Journal.” The Academy of Management Journal, 50(6): 1281–1303.

McCarthy, M. H., Wood, J. V., & Holmes, J. G. 2017. Dispositional pathways to trust: Self-esteem and agreeableness interact to predict trust and negative emotional disclosure. Journal of Personality and Social Psychology, 113(1): 95–116.
Miller, J. G., Akiyama, H., & Kapadia, S. 2017. Cultural variation in communal versus exchange norms: Implications for social support. Journal of Personality and Social Psychology, 113(1): 81–94.
Mooijman, M., van Dijk, W. W., van Dijk, E., & Ellemers, N. 2017. On sanction-goal justifications: How and why deterrence justifications undermine rule compliance. Journal of Personality and Social Psychology, 112(4): 577–588.
Webb, C. E., Coleman, P. T., Rossignac-Milon, M., Tomasulo, S. J., & Higgins, E. T. 2017. Moving on or digging deeper: Regulatory mode and interpersonal conflict resolution. Journal of Personality and Social Psychology, 112(4): 621–641.

LiborA · March 6, 2018

I import this report just now https://pubs.usgs.gov/pp/0272e/report.pdf. But the recognizer finds an article on http://science.sciencemag.org/content/119/3088/328.1 and import the metadata from them.

mark · March 7, 2018

Just want to note this sounds very promising and is likely to solve by far the most common confusion for new users of Zotero, who tend to drag PDFs into the library and expect them to magically be citable.

dstillman · March 7, 2018

@johnmilojamison and @LiborA, most of those examples should now give better results.

dstillman · March 8, 2018

In the latest beta there are two new context-menu options for newly recognized items: "Undo Retrieve Metadata" and "Report Inaccurate Metadata". These will appear for 24 hours after retrieval or until you make changes to the item or restart Zotero.

You can use "Undo Retrieve Metadata" if you're not happy with the results and want to add the parent item another way (e.g., via the connector), or if, say, you attempt to drag in a PDF on top of an existing item and miss, causing another parent item to be created.

"Report Inaccurate Metadata" lets you easily report problems with the returned metadata in a way that we can investigate further. It resends the same data previously sent to the recognizer server along with the final metadata. We'll be automatically notified of reports, so no need to post here unless you want feedback from us. (Note that not all problems will be things we can fix, but we'll keep an eye out for things we can improve.)

Joscha · March 8, 2018

This is great! Pretty unrelated but it does remind me that this would still be great (Integration of zotfile's pdf annotation extraction into Zotero):

https://github.com/zotero/zotero/issues/1018

dstillman · March 10, 2018

The new recognizer is available in Zotero 5.0.36, available now.

https://www.zotero.org/blog/zotero-5-0-36/

(@Joscha: We're hoping to work on both of your big pull requests soon.)

bjohas · March 11, 2018

What would be a sure-fire way of including a DOI, so that the PDF recogniser recognises it? I tried to get this recognised, https://zenodo.org/record/1195743, but it didn't work.

adamsmith · March 11, 2018

I didn't check, but I'd be very surprised if the recognizer didn't find those DOIs, but it currently only queries CrossRef, not DataCite (which is what Zenodo uses). See my comment above: https://forums.zotero.org/discussion/comment/302130/#Comment_302130

dstillman · March 12, 2018

I've pushed an update in the latest beta that switches the recognizer to use the same set of DOI translators used in Add Item by Identifier. I think the original recognizer code just predated the addition of additional DOI translators, so it had CrossRef hard-coded, and the new recognizer simply used the same code. The Zenodo PDF above now works for me.

bjohas · March 12, 2018

Great, that's excellent!

fmuro · March 12, 2018

I can't say THANKS enough times for having improved PDF recognition that much. I can finally index my whole old PDF collection. THANKS!!!

bjohas · March 12, 2018

@fmuro makes a very important point - expressing appreciation for the great work that is being done! THANKS from me also!