Available for beta testing: new PDF recognizer
The latest Zotero beta features a redesigned PDF recognizer that no longer relies on Google Scholar and should allow the recognition of an essentially unlimited number of PDFs without throttling.
PDFs are now recognized using a Zotero-designed web service that operates on the first few pages of text using extraction algorithms and known metadata from CrossRef, paired with CrossRef and ISBN lookups in the client as before. The Zotero lookup service doesn't require a Zotero account, and we don't log any data about the content or results of searches. No data is now sent to Google Scholar.
Recognition still has to be triggered manually for the moment, but an upcoming version will perform automatic recognition of PDFs added to Zotero, now that rate limits are no longer a concern. This also opens the door to a wider variety of PDF-based workflows in the future.
We're still fine-tuning some of the recognition logic, so you may see some worse or incomplete results in some cases, but in many cases you'll get better results than before (particularly for older articles). Most importantly, you’ll no longer be cut off by Google Scholar after a small number of searches (which has become more of a problem with the standalone-only Zotero 5.0, which doesn’t share the browser’s cookie store and is therefore blocked more quickly).
If you do try the beta, let us know if a file isn’t recognized in a way that you would expect.
PDFs are now recognized using a Zotero-designed web service that operates on the first few pages of text using extraction algorithms and known metadata from CrossRef, paired with CrossRef and ISBN lookups in the client as before. The Zotero lookup service doesn't require a Zotero account, and we don't log any data about the content or results of searches. No data is now sent to Google Scholar.
Recognition still has to be triggered manually for the moment, but an upcoming version will perform automatic recognition of PDFs added to Zotero, now that rate limits are no longer a concern. This also opens the door to a wider variety of PDF-based workflows in the future.
We're still fine-tuning some of the recognition logic, so you may see some worse or incomplete results in some cases, but in many cases you'll get better results than before (particularly for older articles). Most importantly, you’ll no longer be cut off by Google Scholar after a small number of searches (which has become more of a problem with the standalone-only Zotero 5.0, which doesn’t share the browser’s cookie store and is therefore blocked more quickly).
If you do try the beta, let us know if a file isn’t recognized in a way that you would expect.
it could query both?
I'm asking because we put our Newletter PDFs on Zenodo which assigns DataCite DOIs. Zotero does find the DOI (yay), but the query fails e.g. for this PDF:
https://zenodo.org/record/889811/files/Büthe-Jacobs_2015_Letter_QMMR_13_2.pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3800026/pdf/nihms516171.pdf
(the PDF of https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3800026/ ) imports as
"NIH-PA Author Manuscript"
(oddly with the correct abstract)
(That's a case where there are no identifiers and all the metadata is being extracted from the PDF itself, and we were grabbing the wrong part. The article is actually in CrossRef too, but without the prefix, which might be why we're not matching it to its DOI. We do, however, get all three authors, compared to CrossRef's one, though CrossRef has a few other fields. Anyway, more improvements to come.)
I wonder what to do with PDFs that aren't recognised. My understanding is that recognition is "first few pages of text using extraction algorithms and known metadata from CrossRef, paired with CrossRef and ISBN lookups in the client as before" (as above) - I assume the PDF file metadata (if present) is also used?
How about letting users submit metadata for PDFs that don't have CrossRef/DOI/ISBN or metadata? I.e. go by file hash, and store the metadata in a separate database? Then at least once I've added metadata manually, this is available to other users (with a warning that it's user generated).
(Or maybe such a store already exists...)
Bjoern
While developing this we explored various other mechanisms, including using file hashes, and some of those might return in later versions. (File hashes are often useless due to watermarking, though.)
Both automatic recognition and automatic renaming can be disabled from the preferences.
Yes, agreed, e.g. for journal papers. However, what I've got in mind is grey literature in the international development space, where there's only one version, with a canonical URL (or maybe a couple of URLs, one with the org, and one with the donor). However, because there's not DOI etc everybody ends up typing the metadata in (unless you find it on google scholar, but you've already found it on the web, so why then go to google scholar...).
[[Perhaps off topic: Don't know whether this would be too intrusive, but perhaps users could even agree to share their metadata anonymously - or rather, for the metadata to be processed (as it sits on the Zotero server anyway). So behind the scenes Zotero could then check metadata consistency (across several people adding), and where it has several independently added records, the metadata can then be offered to other users.]]
Bjoern
An issue I notice is that only about 1/10 had the abstracts. There doesn't seeem to be any logic to which abstracts appear, since I tried adding links to 18 articles from the same issue of the same journal, and 2 had abstracts while the others didn't. Abstracts are really important for how I search in zotero, so I'm hoping this is being looked at :)
Abstracts are actually an area that should be much improved with the new recognizer. Neither CrossRef nor Google Scholar offer abstracts, so you didn't get them with the previous recognizer, and with the new recognizer we're trying to extract abstracts and include them. But the extraction logic can likely be improved further.
2017 Ployhart, Robert E.; Schmitt, Neal; Tippins, Nancy T. Solving the Supreme Problem: 100 years of selection and recruitment at the Journal of Applied Psychology.
Journal of Applied Psychology
1939-1854, 0021-9010
10.1037/apl0000081
2017 Hofmann, David A.; Burke, Michael J.; Zohar, Dov 100 years of occupational safety research: From basic protections and work analysis to a multilevel view of workplace safety and risk.
Journal of Applied Psychology
1939-1854, 0021-9010
10.1037/apl0000114
2017 Kanfer, Ruth; Frese, Michael; Johnson, Russell E. Motivation related to work: A century of progress.
Journal of Applied Psychology
1939-1854, 0021-9010
10.1037/apl0000133
2017 Cortina, Jose M.; Aguinis, Herman; DeShon, Richard P. Twilight of dawn or of evening? A century of research methods in the Journal of Applied Psychology.
Journal of Applied Psychology
1939-1854, 0021-9010
10.1037/apl0000163
Roberson, Q., Holmes, O., & Perry, J. L. 2017. Transforming Research on Diversity and Firm Performance: A Dynamic Capabilities Perspective. Academy of Management Annals, 11(1): 189–216.
Rothman, N. B., Pratt, M. G., Rees, L., & Vogus, T. J. 2017. Understanding the Dual Nature of Ambivalence: Why and When Ambivalence Leads to Good and Bad Outcomes. Academy of Management Annals, 11(1): 33–72.
Jason A. Colquitt, & Cindy P. Zapata-Phelan. 2007. Trends in Theory Building and Theory Testing: A Five-Decade Study of the “Academy of Management Journal.” The Academy of Management Journal, 50(6): 1281–1303.
McCarthy, M. H., Wood, J. V., & Holmes, J. G. 2017. Dispositional pathways to trust: Self-esteem and agreeableness interact to predict trust and negative emotional disclosure. Journal of Personality and Social Psychology, 113(1): 95–116.
Miller, J. G., Akiyama, H., & Kapadia, S. 2017. Cultural variation in communal versus exchange norms: Implications for social support. Journal of Personality and Social Psychology, 113(1): 81–94.
Mooijman, M., van Dijk, W. W., van Dijk, E., & Ellemers, N. 2017. On sanction-goal justifications: How and why deterrence justifications undermine rule compliance. Journal of Personality and Social Psychology, 112(4): 577–588.
Webb, C. E., Coleman, P. T., Rossignac-Milon, M., Tomasulo, S. J., & Higgins, E. T. 2017. Moving on or digging deeper: Regulatory mode and interpersonal conflict resolution. Journal of Personality and Social Psychology, 112(4): 621–641.
You can use "Undo Retrieve Metadata" if you're not happy with the results and want to add the parent item another way (e.g., via the connector), or if, say, you attempt to drag in a PDF on top of an existing item and miss, causing another parent item to be created.
"Report Inaccurate Metadata" lets you easily report problems with the returned metadata in a way that we can investigate further. It resends the same data previously sent to the recognizer server along with the final metadata. We'll be automatically notified of reports, so no need to post here unless you want feedback from us. (Note that not all problems will be things we can fix, but we'll keep an eye out for things we can improve.)
https://github.com/zotero/zotero/issues/1018
https://www.zotero.org/blog/zotero-5-0-36/
(@Joscha: We're hoping to work on both of your big pull requests soon.)