extract references from PDF and create new library items from them

ramrattan · February 4, 2011

Just like in Mendeley, I would like to be able to do this in Zotero.

Input: PDF-file containing a "References" section, listing a number of cited papers mentioning Author1, Author 2, Journal, Volume, Year for each cited paper.
Operation: Zotero extracts list of cited papers from PDF, looks up metadata of each cited paper using Google Scholar or Pubmed and adds these as new library items in the users "My Library" folder.
Output: New subcollection of library items containing all papers cited in PDF that was used for input.
SuperDuperPlus: Zotero and its entire development team would be worth their weight in pure GOLD if Zotero could also automatically retrieve the PDF's for all (or most of) those library items and attach them to each item (for instance when one has institutional access to many journals through a University Library).

Argument: This is a highly frequent situation in which one starts with a review paper to get familiar with a new research topic and then proceeds with reading papers that were cited in the review paper. When done manually, this is a lot of work, which is surely worth automating.
Instead of Zotero just being a great archiving and management tool, it would become a sort of a search tool in the body of literature, in which one can follow a trail of "related papers", because the related papers are being cited by each.

adamsmith · February 4, 2011

can you link to where this feature is described for Mendeley - this is pretty hard to do and it'd be the first I heard of it. Mendeley does have retrieve metadata from pdf just like Zotero does.
There is a kb article on importing formatted bibliographies, which is essentially what you're asking:
http://www.zotero.org/support/kb/importing_formatted_bibliographies

ramrattan · February 4, 2011

Thanks for the quick reply, this is the link.
http://www.mendeley.com/bibliography-maker-database-generator/

adamsmith · February 4, 2011

nope, that's not what you think it is - that's merely Mendeley's version of this:
http://www.zotero.org/support/retrieve_pdf_metadata

ramrattan · February 4, 2011

Adam, you are totally right. I downloaded and installed Mendeley and I can't seem to find it anywhere. The way they put it on their site makes it sound like it does, but it doesn't.

It is pretty smooth at looking up metadata by just dropping a PDF in there, but Zotero can do the same, eventhough it requires one more click.

I still love Zotero though. Keep up the great work.

adamsmith · February 4, 2011

Zotero doesn't do that automatically on purpose - there are a number of reasons why you wouldn't want this to happen right away - you may not have an internet connection at that time, for example, or you might add a lot of pdfs at once - in which case google scholar might lock you out after a while because you look like a bot etc. - so the extra click is on purpose.

ramrattan · February 4, 2011

I figured so much. That one extra click is not a problem at all. Usually I indeed add several PDF's at the same time. It's fine that Zotero only does things when I want it to.
Often I know Google Scholar will not have any data on a PDF (for instance because I made it myself), but I do want it in my library for future reference.

rfhl · October 24, 2011

The feature did actually exist in Mendeley in an earlier version (it was buggy, but still impressive!), but not anymore. Plans are to have a revised version for one of the next updates, I think.

http://feedback.mendeley.com/forums/4941-mendeley-feedback/suggestions/834313-version-0-9-7-does-not-extract-references-from-the

"Hello - This feature was removed in 0.9.7 because it was consuming a fair amount of resources (client and server side) without providing enough value. We plan to re-introduce it in an improved form in future."

Switching to Zotero for now :)

ben.hjorth · February 20, 2017

Any update on this feature?

adamsmith · February 20, 2017

no. I don't think anyone at Zotero is working on doing this straight in the PDF.
There's https://anystyle.io/ which works pretty well to get bibliographic data from pasted bibliographies.

ben.hjorth · March 15, 2017

Thanks, though I'm finding anystyle extremely difficult to work with. It's actually about getting saved library records from my institution into Zotero - it only offers export to Easybib, Endnote, Refworks or Delicious, or email / pdf, and no clean RIS format. Might just have to set up an account with one of those and then export from there (though Easybib still appears to have no RIS export option).

adamsmith · March 15, 2017

Endnote-intended Export should import nicely into Zotero. Have you tried?

adamsmith · March 15, 2017

Also, what's the library catalog if you don't mind posting a URL?

ben.hjorth · March 15, 2017

It's http://search.lib.monash.edu/, but specifically a function called 'e-shelf' (FTR: the Zotero plugin works perfectly well for the catalogue itself, but I have over 200 records already saved from years of study that I'd like to be able to import directly). I believe you should be able to use some of the e-shelf functionality without a login. Otherwise I'd be happy to send you my login privately to try to replicate.

I've tried exporting for Endnote desktop, which does create an RIS file, but neither the file nor a copy from TextEdit / Import from Clipboard imports to Zotero (Mac Standalone) - in both cases I get "the selected file is not in a supported format".

Example from TextEdit:

ID - catau51249007930001751
AU - Krasner, David, 1952-
A2 - Saltz, David Z., 1962-
A2 - University of Michigan. Press
Y1 - 2006
KW - Performing arts -- Philosophy
KW - Performing arts -- Social aspects
PB - Ann Arbor : University of Michigan Press
CY - Ann Arbor
TY - BOOK
T1 - Staging philosophy intersections of theater, performance, and philosophy
ER -

Thanks for your assistance.

adamsmith · March 15, 2017

Yeah, I don't see a good way to make this happen. The RIS is your best bet. The only problem here is the placement of the TY - BOOK tag: that needs to be all the way at the top. So

TY - BOOK
ID - catau51249007930001751
AU - Krasner, David, 1952-
A2 - Saltz, David Z., 1962-
A2 - University of Michigan. Press
Y1 - 2006
KW - Performing arts -- Philosophy
KW - Performing arts -- Social aspects
PB - Ann Arbor : University of Michigan Press
CY - Ann Arbor
T1 - Staging philosophy intersections of theater, performance, and philosophy
ER -

will import reasonably well (not as well as the items import directly from the catalog, where we do all kinds of clean-up, so you may want to consider if that's not the faster route after all).

You can easily manually do this in text edit or think if there's a good search&replace you can run: unfortunately the latter is not going to be easy unless all of the records are books (in which case you'd first remove all TY - lines and then replace ID - by TY - BOOK followed by a newline)

ben.hjorth · March 15, 2017

Wow, it really is just about the placement of the TY tag... do you think other software outputs like that on purpose to jam Zotero input? There's no easy way to make it possible for Zotero to read alternative orderings?

Thanks for your suggestions though - what I can do is export all the books in one batch and all the articles in another, then do a search+replace (though not sure how to opt to replace the whole ID *line* with "TY - BOOK", in TextEdit / Word / etc - any hints? Otherwise I can just do it manually). Though, just FYI, in the process of investigating this I discovered a strange thing - if I replace the ID line with TY - JOUR, then it recognises the DOI as a separate article, i.e. it imports the article I want (sans DOI) and then a separate article with *just* Item Type: Journal Article and the DOI.

eg:

DO - 10.2307/464730
ID - TN_jstor_archive10.2307/464730
AU - Bahti, Timothy
Y1 - 1981
JF - Diacritics
VL - 11
IS - 2
SP - 68
EP - 82
SN - 03007162
TY - JOUR
T1 - The Indifferent Reader: The Performance of Hegel's Introduction to the Phenomenology
ER -

With replacement:

DO - 10.2307/464730
TY - JOUR
AU - Bahti, Timothy
Y1 - 1981
JF - Diacritics
VL - 11
IS - 2
SP - 68
EP - 82
SN - 03007162
T1 - The Indifferent Reader: The Performance of Hegel's Introduction to the Phenomenology
ER -

ben.hjorth · March 15, 2017

Oh one more thing (and I must admit to knowing basically nothing about coding), in case it makes a difference: when I export the results from the library e-shelf I have the following encoding options:

- ISO-8859-1
- UTF-8 [the default which I used for the above]
- US-ASCII
- windows 1251

bwiernik · March 15, 2017

A properly formed RIS file (which is what Zotero expects and needs to import) begins each record with TY - and ends each record with ER - . In the example you just posted, Zotero will not recognize that the DO - line is part of the same record because it is not between TY and ER tags.

ben.hjorth · March 20, 2017

Thanks bwiernik. I suppose a lot rides on what constitutes "properly formed" - if this really is some kind of industry standard then this library database is wilfully going against that by starting with DO - and ID -; I assume that's because this is the format Endnote wants it in. I've gotta say it does seem strange that Zotero wouldn't be able to import RIS files that Endnote can, but again I'm a coding troglodyte, so.

dstillman · March 20, 2017

Yes, this is totally broken. If you don't believe us, see the RIS documentation on EndNote's own site:

TAG ORDER: Except for the first tag of each reference, which must be "TY -" and the last tag of each reference, which must be "ER -," the tags within each reference can be in any order.

(Thomson Reuters, who owns EndNote, also owns the company that created the RIS format many years ago, so this is about as authoritative as it gets.)

There's just no reason we would risk breaking valid imports by supporting this. Contact the library and tell them to fix their output.

ben.hjorth · March 20, 2017

Yep it looks like the library's e-shelf functionality is seriously outdated. I finally got around it by setting up an Endnote account, exporting from library to Endnote for web, and then from there exporting as a RIS file (with tags in the right order!) which worked perfectly into Zotero.

Thanks all for the help!

adamsmith · March 20, 2017

This seems to be a problem in the upstream software (Primo by ExLibris) the library uses. The British Library has the same issue, for example. @zuphilip -- is that the case for your catalog, too?

zuphilip · March 20, 2017

We had this problem 2,5 years ago and corrected it with our own RIS-Export-Plugin. I think it is now also solved in the usptream software, but I don't know exactly.

Rintze · March 22, 2017

(@dstillman, since October 2016 EndNote is no longer owned by Thomson Reuters, but spun off as part of "Clarivate Analytics": http://clarivate.com/news/ip-and-science-launched-as-independent-company/)

dstillman · March 22, 2017

(@Rintze: Huh, didn't know that. In any case, the RefMan IP went with them, so still the same company.)

Anilkumar · April 6, 2017

i am new to zotero. How can I extract references from a pdf

adamsmith · April 6, 2017

@Anilkumar what exactly are you trying to do?

Anilkumar · April 6, 2017

I want to extract the references from fulltext in pdf format

Anilkumar · April 6, 2017

to make library

adamsmith · April 6, 2017

See my comment above: https://forums.zotero.org/discussion/comment/270485/#Comment_270485

philipgooch · July 3, 2018

This is now possible, with pretty much any PDF. I wrote a bookmarklet that sends the PDF URL to my API, from where it extracts the references, formats them into BibTex and imports the file into Zotero (if you are using Firefox, otherwise it downloads the file).

More info here: https://www.scholarcy.com/bookmarklets

Demonstration video here: https://youtu.be/b8zPk364SZM

Please give it a try and let me know what you think.