Is it possible to extract references from article pdfs/webpage?

Is it possible to extract the reference list of an article from its pdf or webpage into zotero collection or library somehow? Thanks
  • This is fantastic. It should be added to the documentation!
  • @migugg, have you given it a try. I wasn’t too sure because of the documentation in Chinese. Couldn’t make much from the documentation. It’d be good if you could share your experience. Thanks
  • I dowloaded the plugin, added to zotero, restarted zotero, open a pdf in zotero with zotero pdf viewer, but nothing happens. Tried with a variety of pdfs. Nothing. What am I doing wrong?
  • @csf2022scut perhaps you can elaborate if you’ve used it. Thanks
  • When opening a new PDF document, click the reference(Chinese:参考文献) in the right column, and click the Refresh (Chinese:刷新) button in that column to display the reference of the article. Then you can click the '+' button on the right side of the required reference to integrate the reference into the zotero collection. If you still can't use it, I suggest you can ask the developer of the plug-in on GitHub.
  • edited February 24, 2023
    Just spent an hour or so playing with this new Zotero Reference addon. It's had 33 releases at github so far (currently v0.3.6) so it does have some maturity. And it's pretty comprehensive and impressive. Here's some notes I took. Please add any new findings. This has the air of a game-changer.

    The readme is here:
    I used Chrome's Translate function to translate Chinese to English.

    Installed the addon's XPI in the usual way.

    See Edit\Preferences\Reference:

    'References are refreshed automatically but the following item types are not'
    If ON, will automatically try to extract a PDF's reference list as soon as the PDF is opened in Zotero (no need to press Refresh button for initial extraction).
    It also sets what type of documents to ignore, ie those less likely to have a reference list at the end. They can still have manual extraction, by the Refresh button.

    Sets order of mechanisms by which addon tries to extract a paper's reference list.
    PDF means it first tries (automatically, or on first Refresh button press) to extract the actual reference list text from the PDF paper and parse it automatically for references.
    API means that it first tries to find the reference list via an online service (Crossref, CNKI) that includes reference metadata.

    If ON, any references in the extracted list that are already in your Library (or collection ?) will be added to the Related tab. I turned this OFF for now.

    Don't know exactly what this does, ie what/where it saves ?

    Addon installs a Reference tab in right pane of PDF viewer ...

    It automatically extracts/list references from the PDF in that tab if set to do so, otherwise tab opens with just the Refresh button showing.

    If automatic, or the first time Refresh is clicked if not (with the default 'PDF' setting in Preferences), then it tries to find a reference list at the end of the PDF and parse it automatically. Successfully parsed refs are listed. This worked well for me, extracting a full list of all references. But I haven't tried many papers yet, to know if that success is common, eg with more obscure or complicated reference list formats.

    Second Refresh click (eg if first one failed) checks at Crossref and CNKI for that paper's reference list metadata. I did not need to try this.
    I didn't know anything about CNKI, so ...
    My previous experience with Crossref is that reference data is now stored there increasingly often for papers, but it is sometimes incomplete for a given paper (eg it may not know about any listed references than don't have a DOI).

    If selected in Preferences, any references that already exist in your Library will be listed under the Related tab (with the usual hotlink to jump to that reference ... and the reverse link at the other end to jump back).

    In the tab list of references, you can click on each listed reference to get more information on each. A floating window is opened, with three dots across the top to indicate which data source for that paper you are looking at. Those 3 window views contain the contents of the following information/searches for the referenced paper:
    - basic data from PDF from which the reference was extracted
    - ReadPaper (title search)
    - Crossref (title, DOI search)
    - Semantic Scholar (DOI search)
    - arXiv (arXivID search)

    The first dot view shows the basic reference metadata as extracted from the paper. As far as I can tell, if the paper already exists in your Zotero library, there will be a red Zotero button at the top. If you click on that button it will take you to that item in your Library.

    Clicking on the second dot view gets you the information from the ReadPaper service, based on the reference's title - that info usually includes the bulk of the abstract. ReadPaper reportedly contains data on 200 million papers. You also get some extra coloured buttons across the top here. The blue number - the citation count - was obvious. The red Zotero button also appeared again. The rest of the buttons look like they copy the reference information to the clipboard in different formats ?

    Clicking on the third dot view appeared to get the basic Crossref data based on the paper's title, which may also include the abstract. Another button showed the 'is referenced by' count. A DOI button will jump to the DOI's URL - perhaps the most useful for downloading a referenced paper that you do not already hold but now want. And the red Zotero button will be there again (telling you that you don't need to chase the paper by DOI).

    The reference list in the right tab also has a '+' next to each reference. Clicking on that will supposedly add that reference to the current collection (not sure what happens if it's already there). More details in the translated But maybe chasing a paper via the above-mentioned DOI URL hotlink and then letting Zotero handle the metadata/PDF download via the Web Connector is a better approach ?

    Colour me impressed.

  • Colour me impressed too! it is absolutely amazing.
    Some observations:
    -I am not 100% sure, but I think it initially did not work when I installed it while having many tabs open. I had to close all tabs, reinstall it and now it works.
    -I was confused initially because I was assuming that the app would work by hovering or selecting text in a pdf in zotero reader. But as Tim820 above points out, you need to go to a pdf view in pdf viewer within zotero, then go to the "references" tab that now appears next to the usual info, tags, and related tabs. Unless you have selected the relevant preference, you then need to first click on "refresh" and the list will magically populated.
    -Generally, clicking on the + button to the right of references works wonders. But this mostly works for articles, it does not work for books. What happens for books is that it usually finds reviews of these books. I guess this happens because it searches article databases only, rather than book databases. This is slightly irritating, because when you hover over the item, it will show you the correct item, the book. In order to see which item will be copied to your library, you need to move to the second or third dot, which will show you the item it will copy. Ideally, the extension would be able to search book databases as well and identify the correct item type, rather than defaulting to articles, and thus book reviews. For people in the natural sciences probably not a big issue, for people in the humanities it is more so.
    -Adding an item to the library indeed adds it to the (sub-)collection open at the moment and not just the main library.
    -At least for some items, it does not correctly identify if they are already in the library. I need more testing, but at least in once instance the journal article had complete metadata, but no DOI. It may be that the whole system is purely DOI based? If not, the rules for recognizing whether an item is already there probably need to be less strict in order for the extension to work properly.

    But these are all minor issues!
  • CNKI is very important for chinese college student,so...
  • edited February 27, 2023
    Since the official github documentation for this important new Zotero Reference addon is in Chinese, and even then some features are not fully described, I'll add some more notes here ...

    Reference list content
    There does not seem to be a way for the reference list to be saved permanently in the addon's Reference tab. Instead the list should get refreshed each time you want to use it. That may be because the mechanisms by which the list is generated - PDF text parsing or online reference data APIs - are not yet foolproof. So you may get improved data in the next session in a newly refreshed list (although so far it has often been good first time round for me). Also, the detection of which references are already in your Zotero library is not yet perfect. So it does make sense to have the list refreshed with the best currently-available information every time you want to look at/use it.

    Where the data for that Refresh comes from depends on the following preference settings under Edit\Preferences\Reference:
    'Save references from PDF'
    'Save references from API'
    If set to ON, then Refresh results will be saved locally (transparently). Then, the next time a Refresh is done, that local copy is automatically used to populate the Reference list shown (the popup dialog shows [Local] to signify that).

    It is probably best to leave them OFF, so that you always get a new, best-current search method when you hit Refresh. Also, I have not found any way in the addon to clear any existing local reference lists - they persist across sessions. The locally-stored lists are in a file in your local Zotero data folder: zoteroreference.json. Deleting that file in your OS does delete all the local reference list copies, so that the next Refresh will again be generated anew from the PDF/API. Otherwise once you have a saved list, you are kind of stuck with it, as it will always be used in the Refresh.

    BTW the preference 'References are refreshed automatically' - that sounds like it might be related to this issue - actually only determines if a Reference list appears automatically when you first open the PDF in each session - it does not force a refresh from PDF/API. It just auto-loads whatever is the current method (saved list, or newly-done PDF/API method). It's probably best to leave that OFF, as clicking the Refresh button is easy, and the auto-refresh would happen every time you open a PDF, even if you don't intend to look at the Reference list.

    The fact that there is no *persistent* DISPLAYED reference list across sessions means that there is no obvious reminder that you have already analysed the PDF's reference list for new papers which you might want to add to your library. So it's worth adding a Zotero tag to signify that.

    The quality of a reference list generated from PDF extraction and online API can differ quite a bit. Generally not in the *number* of references retrieved - usually ALL of the paper's references are retrieved for both methods (which is much better than my previous experience with online services ... which have tended to have incomplete lists). But the direct PDF text extraction might mangle author names that include diacritics, or split one reference into two. And the online API may lack some details within individual references altogether. But so far I have rarely had a situation where I can't get useable information for each reference one way or another.

    But the older the PDF paper, the more likely it is that the online API-based Reference list data will be poor. For new papers, it is often very good.

    I have seen a few instances where a Refresh via online API appears to hang, with the dialog showing [Done] API "Request references ..." . No reference list appears.

    The need for DOI with online API-based reference list
    It seems that a reference list from online API can only be created for a paper with a stored DOI (in its existing Zotero metadata). Without a DOI the Refresh will show a failure message in the dialog.

    Matching listed references to existing ones in your library
    If you hover over or click on a reference in the reference list, it is then shown highlighted in blue. The associated floating window will open, and IF the addon can determine that the reference is already in your library (not always reliable*), a red Zotero button will appear in that floating window (a hotlink to the matching library item). That library search uses the reference's DOI retrieved online. That is, the reference's DOI doesn't have to be listed in the PDF paper's reference list; but the DOI does have to be in your library item for the reference, in order for a match to be made. I have seen an instance when the addon failed to match a reference to my library due to a case mismatch between the DOI stored with an item in the library and the DOI that the addon's reference search had retrieved from an online API. DOIs are supposed to be case-insensitive.

    *you can easily check in your library without leaving the PDF viewer by using your library list in the Related tab.

    Click on + sign
    Clicking on the + sign next to a reference in the list adds that reference item to the same Zotero collection as the PDF paper was loaded from. It first tries to find a DOI online; if so the added metadata should be good. You could then click on the DOI field of the created library item to go to the paper's journal web page and grab the PDF. The added metadata is less good if there's no DOI. And even with DOI here, the alternative Ctrl-click method described below is probably a better option to get good metadata *and* a PDF (in one step) of a listed reference you want to add to your library. Maybe use the + sign method if you just want an item with metadata quickly added to your library, but not the PDF.

    Ctrl-click to search for a cited reference online
    If you decide that you want to search for a reference's PDF online for addition to your library, you can Ctrl-click on the blue reference, which will then open the reference's DOI URL in your browser (assuming a DOI has been found online for it; if so, there will also be a DOI hotlink button in one of the floating window views). As usual, a DOI URL should take you automatically to the journal page for that reference. There you can use the Zotero Connector to grab the metadata and add the PDF to your library.

    Floating window content
    The number of floating window views ('dot' views) seems to vary between references. For example some references may have an additional dot-view window for data from Semantics Scholar and some may not. So for example the second dot-view may not always be from ReadPaper etc.

    A curious variation in the content of some floating window views seems to arise sometimes when a referenced paper does not have a DOI - for example with a conference paper. In that case some views may instead show the titles of individual papers that *cite* the highlighted reference, rather than the reference itself. Whether this is a bug or intended behaviour is unclear. It could be that a title search at ReadPaper or Crossref did not return the actual reference (because it was not there ?) but instead mistakenly returned a paper that cited that reference. Or it may be that ReadPaper deliberately suggests related papers if it can't find the actual reference.
  • Has someone propose a PR with translated/expanded documentation? It'd be good to have that available at the source and not just in a somewhat hidden thread here.
  • A work-around that's really easy is to use to export a bibliography into a .bib file, which can then be imported into Zotero. This works 99% of the time.
  • I can get in touch with them re: zotero-reference internationalization. From I can see so far, it's not just the README that needs translation, but the UI elements too.
  • @rossdavilla: this may be a workaround, but only really makes sense, when you want to import a whole bibliography of an article. But very often, while reading, I want to import only one or two references, and this plugin makes this super easy.
  • edited June 15, 2023
    @ZoeCMA the UI is OK, at least as far as English is concerned. ie it all appears to me in English within Zotero. And reading the github pages works well for me via Chrome's right click Translate of full pages. But some of the UI screenshots there appear only in Chinese.

    But the addon is still evolving, so you learn a lot by reading the Issues pages, as some capabilities that weren't mentioned anywhere else were explained there. For example I learnt recently that you can double click on the number of references at the top of the References tab to copy the entire reference list to the clipboard.

    The language issue is probably enough to scare some people off trying it. So if the documentation could be translated that would be great.
  • Thanks, I'll create an "issue" in their repo for this.
  • Thank you. Having documentation including snapshots and interface in English would have a much wider acceptance. I’ve held back because of my language barrier. Thank you so much for following this up. <3
  • Could someone tell me how to install this plugin in Zotero? I don't seem to be able to find an .xpi file within the folder as downloaded from github. Thanks in advance.
  • @urodriguez the XPI is linked under the latest release (currently 0.5.8) ...
  • Thank you @tim820, I have found it now. Much appreciated!
  • I'm exploring the extension and I have a question perhaps @tim820 would know the answer to. Does it transfer data outside of zotero, what is the extent of it and is there a way to know which data? I'd imagine it would have to send some data to get the references i.e., using the text of the pdf attachment. However does it do that for the entire library or just the item that you use the extension on. Thank you so much for these great comments and feedback so far.
  • edited July 12, 2023
    I don't know the exact answers to your questions, although I presume informed examination of the source code at github would provide the answer. I have not seen any mention of any data being sent outside Zotero other than by calls to various APIs asking for data on references.

    The addon has two options for extracting references from a single PDF (it does not work on the whole library): text parsing or an API call (to Crossref). My uninformed inspection of the code suggests that text parsing is being done locally in that code.

    Once the list of references has populated under the References tab, hovering over a reference in that list causes the addon to query several other databases to locate the reference, should you wish to download it (which then happens by opening a URL in your browser to facilitate download via the Zotero Connector in the usual way): DOI lookup, Semantic Scholar, and several others. I recall that there was also a Chinese reference database (CNKI ?) that required Chinese users to enter their login credentials in order to use it as lookup, but I can't see evidence of that anymore.

    So as far as I know the reference data is only stored locally. On that point it's not like the Cita addon which is developing some similar cited reference extraction features. That addon encourages the user to upload the extracted references to Wikidata, so that database perhaps eventually becomes a rich repository for cited reference lists. That approach perhaps assumes that other databases like Crossref will never become a comprehensive reliable source of cited references data. That perspective appeared reasonable several years ago but now seems pessimistic, as many papers (apparently millions in fact) now have cited reference data in Crossref (now routinely deposited by one-recalcitrant publishers along with a paper's more basic metadata).
  • Thank you @tim820 for taking the time and effort to answer in detail. Much appreciated.
  • Hi everyone, I'm the developer of zotero-reference. I will be updating the English version of the readme soon.

    Because the plugin was designed for Chinese students at first, I didn't write the English version of the readme. but the plugin itself supports English.

    I've taken a general look at the questions in this discussion and I'll try to answer some of them. The biggest feature of this plug-in is that it parses references from the PDF itself, and it is very accurate. In addition also support from the api get, such as crossref and cnki. cnki is a Chinese students will use to a literature search site.

    Because it takes time to parse from the pdf, so the plug-in support to save the results of the parse, its location in your computer locally. Saved in the form of a json file. This can be read directly from the local rather than re-parse from pdf in the next refresh to save time.
  • I notice that there is now an English version of the instructions ...
  • When I attempt to install the newest version of this (Zotero 7, updated), and it would not allow the install. Is anyone else running into this issue?
  • edited 2 days ago
    @kamran_ the latest significant interface changes in the Zotero v7 beta look to have broken a few things. They should be fixed in time.
Sign In or Register to comment.