Zotfile "Extract annotations" and getting the right page number

taiwantaichi · November 12, 2017

First of all, a big thanks to the community for all the help.

I have read through the forums to find an answer to this, and the consensus from 2015 is that Zotfile is unable to apply the actual page number of a PDF to its extracted annotations.
That is to say that it reads page 1 as the cover which of course is not actually page 1 of the article or PDF. While I have used Adobe Acrobat Pro DC on my PC to rename the pages so that it correctly displays the preface material as vi-xiii and then starts the document as page 1 (or whatever), Zotfile's extracts annotations give a page reference based on Adobe's incorrect default numbering. This means I need to go and physically rename each page link which - incredibly tedious when dealing with tens of thousands of pages. On a similar note, Iannotate on my iPad has no problem recognizing that I have re-paginated my document and gives me the page number I need.
So, is there any way to get Zotfile to apply the correct page number. Actually, I was wondering if there is any PDF reader for PC that will extract highlights with the correct page numbering system. Typically my workflow has involved using the iPad with Iannotate, but I would like to be able to work on a PC tablet/laptop/desktop doing the same thing.
I apologize if this question has been answered previously, but from what I can see it remains an open question.
Thanks,
Justin

bwiernik · November 12, 2017

In the Advanced Settings tab of Zotfile preferences, there is the option to "Use actual article/book chapter page..." If this is checked, it will add the default page numbers (1, 2, 3, etc.) to the first page in the item's Pages field in Zotero to estimate the correct page number. This will only work if there are no extra publisher pages appended to the beginning of the article or if the page numbering scheme changes throughout the PDF (e.g., your example with vi-xiii then switching to page 1, 2,...).

Zotfile only works with the PDF file itself, so it doesn't really matter whether you edit them using Adobe, iAnnotate, or another program. It will extract annotations in the same way.

The tool Zotfile uses to access PDFs (pdf.js) only recently added support for page labels. The Zotfile developer currently has limited time to implement new features, so updating the Zotfile code to use page labels would likely have to come from user contributed code.

In the mean time, my recommendation would be to type the page number at the beginning of your annotations. Then this will always display. For highlights, if you add the page number as an annotation to the highlight, it will display just below the extracted highlighted text in the Zotero note.

taiwantaichi · November 12, 2017

Thanks Bwiernik,

So for most books/academic texts that have long prefaces with roman numerals -- or for articles in journals with volumes that begin with high page numbers the only way to ensure my extracted highlights have the right page number is to highlight the page number in the text as well? I figured that might be the case. The advantage of iannotate is that it recognizes the renumbered pages. Still, I guess it is a small price to pay for an otherwise nifty feature. I have also noted that zotfile seems to get the diacritical marks over romanized sanskrit words accurately when I extract them which iannotate struggles with.
Thanks again,
Justin

bwiernik · November 12, 2017

I’m not saying to highlight the page number. Instead, after highlighting the text, double click on the highlight so it’s text box opens and type the page number there.

wayupnorth · February 4, 2018

Thanks to this discussion, I have come up with a solution that works for me. I move any introductory pages to the end of the file using Acrobat Pro. That way the first page in an article matches the page numbers in the metadata in Zotero, and my hilights and comments have the correct page number in Zotfile's extracted annotations.

ThoWmas · July 14, 2018

i got the same question, when i extract a highlighted pdf text i got this :

"L'influence de la couleur sur la perception des traits de personnalité de la marque" (Pantin-Sohier and Joël 2004:1)

i want this format :

"L'influence de la couleur sur la perception des traits de personnalité de la marque" (Pantin-Sohier and Joël 2004)

How to ? i try to edit the about:config but no resultat

%(content) (note on p.%(page))

to

%(content) ()

bjohas · December 29, 2018

@ThoWmas - if you're still interested, let me know.

wayupnorth · January 1, 2019

It took me a while to figure this out. What I've done for journal articles or book chapters, is moved any introductory pages (un-numbered or i, ii, iii - xiv) to the end using Acrobat Pro so the first page in the file is the first page of the article. Then, if needed, I renumbered the article page labels to match the actual page numbers. This way, my zotfile notes had accurate page numbers.
This is not a satisfactory solution for an entire book, but with Acrobat Pro, you can extract the chapter you are annotating and save it as a separate file. Next, make a duplicate of the book in Zotero, change item type to Book Section, and adjust metadata in Zotero as required. Finally, import (drag & drop) the chapter PDF file into Zotero and adjust page numbering as described above. If you want to keep the title page & cover image, place those pages at the end of the chapter.

bjohas · January 1, 2019

@wayupnorth Splitting books by chapters can be a right pain.

If you're happy with the command line: I've used cpdf in the past, which has an open community release, see manual here https://www.coherentpdf.com/cpdfmanual.pdf, esp. section 2.3. IF YOU'RE LUCKY, and the PDF has bookmarks corresponding to chapters, in which case you can do something like as described below.

First look at the bookmarks. This command

cpdf -list-bookmarks

gives you the list of bookmarks (with the level of each bookmark and the actual page number, not the page label). Of course this only works if the PDF actually has (sensible!) bookmarks. From the list, pick the level of bookmarks you want.

Now do this:

cpdf -split-bookmarks LEVEL Book.pdf -o @F-%%%-@B-pp.@S-@E.pdf

where LEVEL is the level you picked earlier (likely 0 or 1); cpdf now generates PDF chunks per bookmark. The @F inserts the PDF name, the @B is the bookmark name (likely the chapter name), and pp.@S-@E is the page range; note that the @S is the actual page number, not the page label.

How do you now get back to page labels?

Option 1: IF YOU'RE LUCKY... book chapters often have their own DOI, so if yours does have DOIs, then when you add the resulting PDF chunks to Zotero, the page ranges will be automatically added to the Zotero item via the DOI. (Assuming the metadata contains them.)

Option 2 (brute force): Alternatively, you can use information from "cpdf -page-info Book.pdf" to translate the to page numbers (in the PDF chunks) to page labels.
cpdf does seem to mess up the page labels in splitting, so you have to go back to your initial PDF. (I.e. you cannot run this on the chunks it seems...)

Do this:

cpdf -page-info Book.pdf

which gives you the correspondence between page labels and actual pages. (ONLY IF page labels have been defined in the PDF.) Hopefully you'll see

Page 1:
Label: C1
...
Page 2:
Label: I
...
Page 15:
Label: 1
...

which is the correspondence between pages and page labels. You can then use this to (manually or automatically) rename your PDF chunks. (The trick would then be to get ZotFile to recognise the pp.A-B string in the filename, and add this as page numbers - however, that would need some work on ZotFile.)

All of this is useless to you if you don't use the command line. Hope this helps anyway!

(Apologies for potential notifications to users @F-%%%-@B-pp.@S-@E ...)

joaofrgomes · February 6, 2021

Any updates on this issue? My workflow revolves a lot around making annotations in Acrobat, and I absolutely must have pages labeled as they should, otherwise I risk inserting the wrong page numbers in citations when searching for said annotations.

I suppose I could do the whole “move all roman-numeral-numbered pages to the end of the book” trick, but that's even more inelegant than just using Acrobat's own integrated TOC link panel.

This is a bug, after all, and while I am also comfortable enough with using the command line for a lot of operations, I fail to see how all those steps are more practical than just duplicating the original PDF and deleting the unwanted pages (which I already prefer doing, as my Uni-provided Google Drive account is unlimited anyway).

Also, I can only think of two books right now where that might come handy, as they contain a lot of texts I'm likely to quote and it may be more practical to have those chapters open at the same time as separate files.

Still, for larger books with a single author, having this work properly should be a top priority and make Zotero and ZotFile that much more usable (you see, seeing how I'll never be completely sure whether a file will play well with it or not, I'll just ignore that functionality altogether until it's fixed). Most academic books and especially dissertations and theses have said numbering schemes, it's a glaring oversight considering the target audience of Zotero…

qqbb · February 8, 2021

Currently, this is a limitation of ZotFile, see here. Its pdf.js code needs to be updated. Luckily, the Zotero developers are actually working on this, see here.

ZotFile's setting "Use actual article/book chapter page for highlighted text snippets" in "Tools" -> "ZotFile Preferences" -> "Advanced Settings" says: "The article page number is determined by adding the page in the pdf document and the starting page from the 'pages' field in the zotero metadata. This approach fails if the first page in the pdf document is not the first page in the article/book chapter".

I absolutely must have pages labeled as they should, otherwise I risk inserting the wrong page numbers in citations when searching for said annotations.

As a workaround, I would suggest following bwiernik's recommendation above:

[…] type the page number at the beginning of your annotations. Then this will always display. […] after highlighting the text, double click on the highlight so its text box opens and type the page number there.

Switch off the "Use actual article/book chapter page" preference if this is confusing. (You could also try to customize the link's text, see here.)

Most academic books and especially dissertations and theses have said numbering schemes

In my experience, the use of page labels varies wildly. Being able to use them would obviously be nice, but not all files provide them. Some manual editing will always be necessary. In the future, you might need to add page labels to your PDF file.

For Acrobat on macOS, you could also experience another related issue. If your PDF uses page labels, the links in the extracted annotations might open the wrong page. See here for a workaround.

manouchk · September 8, 2021

With jpdftweak, you can correct pages numbering by selecting ranges for which numbering is roman. I think it may solve the problem encounetered. Here is a link to this software: http://jpdftweak.sourceforge.net/