OCRing a PDF after annotating in Zotero

edited April 4, 2023
I'm working with a lot of large PDFs made up of scanned handwritten documents. I am going to import these into Transkribus, run the OCR, then re-export the PDFs and import them into Zotero. However, some of these PDFs I have already added into my Zotero and annotated prior to OCRing. Would I be able to take the new, OCRed PDFs (which are identical to the old files except for having the text layer underneath) and simply replace the old PDF file (I have stored them in Zotero as links) and have the annotations be preserved? I guess the actual question is, are the annotations in the Zotero database keyed to any particular version of a PDF file?

(If I'm unable to do this, I will have to resort to the other alternative, which I've already done successfully with other documents but is very time-consuming: export the PDF from Zotero with PDF comments, extract the comments via Foxit PhantomPDF or similar, reimport the extracted comments (which are saved as a text file) into the new PDF also with PhantomPDF, and then reimport *that* new PDF back into Zotero and extract the comments as annotations.)
  • Well, bummer that nobody answered after almost 3 years, but I will bring up this post!

    I got a similar problem, and it usually goes like this:

    1. In big scanned pdfs I sometimes put in hours of work and tons of annotations/highlights.
    2. I reach a part in the pdf where the text is unreadable or unusable. So I cannot annotate/highlight or not use the search ctrl+F/cmd+F. Words are not displayed correctly etc.
    3. I am doing an OCR for the pdf.
    4. Then I got two pdfs. One where I put in hours of work, and the new one where the old part of the text does not have any annotations/highlights at all, since it is now a new pdf (the OCR one).

    5. Now the Questions are popping up:
    5.1 Is there any way I can merge them together?
    5.2 Can I get annotations/highlights from the old one to the new one? (e.g. copying the exact pages of the old pdf in the new one)
    5.3 Or is there now way? Cause that would be really annoying. I got two texts with different annotations/highlights.

    This happens more often than one would think unfortunately.


    PS to point 3: Sometimes it is not only a problem with scanned texts. I got a digital text right now, but all bold words are displayed with three letters each per usually single letter when I highlight them. So eg when I highlight the bold word *Zotero* it is displayed in the annotations/highlight text as "ZZZooottteeerrrooo". Really annoying. Not only in the highlights, but this is also makes ctrl+F/cmd+F completely unusable.

    I added an example:

    https://s3.amazonaws.com/zotero.org/images/forums/u16660026/5lan2omfjywzqo1x6ccq.png
  • Would I be able to take the new, OCRed PDFs (which are identical to the old files except for having the text layer underneath) and simply replace the old PDF file (I have stored them in Zotero as links) and have the annotations be preserved?
    Yes.

    Zotero stores annotations in its database, not in the file. As long as all elements in the file stay in the same place, you can just swap in a new file.
  • Thank you for the answer!
    How is it supposed to be done though?

    I was fiddling around for a long time and only managed to accomplish it due to following Reddit comment:

    "Once I have done it by accident. Right click and show the pdf in Finder, than copy the file name and replace it with the new pdf (better with the same number of pages) and rename it as the old file name, done!!! When you open it in Zotero there should be your highlights and annotations." (Credit: https://www.reddit.com/r/zotero/comments/1kd3wzo/comment/mq8hp0v/)

    This worked for me as well luckily! But I am sure there is another way to do it "correctly" or the way it is supposed to in Zotero.
  • I mean, that's just describing the same thing. Zotero opens the file with the filename it knows. Swap in an identical file with an OCR layer and Zotero will use it.
  • Thanks. Okay, I was just confused because there are also the Export and Import Functions for a File, but this did not work for me.

    I thought there is another implemented, conventional way. But it works, that's the important hing.
Sign In or Register to comment.