Zotero having extracting notes from OCR'd text

gndgnd · May 9, 2019

Hi,

I use Zotero to store notes and highlights i make into PDF files I am reading. Mostly these are already OCR'ed but recently i have OCR'd myself (in Tesseract) a few PDF paper that were just images, to be able to store the notes in Zotero.

However the notes that Zotero extracts are very jumbled. Here is an example:

"contrary to the predictionsof simple physics,yhavebeenwarmest thebeginningenhe Sun was smalland hascooledeversince"

Whereas if i open the same PDF in Foxit and copy the marked passage, Foxit returns:

"contrary to the predictions of simple physics, may have been warmestat the beginning whenthe Sun was small and has cooled ever since’.

Obviously also Foxit is not very ideal, but on average does very little errors, while Zotero does quite a lot. Like this the whole functionality of extracting annotations loses sense, as i have to do it again by hand. With small paper it would be OK, but i also OCRed a book and have several hundred notes that i just dont want to do by hand.

I suspect the fix might be easy, since Foxit can read the PDF much better as Zotero. Is there a way how to look into this ?

Thanks!

dstillman · May 9, 2019

That's not Zotero — that's ZotFile. Specifically, it's a problem in the underlying PDF library ZotFile uses. (You'll often see the same thing trying to copy text from a PDF using Firefox's built-in PDF reader, which is the same library.)

We're hoping to integrate ZotFile's extraction capabilities in a future version, and we'll try to fix this at the same time.