Zotero, OCR plugin and OCR file not indexed.

stef33 · August 7, 2024

Zotero beta 114, 64-bits, Zotero OCR 0.7.3 plugin.
360 pages image pdf
OCR done.
output pdf file and output image files including output txt, hocr files present near source image pdf but not visible in Zotero.
How to index these files, why not appear near the input file and not visible in Zotero?

aborel · August 7, 2024

[After using the plugin] Zotero only knows about the PDF and some HTML; the images, hocr and others are not intended to be indexed (someone might want to check them for testing purposes, or use them outside of Zotero).
The expected result within Zotero is a PDF with a text layer, that can be fully indexed. Did this work for you?

stef33 · August 7, 2024

No. Zotero, after OCR still displays only the source image pdf.
Outside of Zotero, in total commander I see in the same folder as source image file: .zotero-reader-state, OCRed PDF, hocr, txt, image-list.txt, page1.png to ... page 360.png and the source image PDF file.
Now in this situation I don't see any way to index these files except copying out all files from the folder except the source pdf and draging back the output pdf to index it. It this enough to complete the action?

https://s3.amazonaws.com/zotero.org/images/forums/u587761/kkox4o0nbtrp4yll89lz.png

aborel · August 7, 2024

Thanks for the screenshot! Something isn't normal, the OCRed PDF should be attached to the Zenodo record.

Is the original image PDF a regular attachment, or a linked file?
Is the record saved in your personal library or in a group library?

You can of course manually add the output PDF, it should solve the immediate problem. I'd be very happy if we could get to the root cause - in case others encounter it as well.

stef33 · August 7, 2024

The original PDF is a regular attachment.
Personal library, single user, multiple profiles,
Source pdf 141mb, output OCRed PDF 451mb.

aborel · August 7, 2024

It really looks fine, except for this missing attachment problem. It's a larger file than the ones I've tried before, but I don't know if that could have any impact.

Is the output PDF OK? Are all pages present and OCRed?
Did the plugin create a note, as selected in the preferences?

It would be great if you can:
- run the plugin on a fresh copy of your Zotero record (i.e. with only the source PDF), but this time selecting yes to saving html/hocr and no to the PNGs;
- check whether the html files are attached to the record (they should), and whether the PNGs are left behind in the folder (they should not).

Finally, if necessary, would you agree to share the source PDF with me?

stef33 · August 7, 2024

Yes, the output PDF is all OCRed.
No note is created by the plugin.
I will check with smaller PDF as you described to view the results.
Thanks.

aborel · August 7, 2024

It seems that something bad is happening just after the OCR is completed - the note creation is the next step in the code.

How large is the txt file that was produced by the OCR?

stef33 · August 7, 2024

944kb.

aborel · August 7, 2024

OK, that's not huge but maybe longer than Zotero likes.
Another test that would be useful: disable "Save output to a note" for your 360-page PDF, if things are working correctly that we'll have a winner!

stef33 · August 7, 2024

I used the same Zotero 7 portable, downloaded a test PDF from web, 8 pages.
I ve put in a test folder. After OCR, no other file except the source file see in test folder. This time outside of folders in Zotero library I can observe a note with the extracted text, output ocr pdf and 5 html files with 5 ocred pages.
I checked the folder outsite of Zotero in Total Commander: I observed that 8 image files created and not added to Zotero. If these are not attached only the output html files, then it must be 8 html files not 5.
Other obs: The OCR process window closes after OCR.
Imagine if I OCR a 3000 pages book PDF and it creates (if it works like with small pdfs) 3000 html files outside the folder I put the source PDF, uncategorized in Zotero library.

aborel · August 7, 2024

Please run the tests I have proposed with the 360-page PDF, that's where the real problem is. We can discuss the other aspects later, but for now it only adds complexity to the issue :-)

stef33 · August 7, 2024

I do the: "selecting yes to saving html/hocr and no to the PNGs".
restart Zotero.
done the OCR.
Same results. PNGs created. Exact the same symptoms. It is a public PDF, you can check and test it: https://archive.org/details/Psb22
https://s3.amazonaws.com/zotero.org/images/forums/u587761/ioc5x5zzcsre4ih30dw3.png

stef33 · August 7, 2024

I disabled the save output as a note. But the same results.

aborel · August 8, 2024

Thanks for the link to the document, that was very useful.
On my machine, everything has worked as expected - the note is created, but cannot be synchronized with the Zotero server because it is too long. So it is probably not the best idea to create them for long documents in general, but your tests indicate that this is not the original cause I suspected.

Now about your 8-page document:
1) do I understand correctly that the images, html and other files were created by the plugin in a different folder than the one containing the source PDF? That would be new, I haven't read that in your first messages - did I miss something?

2) number of html files: it is controlled by the plugin settings, 5 is the correct number here as per your last screenshot. You will only get a html file for each page if you set the preference to more than the number of pages in your document.

3) closing the OCR process window (I assume you mean tesseract): I think it is normal on MS Windows. There will be no window at all on MacOS, Linux, etc.

4) You don't need to generate any html file if you don't want to, or just a few if you prefer, so a 3000 pages document doesn't automatically mean 3000 html files. Now if they are generated in an unexpected folder, that's still a problem - see question 1.

stef33 · August 8, 2024

Hi. Thanks for the detailed answer.
I use Zotero only offline without connecting to Zotero server.
8409 items, tons of pdf attachments, 361gb total, sqlite 232mb.

1.the 8 page document: the number of html pages to create is set to 5 probably this was the reason it creates 5 and not 8. I did not test it yesterday. Yes the plugin always creates files in the main list of files, uncategorized and not where the source file is.

2. yes probably as you say. i did not test it.

3. yes I considered normal. just mentioned it closes and finishes the OCR process even the problems remains the same.

4. any file generated after ocr-ing small pdfs is put in the main uncategorized list not in the source pdf' folder.
Even I set it so, at big pdf ocr, html files are not generated, png files are created even if I choose not to create.

aborel · August 8, 2024

1. That's an important piece of information, thank you! If the files are not created in the same folder as the source PDF, I guess the plugin will not report them properly to Zotero. This could be the actual cause of the problem. Can you tell me the full paths to the source PDF and to the output files for one of your tests?

4. Intermediate PNG files will always be created, the OCR is performed on them. But if the "Save the intermediate PNGs..." setting is not selected, they will be deleted at the end of the process. In your case, the plugin fails before that step so this clean-up doesn't happen.

I have been testing only with a regular Zotero installation, not a portable version - I will try that next.

stef33 · August 8, 2024

1. "physically" the files appear in the same folder (ex. R34ZQQ5P, etc.) that Zotero generates for a single item in total commander in both cases (small source pdfs that have the generated files visible in Zotero in the main uncategorized list and large source pdfs with ocred files that does not appear in Zotero at all.
4. "Save the intermediate PNGs..." unselected, png files are not deleted.. yes you are right.

aborel · August 8, 2024

1. Oh, I had misunderstood your point, thanks for explanation. I will look into this,

aborel · August 8, 2024

I haven't asked before, and maybe it's a silly question, but does the source PDF have a parent item (i.e. right-hand pane with title, author, document type, etc.), or is it stored as an isolated file in your Zotero library?

stef33 · August 8, 2024

An isolated file without parent item (I think because it is an image pdf) without any indexation in a test named folder.

aborel · August 9, 2024

Thanks for the quick response! That's the main problem, then: the plugin needs a parent item to work properly (like many standard Zotero functionalities). Right-click and select "Create Parent Item" as per
https://www.zotero.org/support/adding_items_to_zotero#pdfs

There is indeed no metadata that Zotero can recognize in your image PDF, that's why the parent was not created automatically. This means that you'll need to add at least some minimal information manually. Another possibility would be to find the book in a library catalog, import the metadata from there and attach the PDF to that record.

I will make a note about this for a future version of the plugin. Maybe it would be a good idea to check for such a case, and create some kind of parent if there is none.

[Edited to add] I'm not saying that creating a parent item is the final solution, but it will at least keep all output elements together. It will then be easier to verify if something doesn't work as it should.

stef33 · August 9, 2024

You solved the problem. I created the parent item for the image pdf. Launched the OCR plugin. It now added the created OCRed pdf, the 5 defaultly set htmls. So it works now. Thank you for you time and patience.

aborel · August 9, 2024

No problem - thanks for helping me to understand the issue!

stef33 · August 14, 2024

Hi. Interesting, the output OCRed PDFs are very large. Ex. 37mb input PDF, 1.2GB output OCRed PDF. I also have 1.2GB pdf, I did not OCRed it yet, which in this way could produce a 40Gb output PDF maybe.

The HTML generation is not enabled. Still the plugin produces 5 html files each time. I select and delete it, it is sent to trash. I delete it from there too, but the HTML files still remain in the folder where the input, output PDFs are (I checked with Total Commander). This is related with Zotero, not the plugin. How to correct it?

https://s3.amazonaws.com/zotero.org/images/forums/u587761/bzzvb8e3180zit3n7e7k.png

If Zotero does not delete files phisically, it is a big problem in time, leading the data folder extremelly large quickly.

aborel · August 14, 2024

The size of the output PDF is a known issue https://github.com/UB-Mannheim/zotero-ocr/issues/42 , we'll probably be able to improve that eventually.

HTML files created while the option is not selected: I'll check.

Deleting attachements vs. deleting files: there is actually something we can do in the plugin to improve that (the files would be deleted when you empty the Zotero trash). It is part of some new code that is under review at the moment - it is supposed to address a different problem but I see it would also have a positive impact here :-)

stef33 · August 14, 2024

Thanks. I have an 1578 pages 871Mb scanned pdf I will OCR it to see the output size.
...
It took more than 5 hours to generate the outpout which is 3.73Gb size.

aborel · August 28, 2024

Version 0.8.0 of the plugin, released today, creates a parent item if the PDF is a standalone item.

It also attaches the images and other files as regular attachments instead of linked files, which (1) makes it compatible with group libraries and (2) ensures that they are deleted on the disk if the user deletes them in Zotero.

stef33 · November 19, 2024

Great. I will check this thanks.