Zotero, OCR plugin and OCR file not indexed.
Zotero beta 114, 64-bits, Zotero OCR 0.7.3 plugin.
360 pages image pdf
OCR done.
output pdf file and output image files including output txt, hocr files present near source image pdf but not visible in Zotero.
How to index these files, why not appear near the input file and not visible in Zotero?
360 pages image pdf
OCR done.
output pdf file and output image files including output txt, hocr files present near source image pdf but not visible in Zotero.
How to index these files, why not appear near the input file and not visible in Zotero?
The expected result within Zotero is a PDF with a text layer, that can be fully indexed. Did this work for you?
Outside of Zotero, in total commander I see in the same folder as source image file: .zotero-reader-state, OCRed PDF, hocr, txt, image-list.txt, page1.png to ... page 360.png and the source image PDF file.
Now in this situation I don't see any way to index these files except copying out all files from the folder except the source pdf and draging back the output pdf to index it. It this enough to complete the action?
https://s3.amazonaws.com/zotero.org/images/forums/u587761/kkox4o0nbtrp4yll89lz.png
Is the original image PDF a regular attachment, or a linked file?
Is the record saved in your personal library or in a group library?
You can of course manually add the output PDF, it should solve the immediate problem. I'd be very happy if we could get to the root cause - in case others encounter it as well.
Personal library, single user, multiple profiles,
Source pdf 141mb, output OCRed PDF 451mb.
Is the output PDF OK? Are all pages present and OCRed?
Did the plugin create a note, as selected in the preferences?
It would be great if you can:
- run the plugin on a fresh copy of your Zotero record (i.e. with only the source PDF), but this time selecting yes to saving html/hocr and no to the PNGs;
- check whether the html files are attached to the record (they should), and whether the PNGs are left behind in the folder (they should not).
Finally, if necessary, would you agree to share the source PDF with me?
No note is created by the plugin.
I will check with smaller PDF as you described to view the results.
Thanks.
How large is the txt file that was produced by the OCR?
Another test that would be useful: disable "Save output to a note" for your 360-page PDF, if things are working correctly that we'll have a winner!
I ve put in a test folder. After OCR, no other file except the source file see in test folder. This time outside of folders in Zotero library I can observe a note with the extracted text, output ocr pdf and 5 html files with 5 ocred pages.
I checked the folder outsite of Zotero in Total Commander: I observed that 8 image files created and not added to Zotero. If these are not attached only the output html files, then it must be 8 html files not 5.
Other obs: The OCR process window closes after OCR.
Imagine if I OCR a 3000 pages book PDF and it creates (if it works like with small pdfs) 3000 html files outside the folder I put the source PDF, uncategorized in Zotero library.
restart Zotero.
done the OCR.
Same results. PNGs created. Exact the same symptoms. It is a public PDF, you can check and test it: https://archive.org/details/Psb22
https://s3.amazonaws.com/zotero.org/images/forums/u587761/ioc5x5zzcsre4ih30dw3.png
On my machine, everything has worked as expected - the note is created, but cannot be synchronized with the Zotero server because it is too long. So it is probably not the best idea to create them for long documents in general, but your tests indicate that this is not the original cause I suspected.
Now about your 8-page document:
1) do I understand correctly that the images, html and other files were created by the plugin in a different folder than the one containing the source PDF? That would be new, I haven't read that in your first messages - did I miss something?
2) number of html files: it is controlled by the plugin settings, 5 is the correct number here as per your last screenshot. You will only get a html file for each page if you set the preference to more than the number of pages in your document.
3) closing the OCR process window (I assume you mean tesseract): I think it is normal on MS Windows. There will be no window at all on MacOS, Linux, etc.
4) You don't need to generate any html file if you don't want to, or just a few if you prefer, so a 3000 pages document doesn't automatically mean 3000 html files. Now if they are generated in an unexpected folder, that's still a problem - see question 1.
I use Zotero only offline without connecting to Zotero server.
8409 items, tons of pdf attachments, 361gb total, sqlite 232mb.
1.the 8 page document: the number of html pages to create is set to 5 probably this was the reason it creates 5 and not 8. I did not test it yesterday. Yes the plugin always creates files in the main list of files, uncategorized and not where the source file is.
2. yes probably as you say. i did not test it.
3. yes I considered normal. just mentioned it closes and finishes the OCR process even the problems remains the same.
4. any file generated after ocr-ing small pdfs is put in the main uncategorized list not in the source pdf' folder.
Even I set it so, at big pdf ocr, html files are not generated, png files are created even if I choose not to create.
4. Intermediate PNG files will always be created, the OCR is performed on them. But if the "Save the intermediate PNGs..." setting is not selected, they will be deleted at the end of the process. In your case, the plugin fails before that step so this clean-up doesn't happen.
I have been testing only with a regular Zotero installation, not a portable version - I will try that next.
4. "Save the intermediate PNGs..." unselected, png files are not deleted.. yes you are right.
https://www.zotero.org/support/adding_items_to_zotero#pdfs
There is indeed no metadata that Zotero can recognize in your image PDF, that's why the parent was not created automatically. This means that you'll need to add at least some minimal information manually. Another possibility would be to find the book in a library catalog, import the metadata from there and attach the PDF to that record.
I will make a note about this for a future version of the plugin. Maybe it would be a good idea to check for such a case, and create some kind of parent if there is none.
[Edited to add] I'm not saying that creating a parent item is the final solution, but it will at least keep all output elements together. It will then be easier to verify if something doesn't work as it should.
The HTML generation is not enabled. Still the plugin produces 5 html files each time. I select and delete it, it is sent to trash. I delete it from there too, but the HTML files still remain in the folder where the input, output PDFs are (I checked with Total Commander). This is related with Zotero, not the plugin. How to correct it?
https://s3.amazonaws.com/zotero.org/images/forums/u587761/bzzvb8e3180zit3n7e7k.png
If Zotero does not delete files phisically, it is a big problem in time, leading the data folder extremelly large quickly.
HTML files created while the option is not selected: I'll check.
Deleting attachements vs. deleting files: there is actually something we can do in the plugin to improve that (the files would be deleted when you empty the Zotero trash). It is part of some new code that is under review at the moment - it is supposed to address a different problem but I see it would also have a positive impact here :-)
...
It took more than 5 hours to generate the outpout which is 3.73Gb size.
It also attaches the images and other files as regular attachments instead of linked files, which (1) makes it compatible with group libraries and (2) ensures that they are deleted on the disk if the user deletes them in Zotero.