zotero ocr file size

Hi all zotero ocr users,

thanks to some great tutorials i managed to install zotero ocr. However, the output files of zotero ocr produce file sizes that massive, and that make using it pointless, since my hard disk would quickly be full (a book that was 14mb, became 295mb after OCR). I understand that OCR might increase filze size, but there are other OCR programs that manage file size much better.
Is there a way how to select output file size somewhere in zotero OCR?
thanks for help
  • There is an open issue: https://github.com/UB-Mannheim/zotero-ocr/issues/42. Not yet addressed
  • Ah, OK, great, thank you! It would be good if at least the github page (or better: the zotero plugins documentation page) contained a warning about this. Installing it took me a lot of work (because of the issues mentioned in various install tutorials), and now it is unusable for me. It would also be important to alert people to this, as some may not be aware of this issue, use ocr liberally and then suddenly realize that their harddisk is full.
    Will wait then for some kind of solution.
  • Hello there. I see the issue is still open. I started using Zotero OCR a few months ago and have maxed out my Zotero cloud storage. I was wondering if anyone has any tips for how to deal with this - how do you use this plug-in without maxing out?
  • You can reduce the output file size if you set a lower resolution in the Zotero-OCR preferences, but don't expect a miracle: unfortunately you can gain maybe 10-20%, not an order of magnitude.
    This is really a consequence of the underlying OCR engine, for a real improvement we'd need very fundamental changes. We're considering it but the roadmap hasn't been decided yet.
  • @aborel gotcha. Do you mind walking me through what parts of the ZoteroOCR preferences will do that? Is it just the "output pdf dpi" setting, or will adjusting the other settings make a difference? These are my settings at the moment:

    https://s3.amazonaws.com/zotero.org/images/forums/u4655716/2af6alaftfedlt408irw.png

    Also - I had a bit of trial and error when I first started using this tool, so I had a few files where I OCR'd it more than once. While I deleted it from my zotero library (and cleared the trash) these files appear multiple times when I search in my computer's "storage" folder. Is there any way to delete the extra copies? Will removing it from my storage folder cause problems?
  • Yes, the output dpi value the main adjustment variable. If you reduce it (let's say 200 instead of 300), the pdftoppm tool used by the plugin will produce slightly smaller files, which will subsequently produce a somewhat smaller output PDF. This will produce a lower-quality output PDF, but it should still be quite readable.

    The overall size of data in your library can also be reduced if you unselect the "save intermediate PNGs" and "save output as a HTML/ocr" preferences, these aren't really useful once you are sure the plugin is working properly. You can also decide to "overwrite the initial PDF with the output" but I don't recommend it: in case something goes wrong, it is safer to keep the original, non-OCRed file.

    The other parameters don't have any impact on the file size.

    ---
    Multiple copies of OCR0'd files? I don't think I've heard of this before. If you can report the exact steps to reproduce this behavior, I'll be happy to take a look.

    I hope this helps, don't hesitate to ask for more details or clarifications if necessary.
  • I'll see if I can figure out how I did that in the first place and report back :)
    Does the "import the resulting PDF as a copy instead of as a file link" make a difference as to whether the new PDF is stored in the zotero storage?
Sign In or Register to comment.