Zotero and OCR

StarPicard · September 3, 2018

Hello, I use Zotero as my daily companion to manage and store information of all kinds. It works perfectly for almost all my applications, but there is one functionality that I lack.

Many of my lecturers at the university provide parts of the scripts as PDF files. However, these are only scanned images. To insert these PDF files sensibly into Zotero, it would be very useful to have a text recognition in the background, which processes the PDF files automatically. I have projects like OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF) in mind.

My question would be whether there is any ambition or even interest to integrate such functionality into Zotero. Since I need this functionality myself, I would also be willing to put work into the implementation, as far as I am able to do so.

bwiernik · September 3, 2018

The Zotero developers would have to comment on adding this feature to Zotero itself, but you could certainly implement this as a plugin. Take a look at the Zotero Sample Plugin or my Zotero DOI Manager plug-in for simple examples.

adamsmith · September 3, 2018

yeah, I'd recommend doing this as a plugin -- certainly very useful, both others might prefer using different, non-free, OCR workflows, (e.g. acrobat pro)

zuphilip · September 3, 2018

I also thought about a Zotero Plugin for OCRing attachments some time ago, but did not yet have time to start with anything. We have some projects about OCR and are especially working with/on the open source software Tesseract, which is behind the tool you mentioned (also we use the software OCRopus and ABBYY as well).

There could be an overlapping interest for such a feature/plugin for Tropy as well.

@adamsmith Can the Acrobat OCR also be scripted?

adamsmith · September 3, 2018

@zuphilip all the professional adobe apps have javascript scripting enabled, yes, but I'm not sure if it is possible to call that from a different app, i.e. I know you can write a javascript automation tool to run inside adobe, but not sure if you can run it from Zotero (or so).

StarPicard · September 3, 2018

@bwiernik implementing this as a plugin sounds like a really good idea. To be honest I have not known about this possibility. I would take a first look at it.

Some facts to OCRmyPDF: It is not only a OCR tool, it generates searchable PDF files out of a given PDF file with only images of the text. And the new PDF file has the text at the same position as the text in the image.

stevenmonrad · November 5, 2018

We use Acrobat itself to OCR convert non-text PDF images. It works well.

pseudomorph · November 5, 2018

I think a plugin like this would be an excellent idea, particularly if it produced good quality OCR documents using an open solution.

I work almost exclusively on linux and really lack a tool that produces a good quality OCR documents that are searchable and have text that is easy to annotate and copy in the way that proprietary tools such as Adobe's do.

My current workflow uses a Windows application running under Wine to achieve this.

laurence80386 · August 29, 2019

Did you get anywhere with this? I just had the same idea and would be happy to contribute.

zuphilip · August 29, 2019

Just some days ago I worked again on this and have now a first prototype of a Zotero plugin using Tesseract OCR: https://github.com/UB-Mannheim/zotero-ocr . There are several TODOs, hard coded things and prerequisites to have installed, but it is already working for me. It is possible to test it yourself (but maybe not yet easy).