Full text indexing of djvu files

myurkin · August 3, 2008

I like very much that Zotero indexes full texts of all attached PDFs. However, I have a significant part of my collection in djvu format (it offers better compression of scanned images than PDF, which is especially important for books). All these files also have OCR text inside, which potentially can be indexed.

So is it planned to add support for indexing of this format in the future? I guess, it can be done completely analogously to PDFs, but using some other utility to get text out of it instead of pdf2text (though I can not directly point to any such utility right now).

noksagt · August 3, 2008

GPLed djvused would do it. I don't know how popular djvu really is & seems like the full-text-indexing should be end-user extensible (unrtf, antiword/wv, and other text extraction tools would seem to be similarly useful).

dstillman · August 5, 2008

The main problem at the moment is that support for calling external processes in Mozilla is currently very limited. Until IPC support is added (which is a sad, seven-year-old ticket), we have no way of getting stdout from processes, which means either they have to be able to write output to files or we have to use shell scripts to do the redirection (for OS X/Linux—I'm not sure this is even possible with batch files on Windows). Binaries on Windows (even command-line ones) also generally need to be modified to not pop up a command prompt window. These are the reasons we distribute modified versions of pdftotext/pdfinfo, and they'd likely be issues with other tools.

Ticket created, though.

noksagt · August 5, 2008

or we have to use shell scripts to do the redirection (for OS X/Linux—I'm not sure this is even possible with batch files on Windows).

You can use '>' std. out redirection on Windows (at least in 2K, XP, and Vista), OS X , and Linux. If you use windows scripting (as below), you can catch std. out to a string & dump that to a file in the script if you'd like.

Binaries on Windows (even command-line ones) also generally need to be modified to not pop up a command prompt window.

We've had to address this in refbase & other programs too. VBS scripting may be a reasonable work-around (particularly if you need wrapper scripts to run the executables with anyway).

dstillman · August 5, 2008

Thanks for the VBS link.

So assuming IPC won't happen anytime soon, we could probably bundle two generic runner scripts—a VBS for Windows and a shell script for OS X/Linux—and just use those to launch programs (with the path passed as a parameter) and redirect the output to a file.

ajlyon · February 19, 2011

The IPC ticket has finally been closed, although it hasn't yet been rolled into Firefox. Code at http://hg.mozilla.org/ipccode/ -- not sure what we'd have to do to make this part of Zotero now.