PDF text screwed after sync to WebDAV server

uweruss · November 5, 2013

Hi,

I've experienced the following problem: I scanned, "OCRed" and attached a PDF document to an item in my Zotero library (Zotero standalone for Mac). The recognized text inside the PDF is fine as long as the files stays on my hard drive. But after I've synced my library, the recognized text inside the PDF is screwed, i.e. turned into unreadable signs. This does not happen to PDF documents I downloaded from databases, which include text information already. It only seems to happen to files I manually applied text recognition to.

What might be the cause: So far, it seems as if the problem is due to the way the WebDAV server (box.net) stores the PDF. This problem does not occur, when I sync a PDF manually into my box documents folder, so NOT via the WebDAV file sync in Zotero. So, this problem must be connected to the way the attached files are stored by Zotero on the WebDAV server.

Unfortunately I have no idea why this happens to manually added OCR information only and, more importantly, how I could fix this. Maybe switching to another WebDAV server could solve the problem. But before I move my whole database, I wanted to check, if someone's got a solution for this.

Has anyone had a similar experience and/or knows how to fix this?

All the best
Uwe

uweruss · November 5, 2013

Hi folks,

I figured out what the problem was myself! Simply, it was only a certain OCR output style that didn't work out, namely the ClearScan type. After I've performed an OCR to the same document using the Searchable Image output (in Adobe Acrobat), the text remained even after syncing it to the WebDAV server.

So, just in case, someone else comes across this problem, here is how to solve it!

All the best
Uwe

Slackenerny · November 11, 2013

Hmmm ... that is a bit concerning, as ClearScan gives, by a long way, the most compact file output when doing OCR on an old pdf or a scanned document. ClearScan basically replaces the bitmap in a scanned document with a vector font created on the fly, so the text layer is not overlayed on top of an image file, but is encoded in the document, as though you had created the pdf in Word. It's not perfect, and when you zoom in close the letters look pretty rough, but the results are usually cleaner, print better, and have a smaller file size than the Searchable Image option.

I've been syncing my library with a WebDAV server at uni for years, and haven't had any problems with ClearScan pdfs. I'd hate to see my literally thousands of pdfs get corrupted! I hope this is just a one off!

adamsmith · November 11, 2013

the WebDAV sync is likely a red herring. Neither Zotero nor the WebDAV alters synced files in any way - the only thing that happens is that they're zipped and unzipped.

mjtowen · January 5, 2014

Confirming that I have had the same problem with extracting annotations from clearscan pdfs

clearscan OCR gives me:
"I P h [:?Yh M&P'Yh +U.&Mh [:.h 7R+-?M&7.h ?Yh U.SU.Y.P[.+h ' e h &?U.h B ^ [ h [;. SU?MRU+@&Hh P&[^U.h R2h [;.h 7R+-?M&7.h M&eh ' . h U.SU.Y.P[.+h APh R[:.U" ( :1)

searchable image OCR gives me:
"As it brought new life to his intellectual interests it also awakened a new feeling for life itself through the mediation of the woman." ( :1)

will stick to searchable image if I need to but clearscan is much clearer :0)and 20-60% smaller depending on the original

Slackenerny: what's your workflow, it that's what is getting your clearscans to extract OK, or what are you doing that I'm not? I too have 100s of pdfs converted to clearscan

Michael

Slackenerny · March 29, 2014

My workflow is pretty simple. I either scan a document (600dpi, 2 bit black and white) and save it as a TIFF file (using CCITT G4 compression) which I then use to create a PDF, or I just start with a downloaded PDF, which can have almost any combination of resolution and compression.

I then open in Acrobat X Pro and run the clearscan OCR. I have yet to encounter any significant problems, either with my own scans or downloaded PDFs. Text extraction has always worked fine in Acrobat, PDFX-Change viewer and Foxit, both when immediately created and after storing in Zotero, either through WebDAV or linking to a file stored in Dropbox.

So I don't think I can offer anything to help troubleshoot this issue, sorry!