PDF text screwed after sync to WebDAV server
Hi,
I've experienced the following problem: I scanned, "OCRed" and attached a PDF document to an item in my Zotero library (Zotero standalone for Mac). The recognized text inside the PDF is fine as long as the files stays on my hard drive. But after I've synced my library, the recognized text inside the PDF is screwed, i.e. turned into unreadable signs. This does not happen to PDF documents I downloaded from databases, which include text information already. It only seems to happen to files I manually applied text recognition to.
What might be the cause: So far, it seems as if the problem is due to the way the WebDAV server (box.net) stores the PDF. This problem does not occur, when I sync a PDF manually into my box documents folder, so NOT via the WebDAV file sync in Zotero. So, this problem must be connected to the way the attached files are stored by Zotero on the WebDAV server.
Unfortunately I have no idea why this happens to manually added OCR information only and, more importantly, how I could fix this. Maybe switching to another WebDAV server could solve the problem. But before I move my whole database, I wanted to check, if someone's got a solution for this.
Has anyone had a similar experience and/or knows how to fix this?
All the best
Uwe
I've experienced the following problem: I scanned, "OCRed" and attached a PDF document to an item in my Zotero library (Zotero standalone for Mac). The recognized text inside the PDF is fine as long as the files stays on my hard drive. But after I've synced my library, the recognized text inside the PDF is screwed, i.e. turned into unreadable signs. This does not happen to PDF documents I downloaded from databases, which include text information already. It only seems to happen to files I manually applied text recognition to.
What might be the cause: So far, it seems as if the problem is due to the way the WebDAV server (box.net) stores the PDF. This problem does not occur, when I sync a PDF manually into my box documents folder, so NOT via the WebDAV file sync in Zotero. So, this problem must be connected to the way the attached files are stored by Zotero on the WebDAV server.
Unfortunately I have no idea why this happens to manually added OCR information only and, more importantly, how I could fix this. Maybe switching to another WebDAV server could solve the problem. But before I move my whole database, I wanted to check, if someone's got a solution for this.
Has anyone had a similar experience and/or knows how to fix this?
All the best
Uwe
I figured out what the problem was myself! Simply, it was only a certain OCR output style that didn't work out, namely the ClearScan type. After I've performed an OCR to the same document using the Searchable Image output (in Adobe Acrobat), the text remained even after syncing it to the WebDAV server.
So, just in case, someone else comes across this problem, here is how to solve it!
All the best
Uwe
I've been syncing my library with a WebDAV server at uni for years, and haven't had any problems with ClearScan pdfs. I'd hate to see my literally thousands of pdfs get corrupted! I hope this is just a one off!
clearscan OCR gives me:
"I P h [:?Yh M&P'Yh +U.&Mh [:.h 7R+-?M&7.h ?Yh U.SU.Y.P[.+h ' e h &?U.h B ^ [ h [;. SU?MRU+@&Hh P&[^U.h R2h [;.h 7R+-?M&7.h M&eh ' . h U.SU.Y.P[.+h APh R[:.U" ( :1)
searchable image OCR gives me:
"As it brought new life to his intellectual interests it also awakened a new feeling for life itself through the mediation of the woman." ( :1)
will stick to searchable image if I need to but clearscan is much clearer :0)and 20-60% smaller depending on the original
Slackenerny: what's your workflow, it that's what is getting your clearscans to extract OK, or what are you doing that I'm not? I too have 100s of pdfs converted to clearscan
Michael
I then open in Acrobat X Pro and run the clearscan OCR. I have yet to encounter any significant problems, either with my own scans or downloaded PDFs. Text extraction has always worked fine in Acrobat, PDFX-Change viewer and Foxit, both when immediately created and after storing in Zotero, either through WebDAV or linking to a file stored in Dropbox.
So I don't think I can offer anything to help troubleshoot this issue, sorry!