PDF text searchability sometimes lost

Hi all,

This problem has been vexing me for a while, but since I've been unable to figure out why or even consistently when it happens, I've been hesitant to post about it. I'm not sure if this is related to Zotfile or something else entirely.

I am using Zotfile to move and rename my OCR scanned pdfs. Often the searchability of texts linked to through Zotero stops functioning. Text extraction using Zotfile returns a jumble of symbols. I can't figure out why this happens sometimes and not others.

An example: After realizing the loss of searchability in one of my pdfs, I downloaded another version of the same text. I opened it and confirmed the searchability of the new file. I then copied the file name of the original pdf, deleted it, and changed the name of the new pdf to match the original file so that the Zotero link would be maintained. After opening the new pdf through the link, the searchability was lost.

I'm sorry that I can't bring more precision to my description. The only thread I've found that resembles my problem is this one: https://forums.zotero.org/discussion/33061/pdf-text-screwed-after-sync-to-webdav-server/

Any suggestions or help would be greatly appreciated!
  • do you have an example PDF where that problem occurs?
  • Here is a link to one example: https://dl.dropboxusercontent.com/u/10940981/Said-1979-Orientalism.pdf

    Thanks!
  • What is the file path of a file that works and a file that doesn't?
  • Actually, nvm. That PDF is messed up. If you open it in a PDF reader, can you select text and then copy-paste? (I get symbols). Something goes wrong between you downloading the file and re-attaching it. After each step in your process, open the file in a PDF reader and see if you can still copy text.
  • Hi aurimas,
    yeah, no go with the copy-pasting function. I also get symbols. After a bit more experimenting, I have realized that this happens to some files whether or not they are linked through Zotero.

    It seems to be related to highlighting a file in Preview (Mac). Before highlighting the text copies and pastes correctly. After highlighting and saving, the text is corrupted. I can highlight and save the same file in Acrobat and the text remains uncorrupted, but then the first time I save a new highlight made in Preview, the text is corrupted.

    It doesn't happen with every file though, and I don't know what the determining variable is.

    I guess this is no longer a question for the Zotero forums, but if anyone happens to have any idea why this might be happening and how I can fix it, I would be much obliged!
  • edited July 7, 2014
    In case anyone else who has been struggling with this problem finds this thread, I've now found a couple of other mentions of this problem in other forums. Unfortunately I have not found any solutions. It appears that this problem with Preview has existed for a long time and that Apple has simply not cared to fix it. I still don't understand why it happens with some files and not others.

    Bottom line: Preview invisibly corrupts the OCR text on some pdfs.

    https://forums.adobe.com/thread/1165888

    https://discussions.apple.com/thread/4487157

    No response on the official apple forum... *sigh*

    Finally, a couple possible workarounds (disclaimer: I have not tried either of them yet):
    http://n8henrie.com/2013/06/how-to-remove-corrupt-ocr-data-from-a-pdf/
    http://www.verypdf.com/app/pdf-repair-mac/try-and-buy.html
  • I finally discovered the source of this problem: Mac Preview (which is built on the Quartz pdf engine in Mac OS X) has a conflict with the embedded fonts used by ClearScan OCR in Adobe Acrobat. This problem first surfaced in ~2009, and is still active in current versions of Mac OS X (Yosemite, 10.10.5) and Adobe Acrobat (XI, 11.0.12). The problem does not occur for pdfs that uses the "Searchable Image" OCR option in Acrobat. The problem may be initiated on the Mac system but it causes problems for anyone that uses the pdf after that point. The pdf is still readable, but the OCR text is corrupted, which makes it difficult to detect. The biggest problem is that the file is no longer searchable, neither locally on your machine nor on the web (aka Google).

    One can reproduce the problem by starting with a pdf created from raster images of a text document. Use Acrobat with the ClearScan option to OCR the pdf, and then save the pdf. If you check the properties of the document, you will see a bunch of embedded fonts, with a common prefix of FD. Those were generated by ClearScan to provide a high quality rendering of the text. If you copy the text in the pdf and paste it into a word processor, then you will see a fairly accurate copy of the text in the original raster images.

    Now, close the file in Acrobat and open it in Mac Preview. You will find that copy-paste still works. The OCR corruption occurs when you save the file in Preview. Several people have suggested that the annotation tool in Mac Preview is involved, but it is actually the save step that causes the problem, regardless of any annotation done before the save.

    You can use Linux pdffonts tool to examine the fonts in the original Acrobat OCR pdf and the Preview version of the pdf. You will find that Preview changes the font names, but even more puzzling, it turns off the text encoding for the fonts (see the third column from the right, labeled "uni", in the pdffonts output).

    After the pdf has been saved in Preview, there is no easy way to recover the OCR text, other than re-frying the document (saving as raster images and recreating and OCRing the pdf). This problem has been active since at least 2009. Others have submitted bug reports to Apple, but it has yet to be resolved.
  • An important challenge associated with the OCR corruption problem is to identify those pdfs that have been corrupted. The bash script below will search through a directory containing pdf files, and check each one against four tests, including a test for corrupted OCR text. This last case is determined by calculating the frequency of the 10 most common letters in the English language. The expected frequency is 74 percent, and the random frequency is 38 percent. I have found that the midpoint frequency, 56 percent, provides a good discriminator for corrupted files.


    #!/bin/bash
    #==================================================================
    # findBadPdfs.sh
    # This script checks all of the pdf files in directory $D0, and
    # copies those that might need to be fixed to directory $D1.
    # The script checks for 4 problems, and reports these problems
    # to stdout using the following labels (in caps):
    # 1) >>NOT REGULAR FILE: bad file or filename
    # 2) >>PDFTOTEXT ERROR: error reported by pdftotext
    # 4) >>NO TEXT: pdf contains no OCR text
    # 5) >>CORRUPT TEXT: OCR text for the pdf is garbled
    # Dependencies:
    # pdftotext is a program included in xpdf
    # http://www.foolabs.com/xpdf/download.html
    # Linux commands: wc, bc, and tr
    #==================================================================
    #... Source directory
    D0="/PDF_directory"

    #... Target directory for broken pdfs
    D1="/PDF_broken"
    mkdir "$D1"

    #... Set IFS variable to account for filenames with spaces
    SAVEIFS=$IFS
    IFS=$(echo -en "\n\b")

    #... Loop through .pdf files in directory $D0
    for fullName in "$D0"/*.pdf
    do
    if test -f text.tmp ; then
    /bin/rm text.tmp
    fi
    iBadFile=0
    shortName=$(basename "$fullName")
    #... Check if pdf file is a regular file
    if [ ! -f "$fullName" ] ; then
    echo ">>NOT REGULAR FILE: $shortName"
    /bin/cp "$fullName" "$D1"/"$shortName"
    continue
    fi
    #... Convert to text and save error message, if present
    err=$(pdftotext "$fullName" text.tmp 2>&1)
    if [ -n "$err" ] ; then
    iBadFile=1
    fi
    #... Check text content of pdf
    # Count number of alpha characters
    sumAlphas=$(tr -cd "[:alpha:]" < text.tmp | wc -c)
    if [ $sumAlphas -lt 10 ] ; then
    iBadFile=1
    echo ">>NO TEXT: $shortName"
    else
    #... Calculate percent frequency of common alpha characters
    # Natural occurrence in text is 73.8 percent for 10 most
    # frequent characters. Random occurence is 38.5 percent.
    # Midpoint is a good cutoff = 56 percent
    sumFreq=$(tr -cd 'etaoinshrdETAOINSHRD' < text.tmp | wc -c)
    percentFreq=$(echo "scale=1; 100 * $sumFreq / $sumAlphas" | bc)
    if [ 1 -eq "$(echo "$percentFreq < 56" | bc)" ] ; then
    iBadFile=1
    echo ">>CORRUPT TEXT: $shortName Frequency = $percentFreq percent"
    fi
    fi
    #... If iBadFile == 1, copy file to broken directory
    if [ 1 -eq "$iBadFile" ] ; then
    /bin/cp "$fullName" "$D1"/"$shortName"
    fi
    #... Echo error message from pdftotext, if one was generated
    if [ -n "$err" ] ; then
    echo ">>PDFTOTEXT ERROR: $shortName"
    echo $err
    fi
    done
    #... Clean up
    if test -f text.tmp ; then
    /bin/rm text.tmp
    fi

    #... Return IFS variable back to its original value
    IFS=$SAVEIFS
  • Update: Apple claims to have fixed the Preview PDF-corruption bug with Sierra 10.12.3 - see here: http://www.documentsnap.com/ocr-text-macos-sierra-preview/
  • edited April 20, 2017
    Thanks for the heads-up about a possible fix for the Preview PDF-corruption bug.
    The link is actually a discussion about corruption of OCR information in pdf documents generated with the ScanSnap scanner and software. I did not find any information there about statements from Apple about this problem.

    UNFORTUNATELY, the corruption bug that I described in September 2015 is still present relative to the most recent versions of MacOS Sierra (10.12.4), Preview 9.0 (909.17), and Adobe Acrobat (XI, 11.0.20). I reported the problem as bug to Apple but there has been no fix yet, to the best of my knowledge.

    The problem is that Preview continues to overwrite some types of OCR information in pdf files. My experience with this problem is limited to the OCR information generated using Clearscan in Adobe Acrobat. I include here a test I devised for the problem (repeated from my Sept 2015 message above).

    "One can reproduce the problem by starting with a pdf created from raster images of a text document. Use Acrobat with the ClearScan option to OCR the pdf, and then save the pdf. If you check the properties of the document, you will see a bunch of embedded fonts, with a common prefix of FD. Those were generated by ClearScan to provide a high quality rendering of the text. If you copy the text in the pdf and paste it into a word processor, then you will see a fairly accurate copy of the text in the original raster images.

    Now, close the file in Acrobat and open it in Mac Preview. You will find that copy-paste still works. The OCR corruption occurs when you save the file in Preview. Several people have suggested that the annotation tool in Mac Preview is involved, but it is actually the save step that causes the problem, regardless of any annotation done before the save. "


  • thanks for the clarification: I'm giving up on Preview and moving to Adobe Acrobat DX for Mac - more attractive to me now that it's easy to highlight in different colours without laboriously changing the colour each time via Properties.

    it really is stupefying that Apple hasn't fixed this issue in Preview by now...
  • edited May 5, 2017
    update (previous comment deleted): as @foret37 suggests above, the OCR corruption *does* remain a problem in Sierra 10.12.4, albeit for fewer pdfs than previously - see here: https://discussions.apple.com/thread/7905824?start=0&tstart=0. I've tested this myself and it is still a problem.
Sign In or Register to comment.