PDF text searchability sometimes lost
Hi all,
This problem has been vexing me for a while, but since I've been unable to figure out why or even consistently when it happens, I've been hesitant to post about it. I'm not sure if this is related to Zotfile or something else entirely.
I am using Zotfile to move and rename my OCR scanned pdfs. Often the searchability of texts linked to through Zotero stops functioning. Text extraction using Zotfile returns a jumble of symbols. I can't figure out why this happens sometimes and not others.
An example: After realizing the loss of searchability in one of my pdfs, I downloaded another version of the same text. I opened it and confirmed the searchability of the new file. I then copied the file name of the original pdf, deleted it, and changed the name of the new pdf to match the original file so that the Zotero link would be maintained. After opening the new pdf through the link, the searchability was lost.
I'm sorry that I can't bring more precision to my description. The only thread I've found that resembles my problem is this one: https://forums.zotero.org/discussion/33061/pdf-text-screwed-after-sync-to-webdav-server/
Any suggestions or help would be greatly appreciated!
This problem has been vexing me for a while, but since I've been unable to figure out why or even consistently when it happens, I've been hesitant to post about it. I'm not sure if this is related to Zotfile or something else entirely.
I am using Zotfile to move and rename my OCR scanned pdfs. Often the searchability of texts linked to through Zotero stops functioning. Text extraction using Zotfile returns a jumble of symbols. I can't figure out why this happens sometimes and not others.
An example: After realizing the loss of searchability in one of my pdfs, I downloaded another version of the same text. I opened it and confirmed the searchability of the new file. I then copied the file name of the original pdf, deleted it, and changed the name of the new pdf to match the original file so that the Zotero link would be maintained. After opening the new pdf through the link, the searchability was lost.
I'm sorry that I can't bring more precision to my description. The only thread I've found that resembles my problem is this one: https://forums.zotero.org/discussion/33061/pdf-text-screwed-after-sync-to-webdav-server/
Any suggestions or help would be greatly appreciated!
Thanks!
yeah, no go with the copy-pasting function. I also get symbols. After a bit more experimenting, I have realized that this happens to some files whether or not they are linked through Zotero.
It seems to be related to highlighting a file in Preview (Mac). Before highlighting the text copies and pastes correctly. After highlighting and saving, the text is corrupted. I can highlight and save the same file in Acrobat and the text remains uncorrupted, but then the first time I save a new highlight made in Preview, the text is corrupted.
It doesn't happen with every file though, and I don't know what the determining variable is.
I guess this is no longer a question for the Zotero forums, but if anyone happens to have any idea why this might be happening and how I can fix it, I would be much obliged!
Bottom line: Preview invisibly corrupts the OCR text on some pdfs.
https://forums.adobe.com/thread/1165888
https://discussions.apple.com/thread/4487157
No response on the official apple forum... *sigh*
Finally, a couple possible workarounds (disclaimer: I have not tried either of them yet):
http://n8henrie.com/2013/06/how-to-remove-corrupt-ocr-data-from-a-pdf/
http://www.verypdf.com/app/pdf-repair-mac/try-and-buy.html
One can reproduce the problem by starting with a pdf created from raster images of a text document. Use Acrobat with the ClearScan option to OCR the pdf, and then save the pdf. If you check the properties of the document, you will see a bunch of embedded fonts, with a common prefix of FD. Those were generated by ClearScan to provide a high quality rendering of the text. If you copy the text in the pdf and paste it into a word processor, then you will see a fairly accurate copy of the text in the original raster images.
Now, close the file in Acrobat and open it in Mac Preview. You will find that copy-paste still works. The OCR corruption occurs when you save the file in Preview. Several people have suggested that the annotation tool in Mac Preview is involved, but it is actually the save step that causes the problem, regardless of any annotation done before the save.
You can use Linux pdffonts tool to examine the fonts in the original Acrobat OCR pdf and the Preview version of the pdf. You will find that Preview changes the font names, but even more puzzling, it turns off the text encoding for the fonts (see the third column from the right, labeled "uni", in the pdffonts output).
After the pdf has been saved in Preview, there is no easy way to recover the OCR text, other than re-frying the document (saving as raster images and recreating and OCRing the pdf). This problem has been active since at least 2009. Others have submitted bug reports to Apple, but it has yet to be resolved.
#!/bin/bash
#==================================================================
# findBadPdfs.sh
# This script checks all of the pdf files in directory $D0, and
# copies those that might need to be fixed to directory $D1.
# The script checks for 4 problems, and reports these problems
# to stdout using the following labels (in caps):
# 1) >>NOT REGULAR FILE: bad file or filename
# 2) >>PDFTOTEXT ERROR: error reported by pdftotext
# 4) >>NO TEXT: pdf contains no OCR text
# 5) >>CORRUPT TEXT: OCR text for the pdf is garbled
# Dependencies:
# pdftotext is a program included in xpdf
# http://www.foolabs.com/xpdf/download.html
# Linux commands: wc, bc, and tr
#==================================================================
#... Source directory
D0="/PDF_directory"
#... Target directory for broken pdfs
D1="/PDF_broken"
mkdir "$D1"
#... Set IFS variable to account for filenames with spaces
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
#... Loop through .pdf files in directory $D0
for fullName in "$D0"/*.pdf
do
if test -f text.tmp ; then
/bin/rm text.tmp
fi
iBadFile=0
shortName=$(basename "$fullName")
#... Check if pdf file is a regular file
if [ ! -f "$fullName" ] ; then
echo ">>NOT REGULAR FILE: $shortName"
/bin/cp "$fullName" "$D1"/"$shortName"
continue
fi
#... Convert to text and save error message, if present
err=$(pdftotext "$fullName" text.tmp 2>&1)
if [ -n "$err" ] ; then
iBadFile=1
fi
#... Check text content of pdf
# Count number of alpha characters
sumAlphas=$(tr -cd "[:alpha:]" < text.tmp | wc -c)
if [ $sumAlphas -lt 10 ] ; then
iBadFile=1
echo ">>NO TEXT: $shortName"
else
#... Calculate percent frequency of common alpha characters
# Natural occurrence in text is 73.8 percent for 10 most
# frequent characters. Random occurence is 38.5 percent.
# Midpoint is a good cutoff = 56 percent
sumFreq=$(tr -cd 'etaoinshrdETAOINSHRD' < text.tmp | wc -c)
percentFreq=$(echo "scale=1; 100 * $sumFreq / $sumAlphas" | bc)
if [ 1 -eq "$(echo "$percentFreq < 56" | bc)" ] ; then
iBadFile=1
echo ">>CORRUPT TEXT: $shortName Frequency = $percentFreq percent"
fi
fi
#... If iBadFile == 1, copy file to broken directory
if [ 1 -eq "$iBadFile" ] ; then
/bin/cp "$fullName" "$D1"/"$shortName"
fi
#... Echo error message from pdftotext, if one was generated
if [ -n "$err" ] ; then
echo ">>PDFTOTEXT ERROR: $shortName"
echo $err
fi
done
#... Clean up
if test -f text.tmp ; then
/bin/rm text.tmp
fi
#... Return IFS variable back to its original value
IFS=$SAVEIFS
The link is actually a discussion about corruption of OCR information in pdf documents generated with the ScanSnap scanner and software. I did not find any information there about statements from Apple about this problem.
UNFORTUNATELY, the corruption bug that I described in September 2015 is still present relative to the most recent versions of MacOS Sierra (10.12.4), Preview 9.0 (909.17), and Adobe Acrobat (XI, 11.0.20). I reported the problem as bug to Apple but there has been no fix yet, to the best of my knowledge.
The problem is that Preview continues to overwrite some types of OCR information in pdf files. My experience with this problem is limited to the OCR information generated using Clearscan in Adobe Acrobat. I include here a test I devised for the problem (repeated from my Sept 2015 message above).
"One can reproduce the problem by starting with a pdf created from raster images of a text document. Use Acrobat with the ClearScan option to OCR the pdf, and then save the pdf. If you check the properties of the document, you will see a bunch of embedded fonts, with a common prefix of FD. Those were generated by ClearScan to provide a high quality rendering of the text. If you copy the text in the pdf and paste it into a word processor, then you will see a fairly accurate copy of the text in the original raster images.
Now, close the file in Acrobat and open it in Mac Preview. You will find that copy-paste still works. The OCR corruption occurs when you save the file in Preview. Several people have suggested that the annotation tool in Mac Preview is involved, but it is actually the save step that causes the problem, regardless of any annotation done before the save. "
it really is stupefying that Apple hasn't fixed this issue in Preview by now...