Zotfile is importing highighted text (with formatting errors) instead of correct test in annotations

Many pdf files have text formatting errors or bad OCR results - so when I make annotations I correct the errors. But when I try to extract the annotations into Zotero, it is not getting my corrected annotation, it is getting the messed-up original highlighted text. So I have to go thorough sometimes hundreds of annotations to correct the formatting again. How can I get Zotfile to only bring in my annotations, and ignore the original text?
  • How exactly are you correcting the text? I'm doubtful you'll be able to do much about ZotFile's behavior here, but there might be a way to change what/how you're correcting in a way that ZotFile picks it up.
  • I copy the excerpt before highlighting it, and then past it into the annotation. When the formatting is bad, I fix it manually in the annotation. Here's an example the text as currently formatted in a pdf file: "S i mp l y to take the informatio out of the context in which it arose and use it generall does not solve the problem...".

    When I create an annotation and fix the formatting it looks like this: "Simply to take the information out of the context in which it arose and use it generally does not solve the problem...".

    Zotfile is not extracting the annotation, it is extracting the highlighted text. How can I make it extract the annotation only?
  • Zotfile extractions three types of PDF annotations: Highlighted text, Text in comment bubbles, and Underlined text. I don't believe there is a way to easily stop one of those from being extracted.
  • Can it bring in annotations connected to highlighted text? If so, it would be easy enough to delete the original text and keep only the annotation.
  • I believe it does, but you would need to try it.
  • I have tried it, and it does not. If I have highlighted text with an associated annotation, Zotfile extracts only the highlighted text, not the annotation.
  • I'm afraid that's not going to change in ZotFile any time soon. Depending on your PDF reader, you can fix the OCRd text directly rather than just the annotation, that would obviously work in ZotFile (and also fix copy&paste and search in the PDF proper).
  • For anyone who is still having this problem with extracting your personal comment from underlined annotations:

    First, you need to understand that annotations in PDFs that follow the standard format (eg. Adobe Acrobat) and save your annotations directly into the metadata of the original PDF file (eg. not Skim, which apparently writes to a second file) are actually composed of two different pieces of metadata: the "markup" and the "content".

    The "markup" part is a copy of the exact text that you either highlighted or underlined. There is no markup part for a Sticky Note since is just a free floating annotation.

    The "content" part is a copy of your personal comment that you type. For instance, it is the part that you see on the right side of the screen in the Comments toolbar if you're using Adobe Reader.

    So for rgfuller's example above, the markup is the messed up OCR text that he or she highlighted in the PDF and the content is his or her correctly retyped text in the comment sidebar.

    For whatever reason, Zotfile currently works like this:

    Sticky Notes: only the content part is extracted (which makes sense because sticky notes don't markup text)

    Highlighting: both parts are extracted (maybe this was fixed after the comments above?)

    Underlining: only the markup part is extracted. Extraction of annotations by Zotfile will not return your content (personal comment)!

    The sort of good news is that if you can handle some text file editing and zip file manipulation, it is not too hard to switch the functionality between extraction of highlights and underlines to retrieve your personal comments from underlined annotations. Please see: https://forums.zotero.org/discussion/comment/348732. It is unfortunately not a fix to get both highlights and underlines to work as expected simultaneously, but it is a workaround to at least extract what you need.

    That post also has a way to essentially turn off the extraction of the markup portion as requested by rgfuller above.

    Hope that helps someone!

Sign In or Register to comment.