Problems with text in beta annotation extraction

kdb_research · January 29, 2022

The new annotation function produces text that is not as readable as the old Zotfile extraction tool.

* importantly the new function is not handling line breaks well.

From the Zotfile extraction:

Chapter 4 A burgeoning credit economy (note on p.104)

"In the 1920s, outstanding debt again doubled, increasing from $3.3 billion in 1920 to over $7.6 billion in 1929. About half as much debt was outstanding in 1933 ($3.9 billion), after which time debt again grew rapidly, doubling in just six years." (Olney 1991:104)

"Debt as a percentage of income increased only gradually before World War I, as seen in figure 4.1, but then doubled in the 1920s." (Olney 1991:109)

"Cash purchase of many major durable goods commanded a large share of the average household's disposable income. In the 1920s, a Chevrolet cost about 20 percent of annual household income, and a Chrysler could cost over 60" (Olney 1991:113)

From the new function:

(Olney, 1991, p. 104) Chapter 4 A burgeoning credit economy

“In the 1920s, outstanding debt again doubled, increas- 0 C\l 0 ,.... m @ @ @ @ @ g g ,.... ,.... ,.... much debt was outstanding in 1933 ($3 .9 billion), after which time debt .,...; Q) O> O> O> >- ,.... ,.... ,.... ,.... ,.... ,.... ,.... ,.... ,.... ,.... ,.... again grew rapidly, doubling in just six years.” (Olney, 1991, p. 104)
In the 1920s, outstanding debt again doubled, increas-
ing from $3.3 billion in 1920 to over $7.6 billion in 1929. About half as
much debt was outstanding in 1933 ($3.9 billion), after which time debt
again grew rapidly, doubling in just six years.

“1 Debt as a percent::::> .2 O>(;:t> (/) <13 t:: ·- ~ c: > 0 al .s age of income increased only gradually before World War I, as seen in o_ EE (.) <13 "ffi y; . -- figure 4.1, but then doubled in the 1920s.” (Olney, 1991, p. 109)
. 1 Debt as a percent-
age of income increased only gradually before World War I, as seen in
figure 4.1, but then doubled in the 1920s.

kdb_research · February 4, 2022

I can withdraw this. It seems to be a generated by extracting annotations made with a different PDF viewer. Using the built-in PDF viewer works fine on the same pdf.

dstillman · February 4, 2022

Zotero shouldn't be any worse at extracting annotations made with external PDF readers, though — if anything it should be better. Are you saying there's a difference?

kdb_research · February 4, 2022

Yes, there are important differences. Line breaks are the obvious ones, but also weird note duplications.

Using PDFxchange and Zotfile:

"our government is destroying two vital instruments of that growth-the system of contract rights and the large corporation." (Jensen and Meckling 1978:31)

"The courts have often taken the lead in revoking private rights, especially in the civil rights arena" (Jensen and Meckling 1978:31)

Using the PDFxchange and the built-in extractor:

Annotations(2/4/2022, 9:14:00 AM)

“our government is destroying two vital instruments of that growth-the system of contract rights and the large corporation.” our government is destroying two vital instruments of that growth-the system of contract rights and the large corporation.

“The courts have often taken the lead in revoking private rights, especially in the civil rights arena” The courts have often taken the lead in revoking private rights, especially in the civil rights arena

dstillman · February 4, 2022

The strings after the quotes should be the comments, not the highlighted text. Do you see those strings in the comment field for those annotations in the built-in viewer?

What's an example of the line break issue?

kdb_research · February 4, 2022

There's an option in PDFXchange to insert the highlighted text into the highlight field. We don't see that as a comment, but if you click on the highlight you see the text. This was useful for exporting comments. Probably a very niche problem.

The breaks seem to be in the duplicated comment text. See below:

Annotations(2/4/2022, 9:29:30 AM)

“If a foreign creditor is so kind as to wait his time and buy the bullion as it comes into the country, he may be paid without troubling the Bank or distressing the money market. The German Government has recently been so kind”
If a foreign creditor
is so kind as to wait his time and buy the bullion
as it comes into the country, he may be paid
without troubling the Bank or distressing the
money market. The German Government has
recently been so kind

dstillman · February 4, 2022

There's an option in PDFXchange to insert the highlighted text into the highlight field.

Into the comment field, you mean? If you're only adding these PDFs to Zotero you should turn that off — Zotero does its own job of identifying the text selected by the highlight, and the comment field is for user-entered text. And Zotero properly handles line breaks in highlighted text, but it doesn't do anything to line breaks in user-entered comments, which would be added by PDF-XChange in this case.

I'm guessing ZotFile automatically ignores (whitespace-normalized?) comments that match the highlighted text (and this suggests it does something like that). We've encountered this before with other PDF readers, so we've create a ticket to do something similar when parsing annotations.

kdb_research · February 21, 2022

Thanks. If there's a way to test this ignore matching text comments please let me know.

dstillman · April 21, 2022

Zotero 6.0.5, available now, ignores comments if they closely match the highlighted text.

kdb_research · April 22, 2022

Thanks. Works for me. Much appreciated.