Exponential growth: Too many fields to update??

c-sander · July 4, 2018

I have wirtten a Ph.D. dissertation of ca. 1600 pages with about 7000 Zotero fields and roughly 3500 entries in my bibliography. I have split the dissertation into 60 files of about 20 pages each. Within each single doc-file (MS Word 2003) I can perfectly use Zotero and update all fields.

Yet, as soon as I combine all files into one large Word-file, Zotero breaks down. I have tried to update the entire document on several PCs, either Zotero says "out of memory" or, after days, it simply doesn't respond any more. Zotero needs up to 3GB of memory while updating the fields.

I think the problem is the disambiguation of inline citations, e.g. Blum 2015a vs. Blum 2015b, or P. Blum 2015 vs. R. Blum 2017. In order to do so, each field has to be compared to every other field. This is done after the Debug code line "Integration: style.processCitationCluster", at the end in [] are listed many, many singular citationIDs. If 10 fields take rougly 10 seconds for updating, taking into account exponential growth, the full amount of 7000 fields will lead to weeks of processing. No wonder, any machine at some point quits its service....

I have chosen (i.e. created, based on CMoS) an inline author-date citation style which only disambiguates regarding the year (Blum 2015a vs. Blum 2015b), as it always adds initials to the names.

I need Zotero to perform an update of all fields in one document in order to create a full bibliography. Obviously, I cannot combine the single bibliographies of all 60 files manually. I can, however, update all Zotero fields in the single files beforehand.

Is there anybody, who has an idea of how to proceed? As it is a major work and with some importance to myself, I would be incredibly grateful for any kind of remark!

version => 5.0.52, platform => Win32, oscpu => Windows NT 5.1, locale => de, appName => Zotero, appVersion => 5.0.52, extensions => Zotero LibreOffice Integration (5.0.14.SA.5.0.52, extension), Zotero Word for Windows Integration (5.0.8.SA.5.0.52, extension), Shockwave Flash (21.0.0.182, plugin, disabled)

bwiernik · July 4, 2018

Combining them into one file should definitely work, though it will take a long time to update with that many fields.

Regarding year versus initial disambiguation, any disambiguation like that is going to slow down updates, the format doesn’t really matter.

Try to update to the Zotero Beta to see if the issue is fixed:
https://www.zotero.org/support/dev_builds

The “cluster” error you are getting suggests that one of your citations is getting corrupted in the cut and paste process. Try to combine them and update the files one at a time and see if you can locate which of the citations is corrupt and replace it.

There is a fairly painless workaround we can do if you are under a time crunch. Insert bibliogrpahies into the individual files (or a fewer smaller combined files) and use Unlink Citations to convert these to plain text (save a copy of the files with live citations). Then cut and paste the documents and bibliogrpahies together. Select all of the now combined bibliogrpahies and use Word’s sort function to sort them alphabetically.

c-sander · July 4, 2018

Thanks! I will give the Beta a try.
I am not yet in a hurry (submission is due in decembre) but I should think of these issues already now.
What would be your estimation how long it can take for all fields to update on a typical Office maschine (8GBRAM, Intel 2,4Ghz)?

adomasven · July 4, 2018

@bwiernik The Zotero Beta attempts to address (seems like unsuccessfully) a specific issue some people are having with Zotero crashing. This is just the update taking a really long time.

@c-sander Your best bet is to do it on a more recent machine that runs windows 10 a newer version of office (at least 2013), has more RAM (at least 8GB would be good) and a better CPU. There might be some exponential component to the updates, although it is likely not very steep and updating 10 fields on a new machine should take a couple of seconds at most and even up to 1000 fields should not take more than 15 minutes. I haven't seen anyone working with a document with so many citations, so you are definitely in unexplored waters, but with a more beefy machine you should be able to update the doc.

c-sander · July 5, 2018

@adomasven Thank you! I have tried it on 64bit machines as well. It never gets to the end. Although, since the Beta-update, the "out of memory"-error does not occure any longer. But still, at some point Zotero simply has no CPU usage any more and the Log stopps, too. The progress bar however remains active, and Z occupies 1 to 3 GB of memory and still can be used normally.

For me, it does not play a any role if it takes 1 or 10 hours, but in principle, I cannot see why Z cannot compile the bibliography at any machine. I have rendered a shortened version of the Log (http://heimat.de/home/c-sander/Zotero_Log.txt). As you can see, the last "Integration: style.processCitationCluster"-entry has already hundreds of [] -- the exponential growth. This final note (after that, Z broke down), is only on page 160 of 1500 of my entire file, thus we can immagine how long these [] entries are going to get as it proceeds.
Is it for sure that Z can address more than 2GB of memory (in PhotoShop e.g. this has to be configured)?
Is there no way to avoid implementing all previous CitationIDs in [] in the style.processCitationCluster command?

adomasven · July 5, 2018

@c-sander That is a good point regarding memory. The current standalone Zotero release is 32-bit, so it probably cannot address more than 4GB, but since you're not reaching it, not getting an out of memory error and Zotero stays responsive, theoretically, that shouldn't be the issue.

In the log that you have posted Zotero seems to have processed nearly all (6425 out of 6571) citations in your document. It is quite unfortunate that it stops there and it's unclear why. That part of the process is Zotero working out all the disambiguations and formatting of citations, after which it then still has to write those back into the document.

You could try running the update on this machine after having removed the last portion of the document with 200ish citations and see if that completes (or at least gets further along the process than it does now).

adomasven · July 5, 2018

@c-sander it might help if you could provide us with a full log somehow, e.g. send a compressed version to support@zotero.org with a link to this thread.

c-sander · July 5, 2018

Thanks again!

Zotero does not stop always at the same citation: it not only varies from PC to PC (I have tested it on 4 PCs [WinXP, Win7 and Win10, yet, none had more than 8GB RAM]), but also from trial to trial. But the RAM seems to be a major issue, as every PC has its own, let's say "Zotero-memory-limit". The machine (4GB RAM), which rendered the partly uploaded log, stopped at 1,7GB, another machine with 8GB RAM stopped at 3GB use of Zotero. It seems that the swap file is not inferred at all.

Thus, it does not seem to be an issue of one particular field (also, when updating in the small files, no problems are to be reported).
I will send you a link to a full log from the same machine asap.

I hope this issue is of general interest to you as it addresses benchmark and performance issues.

c-sander · July 5, 2018

@adomasven Hier you find the complete log: http://heimat.de/home/c-sander/Zotero_Log.txt)
Now, it stopped at a different filed, around the same time.

Here you see the performance when started: http://heimat.de/home/c-sander/z1.jpg

Here you see the performance when stopped: http://heimat.de/home/c-sander/z2.jpg

Zotero remained open, no error message occured. Also, the progress bar remained, but does not respond.

adomasven · July 9, 2018

@fbennett Could you take a look at this? Zotero is running out of memory while citeproc is working. It's a lot of citations, but still shouldn't take 1GB of RAM to process.

fbennett · July 9, 2018

Oh, that's not good. Sounds like we'll have significant gains if this can be fixed. I'll take a look. If I need advice on how to handle the debugging I'll ping back on zotero-dev.

fbennett · July 9, 2018

@c-sander: I see that you are using a modified version of CMoS. Could you also post the code for that to your space on heimat.de?

c-sander · July 10, 2018

@fbennett : Thank you so far!
For the Log posted, I used the regular CMoS author-date, German, I think. However, the problem occurs regardless of the CS. I modified the CMoS in order to get rid of the disambiguation (http://heimat.de/home/c-sander/dissertation-sander.csl). Yet, this task is performed by CiteProc regardlessly.

To me, the memory problem is not so much a surprise. Think about what each-to-each-comparision means: 7000^2 operations, all temporarily stored to RAM.

I think the easiest way to fix it would be to write packages to a swap file. This will slow down the process, but render it more stable.

But I'm no developer. I'm sure you will suceed -- maybe too late for me, but the next case will come, as Zotero is used more often.

fbennett · July 10, 2018

The processor doesn't perform a comparison against all other items. It generates the "minimal" form of the citation, and sets that as a key in a tracking object that carries the itemID of every item that generates the same form. You don't have any items with more than two partners, so the disambiguation work for your document is not very heavy -- a couple of loop iterations for each pair.

The large lists that you see in the trace are citationID values. The processor maintains a list of citations in sequence for three purpose: to identify when citations need to be regenerated, for disambiguation; to identify the position at which a citation has been inserted or modified, for back-referencing; and to identify when a citation and its items has been removed from the document, for bibilography maintenance purposes. With a stripped-down, totally static style, the first two purposes go away, but we still need the tracking list accompanying each insert for the third purpose.

Generating a citation for each of the 3,113 unique items reflected in the log file takes 7 seconds on my system here. So it's the incremental insert operations (in the processor) done by processCitationCluster() that seems to be the sticking point. I'll work at getting a large chunk of your citation set working in a browser environment (locally). That will allow use of memory profiling tools that may turn up a leak in the processor.

c-sander · July 10, 2018

Great. This clarification sheds much light on the problem! Keep me posted and let me know if there's anything I can do!

fbennett · July 15, 2018

@c-sander We have a fresh version of the citation processor that may work for projects of this scale. If you would like to give it a try, you can install it in the Zotero client (not the browser) by downloading the Propachi plugin, saving it to your desktop, then opening Tools -> Add-ons in the Zotero client, clicking on the gear menu in the upper right-hand corner of the popup, and choosing to install from the downloaded file.

If things work as expected, you should see a significant speed improvement, and the job should not collapse before completion. We can't be certain that it will work for you, since you will be the first to trial the code on a job of this size, but the prospect of success is at least reasonably good. If you give it a try, let us know how it works for you.

c-sander · July 15, 2018

@fbennett Great. I will try it tomorrow and let you know!

c-sander · July 15, 2018

@fbennett : Could not wait til tmrw.

Unfortunately, the same happened as before: Everything looked fine while composing the bibliography for about 90 minutes, until Zotero climbed beyond 3GB of RAM. At 3,2 GB RAM, Zotero was terminated, thus also the Debug Log is lost. Is there a way to get to past Logs or to store them in case of a sudden termination of the process?

I will try it ony a different machine, too, just to be sure. But I fear it won't make a difference.

fbennett · July 15, 2018

Rats. Just to be sure (and probably annoyingly pedantic), did you restart Zotero, or switch styles in the document before running the trial? To assure that the newly installed version is used for processing.

Also, @admasven may have further thoughts.

c-sander · July 15, 2018

@fbennett : I did restart Zotero, but I turned off the auto-update of Propachi after installation. I did not switch styles but before the bibliography-command I have re-set the CitStyle I used before (this update went through in 300secs without any problems).

c-sander · July 15, 2018

The Log says Propachi is working and it is up-to-date (15 July 2018)

fbennett · July 15, 2018

Sounds like everything is in order. Will keep thinking about how to test more deeply, and how to reduce the memory footprint.

c-sander · July 15, 2018

shall I post a new Log file? To me it seems identical but maybe you see more in it.

adomasven · July 15, 2018

Please do post the new log.

c-sander · July 15, 2018

@adomasven How, then, is it possible to get access to the Log after Z was terminated. Is it saved to a file automatically?

fbennett · July 15, 2018

(A question for @adomasven: does the log consume memory when logging is enabled?)

adomasven · July 16, 2018

@c-sander Probably the only option is to download Cygwin and then do Zotero debug output logging to the terminal. Cmd doesn't work, it limits the number of lines logged.

@fbennett The consumption is negligible.

bwiernik · July 16, 2018

You can redirect the CMD output directly to a text file (last I checked, this didn’t limit lines):
https://www.lifewire.com/how-to-redirect-command-output-to-a-file-2618084

c-sander · July 16, 2018

If I run
C:\Programme\Zotero\zotero.exe -ZoteroDebugText >C:\zotero.log
the log.file is created but nothing is written into it. What do I miss.
If I put '/cygdrive/c/Program Files (x86)/Zotero/zotero.exe' -ZoteroDebugText in CGwin, where do I find the output after Zotero closed?

Sorry, this is very basic, but I don't want to figure it out by myself. Could you guys give me a hint...Thx!

c-sander · July 16, 2018

@adomasven Sorry, but I don't know how to manage to save the log.
If I try to save it from the terminal, the file is not created (out of memory). If i select all and copy to clipboard, Zotero is terminated immediately.

There are some differences since Porpachi:
The "out of memory"-error reoccurs.
After this error, the CPU usage of Z remains high (30%), but RAM freezes
Zotero windows can be activated but remain black (Hotkeys still apply)

I have made some screenshots:

http://heimat.de/home/c-sander/z3.jpg
http://heimat.de/home/c-sander/z4.jpg
http://heimat.de/home/c-sander/z5.jpg
http://heimat.de/home/c-sander/z6.jpg
http://heimat.de/home/c-sander/z7.jpg

Without finding a way that the Log is automatically and "synchronic" stored to a file, even if Z is terminated abruptly, I don't know how to provide the log any any longer.

c-sander · July 16, 2018

@adomasven what I do know however is that the last command are processCitationCluster() followed by the huge set of IDs, growing and progress becomming more slowly.