Exponential growth: Too many fields to update??
I have wirtten a Ph.D. dissertation of ca. 1600 pages with about 7000 Zotero fields and roughly 3500 entries in my bibliography. I have split the dissertation into 60 files of about 20 pages each. Within each single doc-file (MS Word 2003) I can perfectly use Zotero and update all fields.
Yet, as soon as I combine all files into one large Word-file, Zotero breaks down. I have tried to update the entire document on several PCs, either Zotero says "out of memory" or, after days, it simply doesn't respond any more. Zotero needs up to 3GB of memory while updating the fields.
I think the problem is the disambiguation of inline citations, e.g. Blum 2015a vs. Blum 2015b, or P. Blum 2015 vs. R. Blum 2017. In order to do so, each field has to be compared to every other field. This is done after the Debug code line "Integration: style.processCitationCluster", at the end in [] are listed many, many singular citationIDs. If 10 fields take rougly 10 seconds for updating, taking into account exponential growth, the full amount of 7000 fields will lead to weeks of processing. No wonder, any machine at some point quits its service....
I have chosen (i.e. created, based on CMoS) an inline author-date citation style which only disambiguates regarding the year (Blum 2015a vs. Blum 2015b), as it always adds initials to the names.
I need Zotero to perform an update of all fields in one document in order to create a full bibliography. Obviously, I cannot combine the single bibliographies of all 60 files manually. I can, however, update all Zotero fields in the single files beforehand.
Is there anybody, who has an idea of how to proceed? As it is a major work and with some importance to myself, I would be incredibly grateful for any kind of remark!
version => 5.0.52, platform => Win32, oscpu => Windows NT 5.1, locale => de, appName => Zotero, appVersion => 5.0.52, extensions => Zotero LibreOffice Integration (5.0.14.SA.5.0.52, extension), Zotero Word for Windows Integration (5.0.8.SA.5.0.52, extension), Shockwave Flash (21.0.0.182, plugin, disabled)
Yet, as soon as I combine all files into one large Word-file, Zotero breaks down. I have tried to update the entire document on several PCs, either Zotero says "out of memory" or, after days, it simply doesn't respond any more. Zotero needs up to 3GB of memory while updating the fields.
I think the problem is the disambiguation of inline citations, e.g. Blum 2015a vs. Blum 2015b, or P. Blum 2015 vs. R. Blum 2017. In order to do so, each field has to be compared to every other field. This is done after the Debug code line "Integration: style.processCitationCluster", at the end in [] are listed many, many singular citationIDs. If 10 fields take rougly 10 seconds for updating, taking into account exponential growth, the full amount of 7000 fields will lead to weeks of processing. No wonder, any machine at some point quits its service....
I have chosen (i.e. created, based on CMoS) an inline author-date citation style which only disambiguates regarding the year (Blum 2015a vs. Blum 2015b), as it always adds initials to the names.
I need Zotero to perform an update of all fields in one document in order to create a full bibliography. Obviously, I cannot combine the single bibliographies of all 60 files manually. I can, however, update all Zotero fields in the single files beforehand.
Is there anybody, who has an idea of how to proceed? As it is a major work and with some importance to myself, I would be incredibly grateful for any kind of remark!
version => 5.0.52, platform => Win32, oscpu => Windows NT 5.1, locale => de, appName => Zotero, appVersion => 5.0.52, extensions => Zotero LibreOffice Integration (5.0.14.SA.5.0.52, extension), Zotero Word for Windows Integration (5.0.8.SA.5.0.52, extension), Shockwave Flash (21.0.0.182, plugin, disabled)
Regarding year versus initial disambiguation, any disambiguation like that is going to slow down updates, the format doesn’t really matter.
Try to update to the Zotero Beta to see if the issue is fixed:
https://www.zotero.org/support/dev_builds
The “cluster” error you are getting suggests that one of your citations is getting corrupted in the cut and paste process. Try to combine them and update the files one at a time and see if you can locate which of the citations is corrupt and replace it.
There is a fairly painless workaround we can do if you are under a time crunch. Insert bibliogrpahies into the individual files (or a fewer smaller combined files) and use Unlink Citations to convert these to plain text (save a copy of the files with live citations). Then cut and paste the documents and bibliogrpahies together. Select all of the now combined bibliogrpahies and use Word’s sort function to sort them alphabetically.
I am not yet in a hurry (submission is due in decembre) but I should think of these issues already now.
What would be your estimation how long it can take for all fields to update on a typical Office maschine (8GBRAM, Intel 2,4Ghz)?
@c-sander Your best bet is to do it on a more recent machine that runs windows 10 a newer version of office (at least 2013), has more RAM (at least 8GB would be good) and a better CPU. There might be some exponential component to the updates, although it is likely not very steep and updating 10 fields on a new machine should take a couple of seconds at most and even up to 1000 fields should not take more than 15 minutes. I haven't seen anyone working with a document with so many citations, so you are definitely in unexplored waters, but with a more beefy machine you should be able to update the doc.
For me, it does not play a any role if it takes 1 or 10 hours, but in principle, I cannot see why Z cannot compile the bibliography at any machine. I have rendered a shortened version of the Log (http://heimat.de/home/c-sander/Zotero_Log.txt). As you can see, the last "Integration: style.processCitationCluster"-entry has already hundreds of [] -- the exponential growth. This final note (after that, Z broke down), is only on page 160 of 1500 of my entire file, thus we can immagine how long these [] entries are going to get as it proceeds.
Is it for sure that Z can address more than 2GB of memory (in PhotoShop e.g. this has to be configured)?
Is there no way to avoid implementing all previous CitationIDs in [] in the style.processCitationCluster command?
In the log that you have posted Zotero seems to have processed nearly all (6425 out of 6571) citations in your document. It is quite unfortunate that it stops there and it's unclear why. That part of the process is Zotero working out all the disambiguations and formatting of citations, after which it then still has to write those back into the document.
You could try running the update on this machine after having removed the last portion of the document with 200ish citations and see if that completes (or at least gets further along the process than it does now).
Zotero does not stop always at the same citation: it not only varies from PC to PC (I have tested it on 4 PCs [WinXP, Win7 and Win10, yet, none had more than 8GB RAM]), but also from trial to trial. But the RAM seems to be a major issue, as every PC has its own, let's say "Zotero-memory-limit". The machine (4GB RAM), which rendered the partly uploaded log, stopped at 1,7GB, another machine with 8GB RAM stopped at 3GB use of Zotero. It seems that the swap file is not inferred at all.
Thus, it does not seem to be an issue of one particular field (also, when updating in the small files, no problems are to be reported).
I will send you a link to a full log from the same machine asap.
I hope this issue is of general interest to you as it addresses benchmark and performance issues.
Now, it stopped at a different filed, around the same time.
Here you see the performance when started: http://heimat.de/home/c-sander/z1.jpg
Here you see the performance when stopped: http://heimat.de/home/c-sander/z2.jpg
Zotero remained open, no error message occured. Also, the progress bar remained, but does not respond.
For the Log posted, I used the regular CMoS author-date, German, I think. However, the problem occurs regardless of the CS. I modified the CMoS in order to get rid of the disambiguation (http://heimat.de/home/c-sander/dissertation-sander.csl). Yet, this task is performed by CiteProc regardlessly.
To me, the memory problem is not so much a surprise. Think about what each-to-each-comparision means: 7000^2 operations, all temporarily stored to RAM.
I think the easiest way to fix it would be to write packages to a swap file. This will slow down the process, but render it more stable.
But I'm no developer. I'm sure you will suceed -- maybe too late for me, but the next case will come, as Zotero is used more often.
The large lists that you see in the trace are citationID values. The processor maintains a list of citations in sequence for three purpose: to identify when citations need to be regenerated, for disambiguation; to identify the position at which a citation has been inserted or modified, for back-referencing; and to identify when a citation and its items has been removed from the document, for bibilography maintenance purposes. With a stripped-down, totally static style, the first two purposes go away, but we still need the tracking list accompanying each insert for the third purpose.
Generating a citation for each of the 3,113 unique items reflected in the log file takes 7 seconds on my system here. So it's the incremental insert operations (in the processor) done by processCitationCluster() that seems to be the sticking point. I'll work at getting a large chunk of your citation set working in a browser environment (locally). That will allow use of memory profiling tools that may turn up a leak in the processor.
If things work as expected, you should see a significant speed improvement, and the job should not collapse before completion. We can't be certain that it will work for you, since you will be the first to trial the code on a job of this size, but the prospect of success is at least reasonably good. If you give it a try, let us know how it works for you.
Unfortunately, the same happened as before: Everything looked fine while composing the bibliography for about 90 minutes, until Zotero climbed beyond 3GB of RAM. At 3,2 GB RAM, Zotero was terminated, thus also the Debug Log is lost. Is there a way to get to past Logs or to store them in case of a sudden termination of the process?
I will try it ony a different machine, too, just to be sure. But I fear it won't make a difference.
Also, @admasven may have further thoughts.
@fbennett The consumption is negligible.
https://www.lifewire.com/how-to-redirect-command-output-to-a-file-2618084
C:\Programme\Zotero\zotero.exe -ZoteroDebugText >C:\zotero.log
the log.file is created but nothing is written into it. What do I miss.
If I put '/cygdrive/c/Program Files (x86)/Zotero/zotero.exe' -ZoteroDebugText in CGwin, where do I find the output after Zotero closed?
Sorry, this is very basic, but I don't want to figure it out by myself. Could you guys give me a hint...Thx!
If I try to save it from the terminal, the file is not created (out of memory). If i select all and copy to clipboard, Zotero is terminated immediately.
There are some differences since Porpachi:
The "out of memory"-error reoccurs.
After this error, the CPU usage of Z remains high (30%), but RAM freezes
Zotero windows can be activated but remain black (Hotkeys still apply)
I have made some screenshots:
http://heimat.de/home/c-sander/z3.jpg
http://heimat.de/home/c-sander/z4.jpg
http://heimat.de/home/c-sander/z5.jpg
http://heimat.de/home/c-sander/z6.jpg
http://heimat.de/home/c-sander/z7.jpg
Without finding a way that the Log is automatically and "synchronic" stored to a file, even if Z is terminated abruptly, I don't know how to provide the log any any longer.