Large projects

First, let me heap praise on everybody involved in the development of CSL/Zotero: It's great, I use CSL/Zotero regularly (at least when I'm not using LaTeX/BibLaTeX), have prepared some styles myself and recommend it to my colleagues.

There are, however, a few problems with Zotero that boil down to a simple fact: It's not quite good enough to be really helpful with large documents including many, many references. And the irony is that, of course, large documents including many, many references is *exactly* what you would want to use a reference management software for. To exaggerate a bit unfairly: Zotero is excellent with documents that don't require the effort (short papers with, say, ten to twenty citations) but can become quite bad when using it for the book-size projects where it could make all the difference.

I can only imagine the difficulty and complexity of making CSL/Zotero work seamlessly with all the features it has to offer. So please don't get me wrong and don't think I'm ungrateful for saying this but I think the following two issues need to be addressed for Zotero to become the tool it should be...

1. It needs to get a lot faster when dealing with large projects. Adding, editing and refreshing citations is painfully slow when there are more than one or two hundred of them.

2. It needs an easy-to-use mechanism to manage the links between citations and the corresponding items in the library. These links can break (apparently for different reasons) and there is no convenient way to deal with broken ones. (I imagine a mechanism that lets you run through all citations in a document from beginning to end and re-assign "broken" links to probable items in the library or delete them with simple keystrokes.)

I am in the process of preparing for publication an edited collection of 30 contributions with a total of 1500+ citations and 500+ bibliographic items--a scope certainly not uncommon in the humanities. The contributors sent me their separate texts: a hodgepodge of references and bibliographies (inline, footnotes, author-year, author-title etc.). I had to decide whether to harmonize the formatting of references and bibliographies by hand or to use Zotero. I chose the latter and then put in a lot of work to

a) adapt one of my custom-made CSL styles to the publisher's guidelines,

b) generate the library of 500+ items and

c) replace all manual references in all texts with Zotero citations.

Now, obviously, a) and b) don't have anything to do with the problems mentioned above. But they are factors to be considered when you have to decide whether it is economical to use a reference management software or not.

All in all, I spent several days to "migrate" the references to Zotero--and somewhere along the way, things went wrong (for which I am partially to blame). To cut a long story short: I ended up with some 1500+ broken links. This is bad. But even if it had not happened: I suspect I actually spent more time fiddling around with Zotero than it would have taken me to clean up the references and bibliography by hand.

It hurts me to say this because I have a lot of admiration for all the hard work you are putting into Zotero: If I'll have to do another project of comparable size again, I probably won't use Zotero. It is a great tool--but I think it must become better still.
  • You can look at this:
    http://zotero-odf-scan.github.io/zotero-odf-scan/
    Our prime intention wasn't to deal with large documents, but we feel that it will be quite useful for that purpose, too. Since the formatting is done via scan there are no delays and since the links to items are embedded in plain text they're quicker to search&replace.

    That said, items in Zotero really don't typically become unlinked - in your case you imported/exported a library, something we warn against at every possible occasion.

    Speed in very large documents is an issue, though (and if I understand correctly, this is only going to get so much better because of limitations in the communication between Zotero and word processors).
    For most purposes, though, I don't see why keeping things in chapters until the last moment isn't an option.
  • edited July 9, 2013
    I think there is scope for significant speed improvements in word processor integration on large documents. The citation processor is capable of efficiently targeting specific citations for update based on changes in form, using an internal registry. As I understand it (Simon may correct me if I'm wrong), the algorithm currently used for document updates reconstructs the registry from scratch for each inserted citation, which creates a significant burden on large documents.

    This approach has had the advantage of avoiding too much reliance on the black-box internals of the citation processor (which is now a project separate from Zotero proper, with code that is admittedly not the most transparent). The processor has been quite stable of late, however, and in time we may see moves to make use of its update logic, if Simon and Dan conclude that there are advantages to doing so -- as adamsmith notes, there are fixed burdens with reading and writing to a large document through the APIs available to Zotero that may swamp potential speed gains.

    Note: This is not correct. Please see Simon's comments down-thread for more accurate information on the behaviour of the plugin. -- FB
  • We currently reconstruct the registry from scratch the first time a document is used within a session and when pressing the "Refresh" button, which, as Frank says, we should be able to speed up in a future Zotero release. When adding/editing citations, we only call processCitationCluster once, which should be fast.

    If adding/editing citations is slow, then it would help us to have a debug ID for the second citation inserted in a large document after restarting Zotero. It would also be useful to know which word processor you're using.
  • edited May 10, 2013
    Word:mac 2011 (14.3.2)

    EDIT: I will provide a debug ID as proposed by Simon once I have assembled the thirty documents into one big file.
  • Re the "need for speed": While splitting up large documents is often recommended as a cure-all for everything wrong with Word, it can also be impractical (think global search and replace), causes additional work (assembly of individual files) or even problems (think headers and footers, changing styles[1]) or simply feels "wrong" to some people. I think software should conform to users' expectations (as good as possible) and not force users to adopt strategies to cope with its shortcomings (although it seems that MS Word is as much to blame here as Zotero). So I'm glad to hear that we may see progress in that area.

    Re broken links: To say that it's the user's fault is certainly a correct assessment of the problem (in my case, at least). Unfortunately, it's not a solution. With your help, I was able to avert disaster. But seeing how it *can* go wrong, I feel confident saying that--with Zotero's growing user base--it *will* go wrong again. And when it does, you'll have some pretty miserable Zotero users. That's why I'd love to have some kind of mechanism to check/re-assign/delete links.

    Again, thank you all for your hard work and kind help.

    [1] Yes, I know you should do that in the template file.
  • In the past I thought that the reference style could in-/decrease the speed of Zotero. So I use a short reference style when working in a large document and can change this when finished. I recognize that this thought is very subjective. Have no idea if this really works.
  • Generally author-date styles will be fastest since they require the least amount of updating of other citations.
  • Yes, true! This is what I mean.
  • Dear fbennet and Simon,

    I will echo tillheilmann's praise for Zotero. I love it. I have been using it for years now and just submitted my first large manuscript using it. I had been keeping each chapter separate, so I hadn't noticed any siginificant slowness when inserting or editing a citation. But when I put it all together, I did notice it took at least 45 seconds to a minute when I edited or inserted a citation. But I only had to do this once or twice.

    But now I have the full manuscript in one document, and I would like to avoid breaking it up again (reviewers will put in comments, etc., so keeping it as one is a better option now).

    I am also confused:

    fbennet wrote:

    "the algorithm currently used for document updates reconstructs the registry from scratch for each inserted citation, which creates a significant burden on large documents."

    I get this. This is what I experience.

    But then Simon replied:

    "We currently reconstruct the registry from scratch the first time a document is used within a session and when pressing the "Refresh" button, which, as Frank says, we should be able to speed up in a future Zotero release. When adding/editing citations, we only call processCitationCluster once, which should be fast."

    So, am I to understand that the slow processing will only happen the first time the document is used in my session? But then after that, it will be fast again? If that is the case, great.

    Or is it that EACH CITATION prompts a reconstruction of the entire registry?

    I have a bit of lag time as the manuscript gets reviewed, I can play with it a bit before I have to do serious edits.

    What do you all recommend -- that I break it up again (I guess pull it apart, save each chapter, edit them, then put it all back together again, which entails doing all the Word coding again); or should I work from the large document?

    Thanks for everything,

    Dan
  • Dear fbennet and Simon,

    OK, so I just did a test.

    It seems that it has to rewrite the registry every time I edit or insert a new citation. I am able to type in other parts of the document while it rewrites the registry, but I can kind of tell it is working in the background as the cursor flickers a bit every now and then.

    So, I think my best bet is to break it all up again when I have to edit the manuscript.

    I read this (fbennet):

    "The processor has been quite stable of late, however, and in time we may see moves to make use of its update logic, if Simon and Dan conclude that there are advantages to doing so -- as adamsmith notes, there are fixed burdens with reading and writing to a large document through the APIs available to Zotero that may swamp potential speed gains."

    If possible, I would prefer that the whole registry is not rewritten each time a citation is edited or inserted. But rather, perhaps we could manually hit "refresh" to do that. That way we can work in large documents and then every 10-20 minutes hit refresh and get an update.

    I know there are probably other concerns, but I, too, would like to be able to work with a large document. This is exactly where Zotero is so powerful.

    With all that said, hitting "Create Bibliography" and having it create the bibliography was amazing. Zotero, even if we have to continue to split up large documents, is a huge time saver.

    Thanks,

    Dan
  • Hi once again,

    I am still writing about large documents here, so the last two posts before this will give the context.

    So, I have a debug ID.

    The Debug ID is D1791349867

    I am using Zotero Standalone with Word 2007. I am on a PC.

    I went in and edited the first citation (which called up the Style Preference Box), chose my Style, clicked ok.

    So, that was the FIRST change.

    I think Simon was saying that after this FIRST change, all other changes should not rewrite the whole registry.

    So, I did a SECOND change (edited a citation - it did not call up the Style Preference Box). As you can see, it ran 4180 lines and took a good 25 seconds to complete.

    If that is normal, then so be it, I will split up my document. If it is not suppossed to rewrite the entire registy on the SECOND change, then maybe there is a bug?

    It may be normal, if so, ignore all this. If not, then the debug ID might show what is going on.

    Thanks,

    Dan
  • edited July 9, 2013
    The number of lines of debug output isn't a good measure of what's actually happening, since in that debug output there's less than 1 ms elapsed between most of the lines. The registry is not actually being updated there, and I'm not sure what's taking 25 seconds. In the debug log, I see:
    (3)(+0002082): Integration: Retrieved 865 fields in 2.082; 415.4658981748319 fields/second
    (3)(+0000000): Integration: Updated session data for 865 fields in 0.388; 2229.381443298969 fields/second
    After this, it takes a few hundred milliseconds to generate the citation, and then it generates the bibliography. I'm left wondering whether the remaining 22 seconds are all in bibliography generation, since there wouldn't be a message when that's complete. Are things faster if you remove the bibliography?
  • Thanks for the reply Simon.

    I deleted the extensive bibliography, and it still takes time.

    As I look at the box you pasted, yes, it takes about 2 seconds for the insert citation box to disappear. Then I can edit text, but I can't get into another Zotero citation for about 45 seconds.

    Could it be Word is repaginating or something?

    Or: when I insert a Zotero citation, I first create the footnote using the "Insert Footnote" from Word, THEN I insert the Zotero citation. Could that be an issue?

    I started doing this as when I use "Insert citation" in the main text, the cursor doesn't automatically go into the footnote below, and I have to scroll down to put it in, whereas when I do "Insert Footnote" using word, the cursor goes right into the footnote text, then I hit "Insert citation."

    Could that be creating a problem?

    Again, thanks,

    d
  • Hi Simon,

    OK, so I did some tests. I think it might be Word repaginating or something.

    I think I understand what Zotero does.

    When I insert a NEW citation, it takes the 25-30 seconds for the citation to appear. Then the cursor flickers for another 25-30 seconds (which I think is Word doing something).

    When I edit an already created citation, it takes the 2-3 seconds for the citation to update, then the cursor flickers again for another 25-30 seconds.

    So, I think it is Word that is causing the delay.

    I wonder if I had not done "Insert Footnote" using Word THEN "Insert Citation" using Zotero (that is, inserting the citation within the Footnote), but rather just used "Insert Citation" if this would not have happened?

    But I think I am stuck with this for now, as I did ALL my citations the first way.

    Again, thanks for your help.

    take care,

    Dan
  • Hi Simon,

    I write this so others that have questions might get them answered here.

    It is Word. It repaginates every time I work within a footnote.

    I can turn off repagination only in Draft view.

    OK, so that is the problem.

    Thanks for the help.

    d
  • Thanks for the update. This topic is probably deserving of some more comprehensive documentation.
  • @dmichon: Thanks for you patient and careful followup.

    @Simon + @dmichon: Sorry about the confusion caused by my comment above. I've amended it with a note to look down-thread for better information from Simon.
Sign In or Register to comment.