Merge - auto function?

How about a right-click option on the Duplicate Items folder - 'Merge all'?

It's just that I've got a couple of thousand items in Duplicates (how?? I can't be _that_ stupid) and I'm just click-click-clicking through them. Some are different types which need reassigning and they couldn't be automatically merges but the others - well, that's what computers are for, isn't it?
«1
  • (Having a very large number of duplicates would likely be if you at some point imported a large files of references (e.g., an export from Web of Science or from another tool like Endnote or Mendeley) and either imported the file multiple times or else had the same items in several different imported files. It's also not implausible that you may have slowly accumulated a fair number of duplicates if you've been using Zotero for years and never merged them before.)

    The main reason why there isn't a Merge All function at the moment is because merging can lead to data loss if the two versions of an item have different content in the fields (e.g., one has only author initials rather than full names). Such conflicts require human input to choose which version of the fields to keep. Though it would be nice if there were at least an option to merge all truly identical items.
  • I've noticed the options when merging items with different entries of course but in general I'd assumed that Merge kept as much as possible from both/all items? - in which case losing data shouldn't be a problem?
  • That's not how merge currently works no -- it treats one item as the default and you have to manually change what you accept from the other one. Not sure that "more" is always "better," so I'm not sure that approach even if implemented would work.
  • But it keeps all those tags I’ve added, I hope?
  • it duplicates tags and notes, yes.
  • edited November 5, 2017
    (And also preserves all attachments from all merged items.)
  • All that said, I do wonder if an auto merge option -- perhaps with a clear warning pop-up -- wouldn't be a good addition.

    There's essentially two flavors this could take:
    1. Merging all duplicates. The most intuitive options, but this has the particular problem of false positives, which currently can't be marked as such -- and merging items that are not actually identical can be quite the disaster, leading e.g. to missing items in the database and to potentially the wrong item being cited in a document (because Zotero cites the merged item). So yes, very convenient, but the risk here is quite high.

    2. Merging all _selected_ duplicates. You have 100 items that you _know_ are duplicates and you want to merge. Being able to do so with 1 click instead of 100 seems convenient, but this is tricky from a UX point of view -- currently clicking "Merge" with multiple items merges them all into one item. This is different and possibly confusing. Whether that one is worth the trouble might be a question on how often such a function is helpful.

    I guess I'm not sure about this -- I only get manually added duplicates and those are easily enough managed. The target group for the auto option would seem to be more systematic reviews etc., so bwiernik, you may have a clearer sense of what would be helpful.
  • I don’t think (1) is a good idea for exactly the reasons you mention. Even if there were a way to mark non-duplicates, I worry that many users would not realize that was possible or that this was a recommended step, leading to bad data and consequences.

    (2) Would be great. Particularly for systematic review merging. For this to work, I think there would need to be a check mark column in the Merge Duplicates view that would enable users to check which duplicate sets to apply merging to (one check box per duplicate set). (Potential duplicates would also have to always be shown together, unlike now.) There should be a select all visible check box at the top row, with a pop up warning the user to verify that all selected items are truly duplicates (with an option to not show again). This would certainly save me a bunch of time.
  • I appreciate that Merging the wrong items could be catastrophic - an Unmerge possibility might help here, even if it would involve more bandwidth.

    And a warning, with an explanation of what is about to happen, would be appropriate.
  • A popup window which lists the duplicate bibliographies side by side could help. You could remain and modify a version of the duplicate bibliographies on the either the left side or the right side of the window. Then the satisfied version could be remained by clicking a button like" remain this version".
  • I have hundreds if not a thousand of duplicates. An auto-function (which could be activated) would be very welcome!
  • edited May 24, 2018
    There are at least 2 problems with any automatic merge system for duplicate control. 1) Zotero's duplicate detection is quite (some say overly) sensitive -- it can label similar items as false duplicates. 2) Even true duplicates may not be exact duplicates in that one may be missing metadata items contained in the other or one may contain better metadata than the other.

    It is important to assess the quality of each item before you merge them.

    Also, carefully read the November 5th comments by @adamsmith and @bwiernik above. Improper merging can lead to disaster.

    One more thing, it is the opposite of helpful to ask the same thing in multiple threads. The developers and key volunteers read them all. I (far from a developer) answered you in one of the threads but other people with similar concerns to yours may not find my reply because they don't look at all the relevant threads.
  • It seems to me that the primary argument against automatic merging of potential duplicates (or not adding duplicates to a library) is based on edge cases. In reality, I don't think there are very many occasions in which an algorithm would not outperform me as I click through merge duplicates. It seems like the benefit of usability in 90% of use cases far outweighs the negligible benefit that the rare user obtains from creating a pure and perfect database of entries. How many users are legitimately going through merging duplicates individually in reference to the authority of the publisher's website in order to make sure everything is aligned?

    In my experience, Zotero serves primarily as a way to organize and access papers across computers and devices. In the end, I will only cite a small fraction of the entries that go into Zotero. When they are cited, I simply review the citations for errors, which needs to be performed regardless of the purity of the database.

    The daily usability of the product would be improved tremendously with automatic merger (perhaps based on a slightly different algorithm). Any edge cases identified could then be brought to the attention of the developers for possible improvement of the algorithm. And there would of course continue to be some errors or inconsistencies, but that would be a fraction of the current number of potential duplicates.
  • As I say above, I think having some option to merge automatically would be nice, but the obstacle isn't "edge cases" -- as long as there are a decent number of false positives in duplicate detection (and you can look through the many threads asking for a feature to mark items as non-duplicate as evidence that there are), auto merge is just incredibly risky:

    It would mean that you have items inexplicably disappearing from people's database (because they're merged with the wrong item) and in a worst case scenario would mean that the wrong reference would end up in an article/manuscript (because the correct one was incorrectly merged to a different one). So the trade-off isn't between convenience and a pristine database, but between convenience and lost data and a potentially broken scientific record, which I hope you'll agree should be weighed quite heavily.

  • I see your points, and I see the threads with the complaint of false positives, but my understanding of those threads as a user is actually different. I don't think the number of threads requesting non-duplicate marking is representative of the total number of false positives identified, I think it is representative of the total number of user interactions with a small number of false positives. The whole reason that false positives are annoying to me is not because they are common, but because I spend so much time de-duplicating true positives that I am constantly forced to interact with rare false positives.

    If I perform a systematic review search and want to add 500 references into Zotero, I have no way of knowing if some references are in Zotero without leaving the webpage, and individually searching for each reference in Zotero, then returning to the webpage and adding it to Zotero. That is 500 individual database searches, which is essentially time prohibitive.

    The only alternative solution I have at present is to bring all references into Zotero, where I typically have >10% true duplicates (i.e. >50 duplicates and >100 individual items listed as duplicates. But because I now have to review hundreds of true positives, I also have to interact with all the old false positives that remain listed as duplicates (due to the unrelated issue of marking false positives). The number of false positives is still far less than <1% of the entire database, but I have to interact with them every time I add any batch of references to Zotero, which is on a daily basis. Yes, there is an issue with marking false positives, but it primarily comes up because of the underlying workflow issue, which is orders of magnitude larger in my experience.

    There are alternatives to address (but not solve) the issue that provide users with greater autonomy (and thus a better user experience):
    1. Allow the user to specify if they want to use auto-merge (could be a hidden setting even).
    2. Allow the user to specify how they want to auto-merge by giving the user control over the fields to merge on (again, could be a hidden setting)
    3. Assuming an equal weighting, allow user to specify a threshold for matching -- more than 5/8 fields is the threshold for auto-merge
    4. Have a lower threshold for merging potential duplicate entries that come from the same web source or database
    5. Have the Zotero connector search Zotero database to see if there is a high probability that the reference is already in the database. If there is a high probability match, display this information to the user before they add the reference to the database. (see for example Sente, Paperpile, etc.)

    Thanks for any consideration you give this!

  • Your concerns are certainly valid, but I think you are making a bigger deal of the inconvenience than is warranted. I’ve done a dozen systematic reviews with Zotero, often with 5000-8000 initial items. Merging duplicates after running the searches typically has taken me around 10-15 minutes at most.
  • Of the options you suggest, 5 is a definite yes and the only reason that's not implemented yet is technical (ticket is here: https://github.com/zotero/zotero/issues/1007). I think 2-4 are too complex and likely to cause as much trouble as they do good.

    Some version of 1 makes sense, but our risk/benefit analysis differs there, likely because your focus is on the systematic review scenario.

    I'd personally rather tell 1000 researchers that they'll need to do some more manual work than I'd tell 1 researcher that Zotero, automatically and unbeknownst to them, switched out a reference in one of their papers.
  • edited February 23, 2019
    I want to contribute to this discussion, to talk about my user experience as a relatively recent user of Zotero, having moved over from another reference management program.

    It was inexplicable to me that Zotero required me to manually merge identical duplicates.

    When I viewed these duplicates, every single field was identical across the two records except the date-time when it was created in Zotero. (Incidentally, note that you have to already know how Zotero displays non-identical duplicates in order to tell that what you are looking at is an identical duplicate.) There is absolutely no human intelligence to be applied to the question of whether two identical records should be merged. By definition, the content in the fields is identical. If I have previously referenced one in a manuscript, then it would have been equally good if I had referenced the other. The only rational decision is to click Merge.

    This case—records where every single content field is identical—is absolutely crying out for automatic merge. There is *zero* reason why a human decision should be made here. Providing there is a sufficient number of fields completed that it does represent an actual record of something, there is absolutely no case where merging such records could present a problem. If there are PDFs attached to each, and they are not identical, then just attach both—let the user decide when they view them which to keep. If necessary, change the underlying code so that Zotero supports multiple attachments, if it doesn't.

    There are plenty of ways a user can end up with identical records. Key among them, which is what happened in my case, is through importing multiple previous databases, in my case from other referencing software, that were project-specific but which referenced overlapping literature. In terms of UX feedback, let me tell you I was gobsmacked that Zotero imported 100% identical records and just linked them together as likely duplicates but failed to just merge them as any user would, I suggest, have expected. This has nothing to do with edge case risks of what might happen if nearly-identical records were merged. That is an irrelevant discussion to a decision about merging identical records.

    The entire, defining feature of reference management software is to reduce manual drudgery around referencing for people writing academic documents. 15 minutes a human spends merging records that are algorithmically provable as identical is 15 minutes that should not have had to be wasted. My $2: automatically merging demonstrably identical records is squarely within Zotero's reason for existing.
  • I totally agree with babbage

    When puting together multiple biblographic database we HAVE TO remove duplicates in an automatic manner.
    Zotero should provide a choice for doing this automatically. Obvioulsy there is no optimal way to do it and so me mistakes will occur. But THIS IS THE ONLY WAY !

    Zotero should provide some choices and people will choose depending on what they want. Among obvious choice that could be provided by zotero are:
    1) keep only the more recent file
    2) keep the biggest (+ pdf if attached) file
    3) keep the one that have more fields filled or simply add empy fields by others ..
    etc..


    It is inexplicable to me that Zotero do not provide such tool!!!
  • I agree as well. Migrating to zotero from other systems required me to merge disparate databases (bibtex, endnote, readcube) with redundancies, and with 1000+ duplicates it's insane to ask me to merge by hand. Tell me it's a risky, irreversible decision, but give me a workflow to do it.
  • I second AlanGrossfield's comment. I am trying to migrate from ReadCube whose export functions yield different, incomplete outputs.
    Exporting a BibTex-file from the web app yields me:

    @article{charness2016,
    title = {{The effect of charitable giving on workers’ performance: Experimental evidence}},
    author = {Charness, Gary and Cobo-Reyes, Ramón and Sánchez, Ángela},
    journal = {Journal of Economic Behavior \& Organization},
    abstract = {We investigate how ... not paying anything at all.},
    volume = {131},
    pages = {61--74},
    issn = {0167-2681},
    eissn = {1879-1751},
    doi = {10.1016/j.jebo.2016.08.009},
    year = {2016},
    rating = {3},
    keywords = {\#ExperimentalEconomics,\#BehavioralEconomics,\#RET,\#RealEffortTask,\#RealEffortExperiment,\#LabExperiment,\#Incentives,\#Charities}
    }

    while exporting a BibTex-file from the desktop app yields:

    @article{Charness_The_2016,
    author={Charness, Gary and {Cobo-Reyes,} Ram{\'o}n and S{\'a}nchez, {\'A}ngela},
    pages={61-74},
    abstract = {We investigate how ... not paying anything at all.},
    title={The effect of charitable giving on workers’ performance: Experimental evidence},
    doi={10.1016/j.jebo.2016.08.009},
    issn={0167-2681},
    volume={131},
    year={2016},
    file={/Users/waloszec/Documents/ReadCube Media/Charness et al-2016-J Econ Behav Organ.pdf}
    }

    To summarize:
    - neither of the two include the following features of ReadCube: "flags", "color labels" and "reading status (read/unread)"
    - exporting from the web app includes "rating" and "keywords", while the desktop app includes "file".
    To move as much information as possible from ReadCube to Zotero I would have to import both BibTex versions - I end up with a library of duplicates for each article. Merging both versions easily and quickly would be ideal (let alone importing the same article several times, because it is part of different compilations ...).

    The "rating" given by ReadCube appears in the "Extra" field in Zotero as
    "ZSCC: NoCitationData[s0]
    Citation Key: charness2016
    bibtex*[eissn=1879-1751;rating=3]"
    – which is not very helpful.

    Can anyone recommend a workaround to transfer ReadCube's "flags", "color labels" and "reading status" to Zotero, as they are not included in the BibTex or RIS export from ReadCuve?

    Many thanks in advance!
  • Where did you get the ReadCube desktop app? I can't find where to download it.

    Does the desktop app really export {Cobo-Reyes,} with the comma inside the braces? That's pretty bizarre.
  • There used to be a desktop app for ReadCube, which they apparently took off the market for some time to focus on the web app.

    This is from the desktop export as bibtex:
    author={Charness, Gary and {Cobo-Reyes,} Ram{\'o}n and S{\'a}nchez, {\'A}ngela},
    so just as you said.
  • Bizarre.

    Without the desktop app there's not much I can do; it's possible to get the raw JSON data from the web app (although they seem to have removed the easy menu entry to do it) but I don't see a way to easily get the collections and attachments.

    That's the nice thing about web apps for those who provide them. By default, your tenants can't move.
  • I use Zotero desktop and I merge the duplicates by sorting the Notes column and just keep clicking the merge button. Not automatic but I can live with mechanical. (Sorting by Title or Creator will escape the auto highlighting from time to time)
  • This is absolutely awful. 100s of duplicates. After an hour, I'm still in the the papers that start with "A something something".

    Maybe I can start over from scratch? Reimport all my pdfs, and somehow sync the tags (extracted from Mendeley) to the files without making duplicate entries? Any thoughts or hints on this?
  • Hi ckemere, I had thousands (was doing a systematic review) and figuring out the mechanics seemed to make the task somewhat bearable (~two days). You know that if you clicked on one entry, the duplicate(s) will be highlighted as well. The sorting I mentioned above allowed automatic highlighting and I just had to keep clicking on the merge button. (The highlighting will escape every once a while, though.).

    Programmatically, this design makes sense to me. Zotero is a great piece of software and the community is awesome!
  • edited April 28, 2020
    I had to recompile from source to allow for the newer entries to take precedence over older ones for merging. Now into the B's. The problem I find is that Mendeley/PubMed often classified conference papers as journal articles for some reason, so I still have to do some manual clicking to normalize these entries.

    I wish someone with a bit more JavaScript knowledge would post a hack to allow, e.g., the merge button to be repeatedly pushed until it can't be again.
  • (For the record, I have about 4000. It's still awful.)
Sign In or Register to comment.