Duplicate detection?

fbennett · February 6, 2011

I put in some time working on duplicates detection and management in the multilingual client this weekend, and I'm happy to report that it now appears to be fast and serviceable. It uses an algorithm similar to that currently implemented in the hidden option code in the trunk, but the scan runs entirely in SQL. My first cut at this last summer was also done in SQL, but the code was triggering a table scan that slowed things down tremendously. I can't say that I've learned much about SQL in the interim, but this morning I got lucky. For small numbers of target entries to be checked, it now returns almost immediately, even with significant data sets (tesing was done against a database with 1200 entries).

The multilingual branch is an experiment with live code, and shouldn't be installed lightly; but I'm pretty chuffed that this has worked out so well. If anyone feels inclined to set up a separate Firefox profile for testing and take it for a spin, the relevant warnings (which should be taken seriously) can be found here, along with links to the project overview and the installer. If you do give it a try, please post feedback, requests for guidance, and rotten tomatoes back to this thread.

migugg · February 10, 2011

brilliant! thanks for all this work. Hope it will soon end up as part of zotero proper.
I tried it. Installing was no problem on ff3.6., mac OSX 3.5.5.
I downloaded first on purpose some duplicates from both worldcat and google books (i.e. books that I already had in library).
Worked perfectly, they were all returned as yellow.

EXCEPT: I have ususally sorted the library by author name. after downloading the duplicates the library was sorted after a logic that I cannot decipher, but it still keeps the author column highlighted, which would indicate that it is sorted after author. That must be a bug (see screenshots here:

https://www.wuala.com/migugg/zotero?key=XvUeHGxhSb4x

Also, when I then ctrl-click on an yellow-highlited item, the option "duplicate:mark item as new" is greyed out, but I assume it should actually be not. However, it was not greyed out on any item that was not highlighted yellow as duplicate. Or did I misunderstand sth here? (see other screenshot)

My library is more than 6000 items. to run duplicate detection by ctrl clicking on the library was no problem and it almost immediately returned results (maybe 2 secs).
However, the only two duplicates it detected that I did not add on purpose were false duplicates (see screenshots). One of them, by the author "Kochan" even is the sole item by this author (I find this strange, because I assume the algorythm searches first for author names, but I maybe mistaken on this).

Even more confused: After selecting on an item that has no duplicate, "Rottenburg 2008", "mark item as new", duplicate view adds several items to the view (Hisch/Berg/Stolze) that it detects as duplicates of the item. This I find confusing. It should show these items from the beginning and somehow indicate that they are supposed duplicates of Rottenburg (i.e. it should show them below the suspect entry, and not in alphabetical order, even if this is the chosen sort order. I.e. duplicates should override the sort order. Otherwise it is extremely confusing in case there are lots of duplicates.).

When I then select "use as master" the alleged duplicates disappear from the duplicates view, but they are not deleted from the library.
see screenshot "duplicate still in library" that was taken after the above procedure.

I am not sure whether these are bugs or whether I do not follow the intended procedure. If the latter, then i must say the UI is not very intuitive.

I would suggest that when running "duplicates view" all duplicates including their alleged masters are shown. (Then the user should simply decide on his own what to do with these. )
(In my view, the algorhythm catches too many entries. It would be better if it only catches those with identical authors).

I hope this helps.
best
migugg

klobato · February 14, 2011

Hi,
So how do I go about doing a search for duplicates?

fbennett · February 14, 2011

@migugg: Thanks for this feedback. It no doubt adds to the confusion that I haven't produced anything like a manual for this yet. :)

One point that might help clear things up a little is that all newly imported items are marked yellow (yellow = possible duplicate). The duplicates view then compares these items (only) with other content in the database.

The default scan is a fuzzy search against titles. If the list is not sorted by title, you're right, it should be forced to title as the sort key; a list sorted by author would not be useful. Better still would be a sort that places potential duplicates close to one another. I'll think on this.

When I get some time and other things slow down, I'll take a look at the setup again, and use your comments for usability checks. Again, this is very helpful stuff.

fbennett · February 14, 2011

@migugg: Actually, looking at your screenshots, I think you might have missed something (not surprising, given the complete lack of documentation). After selecting "duplicates view", the context menus for individual colored items will change, offering the possibility of merging yellow items into non-colored partners, if present. Red items in the duplicates view are detected (possible) duplicates that have no non-new partner to which they can be merged. In this case, you need to choose one of the items and set it as the "master". The partners will then turn yellow, and you can merge them into the master individually after moving or deleting any attachments they have.

The key point is that this is not intended to be totally automatic; the color-coding shows the system's guess about duplicate partners, and the user deals with them via the context menus.

The only thing that can be done automatically in a batch is to clear green items in the duplicates view: these are new (originally yellow) items that have no apparent duplicate partners, and can therefore be confirmed as unique. Once they are cleared, they will not be checked again unless a new entry is added to the database that appears to be a duplicate partner to them. In that case, they would appear with the partner in the "duplicates view", and the new partner's context menu would offer the possibility of merging to the older "master" item.

klobato · February 14, 2011

Hi,
I have not fully followed by am happy to hear somebody is working hard on this. Given the intigration of pdf files into the user library I think it is importante that there should be a merging option.
Also I get a feeling that whilst some people like to use zotero for research papers based on other research papers, others may want to use it on a broader context where snapshots of webpages are used. This can be a very interesting tool for people who teach media or languages (such as my future wife) or I can imagine it being useful for people who research about and on the web. However this may cause a bit a contradiction in terms, as scientific research papers tend to follow pretty rigid rules as to what you can and should reference. Maybe there should be some sort of preference setting to allow for the library to work in one way or the other?
All the best and I'll keep trying at Zotero and see if it works for me and my students.

migugg · February 15, 2011

klobato: it is not clear how this relates to duplicate detection. if it does, explain how. if it does not open a new thread. but as far as i understand what you hint at is not problem at all. People can save whatever they want in zotero and they can cite with any style that is available or create their own style if needed

migugg · February 15, 2011

fbennett: thanks very much, indeed I misunderstood this. I will ahve a look again. But one point: why do you only do duplicate detection based on the title? I think only using the title ends up with far too many duplicates. Just think of people who work a lot with book reviews. Reviews of the same book inevitably have the same title. I would actually have assumed that it should contain a combination of author and title.
And many thanks for the great work! looking forward to use this on a regular basis.

fbennett · February 15, 2011

Using the title is just a placeholder function, based on the original heuristic in the "hidden option" duplicates detection code provided by the core developers. Simon has mentioned a number of more sophisticated possibilities, that could be tied into the same framework.

pedrobrasil · March 3, 2011

Adding some thought to this duplicate detection discussion. It comes to my attention that, although not yet available in a stable version, there is a tool to avoid importing duplicates. Recently I have been planing a systematic review. For those not familiar with systematic review, it is necessary to run the same search strategy in remote databases such as EMBASE, MEDLINE, SCOPUS, ISIWeb etc. Is not hard to imagine that, doing like this, duplicates will rise from different databases. Nevertheless, although not that important, systematic reviews usually report a flow diagram of how many references were found in each base and how many were included/excluded matching each inclusion/exclusion criteria. Thus, it would be nice to know how many duplicates were excluded at the end. If zotero, avoids importing replicated references, I would not be able to know how many were they. Is it possible to turn this "stop importing duplicated" tool optional? Or could anyone think about a workaround on this issue?

cumuluss · March 4, 2011

Dear all who developed the "Multilingual Zotero with Duplicates Detection". It is a great work. I tested a bit and it works really good, but unfortunately only on small libraries. I tested it also on one with 3000 entries and the duplicate view was completely empty but there are some duplicates. Is that known?

fbennett · March 4, 2011

Did you mark any entries to be checked as "new" as shown in the screencast? If no items are flagged as new (yellow), you'll always get an empty view. The code was tested with a library of 1200 entries before release, selecting all items in the library and running "Duplicates view". Processing the view took 5 minutes or so on an Atom laptop (that is, a very slow machine), but duplicates were returned. The size of library shouldn't have any effect on the number of matches; it will just take longer to process.

cumuluss · March 4, 2011

Hi Frank thanks for your answer. I did not notice that it applies only to new ones. I tried it again and then it worked. It actually took a while in spite of a faster computer. But then came an error message. A report was generated ID. (Report ID 27242597). I hope this is correct to place it here.
Thanks C.

fbennett · March 4, 2011

That's the right thing to do with a Report ID generally speaking, but the multilingual version is my own experimental thing at this point, and not supported by the core developers. I'm not sure what we do in this situation ...

adamsmith · March 4, 2011

couldn't Dan or Simon just paste the relevant section here? That's what they do for ajlyon with translators.

dstillman · March 4, 2011

There are no Zotero errors in there.

fbennett · March 4, 2011

On a separate note, selecting all items and opening Duplicates View is pretty demanding on the system. Since it must compare everything with everything, the number of operations required expands dramatically on a large library. The number of operations is (number of new items selected) x (number of items in library). With 3,000 items that will work out to something like 9 million comparisons. You can break the task into smaller pieces, though. Selecting 100 items for checking, say, will produce 300,000 comparisons, which is more manageable and will produce the same result, for those items. That would allow you to spread the task over time.

You've got a Catch-22 snag, of course, because the only way to unmark items at the moment is through the Duplicates View. This is not a happy situation, and I'll look at introducing a context menu seleciton for unmarking items in the main listing. In the meantime, you'll have to start again with a fresh copy of your zotero.sqlite.

cumuluss · March 5, 2011

Thanks a lot. I will try the instructions and start with a new zotero.sqlite. And as you said it would be good to have a possibility to unmark items.

cumuluss · March 5, 2011

Update: I tried it. The duplicates view shows me the duplicates, but the option "use existing partner" is disabled for some of the duplicates, even when I first made one of them to a master item. If I duplicate the master item within Zotero than it works for the produced dupl..

ajlyon · March 5, 2011

cumuluss's experience with not having the "use existing partner" option is something I ran into as well when testing the duplicates view.

fbennett · March 5, 2011

The "use existing partner" option should only appear on items that are color-coded in the Duplicates View. Items without a color highlight (white items) are masters, and cannot be merged. Are you failing to get the "use existing partner" selection on yellow color-coded items?

(The menus arrangement and naming conventions could use some attention, obviously.)

ajlyon · March 5, 2011

I can't exclude the possibility that I was just doing things wrong. I'll play with it some more and report back with more concrete, and thereful useful, feedback.

ajlyon · March 5, 2011

Also, the nature of my research makes real duplicates exceedingly rare, so I haven't had call to use the feature extensively.

cumuluss · March 5, 2011

yes just as you said. I'm failing to get the "use existing partner" selection on yellow color-coded items. Even if I make a red coded (one out of two) to a master, I can not merge the second (now a yellow one) into the master.

And a second thing: some of the new features of the Release Candidate 1 (2.1) disappear (extra item window, Added view options to item context menu)

fbennett · March 5, 2011

@cumuluss,

Thanks for the clarification. I'll look at it; post if you notice a pattern to the failure.

Re missing features of 2.1, thanks for pointing this out; some heavy code refactoring was needed to keep up with changes on the trunk, and some things seem to have gotten lost in the shuffle. I'll try to get these features back in there.

fbennett · March 5, 2011

Looking through the changelog for 2.1beta, I don't see a mention of an extra item window. When you get a chance, could you clarify that on this separate thread?

cumuluss · March 6, 2011

Hi frank I posted something on the other thread. To the first point. Now, i would say there is not really a pattern. It is not working if I marked some items within zotero as new. And sometimes also if i import some new (but duplicates) into zotero. But I will test it a bit more and let you know.

fbennett · March 6, 2011

The mystery of the missing "use existing partner" selection has been resolved. The UI for duplicates detection was written last summer, and when I found it rather slow, I set it aside, until dusting it off a month or so ago during the push to get multilingual even with 2.1. In the intervening months, I'm afraid I'd forgotten about some of my own design decisions.

The system won't delete an item that has file attachments on it. It's not a permanent state of affairs -- I'm sure if we put our heads together we can come up with much more elegant solutions -- but the idea is to avoid accidental data loss when tearing through a large number of duplicate items. I'm almost certain that this is what is "breaking" things for both of you. If you move the attachment across to the item into which you are merging by hand, the "use existing partner" selection should wake up.

Playing around with it, I see that the current behavior is inconsistent; only file attachments block a merge, but there could be equally important information in a note (in fact the risk there is if anything higher); yet you can merrily delete items with hundreds of notes attached to them, and the system won't say "boo". You can recover them from the trash, but after they are emptied from trash, it would be bye-bye to all that work product.

So be careful out there. :) Suggestions on what we might do with the merge selection interface would be very welcome. I have some ideas, but there are lots of possibilities, and I probably haven't thought of most of them.

cumuluss · March 6, 2011

Yes you are right. This was the case. I deleted the attachment and then it works. And of course it can be a problem especially when some people work with the same library and they have their own annotations in the pdf or even notes.
My suggestion would be a simpel one. Only a warning message including that you will delete some attachments with the possibility to cancel the merge. Afterwards you can copy all the necessary attachments into the master and then merge again.
Or maybe, not a really elegant one but without warning copy all attachments into the master.

s marple · April 8, 2011

Hi,
I have been using zotero since 2004 maybe? a long time and numerous projects worth at any rate. In general i think zotero rocks, but i need a duplicates solution. I have waited as long as possible, but i need to turn my final dissertation in this week. can anyone help me? I am frankly too database language illiterate and too exhausted to follow much of the discussion presented above.
so my question is this:
(1)is there a social scientists friendly solution? I am not a completely code/programming phobic, just not so literate as to follow a dev level discussion
OR
(2) do i need to manually go through my dissertation refs?

thank you for any answer and or support
//s