Duplicate detection?
This is an old discussion that has not been active in a long time. Before commenting here, you should strongly consider starting a new discussion instead. If you think the content of this discussion is still relevant, you can link to it from your new discussion.
This discussion has been closed.
The multilingual branch is an experiment with live code, and shouldn't be installed lightly; but I'm pretty chuffed that this has worked out so well. If anyone feels inclined to set up a separate Firefox profile for testing and take it for a spin, the relevant warnings (which should be taken seriously) can be found here, along with links to the project overview and the installer. If you do give it a try, please post feedback, requests for guidance, and rotten tomatoes back to this thread.
I tried it. Installing was no problem on ff3.6., mac OSX 3.5.5.
I downloaded first on purpose some duplicates from both worldcat and google books (i.e. books that I already had in library).
Worked perfectly, they were all returned as yellow.
EXCEPT: I have ususally sorted the library by author name. after downloading the duplicates the library was sorted after a logic that I cannot decipher, but it still keeps the author column highlighted, which would indicate that it is sorted after author. That must be a bug (see screenshots here:
https://www.wuala.com/migugg/zotero?key=XvUeHGxhSb4x
Also, when I then ctrl-click on an yellow-highlited item, the option "duplicate:mark item as new" is greyed out, but I assume it should actually be not. However, it was not greyed out on any item that was not highlighted yellow as duplicate. Or did I misunderstand sth here? (see other screenshot)
My library is more than 6000 items. to run duplicate detection by ctrl clicking on the library was no problem and it almost immediately returned results (maybe 2 secs).
However, the only two duplicates it detected that I did not add on purpose were false duplicates (see screenshots). One of them, by the author "Kochan" even is the sole item by this author (I find this strange, because I assume the algorythm searches first for author names, but I maybe mistaken on this).
Even more confused: After selecting on an item that has no duplicate, "Rottenburg 2008", "mark item as new", duplicate view adds several items to the view (Hisch/Berg/Stolze) that it detects as duplicates of the item. This I find confusing. It should show these items from the beginning and somehow indicate that they are supposed duplicates of Rottenburg (i.e. it should show them below the suspect entry, and not in alphabetical order, even if this is the chosen sort order. I.e. duplicates should override the sort order. Otherwise it is extremely confusing in case there are lots of duplicates.).
When I then select "use as master" the alleged duplicates disappear from the duplicates view, but they are not deleted from the library.
see screenshot "duplicate still in library" that was taken after the above procedure.
I am not sure whether these are bugs or whether I do not follow the intended procedure. If the latter, then i must say the UI is not very intuitive.
I would suggest that when running "duplicates view" all duplicates including their alleged masters are shown. (Then the user should simply decide on his own what to do with these. )
(In my view, the algorhythm catches too many entries. It would be better if it only catches those with identical authors).
I hope this helps.
best
migugg
So how do I go about doing a search for duplicates?
One point that might help clear things up a little is that all newly imported items are marked yellow (yellow = possible duplicate). The duplicates view then compares these items (only) with other content in the database.
The default scan is a fuzzy search against titles. If the list is not sorted by title, you're right, it should be forced to title as the sort key; a list sorted by author would not be useful. Better still would be a sort that places potential duplicates close to one another. I'll think on this.
When I get some time and other things slow down, I'll take a look at the setup again, and use your comments for usability checks. Again, this is very helpful stuff.
The key point is that this is not intended to be totally automatic; the color-coding shows the system's guess about duplicate partners, and the user deals with them via the context menus.
The only thing that can be done automatically in a batch is to clear green items in the duplicates view: these are new (originally yellow) items that have no apparent duplicate partners, and can therefore be confirmed as unique. Once they are cleared, they will not be checked again unless a new entry is added to the database that appears to be a duplicate partner to them. In that case, they would appear with the partner in the "duplicates view", and the new partner's context menu would offer the possibility of merging to the older "master" item.
I have not fully followed by am happy to hear somebody is working hard on this. Given the intigration of pdf files into the user library I think it is importante that there should be a merging option.
Also I get a feeling that whilst some people like to use zotero for research papers based on other research papers, others may want to use it on a broader context where snapshots of webpages are used. This can be a very interesting tool for people who teach media or languages (such as my future wife) or I can imagine it being useful for people who research about and on the web. However this may cause a bit a contradiction in terms, as scientific research papers tend to follow pretty rigid rules as to what you can and should reference. Maybe there should be some sort of preference setting to allow for the library to work in one way or the other?
All the best and I'll keep trying at Zotero and see if it works for me and my students.
And many thanks for the great work! looking forward to use this on a regular basis.
Thanks C.
You've got a Catch-22 snag, of course, because the only way to unmark items at the moment is through the Duplicates View. This is not a happy situation, and I'll look at introducing a context menu seleciton for unmarking items in the main listing. In the meantime, you'll have to start again with a fresh copy of your zotero.sqlite.
(The menus arrangement and naming conventions could use some attention, obviously.)
And a second thing: some of the new features of the Release Candidate 1 (2.1) disappear (extra item window, Added view options to item context menu)
Thanks for the clarification. I'll look at it; post if you notice a pattern to the failure.
Re missing features of 2.1, thanks for pointing this out; some heavy code refactoring was needed to keep up with changes on the trunk, and some things seem to have gotten lost in the shuffle. I'll try to get these features back in there.
The system won't delete an item that has file attachments on it. It's not a permanent state of affairs -- I'm sure if we put our heads together we can come up with much more elegant solutions -- but the idea is to avoid accidental data loss when tearing through a large number of duplicate items. I'm almost certain that this is what is "breaking" things for both of you. If you move the attachment across to the item into which you are merging by hand, the "use existing partner" selection should wake up.
Playing around with it, I see that the current behavior is inconsistent; only file attachments block a merge, but there could be equally important information in a note (in fact the risk there is if anything higher); yet you can merrily delete items with hundreds of notes attached to them, and the system won't say "boo". You can recover them from the trash, but after they are emptied from trash, it would be bye-bye to all that work product.
So be careful out there. :) Suggestions on what we might do with the merge selection interface would be very welcome. I have some ideas, but there are lots of possibilities, and I probably haven't thought of most of them.
My suggestion would be a simpel one. Only a warning message including that you will delete some attachments with the possibility to cancel the merge. Afterwards you can copy all the necessary attachments into the master and then merge again.
Or maybe, not a really elegant one but without warning copy all attachments into the master.
I have been using zotero since 2004 maybe? a long time and numerous projects worth at any rate. In general i think zotero rocks, but i need a duplicates solution. I have waited as long as possible, but i need to turn my final dissertation in this week. can anyone help me? I am frankly too database language illiterate and too exhausted to follow much of the discussion presented above.
so my question is this:
(1)is there a social scientists friendly solution? I am not a completely code/programming phobic, just not so literate as to follow a dev level discussion
OR
(2) do i need to manually go through my dissertation refs?
thank you for any answer and or support
//s