control for double entries after importing

I have a large bibliography that I import and would like to see whether I already own some of the imported notes. If you hit an entry with ctrl (under windows) you see whether it is already present in another collection. But doing this for hundreds of imports is tedious.

Is there any more elegant way of controlling for notes with identical fields ?

I looked into the doc but to my surprise did not find any hint for answering my question.

TIA
ftr
  • no - search for duplicate detection and you'll find a (subjective) gazillion threads on this though stating essentially that it's planned but harder to do than it may seem.
  • Computational linguists face a similar problem in duplicate document detection. The documents will not necessarily be exactly the same, but there are statistical methods to detect document similarity.

    One difference is that documents are likely to be considerably larger than bibliographic entries, which makes identification of duplicate documents somewhat easier (roughly, there's more evidence to decide by). Still, I'd be surprised if statistical methods wouldn't help with biblio entries. String edit distance on authors' names (or more likely last names) and titles, date comparison (realizing that Aristotle's books might have odd publication dates), etc., and then a statistical comparison of the results. And then present to the user the pairs over a certain similarity threshold.

    The other aspect of the problem is that it would likely be slow: N**2 in the number of entries. (You couldn't just compare entries that alphabetize together by some field, because e.g. multiple authors might sort differently, and 'van der Hulst' might be sorted under 'v', 'd', or 'h', depending on how it was entered in the field; the string edit distance would have to be sensitive to this, too.) Permutations of oriental names (where family names are not always last) would also be necessary.

    All in all, a messy problem, but not--I believe--impossible, provided you're willing to sort through the results by hand.
  • mcswell - actually identifying duplicates isn't the issue that makes this hard for Zotero. The question is - how do you deal with merging/deletion. Remember that document citations rely on the unique identifiers and will break if you delete the "wrong" duplicate.

    One way around that would indeed be to do it at the moment the new item gets added, but I guess the idea would be to try to do it right from the start.
  • I would guess that identifying dupes would be hard, too (and the more I thought about it, the harder it seemed).

    As for the identifiers: why not delete one of the entries, but maintain its identifier as a pointer to the other entry? That is, there would be two tables of identifiers: one table would have rows which are biblio entries, the other would have rows which have only two columns: the id of the deleted entry, and the id of the "kept" entry. (The second column might change if you found a third dupe entry.) A search for a documentation citation would be over the join of the two tables.

    (I have no idea how the zotero db is organized internally, the above is just a guess as to how this might work. But I think the general point is to have the uid for the deleted entry point to the uid of a non-deleted entry.)
Sign In or Register to comment.