Analize/Show (None)duplicates

cb102 · April 19, 2016

I have to folders (on the same level) with publications in it.
I have the theory that there are duplicates between that folders. But not all of them. Some are unique and only appear in one of the folders.

I want to show that.

I want to know how much duplicates are in each of the folders and how much unique copies are in each of the folders.

adamsmith · April 19, 2016

export both to CSV and match on some key identifiers (DOI, ISBN, or just title)?
Can't be done in Zotero and I'd argue goes beyond what Zotero should do.

adamsmith · April 19, 2016

well, actually take this back partly. I do think that duplicate (and unfiled) should be available as a search condition just like other saved searches are, and with that this could be feasible (maybe?).

cb102 · April 19, 2016

One parameters of a "systematic" search is to use more than one database.
In my example there is "MEDLINE via Web of Science" and "MEDLINE via PubMed" (which IS different!).
And I will add two more database to that search, too.

Isn't Zotero used for systematic literatur search?

Handling duplicates is elementary for that.

And Zotero still offer a item in the root named "Eintragsdupletten" (german! could be "entry duplicates" in english)
So it is still possible to do it but not between two sub-folders.

Gurdas_Sandhu · April 19, 2016

**edited** Thought again and realized this won't work. Because the items are duplicate but NOT the exact same item.

Tag all items in sub-folder 1 with some unique tag, lets call it "tag 1" and all items in folder 2 with another unique tag, called "tag 2". These tags MUST be unique and created just for this comparison.

Then, go to folder 1 and select tag 2. This will show you items in folder 1 that also exist in folder 2. Or go to folder 2 and select tag 1. Or, create a saved search showing items that have BOTH tags.

Not very elegant, but may be a workable solution?

adamsmith · April 19, 2016

Well, but something like gurdas says might be possible. If you tag the whole search import, you can count the number of duplicates in each search.
Also, note that merging items would keep tags/collections, so if you merge duplicates, you could run a search of items in both collections/with both tags and get a list of all items that were duplicates.
Not sure how much more complex the type of data analysis you want to do on this is. I still think that for actual data analysis, you're better of using export (people have done some nice things in R with .csv export from Zotero).

cb102 · April 21, 2016

If I understand @guardas correct it won't work. The tagging mechanism don't know how to compare the entries. I don't see how this should work.

The other point is that @adamsmith and other readers here doesn't responed to my arguments.

Just to be clear: This thread isn't just about finding a solution for MY problem. It is about improving Zotero for everyone - what is my primary goal. Zotero is the only realistic and free (in the meaning of FOSS) alternative to propritary literature managers.

adamsmith · April 21, 2016

You ask how something is possible and we're trying our best to see how it might be. If it already is, that's certainly preferably to implementing new features, which is always costly in terms of coding, maintenance, GUI space, and documentation.

The only specific feature that you request is:
"I want to know how much duplicates are in each of the folders and how much unique copies are in each of the folders."

That is, in fact, possible with tags.
1. Perform the search in WoS and place items in a saved search or a collection.
2. Select all search results and tag them with WoS (either dragging to tag or using a number for a colored tag). This will also show you the total number of items returned as part of the search.
3. Do the same for all other databases you want to search.
4. Go to duplicate view. Filter by tag WoS. This gives you the number WoS search results that are _not_ unique and their number.
5. subtract that number from the total number from 2) and you have the number of unique results.

That may sound a little complicated, but it's really 2mins, tops.

Certainly possible that that doesn't give you what you need -- I'm not an expert in systematic reviews -- but then you'd have to be more explicit about what it is that you need.

And pleaserefrain from ad hominem accusations. If I wasn't interested in figuring this out, I wouldn't post here. I have lots of other things to do.

cb102 · April 21, 2016

To your number 4:
That doesn't give me the number. The duplicate view doesn't show me the count of items in it per default. I first have to mark all items to get something like "xyz items marked".

I should add this use-case to the wiki. Any suggestion which section would be the best for that in https://www.zotero.org/support/?

But I still have a problem with that tag "solution". Maybe it is just because I am not really getting into that tag thing. E.g. I have 3 or more database-search-results (each with a subfolder). So there are some publications existing 3 or more times. How could I find out how much duplicates are between folder 1 and 2 with ignoring folder 3?

And the important: How can I see what a "duplicate" is for Zotero? Which fields are compared in the duplicate-view? Could I modify that? The context menu of the duplicate-view doesn't offer me that.

cb102 · April 21, 2016

Ah I am slow but I am getting into this.
For the question about 3 or more folders you could use user-defined-searches based on the tags.

It works but is far away from usable. It is a question about user interface and I have to think about how this could be done more elegant. I have another fellow who would be interested in that task.

How would you prefere we present our suggestion about that? Internal mail? GitHub-issue, ...? It should be something with upload possibility.

cb102 · April 21, 2016

Another practice question:
How do I "join" the duplicates? I have three folders. I copied all items into just one folder and add the tag "__all". So I have the items off the first three folders in a fourth one taged with "__all".

Now I can see duplicates in the duplicate view with activated "__all" tag.
But I only see a way to join/merge each entry for itself. Step by step. I want to do this automaticly. There are hundreds of merges need to be done.

adamsmith · April 21, 2016

And the important: How can I see what a "duplicate" is for Zotero? Which fields are compared in the duplicate-view? Could I modify that? The context menu of the duplicate-view doesn't offer me that.

That might actually be difficult because Zotero's Duplicate detection wasn't really designed for analysis but just for de-cluttering your database, so it's a black box (an open source one, so not technically black, but you get the sense).
Roughly speaking, it first checks for the presence of a unique ID (DOI or, for books, ISBN) and automatically marks those as duplicate. It then does a fuzzy matching on title and author(s). And no, can't be modified.

Writing up the workflow: Personally I put things like this on my blog rather than in the Zotero documentation. There's https://www.zotero.org/support/tips_and_tricks, but that's shorter bits. I don't think having /support/systematic_review is out of the question, but I'd want a couple of interested people. If it's just you writing something up, then, should you lose interest/time, it's just going to sit in the documentation and become outdated/forgotten.

Making proposals: Here's really the best place. Link out to screenshots. Don't expect anything to happen fast, though, especially as it concerns GUI. Given upcoming changes for Zotero, I don't think anyone is going to be interested in doing GUI stuff until Zotero moves to its new platform.

Auto- (or mass-) merge duplicates: Doesn't currently exist (partly because it hasn't been implemented, partly out of concern for false positives). You're not the first person to ask for it, though, and it probably should be done. Don't think it'll be quick, though.

cb102 · April 21, 2016

"until Zotero moves to its new platform" ???

I don't have a Blog. ;) No time, no tallent. :D

Sounds like there are plans about that duplicate thing. So is there an official ticket I can subscribe to?

bwiernik · April 21, 2016

I'd gladly contribute to and help maintain a "systematic review workflow" piece if cb102 puts one together. The majority of my research is systematic reviews/meta-analyses.

cb102 · May 18, 2016

I found a workaound - nothing automatic just manual. But it will do the job for the first time. But it doesn't prevent us for developing a nice solution for the future doing managing results from systematic research!

You have searched on two databases and import for each of it the reult in a folder (collection).

MyBib
|- Result Database A
|- Publication01
|- Publication02
|- Result Database B
|- Publication03
|- Publication04

So there are four independend data entities/objects in the database.
Letse asume 02 and 03 have the same title, author and publication year: This means it is a duplicate!

Copy(!) all results into one folder

|- All results with duplicates
|- Publication01
|- Publication02
|- Publication03
|- Publication04

Export the complete folder to a ris-file.
Create a new folder "All results without duplicates" and import the ris-file into their.

Now you have created new data entities independend from the first ones. Sort by title and secondary by author to see with your eyes where the duplicates are. In that folder you can delete the duplicates without affecting the other folders.

Thats it I think.

It is not automatic and a lot of handwork (if you have 7 databases with 3000 publications in sum) but it is ok.

If you want to know from which database a publication comes from you can search for its title in the folder "All results with duplicates" and use the ALT-key for each similar entry with the same title to highlight the database folder it is in.

What do you think?

bwiernik · May 18, 2016

I think you probably have a few more steps than needed. I would recommend this:

1. Your first step--save the results for each search in a separate collection.
2. Open the duplicate items view and merge duplicates. Important--do not delete items, merge them.
3. After merging duplicates, the main library view (across collections) will show you the final list/number of unique items.
4. To get the total number of original items, including duplicates, open each database collection and sum up the number of items in each. Because you merged, not deleted, duplicates, each duplicate will still be present in each of its original source folders.

cb102 · May 18, 2016

Technical you are right but your use-case is not realistic and doesn't full fill the needs of systematic research.

I don't create a empty/fresh profile to have an empty Zotero project. IF I could create Zotero projects indpendent from my Firefox/Mozilla profile I would be glad. But currently it is not possible. So I have to deal with this inside my regular productive LibraryProject including damn a lot other publications. That is why the duplicate view doesn't make sense because it view all items and can not be restricted to a bunch of sub-folders/-collections.

The other problem is. When you reedit items (e.g. to correct some data from an import or whatever) this would effect the item in the database-result-folder, too. That is a no go.

The original imported items from each database are raw und shouldn't never be touched! This is for documentation very important!
For example you can analyize data quality between databases. e.g. PubMed sucks extremly compared against WebOfScience.
Never modify the raw data is a big rule in all types of research! While the process, at or after the end you never know if you want to go back into your data to answer some new questions.

by the way: This would be one of the feature items of a systematic research solution for Zotero. Don't merge to eleminate duplicates. Create a new entry out of some duplicates without touching them.

adamsmith · May 18, 2016

I'm just going to post
https://scholar.google.com/citations?user=PeT8rDwAAAAJ&hl=en
and suggest that you take it a little easy on lecturing people with extremely successful research careers on what research entails...

Let's keep this focused on the type of functionality you would find useful for _your_ research instead of overly broad generalizations.

bwiernik · May 18, 2016

Thanks adamsmith.

@cb102

For your first point, I recommend that you create a new Zotero Group for each systematic review you conduct. This way, each review can have unique library that is separate from any other project you do. I do this for each of my reviews, and it is really helpful for keeping everything neatly separated. This is even helpful if you are the only member of the group.

With regard to maintaining the original metadata, I think there are differences in reporting traditions across fields here. In psychology, analyses regarding the quality of metadata in different databases aren't common, and it is considered sufficient to document the number of studies that came from each database as well as which studies came from which database (retaining the original poor quality metadata is not considered necessary--in my particular field this is especially the case because many sources are not indexed by any database, such as test publisher internal data).

Given that your reporting needs are greater, I see that merging will not necessarily work for you. I have a few thoughts on how you might streamline your workflow if they meet your needs.

1 - Do you need to have the original raw metadata constantly available in Zotero? If not, it might work for you to export the untouched results from each database collection and simply archive them for later use if necessary (I believe that Zotero RDF will be the least lossy export format, @adamsmith please correct me if I'm wrong.) Then, you can proceed with the merge duplicates workflow I describe above.

2 - If you really do need to have the untouched items readily available in Zotero, I would recommend you import each database into its own collection, then, in the main Group library view, sort by Title then Author (or some other pair of variables that will put the duplicates next to each other. Then, for each set of duplicate articles, you can choose the one with the best metadata, Duplicate it and correct the metadata, then apply a tag to indicate that this corrected item is the master item of the set (you can set this tag to be one of the library's colored tags to give you a visual indicator of its status). For items that are not duplicates, you could just add the colored tag if not metadata corrections need to be made. To keep track of the duplicate item sets, you can select them all together in the main library view and then use the Zutilo function to relate them all together (this way, you can hop between different versions of the item quickly--you can also add a tag to each item in the database collections so you can quickly see which database each version came from).

3 - One other alternative would be to create another Group library for the review. After you import all of the items, rather than exporting to Zotero RDF, you can go to the main library view, select all the items, then drag them to the other library, where you can do all the merging and further analyses. This isn't as flexible with regard to being able to compare the raw versions to the merged corrected version, but it would keep your libraries less cluttered.

cb102 · May 18, 2016

@bwiernik Thanks for your constructive answer.

I see we have different needs and I handle systemaic research not only like a research. I do it like an informatic, too. I am paranoid and want to record as much data as I can while the research process. This includes this tiny steps I described. Let's say I am kind of paranoid and pedantic in the case of data management.

Using Groups would mean I have to use an account on Zotero.org?
I don't use foreign clouds for security reasons. But it is a nice time to ask if I can use Zotero-Server (or whatever it is called) as my own instance?

But using Groups isn't usefull for me at all. I want to have all literature together in one project accessible all the time without switching between project files or mozilla profiles. I have scientific, technical and even cooking recepts in one project. ;)

Of course I keep the raw export files (RIS, pseudo RIS, xml, whatever) from the databases. But it is easier to access and manage them in a software like zotero.

adamsmith · May 18, 2016

Yes, you need to create groups on zotero.org and you need to sync in order for them to appear in your local copy of Zotero, so if that's not an option, neither are groups.

The Zotero server code is open and some people do run the server locally, but it's not trivial. I think the most comprehensive instructions and code are at
http://git.27o.de/dataserver/about/
there is a good deal of discussion on the zotero-dev group (and any discussion related to setting up the server should take place there)

cb102 · May 19, 2016

What do you mean with "zotero-dev group"? A mailinglist? I can not find one on zotero.org. Or a categorie in that forum? "developers" catagorie is "(closed)".

I just want to ask how I can interprete that part of the description from the dataserver site
"support group file synchronization using a local S3 compatible storage service."

This sounds for me like a contradiction in terms.
"local S3 storage device"
Does it mean that my own zotero-server instance still uses amazon servers to store data?

adamsmith · May 19, 2016

Zotero dev:
https://groups.google.com/forum/#!forum/zotero-dev

"support group file synchronization using a local S3 compatible storage service."
It's local and does what AWS does

cb102 · May 19, 2016

AWS stands for "Amazon Web Services".
So what part of it is local?

For me it sounds like that there is still a contact to amazon servers.

adamsmith · May 19, 2016

Ich weiß nicht wie ich das noch ausdrücken soll.
Es ist ein lokaler Server der die Funktionalität ersetzt für die Zotero Amazon Web Services nutzt.
Es ist ja auch nicht so, dass Du einen Server einrichten könntest Der heimlich mit AWS kommuniziert. Das würdest Du schon merken, angefangen damit dass Du einen AWS account bräuchtest. Tust Du aber eben in diesem Fall nicht...

cb102 · May 19, 2016

Dann sollte man das im Text auch so spezifizieren. "Amazon", "S3" und "AWS" sind in der Regel wenig vertrauenserweckende Strings.