Duplicates across collections imported from multiple databases (meta-analysis)

z8080 · February 22, 2019

I'm starting work on a meta-analysis and have chosen several relevant databases (DBs). For the final batch, I expect to be left with around 15-20 studies to be meta-analysed.

Guidelines such as PRISMA recommend that authors report the number of initial results in each DB (unclear if before/after filters have been applied), and how many of those overlap between DBs (number of duplicates). Thus, in this case the aim is not to merge duplicates but to count them and count also the unique items.

So, on to my questions:
1) Once results from each DB are imported into a collection, how can I arrive at the number of duplicates that exist between all of them, AND
2) .. how can I create a new collection formed of unique items from across all collections?
3) Some of my DBs (e.g. Google Scholar, ExLibrisPrimo) don't have an option to export *all* search results to Zotero (or even as an XML/XLS file). The 'Zotero Item Selector' button in Chrome allows me to only save as many items as the DB allows to fit on one results page, so only 10-or-so items at a time can be imported. Is there a better way?

I'd guess no one feature of Zotero specifically exists for these aims, and that a creative workaround would have to be found. Many thanks in advance to this great community.

bwiernik · February 22, 2019

The workflow I recommend is that you make a collection for each database that you import from. Then, use the Duplicate Items special collection to merge duplicates across each collection. The result will be a single set of non-duplicate items, many of which will be members of multiple collections.
http://zotero.org/support/collections_and_tags#special_collections

Regarding importing whole search results, for Google Scholar, first save the items to your "My Library" by clicking the star icon under references, then export in one go using the Export button on the My Library page. You can also search using the Publish or Perish program and export from there. There are similar systems for other databases.

z8080 · February 22, 2019

Thanks for suggesting this. Unless I misunderstand, this would require that I manually merge every pair of duplicates (or however many there are), which would mean a lot of work for collections with hundreds of results. Selecting all under Duplicate Items and choosing Merge NNN items is probably not the automated solution I'm after, since I guess this would merge all items from all duplicate pairs into a single item.

Also, after I've merged all duplicates, it seems to me each collection will retain its individual (unique) items, but how can I merge them all into a new collection, since it will no longer be of interest which DB they originated from?

DWL-SDCA · February 23, 2019

@longtalker Define what _exactly_ is a duplicate. What about very similar records except that one has info that the other doesn't have? What about records that have _different_ information for the same field but that are otherwise identical? Fully automated duplicate detection and merging is very difficult bordering on impossible -- at least if accuracy / precision is valued. While there are ways to match records 1 to 1 such as with a DOI, different databases can have different degrees of completeness for items such as author names, etc. There will almost always need to be human decision-making involved in the selection and merging process.

z8080 · February 23, 2019

@DWL-SDCA You are right, I know this is far from trivial. Manually, I would just look at the key fields (first author, year, and title). Is this how you would do it as well, i.e. manually go through each pair and decide to merge/not merge?

It seems to me Zotero's algorithms are quite good (certainly better than what I'd achieve by hand - and faster to boot), but there is still no Merge All option.

And what about merging the individual collections after duplicates across all have been removed?

bwiernik · February 23, 2019

You would not need to merge the individual collections. The general library view for a library will show you all the items in a library regardless of collection membership. I recommend making a new Group library for each systematic review/meta-analysis so that your collections and tags for your review can be separate from your general library.

Regarding Merge All, there is no such function, but I personally suggest not using such a system to avoid inadvertently merging false positives. I’ve done a dozen meta-analyses with thousands of hits with Zotero. Merging all of the duplicates with Zotero takes 15-20 minutes.

z8080 · February 23, 2019

@bwiernik Thank you so much, this is really useful. You are right, creating a new library just for this meta-analysis is much cleaner, as I would not want the thousands of items imported during this process to "mingle" with items in my general MyLibrary, which have notes and custom file names etc that I'd not want to risk inadvertently being changed!

I did not see any New Library option (only New Collection) so I assumed I'll have to work within MyLibrary. I'll be doing this alone so not sure if the Group Library is the most appropriate way to define a new library?..

bwiernik · February 23, 2019

Yes, make a new Private group. You don’t have to invite anyone else to it (but I usually do all of my RAs and other collaborators working on the project).

z8080 · February 23, 2019

Thank you very much for those hints, they've been very helpful.

Dr_Costas · February 25, 2019

Hi, this is what I have learned from the forums and my own systematic reviews:

1) Once results from each DB are imported into a collection, how can I arrive at the number of duplicates that exist between all of them, AND

I created individual collections in the library for each database under Dissertation (I did not use the Groups as advocated because I did not see that tip earlier. I will do on my next SR!)).

I then tagged all the files as "dbs:Scholar", "dbs:PUBMED" as needed in each collection. (select one article in the collection and add a new tag, then start typing on the tag box on the bottom left so that the new tag shows (thats why I used dps, so that all relevant tags will show), then press control+A to select all the articles, then drag and drop them on the tag).

Only then I went and merged all the duplicates. The result is that the duplicates will have both tags, so when I go into the Scholar group, I select the dbs:PUBMED tag and see all the articles that are present in both scollar and pubmed.

Problem is some duplicates are not recognised properly, so make sure you go on the main library and check them manually too.

2) .. how can I create a new collection formed of unique items from across all collections?

Now that they are all merged, you just create a new collection and copy paste from the database collections. The duplicates are considered one item and wont appear twice.

3) Some of my DBs (e.g. Google Scholar, ExLibrisPrimo) don't have an option to export *all* search results to Zotero (or even as an XML/XLS file). The 'Zotero Item Selector' button in Chrome allows me to only save as many items as the DB allows to fit on one results page, so only 10-or-so items at a time can be imported. Is there a better way?

For google scholar use https://harzing.com/resources/publish-or-perish its amazing, it gives you all google scholar results. Otherwise you cannot import too many because google stops you (too many connections or something)

For other ones, you could ask your university library if they offer this service.

Hope it helps,

Costas

z8080 · February 25, 2019

Thanks a lot Costas, will follow those tips, am grateful to you for them!