Tips for speeding up duplicate merging?

jenlnorvell · April 28, 2014

I have a large Zotero database (around 35K references) and probably 3,000 to 4,000 duplicates. This is by design because we need to search multiple databases with similar search terms for a meta-analysis and then weed out duplicates.

I've emptied the trash and also hidden the tag window to speed up switching between folders (based on another forum discussion). I'm curious:

1) In general, are there any other things I should do to help Zotero work more quickly (experiencing slow opening of the database and slow navigating between folders)?

2) Specifically for duplicates, is there any way to speed up duplicate merging? Currently it is taking 30-40 seconds after clicking merge to complete the request and pull up the next duplicate. We have the staff to manually merge the duplicates, but the lag time between merges is maddening and is slowing down the progress of the project.

We'd like to stick with Zotero for these types of projects but I'm wondering if the size of the library (and future libraries) is always going to be a hindrance or if there are other things I could be doing to improve the process.

Thanks for all your work!

adamsmith · April 28, 2014

how did you import those items? What's the total item count including notes and attachments:
https://www.zotero.org/support/kb/item_count

jenlnorvell · April 28, 2014

They were imported from RIS files and XML for pubmed. Total item count is 738,373.

adamsmith · April 28, 2014

that's not a typo? You really have 700k+ items in your library? I'm kind of surprised this is working at all.

One thing that would likely _massively_ improve speed is to delete all the notes automatically generated during RIS import and then delete the corresponding tag. See the Tip at the bottom here:
https://www.zotero.org/support/kb/importing_records_from_endnote#importing_into_zotero

adamsmith · April 28, 2014

Similarly, if you don't need the links to pubmed (since you do have the pmid anyway), search for and delete all the links to pubmed that are attached. I'm still wondering what's going on though - getting to 700k items with 35k references requires about 20 notes or attachments per item - where do all of those come from?

jenlnorvell · April 28, 2014

No typo. That's the first time I've looked at the total items- I knew it was a larger library but thought it was around the same size as what some others were working with (from looking at the 35K number).

I'm working on deleting the notes with the _RIS import tag, but its taking awhile to even load them, which makes sense. As I looked through, I found several PsycInfo references with 50+ notes, including a journal article with 147 notes. I can send the citations for a few of these larger ones if needed. Is this normal? The majority of my library is PsycINFO references.

adamsmith · April 28, 2014

I found several PsycInfo references with 50+ notes, including a journal article with 147 notes.(...)Is this normal?

no, that sounds like a bug. An item shouldn't have more than 1-2 (and ideally no) note on import.
From which version (i.e. which database provider) of PsycINFO did you import and how exactly? And yes, a reference would be helpful.

jenlnorvell · April 28, 2014

ProQuest is the provider. I used the Export/Save option in ProQuest and saved as an RIS file.

Reference #1 (147 notes):
There's an elephant in the room: The impact of early poverty and neglect on intelligence and common learning disorders in children, adolescents, and their parents.
Bigelow, Brian. Developmental Disabilities Bulletin 34.1-2 (2006): 177-215.

Reference #2 (130 notes):
Early Relationships and Their Internalization.
Akhtar, Salman. In The American psychiatric publishing textbook of psychoanalysis, edited by Person, Ethel S., Cooper, Arnold M., Gabbard, Glen O., 39-55. Arlington, VA, US:American Psychiatric Publishing, Inc, 2005.

adamsmith · April 28, 2014

thanks. That's helpful, but not good news.
Proquest puts every reference in the bibliography in a separate note (N1 tag). We really have no way to filter those out during regular RIS import, since that's where notes are supposed to go. When using the URL bar icon, we can customize import for specific sites, but not for generic import from RIS etc. Doesn't look like ProQuest allows you to customize RIS export either (e.g. to not contain those references).

@aurimas - I don't think you have access to this, I've put a sample RIS here:
https://gist.github.com/adam3smith/11381032
maybe you have an idea.

edit: FWIW, using the URL bar icon does fix this, but I assume you don't want to do that for systematic review.

edit2: haven't looked, but I assume going through EBSCO you'd also avoid this particular issue.

aurimas · April 28, 2014

The version of RIS that I was able to download didn't have the bibliography in it, so I was a bit confused as to what could be going on. I don't think there is anything we can do about it. Users should generally use URL bar icon, that's the best option we can give.

@jenlnorvell, depending on whether you need any of your notes or not, you could just delete them all using a saved search (though it would take some time and you probably want to do it in batches of a few thousand, emptying the Trash each time).

Alternatively, if you don't need any of your notes and you have not used any of the citations in a document, you can export all of your citations into Zotero RDF without notes, clear your library (we'll guide you through this, if that's your choice), and re-import it.

aurimas · April 28, 2014

FWIW, using the URL bar icon does fix this, but I assume you don't want to do that for systematic review.

I believe you can choose the contents of the RIS on export (my only choice is "Citation, Abstract, and Indexing" though), so maybe not.

adamsmith · April 28, 2014

it would also be possible to pre-process the RIS pretty easily and remove those notes in the future. A simple sed script or even a regex-capable text editor like notepad++ would do the job

adamsmith · April 28, 2014

I believe you can choose the contents of the RIS on export (my only choice is "Citation, Abstract, and Indexing" though), so maybe not.

you'd think & hope so, but I only get that one option, too. My guess is that it's ProQuest not so subtle way of undermining the competition (they make RefWorks). In RW export you do get a choice.

aurimas · April 28, 2014

That's too bad. I guess editing RIS is the only way to go (or, if you're into editing RIS files, then it would probably be easier to just edit the translator. You'd only have to change N1:"notes" to N1:"__ignore" on line 185)

adamsmith · April 28, 2014

(that should be "or, if you're into editing javascript files"...)

aurimas · April 28, 2014

(that should be "or, if you're into editing javascript files"...)well, I think editing that single line in the translator is far easier than coming up with a reliable regexp that will remove all (multiline) notes from a large file.

edit: FWIW, using the URL bar icon does fix this, but I assume you don't want to do that for systematic review.

Do you see an option in ProQuest to export all search results? I can't find it. If you can only do 1 page at a time anyway, then using the URL bar icon is not that bad.

I was thinking that if there is an option to export all, then we could offer that also via URL bar. We could probably do it for PubMed as well.

adamsmith · April 28, 2014

I think editing that single line in the translator is far easier than coming up with a reliable regexp

I wasn't disagreeing with you - but unless I'm misreading your post it doesn't make sense - you're saying the alternative to editing RIS is editing RIS - so I assumed it was a typo.

Do you see an option in ProQuest to export all search results? I can't find it.

nope, only the option to change to 100 at a time. That should make this pretty quick via URL bar icon, too

adamsmith · April 28, 2014

So getting back to jenlnorvell:
Our best recommendation would be to use the URL bar icon for import instead of RIS export. You can disable the "attach PDF" option if you don't need the full text, which will speed things up further.

The 2nd best option would be to modify the RIS import translator as specified by aurimas above. Drastically reducing the number of notes in your library should speed things up a _ton_.

jenlnorvell · April 28, 2014

Ok. I'm a little confused about the options, but maybe this info will help you guys point me in the right direction.

ProQuest is really our only option (which is very unfortunate for a number of reasons) because it is the provider for our university for several of the databases we are using . And it took me ages to get the references out of ProQuest because of having to go page by page and select all records on a page before exporting and with frequent instances where the page would not load. I don't see another option for choosing the contents of the export other than "Citation, Abstract, and Indexing". I have all the RIS files saved though, so maybe this idea about pre-processing the RIS file before importing would work if I start over?

Also, I don't think I need the notes and haven't used the citations in any documents.

It would be great to salvage this existing library if possible. We've already spent a good bit of time merging duplicates. Can you tell me more about the option to export to Zotero RDF?

jenlnorvell · April 28, 2014

Sorry, didn't see several of the recent comments...

adamsmith · April 28, 2014

So for future reference, since you have to go page-by-page anyway, instead of using export, click on the Zotero folder icon in the URL bar:
https://www.zotero.org/support/getting_stuff_into_your_library#web_translators

That won't help you now, though. About RDF: have you been syncing your computer? That would make this harder.
To prepare for using Zotero RDF, click on "Export Library" in the gears menu, select "Zotero RDF" as the format and make sure to unselect notes and, probably, files for the export. The export will take some time and you may get an unresponsive script message (if you do, click continue), so I suggest you let it run while you do something else. Once you have the export, I'd want to hear what aurimas thinks would be the best way to get it back in.

jenlnorvell · April 28, 2014

I haven't been syncing. Ok, I'll start the RDF export.

adamsmith · April 28, 2014

oh, also - are you running Standalone or the Firefox add-on?

jenlnorvell · April 28, 2014

Standalone

aurimas · April 28, 2014

@jenlnorvell, another question... do you need the tags/keywords that come with these items from ProQuest?

About RDF: have you been syncing your computer? That would make this harder.

The reason admasmith is saying this, is because deleting the library and then importing a new one would have to sync back to the server, which would take time (though there may be a way around this).

I haven't been syncing.

Are you sure? Can you check on zotero.org to make sure that it does not appear there?

The library you are working on is a group library, correct? Do you have other group libraries? Do you have anything in your personal library?

@Dan, does resetting to/from server work for group libraries?

dstillman · April 28, 2014

jenlnorvell: While it's true that you haven't synced recently, you do have 30K items in your personal library on the server. Let us know whether the library you're doing this in is a personal library or a group library. (The former would be a lot simpler.)

@Dan, does resetting to/from server work for group libraries?

Well, resetting from the server wipes out the whole local database. But resetting to the server only clears the remote personal library. (If you've made unsynced local group changes it can cause some conflicts with group libraries due to the way it works currently, though.)

adamsmith · April 28, 2014

Oh, if this is a group library you'll need to export it differently, sorry about that.

jenlnorvell · April 28, 2014

This isn't a group library. I turned off syncing quite awhile ago because of the size of the library and took out my zotero account info from the preferences so that it would not sync with my zotero account.

aurimas · April 28, 2014

Well, resetting from the server wipes out the whole local database. But resetting to the server only clears the remote personal library. (If you've made unsynced local group changes it can cause some conflicts with group libraries due to the way it works currently, though.)

Grrr. That would be bad, but since you say jenlnovell only has items in the local library, I'm a bit hopeful this can work out.

aurimas · April 28, 2014

@jenlnorvell, another question... do you need the tags/keywords that come with these items from ProQuest?

Could you answer this as well? We could do some pre-processing to delete tags if you don't need them, which would speed things up a lot.

jenlnorvell · April 28, 2014

Oh sorry! No, I don't need the tags or keywords.