Tips for speeding up duplicate merging?

aurimas · April 28, 2014

OK, here's my take on what you should do (let Dan/adamsmith OK this before you proceed).

1. Export your library as Zotero RDF without notes or file attachments.

2. Close Zotero.

3. Back up your current Zotero library: https://www.zotero.org/support/zotero_data#backing_up_your_zotero_library

4. Delete your existing Zotero library. Locate the library as per instructions in above link and delete the zotero.sqlite file along with the "storage" folder.

5. Modify the RDF translator to not import tags. Find RDF.js file inside your zotero data directory in the "translators" folder. Open the RDF.js file in a text editor. Find line 1156 (should say "newItem.complete()"). Right _before_ this line, add "newItem.tags = [];"

6. Restart/Open Zotero. Delete the single item that should be in the library and empty trash.

7. Fill in your sync details in Zotero preferences, but _uncheck_ sync automatically.

8. Reset your library _to_ server under Preferences -> Sync -> Reset -> "Restore to Zotero Server" -> Reset...

9. Import your RDF file via Gear menu -> Import...

10. You can now enable automatic sync. It will help to sync periodically, instead of syncing large amount of changes at once.

11. Reset your translators via Preferences -> Advanced -> Files and Folders -> Reset Translators...

12. Restart Zotero.

(Edited to speed up resetting to server)

adamsmith · April 28, 2014

yes, that's exactly what I'd have advised.
Steps 7 and 8 are not technically necessary if you're not interested in syncing at this time and even with the cleaned up library, the sync may be too big to go through smoothly. So if the reset fails, just remove your username and password and keep working. Just remember that if you ever want to sync again in the future, the first thing you need to do is a Reset --> Restore to Server. Large syncs will work better in future Zotero versions.

aurimas · April 28, 2014

Good point. Edited instructions to perform reset with an empty library. No need to worry about resetting it in the future if you don't want to sync files at this point.

jenlnorvell · April 28, 2014

Thanks for the step-by-step instructions!

However, I'm having trouble getting past step 1. When I try to Export my library as Zotero RDF, I get a pop-up message that says "An error occurred while trying to export the selected file." It happens pretty quickly and a file is created in the specified location but there is nothing there. any ideas?

aurimas · April 28, 2014

Reset your translators via Preferences -> Advanced -> Files and Folders -> Reset Translators... Then restart Zotero. Go to Preferences -> General -> click Update Now.

Try exporting again. If it doesn't work, submit a Report ID https://www.zotero.org/support/reporting_bugs

adamsmith · April 28, 2014

actually, first just try right-click in "My Library" on the left and Export Library from there. Sometimes that makes a difference.

jenlnorvell · April 28, 2014

I reset the translators as described and also tried right clicking My Library to export. Still not exporting the Zotero RDF file.

ReportID: 1989798613

adamsmith · April 28, 2014

@Dan - what's the error?

dstillman · April 28, 2014

out of memory

dstillman · April 28, 2014

How many total items do you have left?

aurimas · April 28, 2014

Also, are the items sorted into collections? You can export them one collection at a time (top level collections should be ok depending on how many items you have in each). You'll just have multiple RDF files, but you should be able to get your collection info back on re-import.

adamsmith · April 28, 2014

I don't think it'll be much fewer than above - the whole point of the export/import operation was to reduce the number.
How many collections do you have? Would it be feasible to export them one-by-one (top level collections that is).

dstillman · April 28, 2014

I don't think it'll be much fewer than above - the whole point of the export/import operation was to reduce the number.

Oh, right. Well, I suspect a whole lot of operations in Zotero will fail with a 700K-item library — that's approximately an order of magnitude larger than the largest library we've ever seen. I'm actually shocked that it's at all usable.

jenlnorvell · April 28, 2014

Sounds like I should be winning some kind of award for getting this library to semi-function.

Yes, I've got 29 collections. I'll try exporting a single collection.

aurimas · April 28, 2014

Yes, I've got 29 collections. I'll try exporting a single collection.

If you succeed to export all collections, don't forget to also export "Unfiled Items" (special collection just above Trash). Select it (hopefully that doesn't take forever), then select one item in the center pane, press Ctrl+A, right-click, Export.

dstillman · April 28, 2014

Also, just to clarify (maybe I missed this above), items don't exist in more than one collection, correct? (If they do they would be duplicated.)

aurimas · April 28, 2014

(Even if you didn't copy items to other collections yourself, they might still exist in multiple collections if some of the items have already been merged. You could just re-merge them afterwards, since that's what you will be doing anyway)

jenlnorvell · April 29, 2014

Each collection is from a different database, with the larger databases split into different years. I believe that the only duplicates between collections are there because of the duplicates that we have already merged.

Not sure that I am even going to be able to export by collections as they are now. I was successful on one of the small collections from ERIC (also hosted by ProQuest), but just got the same error one of the smaller PSYCInfo collections (1076 parent items, 21180 total items). Report ID for that error is 75680549- think it is also an out of memory error.

Maybe I can split the larger collections up and export that way. Is the "out of memory" error connected to the size of the whole library or just to the collection you are exporting? If it is still being affected by the gigantic library, maybe I can delete collections as I export them (I do have several backups of this library)?

Thanks for your continued help with this.

aurimas · April 29, 2014

Is the "out of memory" error connected to the size of the whole library or just to the collection you are exporting?

Should be affected by the current collection only, though Dan would be able to say for sure. At least that's the only way you would be able to export the smaller collection.

Maybe I can split the larger collections up and export that way.

Yes, you can try that. If you're thinking of splitting these up by sub-collections (instead of a sub-set of individual items within the top-level collection), keep in mind that you may (or may not) have items that are just within the top collection and not in the subcollections, so make sure to export those as well. Not sure if you have "recursive collections" hidden option turned on, but it would probably help in this case.

Also, keep in mind that if you "Remove collection" instead of "Delete collection and Items...", the items will end up at the root of your library. So if you want to be deleting collections as you go, you may want to export the "unfiled items" pseudo-collection first.

jenlnorvell · April 30, 2014

Thanks for all of your help!

Just to give a quick update- after exporting around 80 RDF files (I had to split up my collections into 2 a few different files), and re-importing them as instructed above, the library is functioning so much better. Duplicates are taking about 5 seconds to merge now.

This might be something that you all should consider adding or making more clear for those of us unfortunate enough to be restricted to ProQuest for many of our main databases. The RIS files seemed like a good option for us because we could re-import them later if needed and it seemed quicker than using the folder icon at the time. The library was gradually getting slower as I imported these files, but I thought that was just typically of larger libraries. Obviously knowing what I know now, I should have inquired sooner to figure out what was going on. But it would be helpful to save others the pain of this in the future, especially those working with large libraries.

Thanks!

adamsmith · April 30, 2014

we'd like to let people know about this - any thoughts on how? E.g. was there any place in the documentation that you looked before using RIS import?

dstillman · April 30, 2014

How many total items do you have now, out of curiosity?

jenlnorvell · May 1, 2014

Actually, I think I spoke too soon.

I had imported about a third of my original library (the first three years of my search) so that I could go ahead and work on the duplicates for those years. We are exporting to Filemaker for abstract screening, so my main goal was just to get the first couple years of my search into Filemaker first. That all worked fine.

Now that I am trying to import the rest of my RDF files, I am getting an error message on several of the RDF files- saying the file is not in a supported format. The first set I imported worked fine (and others in this new batch) so I'm not sure what happened. Should I send one of these files or is there something I should look for?

As for where to post information about this issue, I remember reading through the known translator issues for ProQuest and being concerned that if I tried to pull down a large number of files that it may stop working. This was part of the reason I choose to go with RIS. So I know that its not a translator issue, but you might include a note there to caution people from using RIS with ProQuest also.

It also might be helpful to have a general section about how to troubleshoot a slow library- things to check for, when to know when something is wrong rather than just a product of a large library. This might be a good place for the information about tags and how they can slow things down as well.

aurimas · May 1, 2014

Should I send one of these files or is there something I should look for?

You can send it to support@zotero.org with a link to this thread. Dan will forward it to appropriate people from there.

If you don't care about the file being public, you can paste its contents (open in a text editor to copy) to https://gist.github.com/ or you can put it in your dropbox (or the like) and share the link here

jenlnorvell · May 1, 2014

Ok, just sent the email.

jenlnorvell · May 1, 2014

Also, here's the link: https://gist.github.com/anonymous/284709f701c5c94681e7

aurimas · May 1, 2014

Open your file locally and check the end of it. It looks like it is truncated. Might be a bug in Zotero export

adamsmith · May 1, 2014

@aurimas - no, it's not truncated. Open the raw link or download the file. This is only a github display issue.

There is a syntax error in the abstract on l. 612 of the style, but removing that doesn't seem to help, either. I'm not quite sure what's going, but can reproduce the issue.

jenlnorvell · May 1, 2014

Don't know if it would be helpful but I can post one or two of the others that gave the same messages.

adamsmith · May 1, 2014

I think I know what's going on - it's a bit technical (and you're doing everything right), I'll discuss with aurimas per e-mail and get back to you.