Problem importing a large dataset (500MB) - Report # 1783265735
We are building a large bibliographical dataset of multiple 10s of 1000s of scholarly publications in our field. We have merged and deduplicated (based on a synthetic key and retaining the most promising field for each record from across the dataset) exports from Dimensions, Google Scholar, The Lens, OpenAlex, Scopus and Web of Science. If we try to import the full dataset, the system becomes irresponsive and stays so for hours (I've let it run for 11 hours) - meaning I also cannot submit a debug output log. I did make a few screenshots, though - see here and here. When I try to import a small subset (like here) then it does work. But even in this case, the Progress Bar does not build up during 'Importing' until completion.
Does anybody have any ideas how I could still get the entire dataset into Zotero or how to debug this better? I do appreciate that Zotero has been improving over the years in its handling of large files, but it would be nice if this could work better
Like this:
PT J
PY 1998
UT 031-149-490-326-564
DI 10.1080/10758216.1998.11655776
TI Who’s minding the state?the failure of Russian security policy
SO Problems Of Post-Communism
VL 45
AU Blank S
TC 0.0
DT JOURNAL ARTICLE
AF Stephen Blank
JI PROBL. POST-COMMUNISM
J9 PROBL POST-COMMUNISM
IS 2
SC Security Policy; Political Science; Law And Economics; Law; State (Polity)
ER
Now the fact that the smaller sample of the full dataset does import properly suggests to me that the problem probably doesn't lie there. And just FYI - we now work mostly directly in Python with NLP on the abstract AND full-text fields of our merged/deduped/filtered datasets. But we still like using Zotero for our footnotes and bibliographies. That's why we still make these Ersatz-RIS/Wos files - and it would be great if we could import them properly.
BTW - I have a LOT of RAM (96GB) in my system. And Zoterio only uses 2GB of my RAM, and 10% of my CPU. Does 2GB sound right for importing a 500MB dataset? Is there a parameter than can be set to allow Zotero to use much more resources (if that's the stumbling block)? Is there any other thing I can do to debug this? Or better yet - to make it work?
But thanks again Sebastian!
The format is straight WoS (Zotero wouldn't understand hybrid formats). Someone will need to look at the error report to see if you just ran out of memory or if there's an actual error -- it's certainly possible that you run into an issue somewhere at the 50k's item -- these tabbed formats are fragile.
Beyond that -- any reason you can just import in chunks? And also, I did a back-of-the-envelope calculation, and if this item (~380bytes) is somewhat representative, a 500MB file would have more than 1M entries -- beyond anything Zotero has been tested for/with.
You can look at the error report (and/or debug output) yourself to get a sense of where things fail (I don't know how much the WoS import is used and for what types of materials, so it's certainly possible the error is on the Zotero side), but won't tell you if there's a specific entry that does -- but depending on the error should give you a decent clue of what's going on.
You could turn the JSON into CSL JSON? I don't think we have a validator for that, but that'd be a nice clean way of importing -- not sure if it's more efficient, though.