Problem importing a large dataset (500MB) - Report # 1783265735

sdspieg · September 17, 2023

We are building a large bibliographical dataset of multiple 10s of 1000s of scholarly publications in our field. We have merged and deduplicated (based on a synthetic key and retaining the most promising field for each record from across the dataset) exports from Dimensions, Google Scholar, The Lens, OpenAlex, Scopus and Web of Science. If we try to import the full dataset, the system becomes irresponsive and stays so for hours (I've let it run for 11 hours) - meaning I also cannot submit a debug output log. I did make a few screenshots, though - see here and here. When I try to import a small subset (like here) then it does work. But even in this case, the Progress Bar does not build up during 'Importing' until completion.

Does anybody have any ideas how I could still get the entire dataset into Zotero or how to debug this better? I do appreciate that Zotero has been improving over the years in its handling of large files, but it would be nice if this could work better

sdspieg · September 17, 2023

UPD - I just got an error during another attempt that I was able to submit - report id 1783265735. I would appreciate it, if somebody could look into this.

adamsmith · September 17, 2023

You should try on Z7 if you aren't already. Also, what format are you importing?

sdspieg · September 17, 2023

I tried (and failed) in both.And the format is one that emulates RIS/WoS.
Like this:
PT J
PY 1998
UT 031-149-490-326-564
DI 10.1080/10758216.1998.11655776
TI Who’s minding the state?the failure of Russian security policy
SO Problems Of Post-Communism
VL 45
AU Blank S
TC 0.0
DT JOURNAL ARTICLE
AF Stephen Blank
JI PROBL. POST-COMMUNISM
J9 PROBL POST-COMMUNISM
IS 2
SC Security Policy; Political Science; Law And Economics; Law; State (Polity)
ER

Now the fact that the smaller sample of the full dataset does import properly suggests to me that the problem probably doesn't lie there. And just FYI - we now work mostly directly in Python with NLP on the abstract AND full-text fields of our merged/deduped/filtered datasets. But we still like using Zotero for our footnotes and bibliographies. That's why we still make these Ersatz-RIS/Wos files - and it would be great if we could import them properly.
BTW - I have a LOT of RAM (96GB) in my system. And Zoterio only uses 2GB of my RAM, and 10% of my CPU. Does 2GB sound right for importing a 500MB dataset? Is there a parameter than can be set to allow Zotero to use much more resources (if that's the stumbling block)? Is there any other thing I can do to debug this? Or better yet - to make it work?
But thanks again Sebastian!

adamsmith · September 17, 2023

Zotero 6 on Windows is a 32-bit app, so it's very limited in the RAM it can use, that's why I asked -- you should definitely try on Zotero 7 -- there's very little chance this can work in Z6 at all.

The format is straight WoS (Zotero wouldn't understand hybrid formats). Someone will need to look at the error report to see if you just ran out of memory or if there's an actual error -- it's certainly possible that you run into an issue somewhere at the 50k's item -- these tabbed formats are fragile.

Beyond that -- any reason you can just import in chunks? And also, I did a back-of-the-envelope calculation, and if this item (~380bytes) is somewhat representative, a 500MB file would have more than 1M entries -- beyond anything Zotero has been tested for/with.

sdspieg · September 17, 2023

I am using it in Z7... But yeah, I'll still try a few tricks like chunking it. But so there really is no debugging tool than we can run a file through and that can identify which records it might choke on - so that we can then fix those? Also - we DO have this in jsonl format as well. Is there a way to 'massage' that (e.g. just retain the 'WoS'-field-tage-based fields) and then import THAT into Z7?

adamsmith · September 18, 2023

I don't believe there's anything like a validator for those formats. That's one of the reasons those formats are so fragile.
You can look at the error report (and/or debug output) yourself to get a sense of where things fail (I don't know how much the WoS import is used and for what types of materials, so it's certainly possible the error is on the Zotero side), but won't tell you if there's a specific entry that does -- but depending on the error should give you a decent clue of what's going on.

You could turn the JSON into CSL JSON? I don't think we have a validator for that, but that'd be a nice clean way of importing -- not sure if it's more efficient, though.