can't export Large lib/collection to RDF

jclutterBMI · August 23, 2013

I have looked for couple hours on forums & web and not found help ... if I missed a thread, I beg pardon and would be delighted for a redirect.

I have a VERY large project I'm working on - library has 14,000 titles (w/ abstracts). I have NO files attached - just the referernce data, a few custom tags and perhaps 50 notes. The zotero SQL file in my FF profile is 113MB. I have titles sorted into a couple collections, one with about 10,200 titles, the other with 3,800+ titles.

I tried to export the library as RDF several times, but it hangs each time - the progress bar stops moving, FF gets stuck and often the wholes system locks up. I've let it run for hours and hours(i.e. overnight) and tried both my desktop and laptop (both FF add-in, both Linux Mint 15, one Cinnamon, one Mate)- progress bar sticks same place (about 5% done on the whole library, about 25% on collection of 3,800). I tried exporting the 10,200 articles collection AND the 3,800 article collection as RDF's and they both do exactly the same thing - hang. I made then tried exporting a small collection (10 titles) and it did fine - just a few seconds. I viewed the RDF for this small collection and looks like intact XML, so I assume it's okay.

Is there a limit to how large a RDF one can export? I cannot find info on the "upper limits." Would standalone work better? I also have access to a Windows machine ... it would make me cringed to use that over Linux, but if I must, I'll hold by breath and try ;)

What I am doing is this: The project's subject area is medical and very poorly indexed in MeSH, so searches either miss articles or pull static. I don't want to miss articles, hence I started with a set of 14,000. I've "reviewed" the titles and the 10,200 are not germane ... leaving me about 3,800 abstracts to review. I'll eventually pull in probably 150 - 400 articles. I don't want to "purge" the 10,200 titles and loose them forever, but would like to get them out of my library for the moment so it's not so "bloated." I see several permutations of saving all/part of my library and deleting other parts as options to accomplish this. Export to RDF seemed like a good way to go? ... if it would work...? If I'm pushing Zotero too hard to make RDF out of 10-14k titles, okay, but I thought 3,800 was reasonable? If infact my sets are too large, would RIS work? RIS seems not to be liked in forums b/c not all data is represented? I do need both automatic and custom tags to persist, but attached PDF's are irrelevant at this point.

I'm on Zotero 4.0.12 and FF 23.0; I've turned off screen-savers and powersaving/suspend and nothing else is running while I try exports. I moved from Mendely ~ 2-3 mo ago so am a Zotero newbie but trying to learn fast.

Thank you,
jdc

adamsmith · August 23, 2013

There is no theoretical limit for the size of RDF, but there may be practical limits. 4k doesn't seem that big, though, I just exported 2.5k items in less than 10secs.
My guess is this hangs on one particular entry. One thing to try would be to not export the collection, but rather select all items in the collection in the middle panel and export those - there have been a fair number of reports in the past where that has helped.
If it doesn't, the next step would be to see if this is because of one particular item. Try exporting about half of the collection and then the other half. If one of them exports, keep narrowing it down by splitting further into halves.

If anything, Windows would do worse here, it's memory management is notoriously bad.

jclutterBMI · August 29, 2013

Thanks for suggestions and thoughts. Have been preocupied with email problems at the University and other projects.

I've tried smaller groups (1000-2000) and different batches of refs out of several different collections in case there was one ref that process was getting hung up on. RDF export doesn't work with more than say 500- 800. RIS and BibTex can actually do all 14,000 though very, very slowly. Used 3 machines: Linux Mint14 Mate 32bit, Mint 15 Cinnamon 64bit, Mint 15 Mate 64bit. Also tried old Win XP box - same. All are on FF23, but I put Zotero standalone on one Linux and the Windoze box - same thing, except anything on windoze and standalone + Linux crash outright while Linux+FF stalls/hangs. Actually system hangs so bad I can't use that instance of X or even do Ctrl+Alt+Fn to browse and kill processes to get X freed up or shut down. It'll be hung that way for hours and hours ... overnight is all I am patient enough to stand - after a couple 8+ hr goes, I hit the power button.

I was currious, so I opened up sys monitor while I tried. In **all** systems, RAM slowly climbs to 90+% use, at which point things fail, them memory drops to 10-15% which is where it was at baseline. 2 machines have 4GB Ram, older one has 3GB. I know these files are large, but my whole zotero sqlite DB is 113MB ... so if I've got 400-500MB of system overhead I'm a little confused how 113MB of Zotero data swells to well over 3,000MB. Again, I have no PDFs - its all citations but each has an abstract so that adds up. It acts almost like there is a memory leak? It takes way more refs than most of my colleages have in their Zotero library to hit what seems to be the critical 90+% system RAM to hang, so no one else I talk to has seen this happen.

Unless someone has thoughts, I'm thinking I've just jumped high enough I've hit my head on the theoretical ceiling and found, yes, it does exist?

As I have no attached PDFs but want to preserve a few notes and some tags, I guess BibTex would be my next best export mode? My goal, again, is to offload 10,000+ of these refs so Zotero is only playing with 4k BUT still be able to relaod/move the whole set of refs later if I need to re-look at something. Again, being new to Zotero, I'm not sure what plan B should be.

Thanks again for your quick reply and the help; sorry it took a couple days to get back ... lots going on.

adamsmith · August 29, 2013

As I said - something isn't working right for you. 500-800 is certainly not even close to a theoretical limit. I can exported 2500 items as RDF in 10secs. Maybe Simon or Aurimas have thoughts on what's going on.

Do you have mainly journal article items, maybe a couple of books & chapters? Then RIS or BibTex will both work just fine. If anything I'd say RIS is more reliable because it's more clearly defined as a standard.
The main advantages of Zotero RDF come to bear when you're looking at more unusual item types or when you want to keep collection hierarchies.

jclutterBMI · September 6, 2013

Thanks for your thoughts, Adam; I agree, something is definitely not right. Any way to get input from guys you mentioned?

I've been playing with it and unless there is something I'm missing, it looks to me like a memory leak.

Even 2500 items to RDF will run for minutes with system RAM slowly climbing to around 90% then Zotero freezes and RAM drops to level of system idle. 15-20+ minutes later, I frequently have to kill FF via sys monitor or terminal command. RIS export does same thing, but can export more refs before jams up, I guess because of being more compact?

I'm puzzled why 3 different verions of Linux and Win on 3 physically distinct machiens and all act the same with both standalone and FF plugin. Makes me thing it's the ref set or the app. I've checked DB integrity and it's okay. I've tried multiple distinct subsets of refs in case one is jamming it - no dice. I've got just a feeeeew hours in the set as I've reviewed and tagged all 14,000 titles ;) so just re-downloading from pubmed is a hard pill to swallow... not to mention downloading 14,000 aritcles 200 at a time.

Interestingly, Mendeley can suck up the sqlite file and spit out a BibTex or RIS in the sort of timeframe Adam described for Zotero; it did all 14,000 in less time than Zotero thinks about ~1000 for me.

I like the flow of using zotero better than other ref mgrs, so if I can get this figured out would like to use it for this project.

Anybody else? Any other info I can add? I haven't looked for error log as I'm not sure it would do *me* any good, but if someone wants that I can look for it...

I guess if there's not anyone else who can weigh in or throw suggestions my way, I'll thank Adam for all the input, and I may just have to go back to prior ref mgr for now.

jdc

dstillman · September 6, 2013

Can you provide a Debug ID for an export that takes longer than it should? (It doesn't need to take forever—just long enough to demonstrate the problem.)

jclutterBMI · September 6, 2013

Thanks Dan. Got the instructions for turning on debug and will do it right now. It takes a few minutes for system to freeze/crash, but it is predicatable - any RDF export over a few hundred will do it to me.

I'll be back in a bit when it's crashed and submitted to Zotero server.

Thanks,

justin

jclutterBMI · September 6, 2013

Actually ... may be a bit ... I'm playing with RIS and think I may have managed to get an RIS export for a bit over 10,000 refs. I want to see if I can pull them all back in - turns out it's still running/importing. I have a ~3,800 ref set I'd like to export as RDF but fails - when RIS import is done, I'll turn on debug and do an RDF export and upload to server.

thanks,

jdc

jclutterBMI · September 7, 2013

Dan -

I'm having trouble getting logging to cooperate.

I enabled logging, did an export and it hung as usual. I closed Zotero tab but it would not restart Zotero - said there was an ongoing zotero process and I had to wait, but never came back up. To get Zotero going again, I closed FF and restarted but that seemed to flush the log file as it said 0 lines? I *rebooted* and repeated but left preference open while I did the export and I see between 30k and 40k lines logged (have tried a few times). However, after it hangs and I click submit, the progress indicator goes back and forth but it never seems to end. I let it sit almost an hour last night doing this before I shut it down. I've tried "view" log button after it crashes but before submitting - I get a new FF window but it's blank.

First, am I missing something whereby I'm screwing it up?

I assume <40,000 lines of log should upload via cable internet in at most a couple minutes so I assume giving it 40-60 min was enough?

Second, does the log persist somewhere I could go digging in directories and open it in a text or hex editor?

Third, any problem with me logging and "submitting" without trying to hang it?

(Currently, based on link you sent, I've set flag for debug output to true in about:config, started FF from CLI and am running an export. RAM use is escalating like it usually does, but much slower - probably from overhead of outputting to terminal. I'll see if I catch anything that way, but looks like such a massive output my buffer will only hold a fraction.)

Thanks

dstillman · September 7, 2013

You can either click View Output and save the text or generate real-time debug output (except on Windows, where it's too slow), and then zip the text and email it support@zotero.org with a link to this thread. If you go the real-time route, you should enable extensions.zotero.debug.time in addition to the other setting mentioned on that page.

dstillman · September 7, 2013

You can probably adjust the terminal's scrollback buffer. But again, we just need enough to demonstrate the problem.

jclutterBMI · September 7, 2013

Just sent files to support@zotero.org. First attempt had delivery failure - maybe too big (570k)? Attachement was zip of 2 txt files and 2 screengrabs showing RAM escalation. I put it on my Google drive, set permissions to "anyone with link", and sent you link.

thanks for help,

jdc

danb · September 23, 2013

I had a similar problem. It was a single article with a non ascii character somewhere.
I don't remember how I fixed it, maybe it wrote part of the RDF and I searched from the last working article, or I did a bisection.
For 14000 articles that would still be only 14 tests.