Importing another database into Zotero

sdspieg · October 16, 2013

I'm not quite sure where to post this, so let me try it here. For a research project, we are currently crawling various official government websites into a database format containing the following fields: the URL, the title, the source (e.g. Brazilian Ministry of Foreign Affairs, the raw text, and the date of publication. Could anybody please advise whether (and if so - how) we could get that data into Zotero? This would be great for our bibliography, as well as for the textmining we want to do with these sites (through Papermachines). So: which database format should we then use, how should we name the fields for them to be compatible with Zotero, how do we import them, etc. Thanks much!

noksagt · October 16, 2013

You will need to transform that data into a standard bibliographic format that Zotero can import.

This page offers reasonable advice regarding file formats in the unAPI section:
https://www.zotero.org/support/dev/exposing_metadata

adamsmith · October 16, 2013

So the way I understand you is that you don't wan't to publish the database (if you do, see noksagt's advice) but rather just import it into your personal copy of Zotero.
If I understand that correctly, the easiest format to construct from an existing database is probably bibtex along the lines described here:
https://forums.zotero.org/discussion/25120/importing-from-excel-to-zotero/#Item_10 It's a flat format that's not very sensitive to linebreaks and the like.
To find out correct field names, just construct one entry as you want it in Zotero, export as bibtex and use that as a template.

sdspieg · October 16, 2013

Well, it's not so much that we don't WANT to publish the database, but I do not know the legal ramifications of that. So I think we will indeed want to just import it into our Zotero group (probably without the actual html, as some of these sites are enormous). Thanks both, we will try the bibtex-route.

sdspieg · November 8, 2013

Another question. We have now succeeded in importing our database into Zotero through bibtex. We have also successfully ran Papermachines on a subset of a little over a thousand entries. But could somebody please tell us whether there are any limitations that we may run into? I.e. How many items could successfully be added to Zotero AND be processed by Papermachines? Our items are generally not that big, as they are just the raw 'cleaned' text from a bunch of official government websites from which we scraped the datestamp. But we are talking about 10s of thousands of items per site. So if anybody has any thoughts on that, we'd love to hear it. Thanks

sdspieg · November 25, 2013

Does nobody have an idea on this one?

adamsmith · November 25, 2013

I know that the JSTOR corpus they developed Papermachines on is pretty substantial, but no specific ideas about limits - neither Zotero nor PM has been built with explicit limits.

sdspieg · November 25, 2013

Thanks. And is it 'normal' then, that importing a corpus of 20'000 (mostly relatively small) items takes about an hour? With Firefox repeatedly 'not responding'?

adamsmith · November 25, 2013

depends on a number of factors, but in general term, yes.

sdspieg · November 25, 2013

And which might be critical factors that could be remedied?

It doesn't appear to be related to the specs of the pc. E.g. the pc from which I'm doing this runs on x64 Windows 8 on a recent AMD 8-core @ 3.5 GHz with 8 gig of RAM and on a fast SSD drive - and it does NOT extract. Whereas my notebook, which is a x64 Windows 7 machine with an Intel i7 CPU @ 2.67 GHGz with 6 gig of RAM and also a fast SSD drive.

So if you have any suggestions of what we could play with in order to fix this, we'd be very grateful

aurimas · November 25, 2013

Not much you can "play with" but it could be Windows 8 vs Windows 7