Extract the Pdf File location for An Item

rileykav · September 14, 2023

My goal is to be able to relatively automatically export the entire database from Zotero (in the case of discontinued support/I want to move to a new system etc.).

Currently I can export the library metadata as one large Json file, which I can easily separate per paper/item, similarly I can easily extract all of the Pdf files to some folder outside of Zotero (for example using a basic shell script to copy out all of the pdf files in the `~/Zotero/storage/` folder).

However I have no way to link these, I could set up a script to loop through each entry in the json file, and search for the title in the pdf folders, and then authors to narrow the search down in case of common titles; however this is prone to breaking if the title is not allows exact in the filename (maybe its too long, or contains special characters which the filename has recast etc.).

Given that Zotero has knowledge of where the pdf location is stored, and the 8-digit alphanumeric code seems to be used as a unique identifier, this should be the most logical way to link the item metadata and the pdf file. But I have yet to find a way in Zotero to even access this identifier, there is even an id code with a url containing two different alphanumeric codes, neither matching that of the pdf file itself.

So the main question is *How can Zotero tell me what the stage id for each pdf is?*

adamsmith · September 14, 2023

Why would you not just export to Zotero RDF, which includes file attachments (as do formats like RIS and BibTeX, which would likely be what you'd use to transfer to a different tool)?

You can look at the export code to see how export translators access the file attachments if you do want to script a custom solution (e.g. here: https://github.com/zotero/translators/blob/master/Zotero RDF.js ) but this honestly seems like a waste of time. Zotero is already set up to maximally protect you against any sort of lock in with 3 different ways to access metadata and files (export, web API, direct sqlite access)

rileykav · September 14, 2023

I did look at RDF, but I was not familiar with the syntax involved, looks vaguely HTMLy, and nothing I know how to parse in a python script. For the bib latex its a similar story that but more straightforward. I still would not have a perfectly simplistic way to parse that data. Tbh I don't understand why all exports don't include the option to export the attachments along side the metadata...

Anyway I was able to resolve my issue. I needed up using the CSV exporting, with attachment location Information (this csv file seems weird to me with quotations around every item and broke my interpreter for ages) and eventually got a python panda to correctly read the info. Then I can just pass the file name format I want and the file location in Zotero to my shell and archive the pdf and metadata usefully.

Finally as a small suggestion, though I don't know the ins and outs of this, it would be nice for some custom export function, just give me access to all the metadata as variables in some javascript or similar box and let me write the string format myself. Or if not that, customise what goes into the export files, as each of the formats contains different amounts of info (why does csv include the attachment location but not json??).

adamsmith · September 14, 2023

The JSON is *CSL* JSON, i.e. JSON output intended for citation processors, which obviously have no use for the attachment files.

And Zotero export translators are basically this:

it would be nice for some custom export function, just give me access to all the metadata as variables in some javascript or similar box and let me write the string format myself.

You can add custom ones easily.

For the rest:
RDF is a standard XML format, and there's of course an XML parsing library in python: https://docs.python.org/3/library/xml.etree.elementtree.html )
And Zotero's CSV export is valid. Quotation marks around every field are necessary since basically every field can have commas in it. That's perfectly standard CSV, I'm surprised panda would struggle with it.

aborel · September 14, 2023

Of course, there's a Python library for RDF as well:
https://rdflib.readthedocs.io/en/stable/

There are a few code snippets using it to work with Zotero, for example https://gist.github.com/rlskoeser/d18c19a8351d97ca933b64fd26048b98

rileykav · September 14, 2023

I did not know this about the Json, it is still odd to me that you have citation export mixed with exporting metadata itself, which are done fore very different purposes and intent.

I have no idea what a Zotero export translator , I believe you that It does exactly what I want, but it is not exactly user friendly (e.g. to find/know is an option).

I didn't doubt the RDF, just simply commenting on the fact that it gave me all these other options for exporting, and I chose the ones I was comfortable working with. The CSV is valid as you say, and pandas weren't the trouble it was pythons builtin csv parser and numpy. Pandas actually just worked out of the box.

I don't mean to be combative in any of my comments, but too often I come across very useful software, but with annoying/non-user friendly data storage structure (e.g. what is with the weird id in the storage and why is it so difficult for the user to access?)

aborel · September 14, 2023

Your comments did not feel combative at all (at least to me), no worries about that!
We're just trying to point out possible solutions to achieve your goal, hopefully this is useful for you.