Extract the Pdf File location for An Item
My goal is to be able to relatively automatically export the entire database from Zotero (in the case of discontinued support/I want to move to a new system etc.).
Currently I can export the library metadata as one large Json file, which I can easily separate per paper/item, similarly I can easily extract all of the Pdf files to some folder outside of Zotero (for example using a basic shell script to copy out all of the pdf files in the `~/Zotero/storage/` folder).
However I have no way to link these, I could set up a script to loop through each entry in the json file, and search for the title in the pdf folders, and then authors to narrow the search down in case of common titles; however this is prone to breaking if the title is not allows exact in the filename (maybe its too long, or contains special characters which the filename has recast etc.).
Given that Zotero has knowledge of where the pdf location is stored, and the 8-digit alphanumeric code seems to be used as a unique identifier, this should be the most logical way to link the item metadata and the pdf file. But I have yet to find a way in Zotero to even access this identifier, there is even an id code with a url containing two different alphanumeric codes, neither matching that of the pdf file itself.
So the main question is *How can Zotero tell me what the stage id for each pdf is?*
Currently I can export the library metadata as one large Json file, which I can easily separate per paper/item, similarly I can easily extract all of the Pdf files to some folder outside of Zotero (for example using a basic shell script to copy out all of the pdf files in the `~/Zotero/storage/` folder).
However I have no way to link these, I could set up a script to loop through each entry in the json file, and search for the title in the pdf folders, and then authors to narrow the search down in case of common titles; however this is prone to breaking if the title is not allows exact in the filename (maybe its too long, or contains special characters which the filename has recast etc.).
Given that Zotero has knowledge of where the pdf location is stored, and the 8-digit alphanumeric code seems to be used as a unique identifier, this should be the most logical way to link the item metadata and the pdf file. But I have yet to find a way in Zotero to even access this identifier, there is even an id code with a url containing two different alphanumeric codes, neither matching that of the pdf file itself.
So the main question is *How can Zotero tell me what the stage id for each pdf is?*
You can look at the export code to see how export translators access the file attachments if you do want to script a custom solution (e.g. here: https://github.com/zotero/translators/blob/master/Zotero RDF.js ) but this honestly seems like a waste of time. Zotero is already set up to maximally protect you against any sort of lock in with 3 different ways to access metadata and files (export, web API, direct sqlite access)
Anyway I was able to resolve my issue. I needed up using the CSV exporting, with attachment location Information (this csv file seems weird to me with quotations around every item and broke my interpreter for ages) and eventually got a python panda to correctly read the info. Then I can just pass the file name format I want and the file location in Zotero to my shell and archive the pdf and metadata usefully.
Finally as a small suggestion, though I don't know the ins and outs of this, it would be nice for some custom export function, just give me access to all the metadata as variables in some javascript or similar box and let me write the string format myself. Or if not that, customise what goes into the export files, as each of the formats contains different amounts of info (why does csv include the attachment location but not json??).
And Zotero export translators are basically this: You can add custom ones easily.
For the rest:
RDF is a standard XML format, and there's of course an XML parsing library in python: https://docs.python.org/3/library/xml.etree.elementtree.html )
And Zotero's CSV export is valid. Quotation marks around every field are necessary since basically every field can have commas in it. That's perfectly standard CSV, I'm surprised panda would struggle with it.
https://rdflib.readthedocs.io/en/stable/
There are a few code snippets using it to work with Zotero, for example https://gist.github.com/rlskoeser/d18c19a8351d97ca933b64fd26048b98
I have no idea what a Zotero export translator , I believe you that It does exactly what I want, but it is not exactly user friendly (e.g. to find/know is an option).
I didn't doubt the RDF, just simply commenting on the fact that it gave me all these other options for exporting, and I chose the ones I was comfortable working with. The CSV is valid as you say, and pandas weren't the trouble it was pythons builtin csv parser and numpy. Pandas actually just worked out of the box.
I don't mean to be combative in any of my comments, but too often I come across very useful software, but with annoying/non-user friendly data storage structure (e.g. what is with the weird id in the storage and why is it so difficult for the user to access?)
We're just trying to point out possible solutions to achieve your goal, hopefully this is useful for you.