Importing files with structured filenames: "title @author $year.pdf"

sypianski · May 4, 2021

Hello community. I have a large collection of PDFs, some of them which I scanned myself (I'm a historian and we often use old articles). All of them are in the format "title @author $year.pdf" for easy look up through Spotlight/Alfred. Is it possible to take advantage of this naming system and import them as partly-formatted items (with title, year and last name of the author) to Zotero together with the original PDF as attachments? Presumably something similar to what https://github.com/retorquere/zotero-folder-import does, but with filenames instead of a folder structure.

emilianoeheyns · May 4, 2021

Easiest (or maybe rather fastest) I think would be to have a fairly simple python or whatnot script generate a RIS file from these filenames, and then import that. If I'm not mistaken, the PDF file path should go in an L1 line.

emilianoeheyns · May 4, 2021

Something like this should mostly do it: https://gist.github.com/c8d125e083d2a1222c3e73fd0639205c

adamsmith · May 4, 2021

yup, that's a good idea, so the end result should look something like

TY - JOUR
L1 - ~/Users/myfolder/title authors year.pdf
AU - author
TI - title
PY - year
ER -

sypianski · May 5, 2021

Thank you, now I only have to learn to code! :)

emilianoeheyns · May 5, 2021

If you're on a mac, the python script will just work, if you're on windows, I can take a stab at converting it to vbscript.

sypianski · May 5, 2021

Thank you! I've never executed a Python script – is this the way to do it? https://www.pythoncentral.io/execute-python-script-file-shell/

emilianoeheyns · May 5, 2021

It depends a bit on what operating system you use -- if I know that I can give you detailed instructions.

emilianoeheyns · May 5, 2021

(if it's a one-time thing and you're on windows, I can transform the output of a dir /a-d /b /s for you, but that's really for a one-shot)

sypianski · May 6, 2021

Thank you! macOS Big Sur. But you can also redirect me somewhere, I don’t want to steal your time. I think I can adapt the script, I'm just not sure how to execute it.

emilianoeheyns · May 6, 2021

No worries.

Download the script from https://gist.githubusercontent.com/retorquere/722d8c2308f93d3637f2576fcfc6d41b/raw/8c283aaecdb474c3b05ceb8cb8dd7aa61b472918/ris.py (changed from before) and save it somewhere that's easy for you to remember. Your Downloads folder should be fine for now.
The script will have to be ran from a terminal command line. I find it easiest to get to a command line at a particular place (which will matter later) by adding it to the finder services menu: https://www.howtogeek.com/210147/how-to-open-terminal-in-the-current-os-x-finder-location/ (the bit below "Adding a Terminal Shortcut to the Services Menu")

Once that is in place, let's say that your PDFs live at Documents/My Academic stuff/very important and Documents/My Academic stuff/frivolous:

Go to Documents/My Academic stuff with finder and use "New terminal at folder". The terminal will pop up. Without the finder services you can also cmd-space, type "terminal", and then type cd '~/Documents/My Academic stuff', which will achieve the same.
Type python ~/Downloads/ris.py 'very important' frivolous
This will create Documents/My Academic stuff/very important/very important.ris and Documents/My Academic stuff/frivolous/frivolous.ris which you can import.

You can add paths to folders of PDFs as you please, or run it one folder at a time. The outcome will be the same.

emilianoeheyns · May 6, 2021

Note that unlike my folder importer linked to earlier, this will not cause Zotero to do metadata lookup. What's in the RIS is what you get.

sypianski · May 6, 2021

Thank you, it works! Except that... actually I have "®" not "$" in my filenames. So I replaced it accordingly in your regex and I get this:

SyntaxError: Non-ASCII character '\xc2' in file /Users/jakub/Downloads/ris.py on line 11, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

How does one declare encoding? If it's complicated, I can batch-change all ® to $.

adamsmith · May 6, 2021

Could you paste your line 11 here? I suspect you missed something in the code

(you can define utf-8 encoding by putting
# -*- coding: utf-8 -*-
or
# coding: utf-8

in the 2nd line of the script, but I don't think that should be necessary)

sypianski · May 6, 2021

m = re.match(r'^(.+?)@(.+?)\®(.+?).pdf$', os.path.basename(pdf))

adamsmith · May 6, 2021

m = re.match(r'^(.+?)@(.+?)®(.+?).pdf$', os.path.basename(pdf))

(the \ before the $ sign is an "escape" character, because $ is a special character. It's not needed for ®. If it doesn't run after this, add the # coding: utf-8 line.

sypianski · May 6, 2021

I got the same error, but I added the encoding in the second line and it works. Thank you both! Also for helping me to break the psychological barrier “oh no, coding is hard to start”. Now I want to experiment more. :)

emilianoeheyns · May 6, 2021

NP