Importing files with structured filenames: "title @author $year.pdf"

Hello community. I have a large collection of PDFs, some of them which I scanned myself (I'm a historian and we often use old articles). All of them are in the format "title @author $year.pdf" for easy look up through Spotlight/Alfred. Is it possible to take advantage of this naming system and import them as partly-formatted items (with title, year and last name of the author) to Zotero together with the original PDF as attachments? Presumably something similar to what https://github.com/retorquere/zotero-folder-import does, but with filenames instead of a folder structure.
  • edited 11 days ago
    Easiest (or maybe rather fastest) I think would be to have a fairly simple python or whatnot script generate a RIS file from these filenames, and then import that. If I'm not mistaken, the PDF file path should go in an L1 line.
  • yup, that's a good idea, so the end result should look something like

    TY - JOUR
    L1 - ~/Users/myfolder/title authors year.pdf
    AU - author
    TI - title
    PY - year
    ER -
  • Thank you, now I only have to learn to code! :)
  • If you're on a mac, the python script will just work, if you're on windows, I can take a stab at converting it to vbscript.
  • Thank you! I've never executed a Python script – is this the way to do it? https://www.pythoncentral.io/execute-python-script-file-shell/
  • It depends a bit on what operating system you use -- if I know that I can give you detailed instructions.
  • (if it's a one-time thing and you're on windows, I can transform the output of a dir /a-d /b /s for you, but that's really for a one-shot)
  • Thank you! macOS Big Sur. But you can also redirect me somewhere, I don’t want to steal your time. I think I can adapt the script, I'm just not sure how to execute it.
  • edited 9 days ago

    No worries.

    1. Download the script from https://gist.githubusercontent.com/retorquere/722d8c2308f93d3637f2576fcfc6d41b/raw/8c283aaecdb474c3b05ceb8cb8dd7aa61b472918/ris.py (changed from before) and save it somewhere that's easy for you to remember. Your Downloads folder should be fine for now.
    2. The script will have to be ran from a terminal command line. I find it easiest to get to a command line at a particular place (which will matter later) by adding it to the finder services menu: https://www.howtogeek.com/210147/how-to-open-terminal-in-the-current-os-x-finder-location/ (the bit below "Adding a Terminal Shortcut to the Services Menu")

    Once that is in place, let's say that your PDFs live at Documents/My Academic stuff/very important and Documents/My Academic stuff/frivolous:

    1. Go to Documents/My Academic stuff with finder and use "New terminal at folder". The terminal will pop up. Without the finder services you can also cmd-space, type "terminal", and then type cd '~/Documents/My Academic stuff', which will achieve the same.
    2. Type python ~/Downloads/ris.py 'very important' frivolous
    3. This will create Documents/My Academic stuff/very important/very important.ris and Documents/My Academic stuff/frivolous/frivolous.ris which you can import.

    You can add paths to folders of PDFs as you please, or run it one folder at a time. The outcome will be the same.

  • Note that unlike my folder importer linked to earlier, this will not cause Zotero to do metadata lookup. What's in the RIS is what you get.
  • edited 9 days ago
    Thank you, it works! Except that... actually I have "®" not "$" in my filenames. So I replaced it accordingly in your regex and I get this:

    SyntaxError: Non-ASCII character '\xc2' in file /Users/jakub/Downloads/ris.py on line 11, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

    How does one declare encoding? If it's complicated, I can batch-change all ® to $.
  • Could you paste your line 11 here? I suspect you missed something in the code

    (you can define utf-8 encoding by putting
    # -*- coding: utf-8 -*-
    or
    # coding: utf-8

    in the 2nd line of the script, but I don't think that should be necessary)
  • m = re.match(r'^(.+?)@(.+?)\®(.+?).pdf$', os.path.basename(pdf))
  • m = re.match(r'^(.+?)@(.+?)®(.+?).pdf$', os.path.basename(pdf))

    (the \ before the $ sign is an "escape" character, because $ is a special character. It's not needed for ®. If it doesn't run after this, add the # coding: utf-8 line.
  • I got the same error, but I added the encoding in the second line and it works. Thank you both! Also for helping me to break the psychological barrier “oh no, coding is hard to start”. Now I want to experiment more. :)
  • edited 9 days ago
    NP
Sign In or Register to comment.