Zotero strips Mac extended attributes on import

danhicks · October 29, 2014

On a recommendation from a friend, I'm considering migrating my research library from Papers to Zotero. I have ~3,000 PDFs in my Papers library, most of which have Skim notes and many of which have OpenMeta tags.

Both OpenMeta tags and Skim notes are stored in the file's extended attributes, rather than the file itself (as with standard PDF annotations). Since extended attributes are filesystem-specific, I don't expect them to be preserved by, e.g., email. But they should be preserved when doing in-filesystem manipulations, e.g., copying the file from one place to another.

However, whenever I import a file into Zotero (Standalone, 4.0.23), all of the extended attributes are stripped out, erasing the Skim notes and OpenMeta tags. I searched the Zotero documentation and forums, and found no mention of anything like this. Is this behavior known?

dstillman · October 29, 2014

There's a chance this will be fixed in the next major version of Zotero, which will use a new set of file methods provided by Mozilla. If those don't help, there's nothing we can do about it (other than report it to Mozilla).

In any case, I've created an issue to keep track of this.

danhicks · October 30, 2014

Thanks for the quick reply. It's too bad I'll have to wait for the next version. (Or script a workaround.) But at least I know the issue has been noted. Thanks!

danhicks · December 8, 2014

I finally had time to script a workaround. The Python code below assumes a specific file structure and human-readable filenames, which I did using Zotfile. The basic idea is to import from a RIS or BIB file as usual, rename attachments with Zotfile, and then run the script to re-copy all of the PDFs using a filesystem command. It's a bit clunky, but maybe it'll be useful for someone else before the next major version of Zotero is released.


from fuzzywuzzy import fuzz            # fuzzy string matching; see https://github.com/seatgeek/fuzzywuzzy and http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
import os.path as osp                # working with paths
from os import chdir, listdir        # change working directory and list directory contents
from subprocess import check_call    # run a system command as a subprocess

# set some constants
papers = '/Users/dhicks1/Google Drive/Papers'    # <- source of PDFs
zotero = '/Users/dhicks1/Zotero/attachments'    # <- target PDFs
threshold = 80        # <- minimum match quality

# get the contents of zotero.  in line with the notes above, these are author names. 
authors = listdir(zotero)

copied = 0
errors = 0
# iterate through every author name
for author in authors:
    # construct the absolute path
    zot_author = osp.join(zotero, author)
    # if it's not a folder, we skip it
    # note that this means we skip attachments without creator metadata! 
    if not osp.isdir(zot_author):
        print(author + ' is not a folder')
        continue
    # find the corresponding papers folder
    pap_author = osp.join(papers, author)
    if not osp.exists(pap_author):
        print('Papers folder ' + author + ' not found')
        errors += 1
        continue
        
    #print(author)
    
    # get the contents of the two author folders; filter so we're only dealing with PDFs
    zot_files = listdir(zot_author)
    zot_pdfs = [x for x in zot_files if osp.splitext(x)[-1].lower() == '.pdf']
    pap_files = listdir(pap_author)
    pap_pdfs = [x for x in pap_files if osp.splitext(x)[-1].lower() == '.pdf']
    
    # for each PDF on the zotero side, find a matching PDF on the papers side
    for zot_pdf in zot_pdfs:
        # first, use fuzz to score filename comparisons
        percs = []
        for pap_pdf in pap_pdfs:
            percs += [fuzz.token_set_ratio(zot_pdf, pap_pdf)]
        #print(percs)
        # second, pick out the highest score, and make sure it's above the threshold of 80
        max_perc = max(percs)
        if max_perc > threshold:
            match = pap_pdfs[percs.index(max_perc)]
            print('\t Match ' + zot_pdf + ' to \n\t       ' + match)
            try:
                # use check_call to have the OS copy the papers version to the zotero side
                # flag -p ensures that the metadata (incld. Extended Attributes) are copied as well
                # (no way to copy just metadata AFAIK)
                check_call(['cp', '-p', osp.join(pap_author, match), osp.join(zot_author, zot_pdf)])
                copied += 1
            except CalledProcessError:
                # cp returns >0 if the copy attempt fails
                # check_call throws an exception if the call returns >0
                print('*** Copied failed for ' + zot_pdf)
                errors += 1
        else:
            # we land in this branch if none of the filename comparisons scores passed the threshold
            print('*** No match for ' + zot_pdf)
            errors += 1
# all finished!  print totals
print('PDFs copied: ' + str(copied))
print('PDFs with errors: ' + str(errors))