Zotero strips Mac extended attributes on import
On a recommendation from a friend, I'm considering migrating my research library from Papers to Zotero. I have ~3,000 PDFs in my Papers library, most of which have Skim notes and many of which have OpenMeta tags.
Both OpenMeta tags and Skim notes are stored in the file's extended attributes, rather than the file itself (as with standard PDF annotations). Since extended attributes are filesystem-specific, I don't expect them to be preserved by, e.g., email. But they should be preserved when doing in-filesystem manipulations, e.g., copying the file from one place to another.
However, whenever I import a file into Zotero (Standalone, 4.0.23), all of the extended attributes are stripped out, erasing the Skim notes and OpenMeta tags. I searched the Zotero documentation and forums, and found no mention of anything like this. Is this behavior known?
Both OpenMeta tags and Skim notes are stored in the file's extended attributes, rather than the file itself (as with standard PDF annotations). Since extended attributes are filesystem-specific, I don't expect them to be preserved by, e.g., email. But they should be preserved when doing in-filesystem manipulations, e.g., copying the file from one place to another.
However, whenever I import a file into Zotero (Standalone, 4.0.23), all of the extended attributes are stripped out, erasing the Skim notes and OpenMeta tags. I searched the Zotero documentation and forums, and found no mention of anything like this. Is this behavior known?
In any case, I've created an issue to keep track of this.
from fuzzywuzzy import fuzz # fuzzy string matching; see https://github.com/seatgeek/fuzzywuzzy and http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
import os.path as osp # working with paths
from os import chdir, listdir # change working directory and list directory contents
from subprocess import check_call # run a system command as a subprocess
# set some constants
papers = '/Users/dhicks1/Google Drive/Papers' # <- source of PDFs
zotero = '/Users/dhicks1/Zotero/attachments' # <- target PDFs
threshold = 80 # <- minimum match quality
# get the contents of zotero. in line with the notes above, these are author names.
authors = listdir(zotero)
copied = 0
errors = 0
# iterate through every author name
for author in authors:
# construct the absolute path
zot_author = osp.join(zotero, author)
# if it's not a folder, we skip it
# note that this means we skip attachments without creator metadata!
if not osp.isdir(zot_author):
print(author + ' is not a folder')
continue
# find the corresponding papers folder
pap_author = osp.join(papers, author)
if not osp.exists(pap_author):
print('Papers folder ' + author + ' not found')
errors += 1
continue
#print(author)
# get the contents of the two author folders; filter so we're only dealing with PDFs
zot_files = listdir(zot_author)
zot_pdfs = [x for x in zot_files if osp.splitext(x)[-1].lower() == '.pdf']
pap_files = listdir(pap_author)
pap_pdfs = [x for x in pap_files if osp.splitext(x)[-1].lower() == '.pdf']
# for each PDF on the zotero side, find a matching PDF on the papers side
for zot_pdf in zot_pdfs:
# first, use fuzz to score filename comparisons
percs = []
for pap_pdf in pap_pdfs:
percs += [fuzz.token_set_ratio(zot_pdf, pap_pdf)]
#print(percs)
# second, pick out the highest score, and make sure it's above the threshold of 80
max_perc = max(percs)
if max_perc > threshold:
match = pap_pdfs[percs.index(max_perc)]
print('\t Match ' + zot_pdf + ' to \n\t ' + match)
try:
# use check_call to have the OS copy the papers version to the zotero side
# flag -p ensures that the metadata (incld. Extended Attributes) are copied as well
# (no way to copy just metadata AFAIK)
check_call(['cp', '-p', osp.join(pap_author, match), osp.join(zot_author, zot_pdf)])
copied += 1
except CalledProcessError:
# cp returns >0 if the copy attempt fails
# check_call throws an exception if the call returns >0
print('*** Copied failed for ' + zot_pdf)
errors += 1
else:
# we land in this branch if none of the filename comparisons scores passed the threshold
print('*** No match for ' + zot_pdf)
errors += 1
# all finished! print totals
print('PDFs copied: ' + str(copied))
print('PDFs with errors: ' + str(errors))