pdf import without metadata

danb · February 28, 2012

Hi,

I have several hundreds of files of the form
ISBN.pdf (like 123456789x.pdf), but they contain no metadata.
Currently I add the items by identifier, then add a stored copy of the pdf.
This is extremely cumbersome. Is there a better way to import these files?

Is there a way to provide the metadata in a simple extra file with the names and the identifiers, something like
123456789x.pdf 123456789x
xyz.pdf dx.doi.org/1234/567
and then point zotero to this file and have it import the pdfs and complete the entries from the identifiers?

Dan

adamsmith · February 28, 2012

no. AFIK urrently Zotero doesn't even allow import for lists of identifiers, though that really would be nice. Tying that to filenames in a list, on the other hand, strikes me as too rare a scenario to actually be worthwhile addressing on Zotero's side.

Note that the metadata that zotero gets isn't taken from the files tags (pdf tags aren't standardized enough to make that worthwhile, but by trying to look it up on CrossRef (if it finds a DOI) or on google scholar. So you could try dragging the files to Zotero and using retrieve metadata.

danb · February 28, 2012

I tried dragging them in, but mostly there is no metadata in the file itself, only the ISBN in the filename.
In general, the metadata import does not work well for me, there are too many articles that don't have metadata or worse where it finds the wrong metadata, still would need to check every entry manually.

I think it would be nice to have some form of simplified input to import legacy data. I assume many people have literature in folders with meaningful names. In that case it would be easy to write a batch file in any language to create the import file with the ISBNs and have zotero complete the data.

adamsmith · February 28, 2012

I think you're imagining this to be easier than it is.
In any case - it won't happen anytime soon. If it's going to happen, it's more likely going to be from a third party than from zotero's core dev team. Obviously patches or plugins are always welcome.

Improving automatic pdf metadata retrieval, on the other hand, is something where you're more likely to see improvements, though obviously only for pdfs with OCRd text.

danb · March 2, 2012

thanks for your comments. It seems the way to get external files into zotero is only by way of rdf using the import translator, so I will write write an external script that queries oclc and creates an rdf file with all the information.
Someone already came up with that solution:
http://forums.zotero.org/discussion/6769/basic-python-script-to-convert-a-list-of-urls-into-zotero-rdf-for-import/

danb · September 13, 2013

I have created a python library that allows to communicate with Zotero using MozRepl. This will make bulk import of large existing libraries with a given folder structure easier.

The idea is to do the metadata import in python, this makes it much easier to interact with external libraries and programs, find ISBNs in the filename or extract XMP metadata with cb2bib or similar.

For example, if I know that all articles in one folder are from a certain journal, I can limit the search to this single journal.

Also I've noted that often ISBNs are wrong or that the extract metadata function erroneously finds the wrong entry, for this reason I've decided to keep all files in their original place, so that the structure is not accidentally lost. Each item added returns a report, including all Zotero IDs for items and all attachments, this report can be saved alongside the original entry if desired.

I have implemented all item types relevant for me, but it is very easy to add the rest.

The basic usage example is


with zotero() as z:
    zoteroItem=book(creators = [author('first', 'last')], title='Title')
    # notes, attachments and tags can be added as well
    z.addItem(zoteroItem)

A longer example is at the end of the library.

I took great care to read back and verify every single value, but of course it may be possible that I've accidentally sidestepped internal Zotero checks.

For example, I create all creators from scratch 'var creator = new Zotero.Creator; ', I don't know if this is the right way, or if I should first check if a creator with identical data already exists.

The code can be found at https://gist.github.com/danbe/6547077

A longer example is below (the strange unicode strings did not render correctly on the preview, but the program runs correctly on my machine).
I'd be happy about feedback:


# -*- coding: utf-8 -*-
 
if __name__=='__main__':
    # Connect to Zotero
    with zotero() as z:
        
        # First create a standalone note (unrelated to next entry)
        z.addNote(text='This is the text of a standalone note', parent=None)

        # Create library item and populate the fields
        book1 = book(
                abstractNote=u'''
This is a long abstract with some unicode content.
ÐŸÐ¾ Ð¾Ð¶Ð¸Ð²Ð»Ñ‘Ð½Ð½Ñ‹Ð¼ Ð±ÐµÑ€ÐµÐ³Ð°Ð¼

''',
                accessDate='2013-01-01',
                archive='Archive',
                archiveLocation='Archive Location',
                callNumber='Call 1234',
                creators=[
                            author(u'Ð˜Ð²Ð°Ð½ Ð˜Ð²Ð°Ð½Ð¾Ð²Ð¸Ñ‡', u'Ð˜Ð²Ð°Ð½Ð¾Ð²'),
                            author('AAuthor First 2', 'Author Last 2'),
                            
                            contributor('Contributor First 1', 'Contributor Last 1'),
                            editor('Editor First 1', 'Editor Last 1'),
                            editor('Editor First 2', 'Editor Last 2'),
                            editor('Editor First 3', 'Editor Last 3'),
                            seriesEditor('S. Editor First 1', 'S. Editor Last 1'),
                            translator('Translator First 1', 'Translator Last 1'),
                            translator('Translator First 2', 'Translator Last 2'),
                            translator('Translator First 3', 'Translator Last 3'),
                    ],
                date='1234-12-34',
                edition='1st Edition',
                #extra='Extras', # Default value None -> ''
                ISBN='12345678x',
                language=u'Ð ÑƒÑÑÐºÐ¸Ð¹',
                libraryCatalog='library catalog',
                numberOfVolumes='10 Volumes',
                numPages='3141',
                place='Publisher place',
                publisher='Publisher',
                rights='Rights',
                series='Series',
                seriesNumber='Series Number 1',
                shortTitle='A short title',
                title='Title',
                url='http://a.b.c',
                volume='12')

        # Add the book to the library
        # Files are linked and stay where they are
        report = z.addItem(
                book1, 
                attachmentList=['book toc.pdf', 'book chapter 2.pdf', 'cover1.png', 'cover2.bmp'], 
                tags=['defaultTag_type0', ('tag2', 0), ('tag_type_1', 1)],
                notes=['A simple text note.', u'Ð‘ÐµÐ· Ð¼ÑƒÐºÐ¸ Ð½ÐµÑ‚ Ð½Ð°ÑƒÐºÐ¸.',
                '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <head> <title>HTML Note</title> <style type="text/css" media="all"> </style> </head> <body> <h1>Heading 1 (h1)</h1> <h2>Heading 2 (h2)</h2> <h3>Heading 3 (h3)</h3> <h4>Heading 4 (h4)</h4> <h5>Heading 5 (h5)</h5> <h6>Heading 6 (h6)</h6> </body> </html>']) 



        if yamlExists: # save report for each entry
            with open('report.yaml', 'wb') as f:
                printd( yaml.dump(report))
                yaml.dump(report, f)


        # Add an article
        article = journalArticle(
                abstractNote=u'This is a long abstract for an article. It has some unicode'*5,
                accessDate='2013-09-09',
                archive='Article archive',
                archiveLocation='Nowhere',
                callNumber='123 Article',
                creators=[
                            author(u'Ð˜Ð²Ð°Ð½ Ð˜Ð²Ð°Ð½Ð¾Ð²Ð¸Ñ‡', u'Ð˜Ð²Ð°Ð½Ð¾Ð²'),
                            author('AAuthor First 2', 'Author Last 2'),
                            author('Author First 3', 'Author Last 3'),
                            contributor('Contributor First 1', 'Contributor Last 1'),
                            contributor('Contributor First 2', 'Contributor Last 2'),
                            contributor('Contributor First 3', 'Contributor Last 3'),
                            editor('Editor First 1', 'Editor Last 1'),
                            translator('Translator First 1', 'Translator Last 1'),
                            translator('Translator First 2', 'Translator Last 2'),
                    
                    ],
                date='1234-45-79',
                DOI='10.1.1.168.4008',
                extra='Extra notes for article',
                #ISSN='',
                issue='Issue 4',
                journalAbbreviation='ABC',
                language='Language',
                libraryCatalog='No Catalog',
                pages='1234564645',
                publicationTitle='Some Example Article',
                rights='Unknown',
                series='Article Series',
                seriesText='Series Text',
                seriesTitle='Series Title',
                shortTitle='A short title',
                title='The real title',
                url='citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.168.4008',
                volume='Vol 123')

        report = z.addItem(
                article, 
                attachmentList=['book toc.pdf', 'book chapter 2.pdf', 'cover1.png', 'cover2.bmp'], 
                tags=['tag1_0', ('tag2_0', 0), ('tag3_1', 1)],
                notes=['A simple text note.', u'Ð‘ÐµÐ· Ð¼ÑƒÐºÐ¸ Ð½ÐµÑ‚ Ð½Ð°ÑƒÐºÐ¸.'])

ptinits · April 6, 2014

Dear danb,

It looks very interesting and may as well solve the problems that I have. However can you provide some more precise instructions on how to use this. I have some experience with python, but I am otherwise by no means a programmer.

So far I managed to install Yaml, install and start MozRepl, turn on Telnet in windows and make contact with MozRepl

I am not sure exactly how the script should be accessed. Now I just tried running it, but I get a syntax error already on the first except line, just before yamlExists = False.

If I skip the check and just import yaml, then I get another syntax error elsewhere. But this is probably not the way this file should be used at all. So I'm wondering if you may be able to give some hints.

I am using Python 3.3.

The ultimate goal is to import a bib or ris file into Zotero so that it does not copy the files to its own directory, but will just import the links.

Any chance you may be able to help me? Many thanks! P

danb · May 20, 2014

Hi ptinits,

I am sorry, I was traveling and missed your question.

You are using the library correctly.

The script works like this:

You start Firefox with MozRepl. MozRepl makes the internal Javascript API available over telnet.

If you run my file zotero.py, it will connect over telnet to Firefox and add a few items as a test, the part under if __name__=='__main__'.

Can you post the exact error message that you get? It might be related to python 3, I am still using 2.
In python 3, print is a function,
you have to replace print 'hello' with print('hello'), i.e. you have to put parentheses around the argument.
print('yaml is not available, cannot store the protocol to file.')

Please feel free to ask if you are still interested.

Importing bib or ris should work and be very easy, this is exactly what I wrote the script for.

I wrote a ris library as well, I can share that if you are interested.

Dan