Retrieving metadata from archive.org

marga_kempner · July 31, 2015

About half of the PDFs that I've imported into Zotero from Internet Archive (archive.org) are unable to retrieve their metadata using the Zotero "Retrieve Metadata for PDF" function. I am downloading the .pdf and .mrc files of documents from archive.org and would like the PDFs to link with the metadata that should be attached with them. Alternatively, is there a file type from archive.org that I can download that would make the metadata for the PDF automatically appear in Zotero? Many thanks in advance for any help!

aurimas · July 31, 2015

Can you link to some articles/pages you are downloading from archives.org? Zotero retrieve metadata works by looking for ISBNs and DOIs in PDFs. If none of those are found, it ten looks up some text on Google Scholar, which is geared mostly towards academic papers. If the PDFs you're downloading do not have the above identifiers, are not OCR'd, or are not academic papers, then it's likely that Zotero won't find metadata for them.

I think we used to have a translator tailored for archives.org, but it seems that it's not working with the updated interface. We'll take a look at fixing that.

archives.org does offer metadata as MARCXML, which is a very rich metadata format and we should be able to import it, but I see that we currently do not. We'll look into fixing that as well.

We should have something working over the weekend and I'll report back here when something changes. Thanks for letting us know.

aurimas · July 31, 2015

Actually, you should be able to import the mrc files (either via Import or Import from Clipboard, which may be more convenient, because it won't create a separate collection). After you import, you can drag-drop the PDF file onto the metadata.

marga_kempner · July 31, 2015

Here are a few examples of files whose PDFs have been unable to retrieve metadata in Zotero:
https://archive.org/details/agrariantenures00unkngoog
https://archive.org/details/ahandbooktoland01earlgoog
https://archive.org/details/bodykechapterinh00normuoft
https://archive.org/details/lettertoabsentee19wigg

If there were a translator tailored for archive.org, that would be excellent! I found one online but it didn't work, probably because it isn't working with the updated interface, as you said.

Because of the scale (~10 000 documents) of my project, I'm hoping to figure out a way to collate the metadata with PDFs without having to do a manual drag-and-drop. I started doing this, but also found that because the MARC files lose their unique identifier (e.g. bodykechapterinh00normuoft) instantly upon being dragged into Zotero while the PDFs lacking their metadata retain this unique identifier, matching MARC files with their PDFs involves opening up each PDF, scrolling down a few pages, reading the title, and then tracking down the MARC file to which it belongs. If there were a way to link the PDF with metadata automatically, I would be extremely grateful!

Thanks again for your help!

adamsmith · August 3, 2015

The Zotero internet archive translator now works again. It won't however, auto-attach PDFs: Some of them are so big that I don't think users in general would be happy about that.

If you're interested and somewhat technically fearless, I could give you a couple of lines of code to add to a custom version of the translator, though, that would download PDFs automatically.

marga_kempner · August 5, 2015

Thank you for your help.

Unfortunately, the translator still doesn't seem to be working for me. When I upload PDFs from archive.org to Zotero, no metadata appears. I've tried all of the different kinds of file types that archive.org offers for its documents (.pdf, .xml, .txt, .djvu, .epub, .jp2) but none of these seem to include metadata either when I upload them to Zotero. Do you know what might be wrong?

If you were able to send me the code to add to a custom version of the translator to download PDFs, automatically, I would be extremely grateful!

adamsmith · August 5, 2015

no, that's not how translators work -- what does work is going to
https://archive.org/details/agrariantenures00unkngoog and clicking the "Save to Zotero" icon.

aurimas · August 5, 2015

It won't however, auto-attach PDFs: Some of them are so big that I don't think users in general would be happy about that.

Since archive.org provides PDF size information on the page, do you think it would be reasonable to automatically download PDFs smaller than X MB? We could throw the actual threshold into hidden preferences.

adamsmith · August 5, 2015

it's a neat idea, but I'm not a fan of hidden behavior that seems irregular/unpredictable to users.

We could just add a hidden pref for adding PDFs per se, though, along the lines of supplements for other sites.

marga_kempner · August 5, 2015

Wonderful, it's great to have the metadata from archive.org read by Zotero! Because of the volume of my work, I'm looking to attach PDFs to the metadata automatically. Did you mention that there are a few lines of code with which I could customize the translator to download the PDFs as well? Where can I find the code for the translator itself? As you mention, a hidden preference for adding PDFs could work too.

adamsmith · August 5, 2015

The translator is the file called "Internet Archive.js" in the translator folder of your Zotero data directory:
https://www.zotero.org/support/zotero_data

You'll need to do 5 things to the file (edit in any good text editor):

1. Right after

for (i in tags) {
   newItem.tags.push(tags[i]);
}

insert this custom code:

if (itemType== "book"){
   var pdfurl = apiurl.replace(/details(\/[^/]+)&output=json/, "download$1$1.pdf");
   newItem.attachments.push({"url": pdfurl, "title": "Internet Archive Fulltext PDF", "mimeType": "application/pdf" })	
}

2. Change the priority of the translator from 100 to 99 at the top

3. Rename the translator (Label) into something like Internet Archive (Custom)

4. Change any character in the translatorID

5. Save the customized translator under a different filename (e.g. using the same as the label) in the Zotero data directory.

Restart Firefox (or Zotero & your Browser) and you should have your custom translator with PDF download working.

aurimas · August 5, 2015

(Let's enable this under a hidden pref)

marga_kempner · August 5, 2015

When I click the Zotero icon in Google Chrome to download PDFs from archive.org to Zotero standalone, I am getting a message that says "An error occurred while saving this item. Check Known Translator Issues for more information." I am getting the same message when I try the same action on a different computer and different account. Could the Internet Archive translator be malfunctioning?

marga_kempner · August 5, 2015

(I had not tried adding the code you suggested to the different computer's translator in the Zotero data directory before I started getting this error message, so I don't think it's something wrong with the custom code you gave me.)

adamsmith · August 5, 2015

works for me -- is that before or after trying the custom modifications? And on which URL?
edit: overlapped, but I'm still interested in the URL question.

marga_kempner · August 5, 2015

Many thanks for your help! The translator for Internet Archive is now working, but I haven't yet gotten the custom translator to download PDFs for books. A few that I've tried to do this with include (all have the option to download the book as a PDF):
https://archive.org/details/jstor-2212307
https://archive.org/details/tenureoflandinir00duff
https://archive.org/details/jstor-1814000

I followed your instructions for writing a custom translator and have saved it as a Rich Text Document in my Zotero translators directory. Do you have any ideas about why it might not be downloading PDFs?

marga_kempner · August 5, 2015

(I would be happy to send or post the custom translator somewhere -- unfortunately, the space of this forum does not allow for pasting the entire code.)

zuphilip · August 5, 2015

You can copy and paste you code here https://gist.github.com/ and write the corresponding link in the forum again.

karnesky · August 5, 2015

@marga: you don't want to save it as rich text. This should be plain text and should have the .js extension. See some more at https://www.zotero.org/support/dev/translators

I'd be in favor of a Zotero-wide pref for PDF limit size. This is hardly the only site that sometimes serves very large PDFs & it seems somewhat arbitrary to treat this particular site differently.

marga_kempner · August 5, 2015

Thanks! The custom translator can be found here: https://gist.github.com/anonymous/dcf9e42dc5b095b95861

I saved the translator as a Javascript file (.js) but it still doesn't download PDFs into Zotero standalone.

adamsmith · August 5, 2015

do you see it trying to download the PDF? (it initially appears and then there's a red cross in front?)

marga_kempner · August 5, 2015

no, it doesn't try to download a PDF.

adamsmith · August 5, 2015

when you hover over the URL bar icon, does it show your new translator name, including the (custom)? If not (as I suspect), it's not using that translator.

First, make sure you're saving in plain text format (if need by by downloading from github via the raw link). Then restart all relevant software & try again.

dstillman · August 5, 2015

I'd be in favor of a Zotero-wide pref for PDF limit size. This is hardly the only site that sometimes serves very large PDFs & it seems somewhat arbitrary to treat this particular site differently.

I agree that a global pref — visible, I think, as a dependent of the current files pref — makes more sense here. To actually check the file size across all sites, though, we'd need to either make HEAD requests before all file downloads or bail on downloads based on the Content-Length header, which may or may not be possible in our current download methods. (The former would be cleaner, but we'd have to deal with HEAD not being supported by the site or network — I guess by failing open? — and it's possible we'd have people hitting access limits more quickly if sites didn't properly account for HEAD.)

adamsmith · August 5, 2015

@Dan, aurimas, karnesky -- default to 10MB?

aurimas · August 5, 2015

@Dan, aurimas, karnesky -- default to 10MB?

No, I was agreeing with you that it should be either on or off (unless you're not talking just about archive.org anymore) and it should be off by default.

dstillman · August 5, 2015

If we want to enable PDF downloads from archive.org for now I think we can enable it with a hard-coded limit in the translator just to avoid saving ridiculously large files. I'm not in favor of adding a translator-specific pref for this, even temporarily. People can always drag in a PDF if necessary.

aurimas · August 5, 2015

To actually check the file size across all sites, though, we'd need to either make HEAD requests before all file downloads or bail on downloads based on the Content-Length header, which may or may not be possible in our current download methods. (The former would be cleaner, but we'd have to deal with HEAD not being supported by the site or network — I guess by failing open? — and it's possible we'd have people hitting access limits more quickly if sites didn't properly account for HEAD.)

Before I forget, we can also let translators indicate the attachment size, since many websites display that information on the page.

dstillman · August 5, 2015

Oh, yeah, that's a good idea.

adamsmith · August 5, 2015

If we want to enable PDF downloads from archive.org for now I think we can enable it with a hard-coded limit in the translator just to avoid saving ridiculously large files.

OK, as I say above, I find this a little on the intransparent side of things but sure, why not. Same question, though -- what's ridiculous? 5,7,10?

dstillman · August 5, 2015

This is just meant to be temporary to allow saving by default until there's a proper solution. A translator-specific pref isn't really any better.

I'm fine with 10 for the limit.