Retrieve metadata from ProQuest PDFs

y.sapolovych · September 1, 2018

Many ProQuest databases can export full texts. Either in PDF, TXT or RTF, they have metadata, BUT in the very end of the document. For example, in PDF it looks like this:
https://s15.postimg.cc/gpx1kkf8r/proquest.png

The problem here is that Zotero can't retrieve this metadata automatically. It's a shame, as it is really well-structured and full. Just in case: I couldn't find an option to move it to the beginning of the document. On the first page there's still a small header with author, title and date of publication in the beginning, but that's all there's to it. So:

1.Isn't there some way to det metadata from ProQuest PDFs? Perhaps I'm missing something? Maybe there's a workaround?
2.In case there is - another thing. PQ also provides ablity to save articles in 20s, 50s and100s, but they are saved as a single file, and metadata table is shown at the end of each respective text. Is there a way to read one PDF and get multiple items? I, of course, doubt so. Fortunately, PDFs have bookmarks and can be easily split, but it'd still be so cool to just load one PDF and get multiple entries...

bwiernik · September 1, 2018

Zotero has a translator for the ProQuest website, so the Zotero button should generally work. Can you post a specific URL that is not working?

y.sapolovych · September 2, 2018

Translator works fine with single pages, but it can't save multiple snapshots - insted of full texts I get failed captchas.

bwiernik · September 2, 2018

Oh, that is a different issue entirely. ProQuest has fairly aggressive protection from automated scrapers. If you are trying to import multiple pages of results in rapid succession using the Zotero button, you will likely look like a bot to ProQuest and get blocked. Instead, check the items from the search results you want and use ProQuest’s export button to download a RIS file to import.

You won’t get PDFs initially, but you can still add them automatically. The feature is currently in the Zotero beta, so install that. Then, select a small batch of items without PDFs, right click, and choose Find Available PDFs. Zotero will then download and attach the PDFs that you have access to at the item DOI/URL.

See:
https://www.zotero.org/support/getting_stuff_into_your_library#large-scale_imports_from_databases

y.sapolovych · September 2, 2018

Thank you, I've already tried the new functionality and think it is splendid (albeit still quite raw). I'm also aware that ProQuest (like most aggregators) is quite harsh on these things. But the translator actually handles saving ProQuest PDFs just fine.

The issue here is that most full texts I run across are HTML pages, and while connector can save them one at a time, it still can't do batches. I'm not talking of successive (either quick or slow-paced) downloads - it's even the first 20-ish bulk in two days (or even the first I've tried in about half a year).

So I thought that PQ built-in export feature, like you suggested, might do. BUT it can't download both texts and Zotero-readable collection file (e.g. in RIS) simultaneously. The latter does not have attached texts (only linked web pages). Text files DO have metadata inside of them, but Zotero can't read it.

I've tried a 'middle ground approach' - exporting both and then syncing. So I found out that Mendeley (not Zotero:( ) is capable of merging items in a collection with multiple files - provided the latter have names similar to respective entries. Though after several hours of turmoil I was not exactly successful - most items did merge properly, but about 1/3 didn't. So this is not an option sadly.

adamsmith · September 2, 2018

OK, but once you're looking at HTML, the retrieve metadata function doesn't currently seem to apply and the scenario where you'd want to use that is too rare to warrant the effort.

But if Proquest does have an html but not a PDF full text, Zotero should try to get that. Do you have an example?

y.sapolovych · September 3, 2018

If you (excuse me if I misunderstood that) mean retrieving metadata from local files - I'm not talking HTML here, but PDF. You can export PQ full texts either in PDF or RTF or TXT - I've tried all options, and 'retrieve metadata' function is triggered only by importing PDFs to a collection. But so it still fails to get it from ProQuest PDFs ('No matching refernces found'). And it's a pity as they contain metadata in a table form (as on the screenshot in the opening post).

I guess I could send you several PDFs if you wanted to take a look.

philosophical · April 29, 2022

Hi, when I download a pdf from the Proquest ebook website it says it cannot retrieve metadata. Could you please tell me how to retrieve metadata from a pdf downloaded from Proquest?

AbeJellinek · April 29, 2022

You should generally be using the Connector's Save to Zotero button to save ProQuest e-books rather than adding PDFs manually and retrieving metadata. Is there a page that that isn't working on? Post the link and we'll take a look.

philosophical · April 29, 2022

Using the connector it indeed works . Thanks