automating mass-import from PDFs

adamsmith · June 17, 2012

look, I'm sorry I offended you yesterday, I felt you were being unnecessarily pushy and impatient with people who were trying to be helpful. There is no need to react with hostility to everything I say - I'm not here to aggravate you, so let's just forget about those exchanges and get along with improving the retrieve metadata feature.

I'm not contradicting myself on the goals: I think (and said) any solution should not _only_ work for cases like yours, but it certainly _should_ work for cases like yours. We're getting a pretty solid number of reports on the GS lock-out problem and reducing them should certainly be a goal of improving the retrieve metadata feature.

Timing/throttling these better should be done (and core-devs agree). unfortunately we don't know how google determines when to lock an IP out, so so I also think at least reducing the reliance on GS should be a goal.
Constructively, taking your suggested order of queries, I wouldn't get rid of the CrossRef DOI query that currently precedes the GS query. Where we can get the data via DOI we should.

Edit: I agree - and say so above - that what you propose is clearly better than the status quo, but while we're at it I'd really like to - if not fix, at least alleviate the lock-ou problem, too.

aurimas · June 17, 2012

We can probably come up with a way to detect google lockouts. I'm not sure how to get that information from the translator to recognizePDF. If we do detect it, we can probably notify the user and wait for a set amount of time before continuing. Slow, but would probably be important to have this feature even if we manage to trick GS from locking out to start with. I'll take a closer look at this tonight.

adamsmith · June 17, 2012

If someone is aware of another good database encompassing many fields of study and containing full text indices of articles that we can use instead of Google Scholar, that would help improve the metadata retrieval process.

JSTOR indexes and searches full text. Unfortunately, they have modified their site so that without access users don't even get metadata, so this would only work for users with institutional access.
Sciencedirect, Springerlink, Taylor and Francis - and I assume many other journal publishers index&search full text.

normadize · June 17, 2012

adamsmith, you're sorry but label me hostile. Some more valid logic is that if that was the case, I'd surely waste time offering suggestions and possible solutions despite channel noise, e.g. http://forums.zotero.org/discussion/23748/automating-massimport-from-pdfs/#Comment_127908 . Due to hostility blinding syndrome, I can see plenty of "donts" from you but I can't see your proposed solution for the current issue that I could implement. Could you repeat it please?

You're right though, let's get along and forget all that. Before I go completely blind.

GS is there for a reason. It provides a public service. Not using it is in my opinion a bad idea since, to my knowledge, there is no other full-text engine that comes close to its vastness and search results quality. I'm pretty sure Google is basing its lockout decision on the rate of requests per second coming from the same IP, possibly its regularity too. Since it's a public service, Google does expect multiple requests from the same IP within a period of time. The lockout then can be mitigated along the lines of what I was suggesting above:

- repeated requests from Zotero (e.g. mass imports) could have an inter-request delay of N seconds
- the delay can also be randomized with a lower/upper bound so that the Google servers are happy that it follows normal usage -- it would emulate human behaviour.
- try a few other engines (e.g. scopus) before GS? randommize their order when doing mass-imports?

We have good links with Google so I could find out the lockout criteria from them. How long is the lockout anyway?

Implementing a better metadata retrieval for drag-and-dropped pdfs in Zotero is what we should concentrate on. This GS lockout problem is actually a side issue that could be discussed separately. It's not a show-stopper and can be fixed independently in my opinion.

As for the DOI, sure, Zotero can look for a DOI first and then continue with the second part of my algorithm above.

My post above details an algorithm that for me worked surprisingly well for a number of different pdfs I threw at it:

http://forums.zotero.org/discussion/23748/automating-massimport-from-pdfs/#Comment_127908

I'm going to write a shell script that implements this and post it here if people want to try it with their pdfs and provide feedback for the devs. Maybe it'll be a pushy and hostile step forward for Zotero, or maybe Mendeley would be more interested and appreciative to hear of it. They are missing this feature too.

p.s. I was actually thinking to purchase more space on Zotero before recommending it to my research group. This thread has been enlightening. Thanks, adamsmith.

adamsmith · June 17, 2012

I'm sorry. I tried to smooth the waves, apparently without success.

I think your effort is appreciated and worthwhile. I think improving the feature is important, that's why I've stuck around here.
I tried to contribute knowledge where I could - including pointing you to relevant parts of the code and raising issues that had come up in for others the past so you wouldn't have to repeat their mistakes - I'm sorry you just took the latter as don'ts - I figure pointing out what _won't_ work is a way of saving (you) time, too.

From how your proposal developed - if I understood your first posts correctly you hadn't initially envisioned using google scholar for this at all, e.g. - you got something out of this discussion that was helpful to you as well, so I really do wish we could get over the fact that we apparently rubbed each other the wrong way - online communication can be tricky.
So can we now actually try to get along?
(I don't work for Zotero btw., so unless you want to base your purchasing decisions on an obnoxious community power user and occasional contributor that should be irrelevant).

As for the proposal - I think what you outlined sounds good in general.
As I said, I'd be happy if we can _reduce_ the reliance on GS, but I agree with you that we likely won't be able to take it out without losing accuracy.

Some thoughts:
- I think it would be worthwhile testing whether pdf-extract does actually outperform pdftotext - have you experimented with both or is this just about binary pdfs (how common are those?)? (reason being: pdftotext is faster(?), smaller & already used by Zotero for full-text indexing.

- I think that all of this will work better if it's patched into Zotero directly rather than run through a shell script - not least because I'm not sure if the javascript api I point you to above allows you to access translator functions - I wasn't quite following what you were trying to do when I linked that. (As you'll see it's not super well documented and the existing documentation is for data access and manipulation.) But I may be missing something in how you're approaching the script..

- Your third step is close-ish to what Zotero already does (it tries multiple times with different snippets, exactly like you suggest, so you may be able to just use some of that existing code. (see more below)

- There are some other things in that code that are based on past experience that will likely be helpful: e.g. it blacklists the page that google pre-pends for google books, it removes quotation marks etc.

- What the current js does is to search for lines that are around the median line length in the file - my guess would be that this is likely to beat out your idea of starting at p.2 for working papers and other individually published pdfs that often have a couple of pages up front before the "actual" text starts - might be worth considering. I'd guess that it makes little difference for journal articles either way.

Aurimas links to the existing above - it starts in l. 218 of recognizePDF. Hope that's helpful.

normadize · June 17, 2012

From how your proposal developed - if I understood your first posts correctly you hadn't initially envisioned using google scholar for this at all, e.g. - you got something out of this discussion that was helpful to you as well, so I really do wish we could get over the fact that we apparently rubbed each other the wrong way - online communication can be tricky.

I've already pretty much solved my mass-import issue without GS. Since then I kept contributing to this thread towards a feature useful in more general cases, i.e. for Zotero as a product. Slightly obvious from my later posts I'd say.

So can we now actually try to get along?

It was finally getting interesting, with mentions of rubbing and feelings of regret, and now you want to end it so abruptly. Very cruel.

- I think it would be worthwhile testing whether pdf-extract does actually outperform pdftotext - have you experimented with both or is this just about binary pdfs (how common are those?)? (reason being: pdftotext is faster(?), smaller & already used by Zotero for full-text indexing.

pdftotext failed on all my binary encoded pdfs, which are predominant. I advise to stay away from it. pdf-extract did not fail.

Are you sure Zotero uses pdftotext for full-text indexing? pdftotext produces no output on this pdf yet Zotero appears to have full-text indexed it. Try it.

- I think that all of this will work better if it's patched into Zotero directly rather than run through a shell script - not least because I'm not sure if the javascript api I point you to above allows you to access translator functions - I wasn't quite following what you were trying to do when I linked that. (As you'll see it's not super well documented and the existing documentation is for data access and manipulation.) But I may be missing something in how you're approaching the script.

Well, right now I lack the time to learn and figure out the Zotero API (I spent a while to sort out my mass-import issue using other means), and you just provided good encouragement regarding code documentation.

I was hoping that general users might try a shell script quicker than fiddling with a pacthed Zotero -- could be the other way around though. The idea was that we'd want to see if my method is actually sufficiently accurate by receiving more feedback.

Should I ask who was pushy for wanting to patch Zotero? (it's that syndrome again)

- Your third step is close-ish to what Zotero already does (it tries multiple times with different snippets, exactly like you suggest, so you may be able to just use some of that existing code. (see more below)

That sounds good, I'll have a look. I presume that is in "recognizePDF.js" as mentioned above by @aurimas. I have my doubts about the current Zotero approach though (as outlined by @aurimas above). It failed for me many times and there can be several reasons for that -- but I do need to perform more testing. Is Zotero filtering out non-sane words (reference numbers, figure/table text, etc)? Google Scholar is indexing full text, but may also perform similar filtering and its pdf-to-text conversion is also not perfect. Putting quotes around the search string can make it worse -- although I know the reasons behind it -- since that exact string may not be found if GS pdf2text conversion and then indexing does not match Zoteros pdf2text conversion and search string construction. In my method, I'm trying to overcome these obstacles and search GS as is, without quotes around the search string. If GS indexed improperly (e.g. concatenated lines caused two words to appear as one) then such a search is more likely to get a good match. Worked until now for me but more testing is needed, obviously.

- There are some other things in that code that are based on past experience that will likely be helpful: e.g. it blacklists the page that google pre-pends for google books, it removes quotation marks etc.

- What the current js does is to search for lines that are around the median line length in the file - my guess would be that this is likely to beat out your idea of starting at p.2 for working papers and other individually published pdfs that often have a couple of pages up front before the "actual" text starts - might be worth considering. I'd guess that it makes little difference for journal articles either way.

Yes, identifying a good starting point for body text is important. The average line length method seems good. I'm not that sold by it though. Maybe a better method than both would be to start looking half-way through the document (total pages / 2 + 1). That is pretty much bound to be outside those first pages ... unless it's a one page article :)

Aurimas links to the existing above - it starts in l. 218 of recognizePDF. Hope that's helpful

It is indeed. aurimas mentioned it too above.

The second part of this feature (using the acquired GS meta info to search consecutively in a list of publishers/engines to get more accurate metadata) is where I'd spend more time than someone familiar with the code. Maybe they can point me further to what parts of code are dealing with that.

adamsmith · June 17, 2012

Are you sure Zotero uses pdftotext for full-text indexing?

yes. It includes slightly modified versions, but IIRC the only difference is that the version packaged with Zotero don't open a terminal window. (The installed versions are listed in Zotero under preferences --> search)

I can't explain the difference between running pdftotext from the commandline and Z's ability to index a pdf, but if a pdf is indexed by Zotero it is through pdftotext - Simon is in charge of that at core dev so he'd know more.

I was hoping that general users might try a shell script quicker than fiddling with a pacthed Zotero -- could be the other way around though. The idea was that we'd want to see if my method is actually sufficiently accurate by receiving more feedback.

Among the more technically sophisticated users who are also sufficiently present here on the forum we have a good cross-sample of disciplines - I think having half a dozen people trying this out will actually provide pretty good feedback - good enough certainly to put it in a trunk version of Zotero which will then receive wider testing.

The second part of this feature (using the acquired GS meta info to search consecutively in a list of publishers/engines to get more accurate metadata) is where I'd spend more time than someone familiar with the code. Maybe they can point me further to what parts of code are dealing with that.

In l. 364 (DOI/CrossRef) and l. 416 (GS) of the recognize PDF code, you see Zotero calling the respective translators. This will currently only work with translators that are marked and implemented as "search" translators (I link to Worldcat above as an example, afaik GS, CrossRef, and COinS are the only others currently).
To turn a translator into a search translator it needs to be marked as such and it needs a detectSearch and doSearch function.
http://www.zotero.org/support/dev/translators/coding#search_translators
(I know, that's not a lot of documentation - Aurimas and I have worked a lot with translators so we could just adapt a couple.)

For a start it's probably easiest to just take the first GS result and don't query the user (though I wouldn't mind asking the user in the final product, either) - I think we should just be able to get that using var title = item.title somewhere around the current l. 419 and then just write another translator call further down.

Simon · June 18, 2012

I can't explain the difference between running pdftotext from the commandline and Z's ability to index a pdf, but if a pdf is indexed by Zotero it is through pdftotext - Simon is in charge of that at core dev so he'd know more.

Zotero uses a modified version of pdfinfo, but we use a standard (albeit old) version of pdftotext. I'm not sure what you mean by "binary encoded pdfs," but I have yet to see anything that does a better job of extracting text from a PDF than pdftotext. OTOH, what we're looking for is really something that returns text similar to Google Scholar, regardless of how accurately that text reflects what's actually in the PDFs.

Zotero filters out figure captions by using only lines close to the median line length. If you don't use quotes around the search string, your false positive rate will be much higher. However, I'd be interested to learn of PDFs for which our current approach fails and something else succeeds.

I agree that there's value in doing a search at another database after a Google Scholar lookup. However, I don't particularly like the idea of making the user select the database(s) to use for lookup, because I think the added complexity will mean that 90% of users don't end up using the feature. I think the low-hanging fruit is to follow links from Google Scholar to publisher websites, and then run the translators on the publisher websites. Another option is to try a search on the publisher's website based on journal title.

It may also be worth looking at other tools for extracting PDF metadata (Yao and Pitman has one list). I'm somewhat averse to SVM/CRF-based tools because they don't seem to work all that well, and they will either require bundling a large chunk of platform-specific code with Zotero or rewriting a large chunk of code in JavaScript. (The former isn't a huge deal, as long as the platform-specific code comes in the form of a small executable or library rather than a Python/Ruby/Perl interpreter + a bunch of modules.) However, it looks like pdfmeat has a simple technique for grabbing abstracts with a regexp that often works and might be worth implementing.

normadize · June 18, 2012

@Simon: I posted above a link to a pdf which pdftotext produces no output at all for me: this one

Binary encoded -- I meant binary charset. If you're using linux, try "file -bi filename.pdf" to get the mime type. You'll see that for the pdf I linked above it's "application/pdf; charset=binary" while for non-binary pdfs it's just "application/pdf". I seem to have predominantly binary ones ...

I have yet to see anything that does a better job of extracting text from a PDF than pdftotext

Like I said above, crossref's pdf-extract does better for me, and doesn't choke on binary pdfs (it actually worked on everything i threw at it). It's slow though.

Zotero filters out figure captions by using only lines close to the median line length. If you don't use quotes around the search string, your false positive rate will be much higher. However, I'd be interested to learn of PDFs for which our current approach fails and something else succeeds.

Sure, read my post above. Zotero fails brilliantly on a book on Kalman filters (it happily fetches a chemistry article). The page I linked above is from that book. I can pass on the entire pdf if you give me your email.

The method I described earlier (see below for script) works. In fact, it gave me better results overall than Zotero on the pdfs I tried. Apart from the flaky pdftotext implementation, there are many things that can go wrong when adding quotes without better filtering the actual search string.

I wrote a script which takes a slightly different approach. See below.

I think the low-hanging fruit is to follow links from Google Scholar to publisher websites, and then run the translators on the publisher websites. Another option is to try a search on the publisher's website based on journal title.

That's pretty much what we've been discussing until now.

I just quickly wrote a bash script that implements my method detailed earlier: https://www.dropbox.com/s/7xwuf3vemtlhhsf/pdf-meta.sh

Syntax: pdf-meta.sh filename.pdf

Execute without arguments for more options.

(Edit: script changed to use pdf-extract.)

The script does the following:

- start from the 2nd page of the document to skip any boilerplate pages (you can specify the first page to start from in arguments)

- find the median line length in nr of chars after eliminating very short lines

- find all chunks of text that have consecutive lines within 10% of the median

- find the biggest such chunk

- concatenate the lines using space ' '

- deal with hyphenated words by passing their concatenated non-hyphened version through aspell, the idea being that if Latex hyphenated it then it did so based on a dictionary too; this preserves words like "non-overlapping" but fixes words like "per- formance" (I left comments of alternative approaches to this for other languages)

- search for the longest train of consecutive sane words, i.e. without funny chars or numbers (yes, numbers too to avoid funny inline maths that is not always extracted correctly) but keep harmless punctuation (, . ; etc)

- extract the first 25 of these words or the maximum available (you can change this via arguments)

You now have 25 words that are safe to use in search engines which virtually guarantee a match if it exists as they don't contain any problematic words that Google indexed differently. Without quoting the search string, I was able to find a book while searching on Google (not scholar), which i couldn't with quotes, nor with Zotero.

The same string can be used to search on normal Google, and then parse the results looking for publisher websites. Don't use quotes on normal Google as the full-text index on Google is different (subset) than on Google Scholar. This worked for me on a lot of pdfs and may be something to try to reduce the strain on Google Scholar.

I'm curious to know how it works for your pdfs. Until now I got better results on a number of my pdfs but it needs testing on a lot more before drawing any conclusion.

For my script, you'll need:

- aspell (for hyphenation fix)
- pdftotext (apt-get install poppler-utils; on windows download xpdfbin-win-3.03.zip)

Regards.

adamsmith · June 18, 2012

Zotero's pdftotext (and pdfinfo) are in your Zotero data folder:
http://www.zotero.org/support/zotero_data

Simon · June 18, 2012

Binary encoded -- I meant binary charset. If you're using linux, try "file -bi filename.pdf" to get the mime type. You'll see that for the pdf I linked above it's "application/pdf; charset=binary" while for non-binary pdfs it's just "application/pdf". I seem to have predominantly binary ones ...

[...]

Like I said above, crossref's pdf-extract does better for me, and doesn't choke on binary pdfs (it actually worked on everything i threw at it). It's slow though.

This doesn't make any sense to me. pdftotext (both the poppler version and the xpdf version) works fine on that PDF, and your chance of finding a PDF with only 7 bit characters or only ISO-8859-1 or UTF-8 encoded characters is essentially nil if they contain images or embedded fonts or use compressed streams, so pretty much all PDFs have a "binary charset." There may be something that happens to make your build of pdftotext choke and also happens to make file -bi return "charset=binary." One candidate is a compressed stream. (I don't know what's wrong with your pdftotext build, but it works for me.)

Sure, read my post above. Zotero fails brilliantly on a book on Kalman filters (it happily fetches a chemistry article). The page I linked above is from that book. I can pass on the entire pdf if you give me your email.

Books are hard because they first n pages don't necessarily contain unique content. We could definitely do a better job here. But you can email it to me at simon@zotero.org.

Apart from the flaky pdftotext implementation, there are many things that can go wrong when adding quotes without better filtering the actual search string.

My hunch is that you are more likely to end up with false positives if you don't add quotes, but I will look into this further.

I think the low-hanging fruit is to follow links from Google Scholar to publisher websites, and then run the translators on the publisher websites. Another option is to try a search on the publisher's website based on journal title.

That's pretty much what we've been discussing until now.

As far as I can tell, you've been discussing searching an alternative database using the results from Google Scholar. My suggestion is to follow links from Google Scholar to other databases for which we have translators instead of using a second search.

I'll take a look at your script and see how it compares to the current approach on my PDFs, although it looks like the biggest difference is the presence/absence of quotes.

Can I please have a copy of your pdftotext binary?

It's in your Zotero data directory.

normadize · June 18, 2012

As far as I can tell, you've been discussing searching an alternative database using the results from Google Scholar. My suggestion is to follow links from Google Scholar to other databases for which we have translators instead of using a second search.

Actually, unless I'm missing what you're trying to say, that's what I was saying as well. To use GS to fetch some (low quality) metadata first, and then to use that metadata to fetch fuller, higher quality metadata from other publishers. However, GS does not (always?) mention the publisher or location of a particular article to go directly there and fetch it so multiple ones have to be tried. What am I missing?

I'll take a look at your script and see how it compares to the current approach on my PDFs, although it looks like the biggest difference is the presence/absence of quotes.

Hmm, you may have missed the way I build and filter the search string. That's different from what aurimas described Zotero does now. I'd say that's the biggest difference. Do have a look and let me know.

Can I please have a copy of your pdftotext binary?

It's in your Zotero data directory.

Silly me ... I was using pdftotext wrongly by not supplying a 2nd argument (stdout or file) so it was outputing a .txt file with the same filename without me noticing. I guess coffee is no substitute for sleep after all since I used pdftotext heavily for my PhD thesis some years ago (which also has binary charset) and worked just fine.

(Edit: removed paragraph about concatenated lines)

I will adapt my script to use pdftotext as it's so much faster and actually I discovered works for some of the pdfs where pdf-extract actually chokes.

normadize · June 18, 2012

It does have a few quirks: it concatenates lines itself, but not always, it depends on the document. This ends up sometimes with a big standard deviation in line length as text lines can be very long but few compared to the lines with garbage text from graphs/tables. I just tried it with some papers.

Ignore that please, I forgot about the -raw switch to pdftotext.

Simon · June 18, 2012

Actually, unless I'm missing what you're trying to say, that's what I was saying as well. To use GS to fetch some (low quality) metadata first, and then to use that metadata to fetch fuller, higher quality metadata from other publishers. However, GS does not (always?) mention the publisher or location of a particular article to go directly there and fetch it so multiple ones have to be tried. What am I missing?

If you click the link that says "All n versions," for most PDFs, one of the versions will be the publisher's. Some listings are not the publisher, but it would be possible to prioritize which sites we use based on metadata quality. For PDFs included in CrossRef, we could also use CrossRef to resolve the publisher.

Hmm, you may have missed the way I build and filter the search string. That's different from what aurimas described Zotero does now. I'd say that's the biggest difference. Do have a look and let me know.

I have my doubts regarding whether your other changes make a huge difference. You are using aspell to try to fix words that have been hyphenated. We just construct our search string to ignore them. (If Google Scholar indexed the PDF, rather than HTML, this will probably work better, and if not, it probably won't hurt.) You try to filter out lines that might be figure legends, but I think that for most PDFs the length of figure legend lines will differ greatly from the median, so I'm not sure this is a big issue in practice.

It does have a few quirks: it concatenates lines itself, but not always, it depends on the document. This ends up sometimes with a big standard deviation in line length as text lines can be very long but few compared to the lines with garbage text from graphs/tables. I just tried it with some papers.

Use -layout.

adamsmith · June 18, 2012

However, GS does not (always?) mention the publisher or location of a particular article to go directly there and fetch it so multiple ones have to be tried. What am I missing?

the difference is how to decide where to (try to) get the better metadata. I took you as saying that we would let users define a prioritized order of databases, which would then be queried with the data from GS - that's certainly what I had assumed until Simon's post.
Simon suggests to just follow the GS link for the respective item, which goes right to the publisher or a database Zotero works with like JSTOR in what looks to me at least 90% of all cases and then translate from that page.
If that strategy works - and looking at some random GS results it just might - it has two major advantages: 1. It doesn't require user input (such as the ranking of databases) and 2. It will work for all Zotero translators, even ones that a user hasn't thought of and ones that Zotero only supports via a universal translator such as Highwire 2.0 - that's especially useful in fields (like many humanities) that don't have an almost-universal database like IEEE Xplore.

The downsides are that 1. Zotero may not have a translator for the site, 2. the user may not have access to that particular version of the paper and 3. GS may link to an old copy of the paper - e.g. a working paper version of an article etc.
This last one would be the biggest concern.

normadize · June 18, 2012

Use -layout.

I just corrected my childish error above. It's actually -raw. Using -layout cause double column to appear also double column in text which breaks semantics when fetching a text line.

I have my doubts regarding whether your other changes make a huge difference. You are using aspell to try to fix words that have been hyphenated. We just construct our search string to ignore them. (If Google Scholar indexed the PDF, rather than HTML, this will probably work better, and if not, it probably won't hurt.) You try to filter out lines that might be figure legends, but I think that for most PDFs the length of figure legend lines will differ greatly from the median, so I'm not sure this is a big issue in practice.

Yes but you apparently retain everything in the line, which is not what necessarily GS indexes and then a quoted search string fails. If you filter out words then again a quoted search string is likely to fail. Lines of text in papers I write or have have maths, references, funny symbols etc. These still end up close to median length. I experienced problems with this yesterday as GS did not extract or indexed the same full text.

My method worked fine since I concatenate all lines in the biggest contiguous block of "likely text", fix words, then look for as many consecutive words that are problem free (only text and harmless punctuation) as possible, from which I optionally select a contiguous subset. I then search that which is not (less) prone to errors due to different full-text indexing or text retrieval mismatching between what I and GS do. Searching with or without quotes worked for me just fine, precisely owing to the nature of the search string.

It's fine if you don't wish to try it, I can have my own patched Zotero version :). Does still need testing though. I'm posting an updated script which uses pdftotext so no more pdf-extract + Ruby or ghostscript business. People can then try it much quicker.

normadize · June 18, 2012

@adamsmith, @Simon:

Ok, I assumed we wouldn't rely that much on what GS suggests as publishers. I still think Zotero should have a ranked list but not necessarily defined by the user (can come with Zotero) of publishers to try successively, based on the GS suggestions. Some publishers offer more complete info than others and hence should be ranked higher. I was also thinking Zotero could try a 2nd, 3rd etc publisher in case some expected fields (e.g. abstract) are missing.

This would come quite handy when enhancing existing entries like @adamsmith was saying, e.g. right click an entry > complete metadata ... Zotero could then either do the above, or present the user with a list of publishers that he can select (multiple selection should be allowed) which Zotero would then try in order of their ranking to fetch more/better metadata.

Allowing the user to choose is a concept I believe in, rather than making the choice for him. There can easily be an option in the preferences to set it all to auto as Simon wants, so the user is not prompted. I'd rather have the manual way and then Zotero can remember my choices and automatically preselect them next time I use the feature (its likely I will use the same publishers) so it's only one extra click rather than all automatic with chances of failure, e.g. GS does not have good entries or points to a homepage copy of the article only.

adamsmith · June 18, 2012

I still think Zotero should have a ranked list but not necessarily defined by the user (can come with Zotero) of publishers to try successively, based on the GS suggestions.

yes, I think that's the way to go:
1. run through the sites that GS links to, and compare them to an internal&ranked list. If there's a hit, follow that.
2. If not, follow the top-level link from GS.
3. If Zotero doesn't recognize that, fall back to the GS bibtex.

If we have a DOI we should run it through the resolver and run through 1-3 (where we use CrossRef data instead of GS bibtex for 3.).

I was initially also thinking we would want the manual select of databases for item completion, yes. Right now I'm inclined to wait and see how well this works without - having spend a lot of time answering user questions, everything that offers less chance to confuse users is a big plus...

edit: we could put the list of databases in a hidden pref maybe - where dedicated users will be able to customize it, but other's won't be confused or tempted by it.

normadize · June 18, 2012

Updated script:
https://www.dropbox.com/s/7xwuf3vemtlhhsf/pdf-meta.sh

Syntax is: pdf-meta.sh article.pdf

I'm curious to hear your results for pdfs for which Zotero fails. execute without any arguments to see more options. It should work on most systems, including cygwin/msys on Windows. pdf-extract or Ruby no longer.

It calls Google Scholar automatically. You need FireFox installed and in PATH or specify its path in the script.

I'm curious if you try it to hear your results.

Here's an example of what I was talking about in my previous post, when the text line contains funny symbols or maths, so then an unfiltered quoted search string fails. Try this quoted search below. It fails:

http://scholar.google.co.uk/scholar?q="the+CFO+𝜀+can+be+estimated+using+the+conventional+joint+methods+or+joint+ML+method+as+described+in+Section+III+and+IV,+respectively"&btnG=&hl=en&as_sdt=0,5

The above script does its best to extract many consecutive words that are problem free, i.e. without funny chars or even numbers, after identifying the biggest block of likely text and fixing hyphenated words (GS does this too). This quoted search string works fine:

http://scholar.google.co.uk/scholar?q="be+estimated+using+the+conventional+joint+methods+or+joint+ML+method+as+described+in+Section+III+and+IV,+respectively,"&btnG=&hl=en&as_sdt=0,5

You can also see that GS has actually not indexed or extracted as text that epsilon symbol (look at the text preview) so unfiltered search strings do fail for this kind of things.

I'll look into implementing a Zotero patch with the rest of the functionality of fetching better metadata. I just wanted to first implement a better method of finding papers in Google.

Simon · June 18, 2012

Use -layout.

I just corrected my childish error above. It's actually -raw. Using -layout cause double column to appear also double column in text which breaks semantics when fetching a text line.

I looked through the recognizePDF.js source code, and it looks like we presently use -layout and restrict search strings to the first column on each page. I'm don't remember why I chose to do this instead of using -raw (which I think I remember testing when I first implemented this); this is probably worth further exploration. However, because this cleaning step occurs before we search for a DOI, it probably explains why we sometimes fail to capture DOIs even when they exist. I've corrected this.

Yes but you apparently retain everything in the line, which is not what necessarily GS indexes and then a quoted search string fails. If you filter out words then again a quoted search string is likely to fail. Lines of text in papers I write or have have maths, references, funny symbols etc. These still end up close to median length. I experienced problems with this yesterday as GS did not extract or indexed the same full text.

My method worked fine since I concatenate all lines in the biggest contiguous block of "likely text", fix words, then look for as many consecutive words that are problem free (only text and harmless punctuation) as possible, from which I optionally select a contiguous subset. I then search that which is not (less) prone to errors due to different full-text indexing or text retrieval mismatching between what I and GS do. Searching with or without quotes worked for me just fine, precisely owing to the nature of the search string.

It's fine if you don't wish to try it, I can have my own patched Zotero version :). Does still need testing though.

I agree that it's possible that other characters could cause problems. If we find that your approach improves our ability to detect PDFs without increasing the false positive rate, we should certainly change the current implementation. My concern is that we 1) make sure this is the case by testing it on a reasonably large corpus of PDFs from a diverse set of sources and 2) test each change individually to make sure that it's actually beneficial.

Ok, I assumed we wouldn't rely that much on what GS suggests as publishers. I still think Zotero should have a ranked list but not necessarily defined by the user (can come with Zotero) of publishers to try successively, based on the GS suggestions. Some publishers offer more complete info than others and hence should be ranked higher. I was also thinking Zotero could try a 2nd, 3rd etc publisher in case some expected fields (e.g. abstract) are missing.

This would come quite handy when enhancing existing entries like @adamsmith was saying, e.g. right click an entry > complete metadata ... Zotero could then either do the above, or present the user with a list of publishers that he can select (multiple selection should be allowed) which Zotero would then try in order of their ranking to fetch more/better metadata.

Allowing the user to choose is a concept I believe in, rather than making the choice for him. There can easily be an option in the preferences to set it all to auto as Simon wants, so the user is not prompted. I'd rather have the manual way and then Zotero can remember my choices and automatically preselect them next time I use the feature (its likely I will use the same publishers) so it's only one extra click rather than all automatic with chances of failure, e.g. GS does not have good entries or points to a homepage copy of the article only.

I'm not opposed to this. With that said, I don't particularly want to implement, maintain, or support it unless it's absolutely necessary. The easiest thing to do would be to store a priority list in a hidden preference or a file, as adamsmith suggests above. Alternatively, if you want to implement this and can commit to maintaining it, I'd be willing to ship it.

normadize · June 18, 2012

I agree that it's possible that other characters could cause problems. If we find that your approach improves our ability to detect PDFs without increasing the false positive rate, we should certainly change the current implementation. My concern is that we 1) make sure this is the case by testing it on a reasonably large corpus of PDFs from a diverse set of sources and 2) test each change individually to make sure that it's actually beneficial.

I posted my updated script and also an example where the previous method fails but mine doesn't. You can easily see a generalization from that. See here:

http://forums.zotero.org/discussion/23748/automating-massimport-from-pdfs/#Comment_128031

You are perfectly right and I said so too that much more feedback and testing is needed. I'm very curious to see other people's feedback using this method, for many other types of papers. The script should run on pretty much all systems.

Simon · June 19, 2012

I played around with your bash script a little. Zotero did marginally better with the 5 test PDFs I tried (3 correct detections, 1 false positive, 1 miss vs. 2 correct detections, 1 false positive, 2 misses) but the sample size is very small and Zotero is using the DOI where available and performing up to 3 searches, so it may not be a fair comparison. It looks like your script succeeded in identifying the PDF that Zotero failed to identify, which may be a good sign. I'm happy to look at a larger sample if you can either integrate your modifications into Zotero or automate the Google Scholar lookup in your script.

normadize · June 19, 2012

No problem, I'll implement 3 searches (from 3 different text chunks to obtain 3 different sane search strings) and also look for a DOI so a comparison is more pertinent.

What do you mean by automating GS lookup? Currently the script calls firefox to navigate to the GS results. I could use wget to parse and fetch the bibtex from GS.

Simon · June 20, 2012

What do you mean by automating GS lookup? Currently the script calls firefox to navigate to the GS results.

What I mean is that, at the moment, it would be very painful to test this on a large set of PDFs. You'd end up having to go through n Firefox windows and see whether the search ended up with the right item.

I could use wget to parse and fetch the bibtex from GS.

I don't think GS likes wget's user agent, and I think it also wants a cookie from you, but you could probably do this. It may be easier to modify recognizePDF.js at this point, especially if that's your eventual goal. You can use mozISpellCheckingEngine in place of aspell.

normadize · June 20, 2012

I don't think GS likes wget's user agent, and I think it also wants a cookie from you, but you could probably do this. It may be easier to modify recognizePDF.js at this point, especially if that's your eventual goal. You can use mozISpellCheckingEngine in place of aspell.

wget can easily do all that and I've already implemented it. will post the updated script later

# inspired from http://matela.com.br/pub/scripts/bibscholar-0.3
cook=`mktemp`
ua="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5"
# fetch cookie
wget -q --user-agent="$ua" --spider --save-cookies=$cook "http://scholar.google.com"
# change cookie
sed -i 's/\.scholar\.google\.com.*/&:CF=4/' $cook
# fetch the id of the first 3 results, if any
ids=( `wget -q --user-agent="$ua" --load-cookies=$cook "$url" -O - |
egrep -o "q=info:[^:]+" | uniq | head -n3` )
# get the bibtex's
for id in ${ids[*]}
do
sleep 1
wget -q --user-agent="$ua" --load-cookies=$cook \
"http://scholar.google.com/scholar.bib?$id:scholar.google.com/&output=citation" -O -
done

normadize · June 22, 2012

I updated the script to do 3 searches, but I have not implemented any DOI searches as I've been too busy. It is now fetching the bibtex and displaying it the console for all 3 searches.

https://www.dropbox.com/s/7xwuf3vemtlhhsf/pdf-meta.sh

There were 2 bugs in my previous script that were affecting the building of the search string. Your results might improve now.

normadize · June 22, 2012

Added DOI detection, but right now it's only showing the dx.doi.org url -- it's not parsing the redirected page. Should be ok as it's mostly a proof of concept.

Same url:
https://www.dropbox.com/s/7xwuf3vemtlhhsf/pdf-meta.sh

Execute without arguments for syntax and possible parameters. I added a few.

Worked quite well on my pdfs. Awaiting feedback.

bernard_ivo · September 10, 2012

Hi,
as the discussion title points it should give more insights how to do a mass import from PDFs. I read most of it and I kind of got lost in the dicsussion how things can be imrpoved. My question is how I as a Zotero user can currently deal with building my library from the tons of pdfs I have. Drag and drop is not an option (and perhaps not functioning under linux at the moment). Retrieve meta data per every each file might be extremely time consuming. So can you get me any sense what I can currently do? I may be able to do some testing (and give feedback to the authors and the forum) with some of the above scripts, however this may require some HOW TO tips as using comand line is not my strong side.
Regards

adamsmith · September 10, 2012

I don't quite understand why drag&drop (or using "Store Copy of File" under the green plus) is not an option?

I don't know how the original poster solved the problem of getting pdfs into Zotero, but s/he mentions having solved that locally, so maybe s/he has some pointers,
but within the currently existing possibilities with basic Zotero, dragging the pdfs and retrieving metadata is the best way to go. You don't have to retrieve metadata for individual files, you can do that in batches.
The improvements discussed here mainly concern the quality and reliability of the retrieve metadata feature.

bernard_ivo · September 10, 2012

Ok,
I forgot to provide some more background. I plan to use diferent directory for storing PDF than /zotero/storage. So I thought that when I point Zotero to look into its new storage folder it can build up a library, with me not really having to drag and drop files or click "Store Copy of File".
First is to make Zotero see all my stored PDFs and secondly is to get the metadata for all of them so I'm trying to figure out what is the way to do it.