automating mass-import from PDFs

normadize · June 16, 2012

Hi,

I have quite a lot of PDFs that I haven't put in any reference manager (yes, imagine that). Searching for those on the web and then importing in Zotero would be the best option as then Zotero fetches all info about each one properly. Sadly, they are too many and my free time too scarce.

I can drag and drop on Zotero and use the import metadats feature. A few quircks with this:

- many PDFs do not have any metadata and Zotero can't find anything.
- many times the information retrieved from Google Scholar is far from complete (no tags, no abstract, etc)
- i'm still required to drag+drop, right click and choose import metadata

Most of my PDFs file names follow the format "<author><short_year>-<title>.pdf". I'm happy to write a script to parse those and send the title to Zotero for automatic importing.

So:

- does Zotero (standalone or Firefox) support any command line functions?
- can I somehow define another search engine other than Google Scholar for Zotero to use when fetching data?

I'm happy to use intermediary tools to achieve a high quality batch importing, i.e. with tags (index terms), abstracts etc. Mendeley seems to be a lot better at this, as it does automatically fetch info when you drag and drop a PDF but sadly, it gets it wrong many times or field names are truncated si I don't quite trust it.

Any other suggestions anyone?

Thanks in advance.

adamsmith · June 16, 2012

You can't easily define another search engine, no. And even if you could - what would you use?. There is no full text search engine that produces better metadata than google scholar and has any comparable coverage - at least none that we're aware of.

You can interface with Zotero locally using the javascript API:
http://www.zotero.org/support/dev/client_coding/javascript_api

If I understand you correctly, Mendeley isn't actually a lot better at this - the only difference is that you don't have to select "retrieve metadata". But if you do find Mendeley is better, it should be possible to move the data from Mendeley from Zotero (via bibtex, I believe).

So basically the answer is that there currently just is no automated way to get high-quality metadata for PDFs into Zotero automatically. I don't know if any reference manager does this well - it sounds like Bookends for Mac might - but none of the free (Zotero, Mendeley, Wizfolio, Quiqqa)/cheap (Papers) or commonly offered commerical (Endnote, Refworks) are significantly better than Zotero at this. Most are worse.
FWIW, PDFs for which Zotero finds a DOI on the first couple of pages will do quite a bit better - the data comes from CrossRef, but you still won't get abstracts and keywords.

normadize · June 16, 2012

> You can't easily define another search engine, no.
> And even if you could - what would you use?""

Well, sadly, like most researchers, I too have a narrow field of research so I would use IEEE Xplore which has most of the papers for which I already have a PDF that I want to get into Zotero properly. I don't need Google Scholar for this, which quite often has poor results when automated (and poor bibtex entries in general).

> FWIW, PDFs for which Zotero finds a DOI on the first
> couple of pages will do quite a bit better - the data
> comes from CrossRef, but you still won't get abstracts
> and keywords.""

Wait, wait. So Zotero scans the first two pages looking for a DOI then looks online in the CrossRef database to fetch info?

Here are some suggestions based on the above -- if/when I have some time, I might actually give this a go myself if the JS API is flexible (I haven't checked it yet). Zotero already supports importing from lots of online publishers and databases, e.g. IEEE Xplore. It would be great if Zotero had either or both of the following features:

- a search box on the menu bar that would search a chosen publisher/database (e.g. IEEE Xplore) and get the first match as a Zotero entry - rather than me having to manually browse to the publishers entry, search there and then click the Zotero import icon. The search box would have a small dropdown menu to choose the publisher/database to search in, and also remember it afterwards, like the current search box does. This feature could then be exposed in the API so that people can script it and perform batch imports like the one I'm trying to do now.

- a button on the menu bar (not a menu > submenu > entry, too many clicks) to do the above search using the currently highlighted entry when it wasn't yet fully imported, e.g. it's a drag and dropped pdf. The search should try first extracting metadata and if that fails then it should use the actual file name (without extension, after beautifying it a bit replacing [-_.] with spaces etc) since most of the time, the file name is very close to the actual full title of the paper. The user should also be able to specify which publisher or online database to search. Zotero can (should?) even then perform all this when a PDF is drag-and-dropped. You can have an additional option in the Preferences so that a multiple entries are presented to the user corresponding to the search results, just like Zotero is currently doing with search results from Google Scholar for instance.

Should I make these suggestions in some other dedicated area (email)?

Do you think you'd implement these anytime in the future? I think they would be really helpful. Managers like Jabref already have specific publisher/database search features. Zotero could be a lot better as you already support importing from a lot more of them.

Reagrds.

adamsmith · June 16, 2012

Wait, wait. So Zotero scans the first two pages looking for a DOI then looks online in the CrossRef database to fetch info?

Yes (I'm note sure about the number of pages - might be 2, 3, 5, or just the first X number of words - not sure). That takes precedence over a google scholar search.

I don't think Zotero is very inclined to create it's own search interface, no. It's a _huge_ amount of work, because apart from GUI changes, the existing translators would all have to be rewritten to accommodate this. Also, the search interface could never be as good as what the websites already offer. I know other programs do that - but other programs also don't have the type/quality of browser plugin Zotero has.
The ability to use a database's native interface is one of Zotero's core features/advantages - if you don't like to do that, maybe Zotero isn't right for you.

No need for a separate thread for those suggestions - all the relevant people will read this - but my (unofficial) sense is that these are highly unlikely to happen in the next couple of years: Some improvements for the retrieve metadata feature are certainly desirable and likely to happen, but nothing close to the complete re-design you're proposing.

normadize · June 16, 2012

But you'd only have to send the search string to the online publisher/database search engine, exactly in the same way you'd make it with a browser, i.e. Zotero would send same HTTP GET query as if it was from the browser on that webpage (*). The returned html would be the same as if the search was performed with the browser directly, and it would be fetched and parsed in the same way Zotero already does now.

I don't see the huge amount of work at all to be honest. Surely, adding a search box or button on the GUI is not that much of a monstrous task. The translators would not need much alteration either, just a template HTTP GET query with a placeholder for the actual search string.

If you really don't want to do it (why? it'd be so useful) then I could actually do this myself as I have some experience with FireFox extensions and JS. Where can I find the necessary sources to implement this?

(*) there is an associated issue in that the publisher may change the HTTP GET query format so Zotero will have to be kept up to date. However, that happens rarely and you must already be doing this anyway when parsing html pages, since publishers may change the html layout too.

Regards.

p.s. If I actually do it, would you consider bumping up my space on Zotero? :)

adamsmith · June 16, 2012

For every database you want to use, you'd have to write a search function - here the example from Worldcat, e.g:
https://github.com/zotero/translators/blob/master/Open%20WorldCat.js#L153

I.e. this does mean extra code for every site you're interested in using.
I personally think that longterm it would be nice to have at least for a couple of popular sites (JSTOR, IEEExplore, LoC, Wiley, T&F, Springerlink, EBSCO, Sciencedirect), but only if we have a way to work with that in a useful way, which brings me to:

The GUI questions - before you put any work into that, I'd suggest you check on zotero-dev https://groups.google.com/forum/?fromgroups#!forum/zotero-dev what they'll consider implementing/accepting: I don't really see a full search interface happening (though I could be wrong) - GUI space being a major consideration here. Ideally there'd be a clever way to automate that, but I currently don't see how.

What I could see is a way to complete existing but incomplete or low quality entries (e.g. retrieved from google scholar) using this: something like - you right-click on the title and select "complete using IEEEXplore" - that might even allow users to get the PDFs.

normadize · June 16, 2012

> something like - you right-click on the title and select
> "complete using IEEEXplore" - that might even allow users
> to get the PDFs.

That's pretty much what I was suggesting above (the 2nd suggestion) although I was talking about a button to save users from right-clicking. Right-clicking would still be fine since you allow doing this for multiple selected/highlighted entries. Having this would be of real value since Google Scholar results are quite poor.

Since this would involve minimal GUI changes, then it could also be augmented with filename -> cleaned filename -> title -> metadata retrieval.

Is Zotero currently using the "Title" metadata field in the PDF when retrieving metadata of drag and dropped PDFs?

adamsmith · June 16, 2012

Is Zotero currently using the "Title" metadata field in the PDF when retrieving metadata of drag and dropped PDFs?

no. I think I remember Dan (Stilman) saying he didn't think that was a good idea, but it might be worth reconsidering - there is a discussion on this somewhere only a couple of months back, probably worth tracking down.
One issue to consider is that currently Zotero is preventing false positives by searching for chunks of the full text rather than just the title.

normadize · June 16, 2012

This might be one of those "leave the choice to the user, don't make it for them" issues. I think the title should be used and then Zotero would present the user with a list of results and checkboxes, just as it does when clicking the icon on a search results page on Google Scholar.

Would be extremely helpful since most of the time (virtually always in my case) the first result is the correct one when searching using the full title, not just a few words from the title. Granted, for mass-import you can just disable this if multiple entries are selected.

In my case, I could script this. i.e. update the PDFs whose "title" field is empty or not matching the article title (I have a lot of these) using the file name. Then drag-drop into Zotero and right click to have Zotero fetch the first result of a title search for each of them.

I'd actually like to try this. If it proves to work well then I'll post the patch. Could you please give me a head start by pointing me to the API zone and source files I need to implement a "title" based search using an alternate search engine, e.g. IEEE Xplore?

> One issue to consider is that currently Zotero is
> preventing false positives by searching for chunks
> of the full text rather than just the title.

Are you sure about that? That would require fetching the PDF from the remote site as well and then compare (unless you're searching a full-text indexed database) ... for all PDFs I tried it returns much to quickly.

adamsmith · June 16, 2012

> One issue to consider is that currently Zotero is
> preventing false positives by searching for chunks
> of the full text rather than just the title.

Are you sure about that?

yes. That's the beauty of google scholar - it is a de-facto full-text indexed database.

As for the files to change - obviously the translator for IEEE xplore. The retrieve pdf code is here:
https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js
but beyond that post to zotero-dev - Dan would have to say. My sense would be that this isn't something to do via API, but rather as part of the core code.

normadize · June 16, 2012

Thanks for the code entry point. How about the source code entry point for the "Retrieve Metadata for PDF" feature (from the right-click menu of a drag-and-dropped PDF).

I know you pointed me to the dev list, but would same me some time to reexplain the issue there if you knew the above too. I'm happy to code beyond the API.

adamsmith · June 16, 2012

sorry, not sure - I'd have to search myself where the context menu is listed. You can just link to this from the dev list, though.

Gracile · June 16, 2012

This might interest you: https://github.com/zotero/zotero/issues/99
(edit: http://forums.zotero.org/discussion/22882/)

Maybe I'm wrong but lookup engines could help you here:
If you are able to import very basic metadata (from filenames), you can try to write a lookup engine which would search for your articles in your favourite database (three clicks at the moment: select the item, click on the "localize" green arrow, select your lookup engine). Then you'll just have to add the result in zotero (one click). Far from being perfect but with more coding you can improve/automate this workflow.
Just my two cents.

This reminds me this thread too.

adamsmith · June 16, 2012

thanks Gracile - I had misremembered Dan's take on this, sorry, so that sounds promising.

I think search translators (currently Worldcat, CrossRef, and Google Scholar only, I believe) - which already import data into Zotero and don't rely on open search syntax (which almost no site has implemented) are a better way to go than lookup engines.

aurimas · June 16, 2012

Personally I hate endnote's integrated search. It always retrieved a lot of articles that I didn't want. I would probably still prefer using the publisher websites for retrieving articles, but if people want to use search from Zotero, I think we can actually integrate it pretty easily. Most translators support search result pages, so we can simply send queries from zotero to whichever supported website the user chooses and display the "multiple" dialog so they can select what to import. Most of the functionality is there. Just need to build urls for search queries. As I said though, native websites are probably still preferred, since you can browse abstracts, etc.

adamsmith · June 16, 2012

I completely agree on the internal search for regular usage - and I don't think people would use that all that much.

My main motivation and IMHO the biggest gains for Zotero in overall usability would be
1. Complete items with poor metadata: https://www.zotero.org/trac/ticket/1519 including pdf attachments which I think would be really neat (and, afaik, only Bookends is able to do the latter atm).
2. Improve/facilitate retrieving metadata for pdfs, along the lines described by the OP.

I think all changes should be aimed at that functionality. That means we probably shouldn't bother with a full search interface for integrated search (which I'd bet Dan would veto anyway), but rather use existing Zotero fields or pdf content as search terms.
It's probably a good idea to agree on how we want this to work exactly before starting to code - not least because we'd want to wait for the input of core devs who have the last word.
Since the "how should this look" needn't be technical I'd suggest keeping this here rather than on zotero-dev. Once we have agreed on the set-up we can hammer out details on github and zotero-dev.

normadize · June 16, 2012

@aurimas: that's exactly what I suggested in a post above yours when describing the HTTP GET request. Virtually all functionality is already there in Zotero so when adding a PDF and then clicking on "import metadata for pdf", the user could select which search engine to send the query to using a HTTP GET query that uses the title extracted from the pdf if nothing better is detected.

This way people (like me) can also automate batch import jobs which currently is quite a pain for all pdfs that do not have proper metadata fields already which Zotero can use to search ... and even if it does, the returned result is of quite poor quality because of the search engine used. Don't get me wrong, Google Scholar is great, but its bibtex entries suck.

If there are no metadata fields found in the PDF then Zotero can attempt to turn the filename into the search string to be sent to the search engine. Most of my pdfs filenames actually contain the full paper title.

I could have a go at this if you guys are not going to anytime soon. I don't have immediate free time either but if I manage to code it before you begin then you can decide what to do with it.

adamsmith · June 16, 2012

I'd really hash out the details before anyone starts on this - there is a lot of stuff that needs to be cleared up before people should start any coding work
- do we take people to the search result page (aurimas' solution), or auto-import the first item?
- where do we look in the pdfs for the title?
- how do we prevent false positives?
- how/where do people select the database they want to query?
- for completing items: How do we deal with conflicts?
- how/where is the search triggered?
and I'm sure there are more.

Sending the search queries via translators is really the easiest part here. Implementing it in a way that works and is intuitive is the main chunk of the work.

normadize · June 16, 2012

Well, I have a ton of pdfs I'd like to import and start using. You guys can take your time.

> - do we take people to the search result
> page (aurimas' solution), or auto-import
> the first item?

I think both myself and aurimas were saying Zotero should present the multiple choice list with checkboxes representing the search results. I would further enhance this to automatically take the first result in case multiple entries (pdfs) are selected in Zotero when clicking on "retrieve metadata for pdf"

> - where do we look in the pdfs for the title?

I think I described this several times now. First, in the pdf metadata itself. Every pdf has embedded metadata (many times they are empty). Check your pdf viewer > document properties > metadata/description. E.g. Acrobat: http://bit.ly/M3jef7 . If that fails, then my suggestion was to use the filename, after cleaning it up with s/[^a-z0-9]/ /g or similar so that search engines have a cleaner search string. Zotero may even allow the user to edit the search string before being sent to the search engine.

> - how do we prevent false positives?

Let's not. I'd be interested in receiving the full list of results in the form of the multiple checkboxes dialog that Zotero extracts from the html search results page.

> - how/where do people select the database they want to query?

Preferences for starters, but a dropdown list on the menu bar would be much better as a second implementation stage.

> - for completing items: How do we deal with conflicts?

In what sense conflicts? You mean duplicates? The user would deal with that afterwards, it wouldn't be Zotero's fault.

> - how/where is the search triggered?

I'm not sure I follow. A HTTP GET request straight to the selected search engine, exactly the same way the browser would do it if I was to search for the same string on the search engine's webpage (E.g. "multi-carrier burst contention" would result in http://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=multi-carrier+burst+contention ). Zotero would then parse it exactly in the same way it already does if I was to press its "save to Zotero" icon

> and I'm sure there are more.
>
> Sending the search queries via translators is
> really the easiest part here. Implementing it
> in a way that works and is intuitive is the
> main chunk of the work.

I don't see the complexity to be honest but it's not my software. I will probably have a go at it for myself just to be able to mass-import all my pdfs. It's too much of a pain to do it manually using the conventional method.

adamsmith · June 16, 2012

"how/where is the search triggered?"
I mean: How does the user tell Zotero s/he wants to search for something? I have absolutely no sense of the GUI vision behind all this yet.

If you don't want to take the time to discuss & think through these issues with people who have used Zotero for much longer as well as the core devs who have written it that's obviously fine.
It does, unfortunately, mean that what you code will likely only be marginally useful for inclusion in Zotero.

normadize · June 16, 2012

I mean: How does the user tell Zotero s/he wants to search for something? I have absolutely no sense of the GUI vision behind all this yet.

I made several suggestions in this thread about that:

a) a search box (you said it's too much work),
b) a button to trigger a search using the highlighted entry that a drag-and-dropped pdf produced (again, apparently too much work),
c) via the "retrieve metadata for pdf" entry which would automatically search using the specified search engine, using metadata fields from the pdf or the tital via the filename -- which is what we've been discussing for the last several posts

If you don't want to take the time to discuss & think through these issues with people who have used Zotero for much longer as well as the core devs who have written it that's obviously fine.

I'm too tired to debate this awkward comment or understand why we're discussing willingness issues when I clearly spent time in this thread and am willing to contribute. That said, I think I was clear in several posts above that I'd like to import a large number of pdfs faster than the snail pace Zotero currently allows, and I'd like to do it soon. For that purpose, I can get to work rightaway and hack it to my liking; why should I wait weeks/months for people to decide on how a GUI that I don't care about should look like? It's primarily for my own use, with potential of being of use to others too.

I however made several suggestions re: the GUI for the features I proposed in case devs are interested to implement them in Zotero but you're not obliged to use any code I produce if you think it was either rushed or not properly discussed beforehand. It also doesn't mean I'm not willing to discuss implementation aspects for a Zotero adoption and adjust my code accordingly afterwards. But who knows how tired I'll be then ...

It does, unfortunately, mean that what you code will likely only be marginally useful for inclusion in Zotero.

I intend to implement c) above with IEEE Xplore. My code might already be working before you even discuss GUI placement. If I was part of the dev team then I'd use something like that or parts of it. But I'm not, and you're not obliged to use any of it. I'm open to discussion regarding a global Zotero implementation, but for the immediate purpose of importing my pdfs I'd rather get to work.

aurimas · June 16, 2012

normadize wrote:

a search box on the menu bar that would search a chosen publisher/database (e.g. IEEE Xplore) and get the first match as a Zotero entry - rather than me having to manually browse to the publishers entry, search there and then click the Zotero import icon. The search box would have a small dropdown menu to choose the publisher/database to search in, and also remember it afterwards, like the current search box does. This feature could then be exposed in the API so that people can script it and perform batch imports like the one I'm trying to do now.

aurimas wrote:

I think we can actually integrate it pretty easily. Most translators support search result pages, so we can simply send queries from zotero to whichever supported website the user chooses and display the "multiple" dialog so they can select what to import. Most of the functionality is there. Just need to build urls for search queries. As I said though, native websites are probably still preferred, since you can browse abstracts, etc.

I think we should move this part to another thread. While slightly related, it's an issue on it's own. Core Zotero devs have mentioned previously that this feature will probably not be integrated and if it were integrated the same way EndNote does it, where all matching results are imported (at least that's how it used to do it), then I completely agree with them. If the results, however, are presented to the user in the same dialog as the webpage "multiple" results, then I think it would be marginally useful and it wouldn't take a lot of effort to integrate.

Related discussion

Moving on to the actual issue at hand. Automatic PDF import:

This is the current workflow (recognizePDF.js:238) (The code is not exactly in order listed here, but the logical workflow should be as follows):

OCR'ed text (or just whatever text content is present) is retrieved from the first 3 pages of the PDF

We look for a DOI in this text

If the DOI is found, we look it up using CrossRef translator

If there is no DOI, then we sort all the lines in the retrieved text, and find median length. We pick out all the lines that are +/- 4 characters from the median length.

Now from the lines we picked, we build a search string of at least 25 words. We look at each line, drop first and last word (since they could be partial words), put quotation marks around the rest of the line and add it to search string. Once we have enough words in the search string, we query Google Scholar.

If we get results, we store the first one. If not we retry the last step 3 more times or until we run out of good lines

The following lists are likely incomplete and I will likely be repeating the issues/solutions mentioned above, but I'm listing them for completeness.

Here is where we run into trouble:

PDFs that don't have OCR text.

PDFs that have pages prepended in the begining: like copyright notices

PDFs with short lines (I think)

PDFs that are not indexed by Google Scholar

Poor metadata from Google Scholar

Some things that we could do better:

Look at file names

Try to detect more relevant lines in PDF

Use additional databases for metadata retrieval

The list can probably be expanded.

I think the biggest issue is frequently the lines that Zotero picks up to build queries. I've seen it pick up publisher addresses, copyright language, etc. This mostly happened because there was little other text embedded in the PDF, so there's not much we can do about it. Using a different search engine would do us little good.

Looking into file names has been suggested previously (discussion and github ticket) and is probably not a bad idea. It can even be further improved by allowing the user to define the format of the file name in a dialog once "retrieve metadata" is selected. I think this is worth implementing.

Edit: per Dan's comment in the linked discussion, the feature is supposed to just work, though I personally don't see any harm in including a pop-up that allows you to select what parts of the PDF to use when retrieving metadata. As an advanced option, this could include a way to specify file name format.

Looking at embedded metadata is also not a bad idea, and there is an open issue for it on github

Using other databases may be beneficial in certain cases where all the papers can be found there and Google Scholar metadata is simply incomplete.

Edit: There's been some talk about using MS Academic Search instead of Google Scholar, but I don't think it indexes the contents of papers, just their metadata. If someone is aware of another good database encompassing many fields of study and containing full text indices of articles that we can use instead of Google Scholar, that would help improve the metadata retrieval process. If we go with a pop-up dialog as discussed above, a more specialized database can be used and the user would be able to select it.

But I think a much more beneficial feature, and something that has been on my mind for a long time is:

adamsmith wrote:

What I could see is a way to complete existing but incomplete or low quality entries (e.g. retrieved from google scholar) using this: something like - you right-click on the title and select "complete using IEEEXplore" - that might even allow users to get the PDFs.

This can be done as a plugin and would certainly be helpful. Though this should also probably go into a different thread.

Rintze · June 17, 2012

What I could see is a way to complete existing but incomplete or low quality entries (e.g. retrieved from google scholar) using this: something like - you right-click on the title and select "complete using IEEEXplore" - that might even allow users to get the PDFs.

This could even be automated, right? If the user could specify a preferred database (e.g. PubMed for the life science folks) for metadata retrieval, Zotero could still use Google Scholar for the initial full-text search, but instead of saving the metadata of the first match, it could use that to query PubMed, and save the best match from there (if there is one).

DWL-SDCA · June 17, 2012

Beware of MS Academic Search. I use it but I do not depend upon it. I frequently find results with articles assigned to the wrong journal name and authors with somewhat similar (but not all that similar) names substituted for the correct author. Sometimes the first or last author is missing from the record. Sometimes there are authors listed in the MSAS record who are not listed on the print or online versions of the actual article.

I find the service useful for finding articles that, because they were published in an unfamiliar journal, I wouldn't otherwise find. But I _always_ feel a need to verify the information by finding the article on the publisher's website. (this has nothing to do with the need to read the article before citing it -- only that the metadata for the article is often very wrong and trusting the flawed metadata can suggest you had not read the article.) I have found that the doi is usually correct.

edit
For example: a search using 'partner violence' will find articles from the journal Tradition-a Journal of Orthodox Jewish Thought. However, the articles were not published there but were published in other journals (Infant mental health journal, Wiley; Child and adolescent social work, Springer; and others). Some articles that were published in the Journal of Forensic and legal medicine were assigned instead to Desalination

My comments are not about problems with importing into Zotero. The errors are in the MSAS database.

adamsmith · June 17, 2012

just as one example of why this is complex:

> - where do we look in the pdfs for the title?
I think I described this several times now. First, in the pdf metadata itself. Every pdf has embedded metadata (many times they are empty).

If you check PDFs from IEEExplore, you'll find that many times the XMP title tag isn't blank. But it doesn't contain the paper title, it has:
"Paper Title (use style: paper title)"

This one for example - but this was the case for three out of four papers I randomly downloaded from IEEExplore.

[1] J. You, X. Jiang, N. Wang, Z. Shen, Q. Ma, and W. Peng, “Study of without blankholder drawing for individual titanium implant forming,” in Mechanic Automation and Control Engineering (MACE), 2010 International Conference on, 2010, pp. 5766 –5769.

The problem is, of course, that searching for this in IEEExplore does produce results --> as a result, depending on the setting, if you're trying to get metadata for ten of these papers either you get wrong metadata for eight papers or you get eight completely useless search results that you have to click away. Neither is acceptable. So no, just using the XMP title tag is not a viable option.

I think using the filename is more promising, for IEEE the default is just a number, which at least doesn't produce false positives - but I'm nots sure that's the case for other databases, too.
This has to work in a way that it doesn't just work for users who have saved files in a specific way.

@Rintze
I like that thought in general. One downside is that it doesn't circumvent google scholar, which I'd really like to be able to do, because GS locks out people batch-importing papers.

Rintze · June 17, 2012

Maybe it's possible to identify the abstract within the PDFs? There are a lot more services that index abstracts (PubMed included, see e.g. http://bit.ly/LYkNir ).

aurimas · June 17, 2012

I mailed this to zotero-dev group before: http://labs.crossref.org/styled-6/pdf_extract.html. We can use similar logic for abstract. For instance: first paragraph with more than 3 lines of text and the same font size as the main text (in case there is some sort of copyright block). Or we can try identifying title and authors. It does get quote complicated though.

adamsmith · June 17, 2012

@ aurimas - ah yes, I remember - that might be quite helpful here. They say it's open source, but I couldn't find a license, that might be a problem.

normadize · June 17, 2012

@aurimas: I also pointed to CrossRef's PDF-EXTRACT tools for content extraction.

Here's a technique that I just tried and worked surprisingly well. It's borrowed from a former colleague, Stephen Kell, who wrote a set of quite nifty scripts a long while ago: http://www.inf.usi.ch/postdoc/kells/goodies/research/bibtex/

- get the full text from the pdf starting with page 2; skip first page as it might be boilerplate

- crossref's pdf-extract can be used with "pdf-extract extract --sections" and then concatenating the <line> nodes (pdf-extract is very slow though). I advise against using ps2ascii from ghostscript as it still has new line issues. I also advise against pdftotext as it doesn't work on binary encoded pdfs. Ghostscript can be used to skip the first page and then pass it to pdf-extract.

- extract the first 10 or so consecutive "sane" words, i.e. that do not have any weird chars; regex is your friend. As an optional method, these can also be passed through aspell to detect any non-spellcheck'able words -- this should take care of words inside figures which end up concatenated and not making sense. Continue until 10 really "sane" consecutive words are found.

- pass these as the search string to google scholar. This actually worked for me surprisingly well. At this point, Zotero may prompt the user with a multiple choice list to select the best entry -- by default, the first one. if none are found, then maybe those 10 words were actually bogus ones (from figures, not properly extracted, etc) in which case, the above step could be repeated starting with where those 10 words left off or from the 3rd page. repeat this a maximum number of times before giving up.

- take the google scholar first entry info and then use that to search consecutively in a ranked list of search engines (e.g. 1-ieee xplorer, 2-acm.org, 2-springer, 3-citeseerx etc) until a match is found and then fetch all metadata, including the pdf.

I tried this approach manually on a few papers and it worked really well. I'm building now a shell script so that you guys can try it on some of your pdf samples. You'll need ghostscript and crossref's pdf-extract (which in turn requires ruby).

@adamsmith: I'm trying to spend time constructively, i.e. to find a method that works on a good set of samples, rather than counter-examples.

adamsmith · June 17, 2012

yeah - that's what Rintze suggests above.
(At the risk of being destructive again:) As I note, it would be better to be able to do this without google scholar. GS locks users out when it thinks it detects a robot/automated data retrieval - that usually kicks in somewhere between 20 and 100 queries. That's especially true for your purpose of importing a large batch of pdfs at once.
It would still be an improvement over the status quo, of course, where you also get locked out but only get the crappy GS data.

normadize · June 17, 2012

Please point me to where Rintze described that full algorithm that I just did. I can't see it.

You first said it should work in general and not for specific cases like mine. I provided an idea and an algorithm exactly for that. Now you seem to be saying pretty much the opposite. Let's stick to a normal usage case that covers most users rather than finding exceptions. Most users are not doing mass-imports (*).

For that purpose, GS can and I think should be used. It's pretty much the only full-text search engine that is that vast and has some reliability. Zotero uses it right now any so the lockout argument doesn't really hold -- my algorithm above wouldn't be different from that perspective to what Zotero already does.

Regardless of the method, there will always be examples for which it doesn't work. The idea is to find something that covers the majority of usage cases if we are to continue this discussion. I'm no longer discussing my mass-import (which I pretty much solved already anyway).

(*) when doing mass imports, Zotero can just introduce a delay of N seconds between importing each entry to avoid the lockout. Users will probably happily wait to get their database fully imported with high quality metadata.