Dissociate PDF

mel47 · April 22, 2013

Hi,
I use "metadata retrieval" to find data for many pdf. But in certain cases, data is completely false. I'm searching how to make the correction. If I delete the entry, it throw also the pdf. How I can dissociate the pdf from the metadata attach to it?
Thanks
Mel

adamsmith · April 22, 2013

you can drag the pdf out of an item - it's a little finicky, but you can see where it's going to be place and should be at the top level, obviously.
If this happens a fair amount we'd be interested to know about the PDF (e.g. a link if it's somewhere online) - retrieve metadata is designed to have a very low rate of false positives.

mel47 · June 4, 2013

Hi,
Sorry for long delay but I had many files and I was blocked a lot of times by google scholar...

So from my 2000 pdf, I get :
- many which was not found, but they are too old, not indexed, or always from the same journal (J Hepatol) which have a really poor metadata extraction (also tested by Jabref)
- Those 5 pdf with various errors; not the good author, title or journal: http://ubuntuone.com/42q7ZShd6NAgQAm9fSFMaX

Thanks for drag tips, I never saw that before.
Mel

adamsmith · June 4, 2013

Cool, thanks. Going through them one-by-one

- Groeneweg et al Hepatology 1998 28.pdf - picks up an ISBN from the bibliography. Maybe restrict ISBN more?

- Holroyd & Overdyke J Neuropsychiatry Clin Neurosci 2012 24(3).pdf
This is just bad luck. They published two nearly identical articles _and_ google scholar's data for the first one is flawed. Not much we can do about that.

http://scholar.google.com/scholar?q="had elevated ammonia. This occurred in" "visual perception, construction," "treatment of dementia with behavioral" "of VPA in this population, potential side" "this retrospective, chart-review study, all patients" &hl=en&lr=&btnG=

- Mike Garcia Mdel Ann Hepatol Jun 10 Suppl 2.pdf
This works correctly, just doesn't get complete data, because google scholar doesn't have anything beyond author and title. Nothing to be done.

- Filippini et al Reumatismo 2002 54(2).pdf
gets everything but the year right for me (and uses English title). Year comes from google scholar, but generally I think that's pretty good.

- Jalan J Hepatol 2010 Sep 53(3).pdf
is a review of the study Zotero retrieves and contains large amounts of verbatim text from the original study. I doubt we can do much about that.

So I'd call this three false positives. The first one I think we might be able to avoid, which would bring us down to two. In either case, we're looking at false positive rates in the range of .1-.25 percent, that seems pretty reasonable to me.

aurimas · June 4, 2013

- Groeneweg et al Hepatology 1998 28.pdf - picks up an ISBN from the bibliography. Maybe restrict ISBN more?

We could do a bit of a sanity check for ISBN. I don't think there are many books that are 3 pages long.

adamsmith · June 4, 2013

I haven't checked, but I assume the current implementation looks for ISBNs on the first x-pages?
How about first x pages _and_ first 25%?

Generally super-short books are likely very rare. My concern would be
1. Items where Zotero just isn't able to index much of the text, but does find the ISBNs
2. Reports or other potentially short items which may get an ISBN

both of these are unlikely scenarios, but not sure how much.

aurimas · June 4, 2013

The thing about first 25% (did you mean 25% of pages or 25% of text?) is that you can have single page articles.

If you meant text, then maybe we can bump the number of extracted pages to, say, first 10 and then only search first 50% of text. That should skip references in most cases.

adamsmith · June 4, 2013

If you meant text, then maybe we can bump the number of extracted pages to, say, first 10 and then only search first 50% of text. That should skip references in most cases.

yeah, that's roughly what I had in mind.