Error getting metadata

Grode · January 15, 2015

BEgan as not being able to drag. Now I can drag after installing (and later removing) stand alone version, but not settings. I noticed something got copied so that is probably still there.

Now... local files seem never to be able to pick up any metadata. And at the moment I have no luck finding an online file that can.

I just get a "ups someting happened" ... not certain how to translate the Danish error message properly back to the English.

adamsmith · January 15, 2015

please provide the message in Danish as it appears.
Which Zotero version exactly (check under "about Zotero" in the gears menu).
And as I requested in the other thread, ideally a PDF you're trying this with.

Grode · January 15, 2015

4.0.25.1

Messages are different: http://www.uvm.dk/~/media/Publikationer/2009/Folke/Faelles%20Maal/Filer/Faghaefter/matematik_31.pdf returns: Fandt ikke referencer, der matchede

http://www.michaelfullan.ca/wp-content/uploads/2014/04/14_Spring_Maximizing-Impact-Handout.pdf returns: Der opstod en uventet fejl
The same did this: http://www.reflexen.learning.aau.dk/digitalAssets/66/66266_en_paedagogisk-didaktisk_praesentation.pdf

And this: 118280219-Berinderjeet-Kaur-Yeap-Ban-Har-Manu-Kapur-Mathematical-Problem-Solving-Yearbook-2009-AME-Association-of-Mathematics-Educators-World-Scient.pdf (local file)

While another local file returns: PDF inderholder ikke tekst genkendt ved OCR
but it does. Here is a copy from a random spot: School profiles

Grode · January 15, 2015

in the meantime Firefox changed to v35.0 and Zotero to 4.0.25.2 - do I have to test it all once again?

adamsmith · January 15, 2015

update to 4.0.25.2
I think that might fix the unexpected error (der opstod...)

The first and third error mean what they say: Zotero doesn't find metadata for the first one and the last one, I assume, is a scan or a read-protected PDF, so Zotero doesn't find any text.

Grode · January 15, 2015

Almost all solve. The latter is still an issue. At first I thought it was because that it is a print from slides (but with OCX and easy editable) but then I found more files that could not be read. Like www.eva.dk/projekter/2013/undervisning-pa-mellemtrinnet/download-rapporten/motiverende-undervisning.-taet-pa-god-undervisningspraksis-pa-mellemtrinnet/download that I have locally, and it cannot be found, but later it began to be more critical.

One os the files from this test - the top most - that could be read and referenced... well I had it as a pdf file from internet (I have picked it several times today) and despite this is the same as I have just fetch metadata from it will not fetch metadata this time.

Never mind I am going to delete that one, but there is something that look strange. And I still got at lot of pdfs that presumeably can't find metadata, but I get the feeling that another day to another time they will.

adamsmith · January 15, 2015

@aurimas - do you want to take a look at www.eva.dk/projekter/2013/undervisning-pa-mellemtrinnet/download-rapporten/motiverende-undervisning.-taet-pa-god-undervisningspraksis-pa-mellemtrinnet/download

looks like something unusual might be going on there that could use another look.

Grode · January 15, 2015

One more thing... hmm new thread... not sure: When I do succeed to fetch metadata the filedata is removed. That is a little sad because then I can't click open the file from Zotero anymore.

adamsmith · January 15, 2015

are you sure the file isn't just attached to the metadata? That's what's supposed to happen.

aurimas · January 15, 2015

@aurimas - do you want to take a look at www.eva.dk/projekter/2013/undervisning-pa-mellemtrinnet/download-rapporten/motiverende-undervisning.-taet-pa-god-undervisningspraksis-pa-mellemtrinnet/download

looks like something unusual might be going on there that could use another look.

Here's what's happening (some of these parameters are chosen arbitrarily and the logic could be improved overall, so it may be worth discussing)

For metadata extraction, Zotero takes text from the first 7 pages of a PDF (in the case of that PDF, it covers the ToC and the preface). It then looks for DOI in the first 80 lines of that text (the PDF in question doesn't contain a DOI). Then it looks for ISBN (again in the first 80 lines). In this case, there is an ISBN 978-87-7958-796-0, which Zotero detects correctly, but is unable to find it registered in Library of Congress, WorldCat, or Lulu.

At this point, Zotero proceeds to look for lines of text that are suitable for a full-text search via Google Scholar. It does some magic taking only text in the first column (i.e. not preceded by a tab) of the line that is longer than 3 words (the column selection is a good idea for journal articles that tend to be printed in multi-column format). If Zotero finds at least 20 such lines it proceeds to query Google Scholar with some of them. The PDF in question contains only 19 lines that pass the cleanup, so Zotero assumes that the PDF is not OCR'ed and those lines are just some junk, like "This is a digital copy of a book that was preserved for generations on library shelves before it was carefully scanned by Google as part of a project", which could increase false-positive hits. Though maybe the language for the error could be changed to something like "Zotero could not find enough usable text to retrieve metadata. Make sure the PDF has been OCR'ed."

Generally, this logic works ok, because having extracted text from 7 pages of a PDF, you would hope to have quite a few lines of text with more than 3 words. The ToC in this PDF kills a lot of lines, since the first column on each line are just numbers. I think what would help in general is increasing the number of pages of PDF extracted. I can't think of much penalty in terms of false-positive detection, but it does mean a bit more processing. Increasing the number of pages to, say 15 or 20 shouldn't be a big issue though.

Again, though, this is not a general issue that's applicable to PDFs of journal articles, which is a major target of metadata retrieval.

adamsmith · January 15, 2015

Thanks. I didn't remember the 7pags part.
I would like us to do as well as possible on reports and working papers, not least because those are what I use the function for mainly (journal articles are imported via URL bar icon). ToCs, title pages, impressums and the like aren't uncommon there. I think going up to 15 pages would be worth at least a test.

Grode · January 16, 2015

And more sources. For text in Danish a clever search will be in bibliotek.dk as all Danish texts will be in that database. You probably have to talk to them about how to access the ref data, but I believe it will be possible.

Grode · January 16, 2015

And on the file name gone to be included in the metadata it is, but that is a upload to zotero I believe... hmm... maybe I have to look. Or the global file... not sure.

Anyways, it does not look as it point towards my local file and I would like also to keep that reference.

adamsmith · January 16, 2015

but that is a upload to zotero I believe

no, that's a local copy. By default it'd be in the Zotero data folder, but you can set this up in different ways. In either case, the file stays in the same place before and after retrieve metadata.

For text in Danish a clever search will be in bibliotek.dk as all Danish texts will be in that database.

Unlikely we'll do that. That's not a full text search, so it wouldn't replace google scholar in the first place and for the ISBN search, there are costs to adding too many different options and my guess would be that most of what they have is covered in WorldCat (and the ISBN in question isn't found in bibliotek.dk either)