Bizarre Indexing and Metadata Problems

pensak · February 4, 2010

I suspected that there was something wrong with the PDF indexing as the ratio of documents to words was very odd (4087 documents - each averaging over 10 pages - but only 308278 words indexed. So I decided to take a one page Word document that I had written (575 words), convert it to PDF and then import it into Zotero. By reflex, after I had imported the PDF, I requested a retrieve of metadata. Of course it should have returned failure as the document was one I had just written and it had never been published anywhere. Imagine my surprise when I was told that said document was a book written by Saul Bellow entitled Seize the Day. When I checked the indexing statistics, it now read 4088 documents (correct) but 308280 words (it claims to have added just two words out of the entire document. Can anyone give me a clue what is going on and how to work around it ? I'd really like to have my documents indexed and would like a clue how Zotero decided that my one pager had been written by Saul Bellow who fortunately is no longer around to take insult at my woeful writing skills.

dstillman · February 4, 2010

Those are unique words, not total words. Did you try searching for words and/or phrases in the document?

If you provide a Debug ID for the Retrieve Metadata attempt, we can take a look.

pensak · February 4, 2010

Ok....The Debug ID is D1632209700

I'm still confused on the word count issue. Before I posted, I went to the Oxford English dictionary and it said there were only 141,000 distinct words and a few tens of thousands of obsolete words. Together that does not come close to the 308,000 words the index summary says are there. Since you wrote the code you obviously know what is going on but I am curious whether you count words with and without pluralizing suffixes etc as one word or different words....

Thanks

ajlyon · February 4, 2010

The word count is almost certainly a naive index of all the words that occur, counting different forms of the same word as distinct words. Note also that garbled text in OCR'ed PDFs and all sorts of odd non-word sequences of characters in HTML files could inflate the number. Toss in names, odd scientific terms, numbers, and non-English text, and 308,000 is believable.

dstillman · February 5, 2010

Yep.

dstillman · February 5, 2010

The metadata issue here (and on the third one you posted) is due to the double-quotes in the original text, which Zotero is incorrectly passing through to Google Scholar in the middle of an already-quoted phrase. This causes Google Scholar to match on individual words rather than the full phrase. We'll fix this.

pensak · February 5, 2010

Thanks for the explanation. I've just submitted another one ID 1551622069. The title is ECONOMETRIC ASSET PRICING MODELLING but it comes back as an article in the Journal of Forecasting entitle Forecasting Inflation Using Economic Indicators. I didn't see any double quotes in the text so I thought I should submit it anyway, just in case there is something else in there which is confusing Google Scholar.

Do you want me to continue submitting these as I find them or just wait until you have another update and see if they persist despite all the other things you are fixing ? I do not want to inundate you with related problems.

I think Zotero is fabulous and am moving my entire cache of references to it.

dstillman · February 5, 2010

It's the Debug ID that we need to diagnose this, not the Report ID.

Feel free to post that one, and then maybe hold off until after the next release.

pensak · February 5, 2010

The Debug ID is D979060758. I'll just keep a record of anything else that fails and test them out on the next release.

Thanks