Retrieve PDFs Metadata > wrong metadata > source ?

alhb · March 11, 2016

Hello,
I hope not wasting your time with my issue.
Following a retrieval of wrong metadata, I would like to understand how Zotero retrieves the metadata of PDF.
My case : it’s a freely downloadable PDF from the website of the French journal Glottopol
http://glottopol.univ-rouen.fr/telecharger/numero_8/gpl8_04tran.pdf
If I save the PDF in Zotero and ask for metadata retrieval, it works but I get the metadata that refer to a citation which is the subject of a report in the journal issue :
Bertucci M-M, Houdart-Merot V. Situations de banlieues: enseignement, langues, cultures. Lyon: Institut national de recherche pédagogique; 2005.

If I save the PDF from Google Scholar, I get the correct metadata :
Tran TD. SYSTEME DE RECHERCHE D’INFORMATION MEDICALE PAR CROISEMENT DE LANGUES: VIETNAMIEN-FRANÇAIS-ANGLAIS. [cité 11 mars 2016]; Disponible sur: http://glottopol.univ-rouen.fr/telecharger/numero_8/gpl8_04tran.pdf

I though Zotero queries the Google Scholar database.
Does it mean that the source of wrong metadata comes from the journal ? (if so, I will inform the journal’s webmaster).

Thank you for clarifying.

adamsmith · March 11, 2016

The reason for the wrong metadata is the ISBN on the first page. Zotero assumes (usually correctly) that ISBNs on the first page or two refer to the PDF itself, not to another work. It looks those up on Worldcat.
(And journals can affect how Zotero imports from their website, but never how Retrieve Metadata works, so no, don't report problems with that to the journal.)

alhb · March 11, 2016

Thank you for replying so quickly !

hamed_fcs · April 23, 2017

This is an example of wrong metadata retrieval that is not related to wrong ISPN
https://www.researchgate.net/publication/262198643_althlyl_alaly_llwqf_walabtda_fy_nsws_allght_alrbyt_alhdytht_walklasykyt_Automatic_Analysis_of_Phrase-Break_Prediction_for_Arabic

this was repeated problem at researchgate
no ISPN at metadata, but info retrieved wrong from google scholar

Text analysis and word pronunciation in text-to-speech synthesis
Type Journal Article
Author Mark Y. Liberman
Author Kenneth W. Church
URL https://www.researchgate.net/profile/Mark_Liberman2/publication/230876257_Text_Analysis_and_Word_Pronunciation_in_Text-to-Speech_Synthesis/links/56550e3508ae1ef9297700a4.pdf
Pages 791–831
Publication Advances in speech signal processing
Date 1992
Accessed 4/23/2017, 11:08:44 AM
Library Catalog Google Scholar
Date Added 4/23/2017, 11:08:44 AM
Modified 4/23/2017, 11:08:44 AM
Attachments
althlyl-alaly-llwqf-walabtda-fy-nsws-allght-alrbyt-alhdytht-walklasykyt-Automatic-Analysis-of-Phrase-Break-Prediction-for-Arabic.pdf

hamed_fcs · April 23, 2017

the metadata were retrieved by downloading the attachment to zotero

DWL-SDCA · April 23, 2017

Google Scholar openly acknowledges that theit metadata is not curated and that by the nature of how records are brought into the GS service there will be errors. I strongly recommend that GS users follow the link to the source and import metadata from the original site. The original site should always provide metadata that is more accurate and complete than metadata direct from GS.

hamed_fcs · April 23, 2017

the file is downloaded from researchgate, zotero automatically create parent item for the file and retrieve metadata from google scholar using factors from the file, mostly name and ISPN if available and many other I don't know.
most files available at researchgate couldn't be retrieve from thier sources, maybe some metadata are openly available but not the full paper.

bwiernik · April 23, 2017

You should use the Zotero button in the browser while on the Researchgate site. That will usually get much better metadata than through Google Scholar. Retrieve Metadata is almost never the best way to get item metadata:
https://www.zotero.org/support/getting_stuff_into_your_library

aborel · April 23, 2017

> Google Scholar openly acknowledges that theit metadata is not curated

Do you have a reference for this? An authoritative statement would be extremely useful for me right now!

adamsmith · April 23, 2017

(all that said, Zotero doesn't grab any metadata for the file at https://www.researchgate.net/publication/262198643_althlyl_alaly_llwqf_walabtda_fy_nsws_allght_alrbyt_alhdytht_walklasykyt_Automatic_Analysis_of_Phrase-Break_Prediction_for_Arabic for me, so not sure what you're seeing)

hamed_fcs · April 23, 2017

that is strange @adamsmith for sure zotero retriev meta data for the Item
this is other example just now

https://www.researchgate.net/publication/280976086_An_Interactive_Speech_Web_Site_in_Arabic_and_English
Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition
Type Book
Author Daniel Jurafsky
Author James H. Martin
Series Prentice Hall series in artificial intelligence
Edition 2. ed., Pearson internat. ed
Place Upper Saddle River, NJ
Publisher Prentice Hall, Pearson Education Internat
ISBN 978-0-13-504196-3
Date 2009
Extra OCLC: 263455133
Library Catalog Gemeinsamer Bibliotheksverbund ISBN
Language eng
Short Title Speech and language processing
# of Pages 1024
Date Added 4/24/2017, 12:22:55 AM
Modified 4/24/2017, 12:22:55 AM
Tags:
Automatic speech recognition
Automatische Spracherkennung
Computational Linguistics
Computerlinguistik
Lehrbuch
Natural language processing (Computer science)
Notes:
Literaturverz. S. 945 - 994
Attachments
An-Interactive-Speech-Web-Site-in-Arabic-and-English.pdf

adamsmith · April 23, 2017

So in this case it's grabbing an ISBN from the bibliography. We may be able to try to prevent that -- it's only looking for ISBNs on the first 10 (I think) pages, so this wouldn't happen for regular length articles.

hamed_fcs · April 23, 2017

what I'm doing is download from the download link then save to zotero with retrive metadata from file
@bwiernik if you use the default save to zotero using embeded metadata it will return nothing and save web page and will download nothing
if you use "DOI" it will retrieve metadata from google scholar or other library and may save the file only if it's available publicly by publisher

using download will save the file to zotero and create item which is correct as type and most of he time correct title, but too many mistakes for the authors and the publishing date

adamsmith · April 23, 2017

I'd avoid importing from ResearchGate whenever possible. Import the article via DOI or from the publisher, then manually attach the PDF. I'm pretty sure you'll save time&trouble in the long run.

Zotero should have relatively few false positives using retrieve metadata, but it'll often get details wrong, especially as there are frequently multiple versions of (essentially) the same paper that it has no way of distinguishing.

hamed_fcs · April 23, 2017

Academia and researchgate is not normal publisher, the articles posted here usually 90+% of the time by the author may or may not it's allowed by the publisher.
if it's allowed you could sometimes find the original using DOI or google scholar but sometimes also cannot and you need to open the orignal site and maybe login or create account
for some reason the cooperation between academia or researchgate and zotero is around zero, maybe some competition, I don't know
but DOI working on researchgate "not academia" and tis is good that means considered a library and there is dedicated translator, maybe need some adjustment

adamsmith · April 23, 2017

I understand what RG and Academia.edu are. They just don't provide much, if anything, structured enough to base a translator on, which is why none exists.

hamed_fcs · April 23, 2017

@adamsmith that is correct in a way, you can't reference RG at paper or thesis, but when you are looking for extensive knowledge to grasp a topic and you are not paying tens of thousand $ for every semester to a famous univ where they give you access to anything you need. RG and academia is a must, you need them to find enough knowledge then referencing is something easy
normally when you create a paper, or thesis you need at least 5-10 times what you are going to reference in your dissertation

adamsmith · April 23, 2017

no, that's not what I mean. I really do understand the scholarly publication landscape. What I am saying is that RG and Academia don't provide enough information on their pages for Zotero to construct a reasonably well-working translator to import it.

Almost all publishers allow you to import the item's information in front of their paywall. What I'm advising you to do is to import that and then attach the PDF from wherever you get it -- RG, Academia, unpaywall button, or SciHub.

hamed_fcs · May 20, 2017

I tried to retrieve meta data for very old pdfs I downloaded ages ago, Zotero manage to retrieve metadata for 26 out of 47 pdfs, didn't check the accuracy of metadata yet, but notice that name of the file has huge effect on the procedure
duplicate files with deferent names get deferent result
This is not an important issue for zotero, but it implicate some adjustment to the procedure could be done and enhance it, and for some short-named pdfs zotero manage to retrieve the full article name.

adamsmith · May 20, 2017

That's just not possible. The code that runs for retrieve metadata does never so much as load the filename into memory. In only looks inside the file. Something else is going on.