Making a page generic connector friendly

bjohas · May 4, 2018

Hope this makes sense: Is it possible to adjust web-page metadata, so that it's picked up by the connector?

Say I want to add pages to my website X for papers I write (where the paper is linked from the page). Clearly it's not worth developing a connector for this website X, as it would hardly be used. So can I adjust the metadata for pages on my website X to be picked up by the existing connector?

Phrased in an other way: How does webpage metadata get collected by the Zotero connector?

Thanks!

adamsmith · May 4, 2018

https://www.zotero.org/support/dev/exposing_metadata

bjohas · May 4, 2018

Doh! Thank you!

bjohas · June 8, 2018

@adamsmith - would you happen to have an example for how to structure page metadata so that the connector identifies multiple publications? E.g. what would the metadata on https://www.educ.cam.ac.uk/centres/real/publications/ need to be so that all papers are picked up?

DWL-SDCA · June 8, 2018

@bjohas I looked at your other posts to this forum and I see that you have the background knowledge needed to handle website programming but do you have the skills and available time? Anyone who has contemporary web development skills should be able to do this for you or your center. The group that developed the website in your example should be able to handle the task.

If you are only requesting an example of a site that has implemented exposed metadata I recommend visiting my https://www.SafetyLit.org site, performing a literature search, and looking at the page source of any page with one or several records. SafetyLit provides both unAPI/MODS and GS/Highwire formatted metadata. SafetyLit is a free service presented without advertising. Our mission is to index the world's scholarly literature on safety.

bjohas · June 8, 2018

@DWL-SDCA Many thanks! I looked at your pages, and pages for individual papers have got meta/GS/Highwire, but pages with multiple papers (from the search) do not.

Question 1: Am I right in thinking that using to provide metadata will only work for individual items, irrespective of what RDF vocab you follow? In principle RDF could describe more than one paper...

Question 2: I assume this is also true for GS/Highwire? I.e. only one entry per page?

Question 3: What's the advantage/disadvantage of COinS vs. unAPI/MODS?

Many thanks!

noksagt · June 8, 2018

Question 3: What's the advantage/disadvantage of COinS vs. unAPI/MODS?

COinS

PRO

"Easy": Can be plain static HTML, editable on the page
Zotero will generate it for you
Other clients like LibX can use it as an article locator

CON

Extremely limited format. It was designed to locate a resource, not to fully describe it. Features like attached PDFs and notes won't work.

UnAPI

PRO

May provide multiple concurrent formats, including MODS.
Some reference-oriented web-apps may generate these bibliographic formats already.
These are richer and more robust

CON

A little heavier (needs a web app to respond to the various get requests.

Both work and both can be implemented on the same site (see another example). Neither has become particularly popular or is a "maintained spec". Both end up being easyish to use for people familiar with web programming. I'm not sure of their respective abilities to be used by reference managers other than Zotero.

bjohas · June 8, 2018

@noksagt That's very helpful - thank you. So COinS cannot have attached PDFs, but I assume UnAPI can?

"A little heavier (needs a web app to respond to the various get requests": So for UnAPI, I'd have to set up an UnAPI server, like http://...northwestern.../.../unapi.php?

noksagt · June 8, 2018

So COinS cannot have attached PDFs, but I assume UnAPI can?

Yes.

So for UnAPI, I'd have to set up an UnAPI server?

I'd call it a "service" rather than a "server", but yes.

bjohas · June 8, 2018

@noksagt Thank you so much, that's very helpful!

If you or @DWL-SDCA have thoughts about Q1, Q2, that would be appreciated too!

noksagt · June 8, 2018

Q1: If by "RDF" you mean "embedded RDF" then: I believe so. But you could return Zotero RDF or Bibliontology RDF or Dublin Core RDF via UnAPI too.

bjohas · June 8, 2018

OK, makes sense - thanks. I was hoping I could just add metadata for all papers as , but as far as I can see, it doesn't seem possible.

noksagt · June 8, 2018

Q2: I admit I'm less familiar with GS/Highwire (I've developed to every other standard/format listed in this thread, but not this one yet). But I believe that these embedded tags are also meant to describe the single page you're on instead of multiple resources referred to in the page. I also don't know if Zotero has a generic GS/HW translator (EDIT: unless it is part of the generic "embedded metadata" transtlator) ? It isn't a separate option for any safetylit page.

bjohas · June 8, 2018

Yeah, that is how I see it too :) Thanks again!

bjohas · June 8, 2018

Follow-on off-topic-ish discussion here: https://forums.zotero.org/discussion/72277/are-public-zotero-libraries-indexed-by-google-scholar

DWL-SDCA · June 9, 2018

Q2 noksagt is correct that GS/Highwire is for single-item pages. That is why the SafetyLit multiple item pages don't have that in the header.

Q1 The unAPI - MODS function with SafetyLit is what allows Zotero multiple item downloads from pages that list multiple items. You will not see the details in the page header because that occurs behind the scenes via an invisible service.

Q1.5 Upon selecting items from multi-item pages and viewing the summary records the system will also allow RIS to be downloaded for all checked items.

The technology (open access) has already been developed to implement these features on a site with bibliographic content. My web developers charged for very few hours of effort to include these features because they only adapted existing scripts to accomplish the tasks.

It appears that the bibliographic list on your site is more a static flat file while SafetyLit pages are php/SQL based and rendered dynamically. That might make a difference depending on how you manage your site content.

bjohas · June 9, 2018

Hi @DWL-SDCA, and thank you!

So any meta tags in html (whether highwire or Dublin core) are only single items, because they effectively describe the page? DO you happen to know whether there's a Dublin core equivalent to

meta name="citation_pdf_url" content="http://www.example.com/content/271/20/11761.full.pdf"

?

DWL-SDCA · June 9, 2018

No I don't know. I recall that at one time I asked about including the URLs to theses and reports and learned that GS found that objectionable because universities and agencies too often break the links by changing their site structure. DOIs, however, are desirable. -- I suspect that GS is less interested in indexing my website (itself only an index) than in using my site to locate the full-text items that I have indexed.

bwiernik · June 9, 2018

Google scholar won’t index a page if the PDF url is not in the same sub directory as the page URL.

bjohas · June 9, 2018

Hi @bwiernik - that's interesting. Is that documented in the spec? My site is running off a mediawiki, so that would be difficult!

bwiernik · June 9, 2018

I read it in their documentation somewhere, but don't recall where. I've noticed that WordPress sites don't get indexed or have their PDFs appear in Scholar even after multiple indexing requests through Google Dashboard, but static pages appear quickly.

bjohas · June 9, 2018

Interesting and very helpful facts!!

bjohas · June 9, 2018

To help me come up with a strategy for my own pages, I've started to write some of this up here: http://bjohas.de/go/metadata-and-zotero-api (Google Doc, ask for edit permissions if you want to contribute)

bwiernik · June 9, 2018

Item 2H. on the Google Scholar indexing guidelines:
https://scholar.google.com/intl/en/scholar/inclusion.html#indexing

The "<meta>" tags normally apply only to the exact page on which they're provided. If this page shows only the abstract of the paper and you have the full text in a separate file, e.g., in the PDF format, please specify the locations of all full text versions using citation_pdf_url or DC.identifier tags. The content of the tag is the absolute URL of the PDF file; for security reasons, it must refer to a file in the same subdirectory as the HTML abstract.

bjohas · June 9, 2018

Thank you!

bwiernik · June 9, 2018

See also this page:
https://partnerdash.google.com/partnerdash/d/scholarinclusions#p:id=new&a=100323453