Google indexing and the look of the items pages

Dear Zotero,

I recently posted that I started to use Google CSE to search our Zotero libraries: Zotero Forums - Search Zotero with Google

It conventient and works fast, but there is a problem with the result list. Please have a look here:

Search (with Google) Zotero library From Some Psychologists for "unspoken"

There is an entry in the result list there saying "Community Treatment Disorder". The snippets for this says

"Community treatment orders
Peter Levine: In an Unspoken Voice - references. Tips 2011. Violence. Trash. _RIS import. *Behavior An… *Behavioral … *CORRUPTION*. 1-Propanol..."
Unfortunately most of that is unrelated to the entry, which you can see here: Zotero | Groups > From Some Psychologists > Library > Community treatment orders: current evidence and the implications ;-)

The reason for this is what Zotero gives Google while it indexes the library. I would suggest totally rewamping the look of the items pages. I would even be glad to help with this since it is a major problem! (The feel and look can of course be much enhanced for human beings to.)

Kind regards,
  • I would be glad for a comment on this.
  • The reason for this is what Zotero gives Google while it indexes the library.
    I'm not sure what that's supposed to mean. We can't "give" Google anything different from what we show users.* (On a technical note, this does show that Google is rendering the page using JavaScript, since we don't actually include the collections list in the served HTML. So to the extent that they haven't always done that, or have gotten better at doing so, this behavior might be new.)

    It's possible that something like adding a <nav> tag around the collections list would make Google exempt text within from searches, but there's no guarantee.

    * I'm not sure if it's technically forbidden to show less content to Google — they're generally concerned with showing more content to bots than to humans. But I'd rather not risk our being penalized for serving different content to bots.
  • Thanks Dan.

    I did a quick search to see if Google handles differently. The answer from JohnMu here (2012-10-28) suggests that they do not handle it differently, unfortunately:!topic/webmasters/NvB9f5n0saU

    I thought this problem must be common so I did a search in Google. Starting here

    I landed on this page which seems to give a possible solution:

    It says that "non-malicious duplicate" is allowed.

    Now I think it is possible to give two versions of an item page, one with table of contents and one without, and use the technique here to tell Google how to index and which page to display in the search results:

    I think this can be tailored to something very useful.
  • edited January 13, 2014
    Sorry, but those links mostly aren't relevant to this issue. (Telling Google which URL is canonical just determines which URL it shows in search results, which has to be the version with collections that users get. It's not going to present a non-indexed version to users.)
  • I guess you are right when you think that the indexed pages is what Google will show in the search result. Perhaps that is a problem.

    However to me that seems like a much smaller problem then the current problem with a lot of false hits when searching.

    A button to show table of content could easily be added. (In most cases I guess you will not want to show the table of content.)

    In the long term a structured search through Google would be much better of course. (I sent them such a suggestion, which they will perhaps never notice. ;-) )
  • edited January 30, 2014
    Here are two other suggestions that might more easily solve the problem.

    1) The title that Google display in the hit list is now taken from the <title>...</title>. This could be changed to the title of the reference.

    2) The reference title is now in <h2>...</h2>. This could be changed to <h1>...</h1>. I do not remember for sure, but I think Google will use the <h1> if it is available (and have some text in it).
  • Google says here that it will use "microdata" (and not RDFa): FAQ - Webmaster Tools Help

    Perhaps that is enough to solve the problem here?
  • edited April 12, 2014
    This may be more current:

    [The following formats are supported by Google]

    > * Microdata (recommended)
    > * Microformats
    > * RDFa

    The Structured Data Testing Tool helps with testing microdata / microformats / RDFa:

    Structured Data Markup Helper helps with generating structured data markup:
  • Took a quick look to see if Google indexes javascript generated content now. They do not say they do it here, but they do not deny it either. Still they recommend static html for indexing, using what Google call "pretty URLs" (for javascript generated content) and "ugly URLs" (for static html content):

    Learn More - Webmasters — Google Developers
    What the user sees, what the crawler sees

    (Just putting this note here now.)
  • This is a real problem for me. I have therefore decided to go for my plan B:

    I will give Google content it can index the way I want it. I use two php scripts for this. One presents the items like this:

    The other gives Google a sitemap:

    Before submitting this to some new Google CSE:s I wonder if you have any opinion on this. Is there something missing in the formatted output above? (It has markup for, Facebook and Twitter.)

    (I think it would be pretty cool if Zotero could handle this. I mean if users would be able to upload some definition of output views. It can be made safe.)
Sign In or Register to comment.