Notes not associated with sources after RDF import

acrosman · March 17, 2013

I have an RDF file with a few thousand sources and associated notes that I'm trying to import into Zotero. The file validates, and the import runs, but Zotero does not associate the notes with the sources. What kind of problems can trigger Zotero to misbehave this way?

A little more background: The RDF file started life as a Scribe3 export. After reading through a couple threads about the import problems from that migration path I wrote a Python script to adjust the export file so that it is a fully valid RDF file (not just enough to pass Zotero's validation, but all the changes needed to get RDFLib to parse the file). The script switches the file to UTF-8, removes invalid XML entities, updates the line breaks, changes RDF:ID= to RDF:about=, removes stray formatting tags, etc, etc, etc. My plan is to get the formatting flags converted to something Zotero can read, or at least get all those records flagged and get line breaks to import properly. I know the resulting RDF file is valid RDF and that the references between the sources and the notes are intact; I know because I also wrote an RDF to HTML converter to be able to see the final result more clearly, all but about 520 notes should associate correctly. Earlier today I was able to get Zotero to import the file I generate, but after a few adjustments to the source file Zotero now refuses to associate notes at all. The changes I made were to add additional isReferencedBy to a couple sources to deal with the apparent limit in Scribe's export of 50 notes per source (that's way I have 520 disconnected sources).

If anyone can help figure out what kinds would prevent Zotero from associating notes and sources I'd be very grateful.

aurimas · March 17, 2013

it would be helpful to see a single entry with associated notes in your adjusted RDF format. We fixed a bug yesterday that would miss very long notes on import, but this sounds different.

adamsmith · March 17, 2013

I don't think it's possible to diagnose that in any helpful way without a sample file.
Try to create a minimal RDF that's not working and post it somewhere.

(also note that RDF comes in many flavors. Zotero principally supports the DC and the Bibliontology implementations. Not everything that is valid RDF will import fully in Zotero).

acrosman · March 17, 2013

Per aurimas' request I'll try to copy all the elements of a source and its notes into the next post (or two). I tried to put them into this post, but it's too long.

Per AdamSmith's request, umm, how would you suggest I about cutting down at 85,000+ line file to the minimum that triggers a bug in your software without guidance? Sounds like you've chased import bugs before, so some suggestions about what might trigger issues would be helpful. The best I can offer you without some help is to suggest that, like Scribe, Zotero doesn't like sources with more than 50 notes (since that's the only difference from the file that worked earlier today). I might be able to make better guesses if you could point me toward places you've seen problems in the past that are worth investigating.

Do you have a data generator for creating files with various complexities? Potentially I could write you one if you haven't done so yet, but in my past searching I've never found complete docs on the Zotero RDF format, and as best I can tell you pull from at least four name spaces, not two, which makes it a challenge to determine how Zotero handles the source files (not to mention that I know from past experience that your code doesn't validate against them so your import standard is wider than I could determine reading those standards anyway). I don't mind being part of the solution here, but some suggestions on where to look would be helpful.

acrosman · March 17, 2013

<rdf:Seq rdf:about="rdf:SA1">
<rdf:li rdf:resource="rdf:A11" />
</rdf:Seq>
<ns1:Person rdf:about="rdf:A11" ns1:givenname="David" ns1:surname="Hempton" />
<ns2:Address ns2:locality="New Haven" rdf:about="rdf:LO1" />
<ns1:Organization rdf:about="rdf:PU1" ns1:name="Yale University Press">
<ns2:adr rdf:resource="rdf:LO1" />
</ns1:Organization>
<ns3:LCC rdf:about="rdf:CN1" rdf:value="Owned" />
<ns4:Book dc:date="2005" dc:title="Methodism: Empire of the Spirit" ns4:pages="" rdf:about="item_1" ns5:itemType="book">
<dc:description>
</dc:description>
<ns4:authors rdf:resource="rdf:SA1" />
<dc:publisher rdf:resource="rdf:PU1" />
<dc:subject rdf:resource="rdf:CN1" />
<dc:subject>s1</dc:subject>
<dc:subject>s2</dc:subject>
<ns3:isReferencedBy rdf:resource="#item_973" />
<ns3:isReferencedBy rdf:resource="#item_974" />
<ns3:isReferencedBy rdf:resource="#item_976" />
<ns3:isReferencedBy rdf:resource="#item_977" />
<ns3:isReferencedBy rdf:resource="#item_978" />
<ns3:isReferencedBy rdf:resource="#item_979" />
<ns3:isReferencedBy rdf:resource="#item_980" />
<ns3:isReferencedBy rdf:resource="#item_981" />
<ns3:isReferencedBy rdf:resource="#item_982" />
<ns3:isReferencedBy rdf:resource="#item_983" />
<ns3:isReferencedBy rdf:resource="#item_984" />
<ns3:isReferencedBy rdf:resource="#item_985" />
<ns3:isReferencedBy rdf:resource="#item_986" />
<ns3:isReferencedBy rdf:resource="#item_987" />
<ns3:isReferencedBy rdf:resource="#item_988" />
<ns3:isReferencedBy rdf:resource="#item_989" />
<ns3:isReferencedBy rdf:resource="#item_990" />
<ns3:isReferencedBy rdf:resource="#item_991" />
<ns3:isReferencedBy rdf:resource="#item_992" />
<ns3:isReferencedBy rdf:resource="#item_993" />
<ns3:isReferencedBy rdf:resource="#item_994" />
<ns3:isReferencedBy rdf:resource="#item_995" />
<ns3:isReferencedBy rdf:resource="#item_996" />
<ns3:isReferencedBy rdf:resource="#item_2696" />
<ns3:isReferencedBy rdf:resource="#item_2700" />
<ns3:isReferencedBy rdf:resource="#item_2701" />
<ns3:isReferencedBy rdf:resource="#item_4467" />
</ns4:Book>

acrosman · March 17, 2013

There are several of these, that are near perfect matches. If you'd like, I can post them all, but the pattern continues and the post limit means it would require a long series of very similar posts. If there is some place to send a file, I'd be happy to do that instead.

<ns4:Memo rdf:about="item_973" ns5:itemType="note">
<rdf:value>[2] [B. Estimating Methodist Adherents]
"Historians conventionally multiply Methodist membership figures by between three and five to estimate adherents."
</rdf:value>
<dc:subject>member</dc:subject>
<dc:subject>none</dc:subject>
</ns4:Memo>
<ns4:Memo rdf:about="item_974" ns5:itemType="note">
<rdf:value>[4] [T. Methodist Geographic Distribution]
End of 19th Century, <10% of Methodists lived in British Isles, <75% lived in the United States. More African American Methodists than European Methodists.
</rdf:value>
<dc:subject>member</dc:subject>
<dc:subject>none</dc:subject>
</ns4:Memo>
<ns4:Memo rdf:about="item_976" ns5:itemType="note">
<rdf:value>[56] [T. Oral nature of Methodism]
Methodism was essentially oral, but has been reconstructed largely from written sources. See: the work of Leigh Eric Schmidt.
</rdf:value>
<dc:subject>hist</dc:subject>
<dc:subject>none</dc:subject>
</ns4:Memo>
<ns4:Memo rdf:about="item_977" ns5:itemType="note">
<rdf:value>[64] [T. E.P Thompson]
"Thompson's view was that the uncertainty principle in the salvation theology of Arminian Methodism combined with a psychic compulsion to adapt to new economic environments produced a manic spirituality driven by fear of backsliding and craven acquiescence in the work discipline of industrial capitalism."
</rdf:value>
<dc:subject>hist</dc:subject>
<dc:subject>none</dc:subject>
<dc:subject>theology</dc:subject>
</ns4:Memo>
<ns4:Memo rdf:about="item_978" ns5:itemType="note">
<rdf:value>[65-66] [Hannah Syng Bunting]
Counter example to E.P. Thompson's belief that Methodist deeply feared backsliding. Instead, she and other Methodists sought peace and purity.
Note: have a copy of her journal
</rdf:value>
<dc:subject>hist</dc:subject>
<dc:subject>none</dc:subject>
<dc:subject>sex</dc:subject>
<dc:subject>source</dc:subject>
</ns4:Memo>
<ns4:Memo rdf:about="item_979" ns5:itemType="note">
<rdf:value>[70] [T. Hymns]
Methodist hymns generally ignored doctrine and focused the Christian life as a pilgrimage.
</rdf:value>
<dc:subject>none</dc:subject>
<dc:subject>theology</dc:subject>
</ns4:Memo>
<ns4:Memo rdf:about="item_980" ns5:itemType="note">
<rdf:value>[77] [B. Church building]
"Mark Noll's ingenious calculations have shown that by the 1850s the Methodists had constructed almost as many churches as there were post offices in the United States and employed almost as many ministers as there were postal workers. 'Considered together,' he writes, 'the evangelical churches employed nearly double the personnel, maintained nearly twice as many facilities, and raised at least three times the money as the post office. Moreover the churches delivered their message to more people than the postal services delivered letters and newspapers.'"
</rdf:value>
<dc:subject>econ</dc:subject>
<dc:subject>hist</dc:subject>
<dc:subject>none</dc:subject>
<dc:subject>pc</dc:subject>
</ns4:Memo>

acrosman · March 18, 2013

For what it's worth, I also have 500,000+ lines of debugging information, if you have suggestions for the kinds of things to look for in there I'd love to narrow the search a bit.

adamsmith · March 18, 2013

you're aware that the code for the translator is available on github, yes?
I'm not saying you have to look through this, but if you want to find a solution yourself, that'd be the place to look.

For Zotero RDF _export_, Zotero uses
https://github.com/zotero/translators/blob/master/Zotero%20RDF.js

For Bibliontology RDF Import - the richest import (and I thought the default from Scribe?):
https://github.com/zotero/translators/blob/master/Bibliontology%20RDF.js

And for general RDF:
https://github.com/zotero/translators/blob/master/RDF.js

aurimas · March 18, 2013

Please post the RDF entry to https://gist.github.com/ or http://pastebin.com/ or something similar.

If you could create an RDF file with a single item in it, that would be the best. Then paste everything between (and including) the rdf tags. <rdf:RDF ...></rdf:RDF>

Per AdamSmith's request, umm, how would you suggest I about cutting down at 85,000+ line file to the minimum that triggers a bug in your software without guidance?

First, I would just try with a single item. It sounds to me like this is a general issue and not specific to any one item. But if a single item imports correctly, perform a binary search. Keep cutting it in half (though it's a bit more difficult with RDF files if you have entries referencing the same notes/attachments) and you should find the item that's causing problems in.... ~17 trials. Still sounds like a lot of debugging.

adamsmith · March 18, 2013

if I read what s/he says correctly, this has worked with some similar files before though.
and this:

The best I can offer you without some help is to suggest that, like Scribe, Zotero doesn't like sources with more than 50 notes (since that's the only difference from the file that worked earlier today).

suggests it might be the same bug you fixed earlier today after all. In which case trying this with the branch xpi would probably be worth it.

aurimas · March 18, 2013

Here's your problem:
<ns3:isReferencedBy rdf:resource="#item_973" />
vs
<ns4:Memo rdf:about="item_973" ns5:itemType="note">

Your identifiers don't match up. Should be
<ns4:Memo rdf:about="#item_973" ns5:itemType="note">

Also, I assume that the quotation marks and apostrophes in the bib:Memo/rdf:value are actually escaped in your RDF and they just got converted when you pasted them here.

Everything imported correctly for me when I fixed those issues.

acrosman · March 18, 2013

Aurias,

Thank you the ID issue seems to have resolved it, I wonder how I broke that after my earlier version...oh well.

As for escaping quote marks, apostrophes, and a few other basic characters within the memo values, that does not appear to be required. It is actually present when I feed it into the python etree parser, but the values that come out after I dump it back out are no longer escaped. They import properly into Zotero, and I can't really think of a reason they shouldn't (since they are valid in HTML under most conditions these days).

I just need to build a good method for preserving line breaks and restoring formatting. But I have a sample RDF export from Zotero that helpfully provides enough of a model to make that work.

Thanks.

aurimas · March 18, 2013

Sorry, the main problem with escaping were < signs for me. Apostrophes and quotes should be fine, as you say.

Notes are formatted as HTML, so use either <p> our <br/> to preserve newlines.