Export collections as CSV files for data analysis

edited September 16, 2019
Dear all,

I hope this topic has not been covered elsewhere, but I searched the forums for "csv" and did not find what I was looking for.

I am having the following problems:

I want to export a ZOTERO collection with currently 2173 entries to a CSV file for data analysis in Python.

I have used the export to CSV option in the stand-alone version, but when I open the CSV file in EXCEL to check what it looks like, it has 10165 rows (each with their own key or document ID) rather than 2174 (which is what I would have expected). Individual books seem to come up more than twenty times although this is not the case in the collection itself. And there might be other issues.

Just to be sure, I wrote a Python script that simply counts the number of rows in my CSV file, but when the counter reaches 87, the script breaks and gives me the following error notification:

>>>
RESTART: U:\Dokumente\DH technologies\HGIS\GIS with Python\Python Geocoder\Geocoding in Python_testfiles\GeocodeCSV.py
87
Traceback (most recent call last):
File "U:\Dokumente\DH technologies\HGIS\GIS with Python\Python Geocoder\Geocoding in Python_testfiles\GeocodeCSV.py", line 21, in
for row in data:
File "C:\Users\mobarget\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7936: character maps to
>>>

What I eventually want to do is geocode the places of publication and use my tags for data analysis, but if I cannot even get my data out of ZOTERO as they should be, I won't be getting anywhere with that.

A trial script to retrieve specific information such as publication years from individual rows of data worked fine but broke after circa 65 results. This was the error notification I got there:

>>>
Traceback (most recent call last):
File "U:\Dokumente\DH technologies\HGIS\GIS with Python\Python Geocoder\Geocoding in Python_testfiles\GeocodeCSV.py", line 20, in
for row in data:
File "C:\Users\mobarget\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7936: character maps to
>>>

Your help is very much appreciated! If you think it is better to use a different file format (e.g. JSON) for the analysis I would like to do, please let me know.

Best wishes, Monika




  • How are you triggering the export? I'd guess that you're exporting your entire library — which may have duplicates — rather than just the collection. To export a given collection, you need to right-click on the collection itself.

    (For Python help, you'll need to look elsewhere. It looks like you're parsing the file as Windows-1252 rather than UTF-8, and that's a control character that isn't valid in 1252. You should either parse as UTF-8 or just skip invalid characters.)
  • edited September 18, 2019
    Thanks for your immediate response.

    I am quite sure that I exported an individual collection and not the whole library. It is, in fact, a sub-folder in another collection. I right-clicked and chose "Sammlung exportieren". And in the section "Zeichenkodierung" (character encoding), I selected Unicode (UTF-8).

    I will ask one of my colleagues with more programming experience to check if there are any issues in my code.

    But I fear it is something more fundamental as the table already looks wrong in EXCEL. Could it be that it is a problem that my collection contains books as well as journal articles, manuscripts and maps? Does this cause some confusion in the CSV file and should I possible export each data type separately?

    Thanks again for your help.
  • You could alternatively just select all in the middle panel and right-click --> export selected items. It's certainly conceivable that Zotero would e.g. export the collection one level up.

    CSV export is definitely fine and works for all item types and mix of them. Excel is just notoriously bad at importing CSV -- I think you'll get better results by going through Data --> Get External Data --> From Text which allows you to specify encoding and delimiter and preview import.

    The problem with the encoding is indeed in your python code -- Zotero, as you note, exports as Unicode (utf-8) and your python script doesn't seem to expect/handle that.
  • edited September 18, 2019
    Awesome, exporting from the middle panel fixed it -- and my colleague figured out that I had put the Python UTF-8 encoder in the wrong place.

    So now my script is performing my line count alright.

    For anyone who is interested, I have put the code here as this forum ignores my indentation:

    https://github.com/MonikaBarget/DigitalHistory/blob/master/CountLinesCSV.py


    THANKS! :-)



  • Hi Monika,

    Ich habe Deine Einträge gelesen und habe das gleiche Interesse ... ist es Dir gelungen Deine Einträge auf einer geografischen Karte darzustellen?

    Falls ja wäre ich Dir sehr dankbar wenn Du mir zeigen könntest wie ?

    Herzliche Grüße
    Jakob
  • Ja, du kannst einfach den CSV Export nehmen und die Ortsangaben georeferenzieren. Das geht entweder mit einer Programmiersprache wie Python und einer API wie Google Maps oder Geonames. Aber auch Excel hat meines Wissens nach mittlerweile eine Geo-Coding Option:

    https://www.adventuresincre.com/auto-populate-latitude-longitude-excel/

    Bei historischen Daten ist die Fehlerquote hoch und man muss einige Angaben dann manuell eintragen.

    Sobald die Geodaten vorliegen, kann man die Excel/CSV Datei in GeoJSON konvertieren. Dazu gibt es Converter-Software, oder man arbeitet wieder mit einem Skript.

    Dann kann man die Datei als Layer in einem GIS Programm laden und auf einer Karte abbilden. Ich arbeite mit QGIS, das kostenlos ist

    Zum Thema Kartenerstellung habe ich hier zwei Tutorials für Einsteiger erstellt:

    https://dhlab.hypotheses.org/tag/geohumanities
Sign In or Register to comment.