Walk through folders

p.hacker · December 19, 2018

I've been dragged, kicking and screaming, into using Zotero.

Currently, my reference materials are organized in folders by subject, author, item.
EG:
* /eBook/HWA/American/Leslie King/Behavioral Graphology.pdf;
* /eBook/HWA/IGAS/Stockholm1977a.pdf;
* /eBook/HWA/A Graphological Bibliography.epub;

As a general rule:
* PDF metadata has been scrubbed;
* Books were originally published sans ISBN;
* Research articles do not have DOIs;

There are around 15,000 items in that /eBook/ folder.
There are around 100 top level folders /eBook/TopLevel Folder/.
Each top level folder has both individual files, and between 2 and 100 sub-folders /eBook/TopLevel Folder/sub-folder/.
Sub-level folders can have both individual files, and between 0 and 100 folders: EG:
* /eBook/TopLevel Folder/sub-folder/Third level folder/;
* /eBook/TopLevel Folder/sub-folder/document.pdf;

One of the videos on using Zotero implied that Zotero could start at /eBook/ and automatically add everything in every folder below that. However, the video is using icons I don't have. :(
Does this require an addon, and if so, which one, or is this a functionality that was in earlier versions, but is not in 5.x, or this an implication that should not be present?

If the latter, what is the fastest/simplest/easiest way to add those 15k items to Zotero.
FWIW, I currently have the items listed in an ODF document, with each item formatted using APA Style Manual, Third Edition.

emilianoeheyns · December 20, 2018

I fail to see what information we're supposed to extract from the "kicking and screaming" part. I don't reckon anyone on this forum is forcing you.

If you don't have structured information beyond the APA-formatted references, you can try putting that through https://anystyle.io/. Once you have them recognized properly, save to bibtex, and I can look at whipping up a script coupling them to the saved PDFs.

But manually (and it will involve lots of manual work) doing 15k items from unstructured data to any reference manager is going to be a chore any which way you look at it. This has nothing to do with Zotero.

adamsmith · December 20, 2018

One of the videos on using Zotero implied that Zotero could start at /eBook/ and automatically add everything in every folder below that. However, the video is using icons I don't have. :(

This has never existed as functionality in Zotero or an add-on. Not sure what the video was showing.

The closest thing possible is to create a saved search for PDFs in /eBook and then drag all the PDFs to Zotero but
a) this won't keep the subfolder structure and
b) this won't, in many/most cases, automatically add citation data, given the constraints that you describe above

p.hacker · December 26, 2018

>I don't reckon anyone on this forum is forcing you.

The toolchain I use, has thrown Zotero into the workflow process.

I'm hoping that the issues I'm running to, are simply due to being unfamiliar with how to use Zotero.

> you can try putting that through https://anystyle.io/.

Every list I put through that list, generated an error message. Usually something along the lines of "Oops, something has gone terribly wrong."

>This has never existed as functionality in Zotero or an add-on. Not sure what the video was showing.

I haven't been able to find that video again.

It looks like the easiest, most efficient way to get my existing bibliographies into Zotero, is to add each item, one at a time, filing out all of the fields, manually.

dstillman · December 26, 2018

> you can try putting that through https://anystyle.io/.

Every list I put through that list, generated an error message. Usually something along the lines of "Oops, something has gone terribly wrong."

It's working for me. If you're still getting an error, cut down what you're pasting in halves until you find the smallest possible section that fails, and then post it here. (AnyStyle isn't an official Zotero project, but the developer is on the Zotero team.)

It looks like the easiest, most efficient way to get my existing bibliographies into Zotero, is to add each item, one at a time, filing out all of the fields, manually.

AnyStyle is the best way — we can just figure out why that's not working for you.

But note that, if you were going to add items from scratch, you wouldn't do it by filling out the fields manually — that's just not how you should add stuff to Zotero the vast majority of the time. In this case, you would likely copy DOIs or ISBNs from the bibliography into Add Item by Identifier in Zotero or copy URLs to your browser and save from there using the Zotero Connector.

emilianoeheyns · December 26, 2018

Also, you might get lucky and the PDFs will have metadata that can be fetched. Bulk import of a hierarchy of PDFs is still possible, and if you get me a copy of the PDFs or at least the folder structure, I can probably get you a zotero rdf that will import them. Mind that importing 15k items will take a *long* time.

But most of all the people around here are looking to help you, yet you really make it clear you don't want to use Zotero, that you are forced by (to us) mysterious forces to do so anyway, that you have an enormous backlog of unstructured data, and you seem to imply that this state of affairs is somehow the fault of the Zotero team. We have not thrown zotero into the tool chain we did not force you to use, mind.

I recognize this must be frustrating work to do, but there simply does not exist any tool that will magically transform your unstructured data to structured data. Anystyle.io is the closest there is. If I am mistaken and there is indeed a tool that does this, we'll be happy to help you migrate off of that. But in the meantime, try to remember that we are not the ones that put you in this situation.

emilianoeheyns · December 26, 2018

You can try the python script at https://gist.githubusercontent.com/retorquere/7de109dc80d509941ada1fd88ed9fe12/raw/833105e8c68e61cac28840d3ee8cc592aa3fc1f2/dir2rdf.py . When provided with a directory name of the top level directory that holds the PDFs, it will put an RDF file at that directory that you can import, and collections will be imported as per the directory structure.

Import will be *very* slow though. Importing 37 PDFs this way took 60 seconds. If that scales linearly, importing 15k items would take between 6-7 hours, and since the translator isn't async, you will have to occasionally click "Wait" on the timeout popup you get. BibTeX is async but won't work for importing bare attachments.

p.hacker · December 26, 2018

>But note that, if you were going to add items from scratch, you wouldn't do it by filling out the fields manually — that's just not how you should add stuff to Zotero the vast majority of the time. In this case, you would likely copy DOIs or ISBNs from the bibliography into Add Item by Identifier in Zotero or copy URLs to your browser and save from there using the Zotero Connector.

Less than a quarter of the material in my Ready Reference Library, has either a DOI or ISBN associated with it. That is why I'm manually adding everything.

At least Zotero, unlike Calibre, doesn't assume that the ISBN I use is wrong, and replaces it when doing a metadata update.

p.hacker · December 26, 2018

>AnyStyle is the best way — we can just figure out why that's not working for you.

Here's an example of what I put in, that didn't generate an error message..
However, the output is not usable.

---- start here ---

Jacobs, Eva
The Wittlich Character Diagram
NP: 1971

Karohs, Erika
Step By Step System of Handwriting Analysis CD
Pebble Beach CA: Karohs: 2006

King, Leslie
Getting Control of Your Life
Bountiful UT: Handwriting Consultants of Utah: 1972

King, Leslie
Measurement Gauge
Bountiful, UT: Handwriting Consultants of Utah: 1976

King, Leslie
Descriptive Definitions: Equal Weight Score Criteria: Part 1
Bountiful UT: Handwriting Consultants of Utah: 1977

King, Leslie
Descriptive Definitions: Equal Weight Score Criteria: Part 2
Bountiful UT: Handwriting Consultants of Utah: 1977

Knobloch, Hans
Die Legensgestalt der Handschrift
SaarBrucken, West-ost Verlag: 1950

---- end here ---

This is fairly typical of what I cite.

The structure is straightforward:
Author
Title
City, State: Publisher: Year of Publication.

In the first listing, (Jacobs, Eva), I thought that the NP was tripping up on something, because technically, that line should be: "NP: NP: 1971". But as a single item, but it works thusly, either way:

---- start here ---


<references><reference><authority>Jacobs, Eva</authority></reference><reference><authority>The Wittlich Character Diagram</authority></reference><reference><date>NP: 1971</date></reference></references>

---- end here ---

And with (NP:NP: 1971)

---- start here --


<references><reference><authority>Jacobs, Eva</authority></reference><reference><authority>The Wittlich Character Diagram</authority></reference><reference><publisher>NP: NP:</publisher><date>1971</date></reference></references>

---- end here ---

It should be Title, not authority, but if the end result is correct, it doesn't matter.

But I'm baffled by what it does, when only the (Karohs Erika) data is used.

---- start here ---


<references><reference><publisher>Karohs, Erika</publisher></reference><reference><title>Step By Step System of Handwriting Analysis</title><publisher>CD</publisher></reference><reference><title>Pebble Beach</title><location>CA: Karohs:</location><date>2006</date>
</reference></references>

---- end here ---

With just the (King, Leslie) data, the results are equally odd:

---- start here ---


<references><reference><authority>King, Leslie</authority></reference><reference><authority>Getting Control of Your Life</authority></reference><reference><title>Bountiful UT: Handwriting Consultants of</title><location>Utah:</location><date>1972</date></reference><reference><authority>King, Leslie</authority></reference><reference><authority>Measurement Gauge</authority></reference><reference><location>Bountiful, UT:</location><publisher>Handwriting Consultants of Utah:</publisher><date>1976</date></reference><reference><authority>King, Leslie</authority></reference><reference><title>Descriptive Definitions: Equal Weight Score Criteria:</title><note>Part 1</note></reference><reference><title>Bountiful UT: Handwriting Consultants of</title><location>Utah:</location><date>1977</date></reference><reference><authority>King, Leslie</authority></reference><reference><title>Descriptive Definitions: Equal Weight Score Criteria: Part</title><date>2</date></reference><reference><title>Bountiful UT: Handwriting Consultants of</title><location>Utah:</location><date>1977</date></reference></references>

---- end here ---

emilianoeheyns · December 26, 2018

(updated RDF script at https://gist.githubusercontent.com/retorquere/43cdd5d087d61a7b29fb173f8af11a6e/raw/36056b2b21321b87a6d1142a24c4b0141cb674ab/dir2rdf.py)

emilianoeheyns · December 26, 2018

First thing: I think anystyle.io expects one reference to be on one line. I got a lot better results when I entered

Jacobs, Eva. The Wittlich Character Diagram, NP: 1971

Karohs, Erika. Step By Step System of Handwriting Analysis CD, Pebble Beach CA: Karohs: 2006

King, Leslie. Getting Control of Your Life; Bountiful UT: Handwriting Consultants of Utah: 1972

King, Leslie. Measurement Gauge; Bountiful, UT: Handwriting Consultants of Utah: 1976

King, Leslie. Descriptive Definitions: Equal Weight Score Criteria: Part 1; Bountiful UT: Handwriting Consultants of Utah: 1977

King, Leslie. Descriptive Definitions: Equal Weight Score Criteria: Part 2. Bountiful UT: Handwriting Consultants of Utah: 1977

Knobloch, Hans. Die Legensgestalt der Handschrift. SaarBrucken, West-ost Verlag: 1950

(and you will see that "parse 21 references" changes into "parse 7 references").

You will almost always have to edit the results, but you can submit the edits as training data, and they will slowly get better as you do this.

However

If your data is semi-structured as

Author
Title
City, State: Publisher: Year of Publication.

This becomes a different matter. That can be parsed and imported without using anystyle.io

p.hacker · December 26, 2018

>(updated RDF script at

I don't know if the error is due to something purporting to be a PDF, but isn't, or something else.

In looking at what has been processed, the script had just started walking through the files that are in Chinese, Japanese, Korean, or Viet.


Traceback (most recent call last):
  File "pdf.import.2.py", line 112, in <module>
    f.write(minidom.parseString(ET.tostring(rdf, 'utf-8')).toprettyxml(indent='  '))
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
    v = _escape_attrib(v, encoding)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1090, in _escape_attrib
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

emilianoeheyns · December 26, 2018

Are you using python 2 or 3?

Also, Zotero journal articles don't have a "Place" field. Report does, for example. What type of item do you have in mind for the import?

emilianoeheyns · December 26, 2018

And on what OS are you doing this?

emilianoeheyns · December 26, 2018

Updated tree walker at https://gist.githubusercontent.com/retorquere/c3d89e8762e5346fce20629781fa2388/raw/020ed160fb0445e78ac546476446b6c22ae36cbe/dir2rdf.py

emilianoeheyns · December 26, 2018

Also, what to make of

Knobloch, Hans
Die Legensgestalt der Handschrift
SaarBrucken, West-ost Verlag: 1950

That doesn't have 3 fields in the last line. Seems to be place/date, but you did say

City, State: Publisher: Year of Publication.

p.hacker · December 26, 2018

>Are you using python 2 or 3?

Oops, I was using Python 2.7.13

I thought that Python by itself called Python 3.5.3.

Using Python3 it ends as expected.

>what OS are you doing this?

Depending upon the command used:
* Debian GNU/Linux 9 (Stretch);
* MX 18 Continuum;
* Linux version 4.15.0-1-amd64;
* Welcome to MX 17.1 (Horizon)! Powered by Debian;
* Linux 4.15.0-1-amd64 #1 SMP Debian 4.15.17-1~mx17+1 (2018-04-23) x86_64 GNU/Linux

emilianoeheyns · December 26, 2018

https://gist.githubusercontent.com/retorquere/1e3c48ee94e67e797d0c3e42fd8da6c7/raw/18c99c815b5229a4aeb63fd2b8e5f27f848226df/refs.py will parse the 3-line format you specified, given a text file (not ODF, you will have to save as plain-text) into CSL-JSON which can be imported into Zotero. It will assume book for the type, when "City, State" is not "NP", otherwise it will assume article. If Publisher is "NP", that will be ignored.

emilianoeheyns · December 26, 2018

Python version doesn't really matter to me but it's good to know which is in play (I tried the dir walker with both 2 and 3) but what "python" is default is system dependent.

Having it on Linux makes things easier for me. What is your system locale set to?

emilianoeheyns · December 26, 2018

The dir walker also does not care what's in the files, it only looks at the file name. So if an Excel file is called "gotcha.pdf", the dir walker will add it to the RDF as a PDF file. How that is handled on import I don't know but I suspect Zotero will shrug and proceed.

emilianoeheyns · December 26, 2018

All this leaves unaddressed how you want to couple the PDF files to the stuff you have in the ODF file.

emilianoeheyns · December 27, 2018

Updated dirwalker at https://gist.githubusercontent.com/retorquere/e4228b555e7820aad3d8cd0fc33e78e8/raw/fbf3de8e442a3e7c5c44a9353cdf1e9118dc2b64/dir2rdf.py -- previous versions would place all folders at the toplevel instead of nesting them.

emilianoeheyns · December 27, 2018

Updated dirwalker at https://gist.github.com/643dab61a2d5302982bf75dd2ede752e -- I missed that you wanted epubs as well as pdfs. This one wants you to be specific about what you want imported, so invocation is now

dir2rdf.py [folder path] [extension] [extension] [-extension] [-extension] ...

where [extension] is something like pdf or epub, end [-extension] is for files you want to ignore, in my case -pptx, -iso, etc. It will stop and complain if it finds an extension that it isn't told about. In my case, and invocation on my Download folder ended up being

python dir2rdf.py ~/Downloads/ -mkv -csv pdf -xlsx -py -mp4 -wnrfnd -ovpn -txt -deb -csv~ -docx# -ppt -rdf -json -zip -sh -odt# -qhto7a -odb -png docx -html -pptx -bib -ipynb -feather -iso

It might be easier to just clean up the folder before running the command. Note also that it will create collections for every folder it finds, regardless of whether there are any non-ignored files in them; you may also want to prune empty folders before running it.