Walk through folders
I've been dragged, kicking and screaming, into using Zotero.
Currently, my reference materials are organized in folders by subject, author, item.
EG:
* /eBook/HWA/American/Leslie King/Behavioral Graphology.pdf;
* /eBook/HWA/IGAS/Stockholm1977a.pdf;
* /eBook/HWA/A Graphological Bibliography.epub;
As a general rule:
* PDF metadata has been scrubbed;
* Books were originally published sans ISBN;
* Research articles do not have DOIs;
There are around 15,000 items in that /eBook/ folder.
There are around 100 top level folders /eBook/TopLevel Folder/.
Each top level folder has both individual files, and between 2 and 100 sub-folders /eBook/TopLevel Folder/sub-folder/.
Sub-level folders can have both individual files, and between 0 and 100 folders: EG:
* /eBook/TopLevel Folder/sub-folder/Third level folder/;
* /eBook/TopLevel Folder/sub-folder/document.pdf;
One of the videos on using Zotero implied that Zotero could start at /eBook/ and automatically add everything in every folder below that. However, the video is using icons I don't have. :(
Does this require an addon, and if so, which one, or is this a functionality that was in earlier versions, but is not in 5.x, or this an implication that should not be present?
If the latter, what is the fastest/simplest/easiest way to add those 15k items to Zotero.
FWIW, I currently have the items listed in an ODF document, with each item formatted using APA Style Manual, Third Edition.
Currently, my reference materials are organized in folders by subject, author, item.
EG:
* /eBook/HWA/American/Leslie King/Behavioral Graphology.pdf;
* /eBook/HWA/IGAS/Stockholm1977a.pdf;
* /eBook/HWA/A Graphological Bibliography.epub;
As a general rule:
* PDF metadata has been scrubbed;
* Books were originally published sans ISBN;
* Research articles do not have DOIs;
There are around 15,000 items in that /eBook/ folder.
There are around 100 top level folders /eBook/TopLevel Folder/.
Each top level folder has both individual files, and between 2 and 100 sub-folders /eBook/TopLevel Folder/sub-folder/.
Sub-level folders can have both individual files, and between 0 and 100 folders: EG:
* /eBook/TopLevel Folder/sub-folder/Third level folder/;
* /eBook/TopLevel Folder/sub-folder/document.pdf;
One of the videos on using Zotero implied that Zotero could start at /eBook/ and automatically add everything in every folder below that. However, the video is using icons I don't have. :(
Does this require an addon, and if so, which one, or is this a functionality that was in earlier versions, but is not in 5.x, or this an implication that should not be present?
If the latter, what is the fastest/simplest/easiest way to add those 15k items to Zotero.
FWIW, I currently have the items listed in an ODF document, with each item formatted using APA Style Manual, Third Edition.
This is an old discussion that has not been active in a long time. Before commenting here, you should strongly consider starting a new discussion instead. If you think the content of this discussion is still relevant, you can link to it from your new discussion.
Upgrade Storage
If you don't have structured information beyond the APA-formatted references, you can try putting that through https://anystyle.io/. Once you have them recognized properly, save to bibtex, and I can look at whipping up a script coupling them to the saved PDFs.
But manually (and it will involve lots of manual work) doing 15k items from unstructured data to any reference manager is going to be a chore any which way you look at it. This has nothing to do with Zotero.
The closest thing possible is to create a saved search for PDFs in /eBook and then drag all the PDFs to Zotero but
a) this won't keep the subfolder structure and
b) this won't, in many/most cases, automatically add citation data, given the constraints that you describe above
The toolchain I use, has thrown Zotero into the workflow process.
I'm hoping that the issues I'm running to, are simply due to being unfamiliar with how to use Zotero.
> you can try putting that through https://anystyle.io/.
Every list I put through that list, generated an error message. Usually something along the lines of "Oops, something has gone terribly wrong."
>This has never existed as functionality in Zotero or an add-on. Not sure what the video was showing.
I haven't been able to find that video again.
It looks like the easiest, most efficient way to get my existing bibliographies into Zotero, is to add each item, one at a time, filing out all of the fields, manually.
But note that, if you were going to add items from scratch, you wouldn't do it by filling out the fields manually — that's just not how you should add stuff to Zotero the vast majority of the time. In this case, you would likely copy DOIs or ISBNs from the bibliography into Add Item by Identifier in Zotero or copy URLs to your browser and save from there using the Zotero Connector.
But most of all the people around here are looking to help you, yet you really make it clear you don't want to use Zotero, that you are forced by (to us) mysterious forces to do so anyway, that you have an enormous backlog of unstructured data, and you seem to imply that this state of affairs is somehow the fault of the Zotero team. We have not thrown zotero into the tool chain we did not force you to use, mind.
I recognize this must be frustrating work to do, but there simply does not exist any tool that will magically transform your unstructured data to structured data. Anystyle.io is the closest there is. If I am mistaken and there is indeed a tool that does this, we'll be happy to help you migrate off of that. But in the meantime, try to remember that we are not the ones that put you in this situation.
Import will be *very* slow though. Importing 37 PDFs this way took 60 seconds. If that scales linearly, importing 15k items would take between 6-7 hours, and since the translator isn't async, you will have to occasionally click "Wait" on the timeout popup you get. BibTeX is async but won't work for importing bare attachments.
Less than a quarter of the material in my Ready Reference Library, has either a DOI or ISBN associated with it. That is why I'm manually adding everything.
At least Zotero, unlike Calibre, doesn't assume that the ISBN I use is wrong, and replaces it when doing a metadata update.
Here's an example of what I put in, that didn't generate an error message..
However, the output is not usable.
---- start here ---
Jacobs, Eva
The Wittlich Character Diagram
NP: 1971
Karohs, Erika
Step By Step System of Handwriting Analysis CD
Pebble Beach CA: Karohs: 2006
King, Leslie
Getting Control of Your Life
Bountiful UT: Handwriting Consultants of Utah: 1972
King, Leslie
Measurement Gauge
Bountiful, UT: Handwriting Consultants of Utah: 1976
King, Leslie
Descriptive Definitions: Equal Weight Score Criteria: Part 1
Bountiful UT: Handwriting Consultants of Utah: 1977
King, Leslie
Descriptive Definitions: Equal Weight Score Criteria: Part 2
Bountiful UT: Handwriting Consultants of Utah: 1977
Knobloch, Hans
Die Legensgestalt der Handschrift
SaarBrucken, West-ost Verlag: 1950
---- end here ---
This is fairly typical of what I cite.
The structure is straightforward:
Author
Title
City, State: Publisher: Year of Publication.
In the first listing, (Jacobs, Eva), I thought that the NP was tripping up on something, because technically, that line should be: "NP: NP: 1971". But as a single item, but it works thusly, either way:
---- start here ---
<references><reference><authority>Jacobs, Eva</authority></reference><reference><authority>The Wittlich Character Diagram</authority></reference><reference><date>NP: 1971</date></reference></references>
---- end here ---
And with (NP:NP: 1971)
---- start here --
<references><reference><authority>Jacobs, Eva</authority></reference><reference><authority>The Wittlich Character Diagram</authority></reference><reference><publisher>NP: NP:</publisher><date>1971</date></reference></references>
---- end here ---
It should be Title, not authority, but if the end result is correct, it doesn't matter.
But I'm baffled by what it does, when only the (Karohs Erika) data is used.
---- start here ---
<references><reference><publisher>Karohs, Erika</publisher></reference><reference><title>Step By Step System of Handwriting Analysis</title><publisher>CD</publisher></reference><reference><title>Pebble Beach</title><location>CA: Karohs:</location><date>2006</date>
</reference></references>
---- end here ---
With just the (King, Leslie) data, the results are equally odd:
---- start here ---
<references><reference><authority>King, Leslie</authority></reference><reference><authority>Getting Control of Your Life</authority></reference><reference><title>Bountiful UT: Handwriting Consultants of</title><location>Utah:</location><date>1972</date></reference><reference><authority>King, Leslie</authority></reference><reference><authority>Measurement Gauge</authority></reference><reference><location>Bountiful, UT:</location><publisher>Handwriting Consultants of Utah:</publisher><date>1976</date></reference><reference><authority>King, Leslie</authority></reference><reference><title>Descriptive Definitions: Equal Weight Score Criteria:</title><note>Part 1</note></reference><reference><title>Bountiful UT: Handwriting Consultants of</title><location>Utah:</location><date>1977</date></reference><reference><authority>King, Leslie</authority></reference><reference><title>Descriptive Definitions: Equal Weight Score Criteria: Part</title><date>2</date></reference><reference><title>Bountiful UT: Handwriting Consultants of</title><location>Utah:</location><date>1977</date></reference></references>
---- end here ---
Jacobs, Eva. The Wittlich Character Diagram, NP: 1971
Karohs, Erika. Step By Step System of Handwriting Analysis CD, Pebble Beach CA: Karohs: 2006
King, Leslie. Getting Control of Your Life; Bountiful UT: Handwriting Consultants of Utah: 1972
King, Leslie. Measurement Gauge; Bountiful, UT: Handwriting Consultants of Utah: 1976
King, Leslie. Descriptive Definitions: Equal Weight Score Criteria: Part 1; Bountiful UT: Handwriting Consultants of Utah: 1977
King, Leslie. Descriptive Definitions: Equal Weight Score Criteria: Part 2. Bountiful UT: Handwriting Consultants of Utah: 1977
Knobloch, Hans. Die Legensgestalt der Handschrift. SaarBrucken, West-ost Verlag: 1950
(and you will see that "parse 21 references" changes into "parse 7 references").
You will almost always have to edit the results, but you can submit the edits as training data, and they will slowly get better as you do this.
However
If your data is semi-structured as
Author
Title
City, State: Publisher: Year of Publication.
This becomes a different matter. That can be parsed and imported without using anystyle.io
I don't know if the error is due to something purporting to be a PDF, but isn't, or something else.
In looking at what has been processed, the script had just started walking through the files that are in Chinese, Japanese, Korean, or Viet.
Traceback (most recent call last):
File "pdf.import.2.py", line 112, in <module>
f.write(minidom.parseString(ET.tostring(rdf, 'utf-8')).toprettyxml(indent=' '))
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
serialize(write, self._root, encoding, qnames, namespaces)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
v = _escape_attrib(v, encoding)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1090, in _escape_attrib
return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)
Also, Zotero journal articles don't have a "Place" field. Report does, for example. What type of item do you have in mind for the import?
Knobloch, Hans
Die Legensgestalt der Handschrift
SaarBrucken, West-ost Verlag: 1950
That doesn't have 3 fields in the last line. Seems to be place/date, but you did say
City, State: Publisher: Year of Publication.
Oops, I was using Python 2.7.13
I thought that Python by itself called Python 3.5.3.
Using Python3 it ends as expected.
>what OS are you doing this?
Depending upon the command used:
* Debian GNU/Linux 9 (Stretch);
* MX 18 Continuum;
* Linux version 4.15.0-1-amd64;
* Welcome to MX 17.1 (Horizon)! Powered by Debian;
* Linux 4.15.0-1-amd64 #1 SMP Debian 4.15.17-1~mx17+1 (2018-04-23) x86_64 GNU/Linux
Having it on Linux makes things easier for me. What is your system locale set to?
dir2rdf.py [folder path] [extension] [extension] [-extension] [-extension] ...
where [extension] is something like pdf or epub, end [-extension] is for files you want to ignore, in my case -pptx, -iso, etc. It will stop and complain if it finds an extension that it isn't told about. In my case, and invocation on my Download folder ended up being
python dir2rdf.py ~/Downloads/ -mkv -csv pdf -xlsx -py -mp4 -wnrfnd -ovpn -txt -deb -csv~ -docx# -ppt -rdf -json -zip -sh -odt# -qhto7a -odb -png docx -html -pptx -bib -ipynb -feather -iso
It might be easier to just clean up the folder before running the command. Note also that it will create collections for every folder it finds, regardless of whether there are any non-ignored files in them; you may also want to prune empty folders before running it.