Links broken following migration from Mendeley and attempt to clean up duplicates

leblancj · April 25, 2020

Hi,

I just switched from Mendeley and I'm still struggling with the concepts of the base folder, linked attachments folder and shared folders. Zotero is a wonderful initiative and product and I'm planning to stick with it. I think though that I have royally screwed up my migration and have lost most if not all of my attachment links. I'm hoping there's a simple solution.

Right now my data directory location is %SYSPROFILE%\Zotero and my base directory is "D:\Attention\Lit\_Zotero 2020\storage". Many files now have a link to my research assistant's folder "/Users/emily/Dropbox/ASDIT common/Lit/" on her Mac. This is because I had thousands of duplicate refs when I imported my records from Mendeley and I asked her to help me merge them.

I did this as follows:
1. I bought extra storage on Zotero so that I would have room for citation info + associated pdfs.
2. I synced my desktop library to my online Zotero account.
3. I asked her to install Zotero and login with my credentials so that she could merge duplicates.
4. She did this and happily duplicates are gone. However, when I look at refs on my computer, I see the link to pdfs to her folder "/Users/emily/Dropbox/ASDIT common/Lit/" not to mine.

QUESTIONS
1. If she set her linked attachments folder to the base of the Zotero storage folder and I set mine to the same, e.g., C:\Users\John\Zotero, would that solve the problem.

2. On my computer, should I set my base directory to be the same as the data directory? Right now I have two Zotero storage folders and I'm sure that leads to no end of trouble.

Thanks for your help with this.

dstillman · April 25, 2020

The Linked Attachment Base Directory is solely for linked-file attachments, not for stored-file attachments. It has nothing to do with the data directory — or 'storage' within that — and shouldn't be set to that. (I can't remember if Zotero prevents that, but it certainly should.)

Right now my data directory location is %SYSPROFILE%\Zotero and my base directory is "D:\Attention\Lit\_Zotero 2020\storage".

(%USERPROFILE%, not %SYSPROFILE%)

I'm not really sure what D:\Attention\Lit\_Zotero 2020\storage is in your case, or what you have in the folder above it. 'storage' is the hardcoded name of a folder within the data directory. There's no reason for anything else to be called 'storage'. I'm not sure if you previously pointed the data directory at "_Zotero 2020", such that you have a zotero.sqlite file there as well and the random 8-character folder names that go in the real 'storage' folder in "_Zotero 2020\storage". If so, that would be quite a mess.

The Linked Attachment Base Directory causes linked files under it to be stored with relative paths, so she needs to set it to the folder she's going to share with you that contains all the linked files. You'll then set it to the location of the same folder on your computer, and the files will be accessible. So if she can access a linked file at /Users/emily/Dropbox/ASDIT common/Lit/foo.pdf, and you can access the same file at D:\Dropbox\Lit\foo.pdf, you'll both be able to access the file as long as the base directory is set properly on both of your computers.

leblancj · April 25, 2020

Very helpful thanks and sorry, I meant %userprofile%.

I think when I first set up Zotero, I intended to sync attachments with Dropbox so I pointed the storage folder to one that DB would sync. That folder on the D: drive is only 300 Mb vs. 1,300 at the default userprofile location. That's probably why I have two 'storage' folders. Fortunately, I only have one zotero.sqlite db and it's in the proper location @ %userprofile%\zotero. It was also updated today and it passed the db

So how can I figure out if the storage folder on drive D:, to which the linked attachment base directory (LABD) is pointing, contains valid data or not? Is it just a bunch of references that aren't linked to anything or are they references from other locations on my hard drive that I somehow linked to Zotero rather than copying to the storage folder?

Is is safe to delete it? If it helps in my Zotero recovery, I'm willing to sacrifice that folder so I have a single Zotero storage folder even if I lose some refs.

dstillman · April 25, 2020

You can't point the 'storage' folder anywhere, though. You can point the data directory (which contains a 'storage' directory) somewhere (and if you put that in Dropbox Zotero will warn you not to do that, since it will corrupt your database).

You'd have to say what kinds of folders and files you have in '_Zotero 2020' and 'storage' on D: for us to tell you what to do. If there are just regular PDFs within there, they may be linked to attachments in your database. If all the PDFs that matter are on your assistant's computer and will be synced to your computer either via Zotero or via a cloud storage folder, then you can probably delete it.

leblancj · April 26, 2020

I really appreciate the time you're spending helping me sort this out. I hope you can help me find a quick way forward even if it means sacrificing some of my information (e.g., tags) that got imported from Mendeley.

re: 'Zotero 2020\Storage'
They are regular PDFs that exist elsewhere on my PC. Therefore, my inclination is to delete them and start from scratch.

I'm more worried about salvaging my main library in %userprofile%\Zotero\storage with thousands of refs. It looks like when the duplicates were merged, the Zotero metadata (if that's the right word) got merged but also all of the attachments, even though identical, were attached to the merged record. See this link for example that shows one Zotero ref, 4 valid attachments and 3 that are greyed out. Output from 'Duplicate cleaner Pro' (part of the screenshot) shows that these 7 pdfs exist in different sub-folders of storage:
https://bit.ly/2S8J5NG
Going through thousands of files and manually deleting duplicates is not worth my time. I see two ways forward and would appreciate your advice:

1. Use a program like duplicate cleaner to delete all of the duplicates. This is easy but what will happen in Zotero? I guess there's a risk that some links will be lost if the deleted files were the only ones linked to a particular Zotero record. Is that right?

2. Start from scratch and re-import all of my pdfs to Zotero. This would be easy since the majority of refs I care about are on my hard drive organized into hierarchical folders. I guess it would mean I would lose all of my tags though.

Thanks for your advice.

dstillman · April 26, 2020

(I'm a bit confused, since initially you were referring to linked files (stored outside the data directory), and now you seem to be referring exclusively to stored files.)

It looks like when the duplicates were merged, the Zotero metadata (if that's the right word) got merged but also all of the attachments, even though identical, were attached to the merged record.

Yes, merging items doesn't currently merge attachments. (PDFs are often watermarked and not identical even from the same source, so we didn't bother implementing this initially. Having it merge identical files is generally planned, but there still might be different titles, filenames, tags, notes, etc., that would have to be somehow dealt with, so it's a somewhat complicated problem.)

If you don't have any groups where the same files exist, then yes, you could use a duplicate cleaner to delete duplicate files. You'd still then want to delete all the attachments in Zotero with missing files (indicated by an empty blue circle), but we could give you a short script to run that would delete all attachments without files.

(While doing it by hand would still be annoying, note that you can click an item and press + to expand all items, which would make it a bit quicker to quickly select ranges of attachments with Ctrl and Shift.)

leblancj · April 27, 2020

Yes it's confusing for me too because there are actually two problems, the linked attachment folder problem (D:and the duplicates problem. I think the linked file problem (on the dropbox accessible folder "D:\Attention\Lit\_Zotero 2020\storage") is not worth spending time on. I've learned that my duplicate attachments are in the Zotero data directory storage sub-folders (about 12,000 refs accumulate over a lifetime but with lots of duplicates!) and these are the ones I'd like to cull. I would be very grateful if you could provide me with a script that will delete all attachments without associated files.

Thanks!

dstillman · April 27, 2020

The linked file issue I explained above — it's just a question of setting the Linked Attachment Base Directory correctly on all computers and syncing. (But it's made more confusing by calling that directory 'storage' when it doesn't have anything to do with the actual 'storage' directory in the data directory.)

Here's a script you can run from Tools → Developer → Run JavaScript to add the tag "_missing file" to any attachments in My Library with a missing file. You can then click in the items list and do a Select All (Ctrl-A/Cmd-A), delete them, and empty the trash.

var s = new Zotero.Search();
s.libraryID = Zotero.Libraries.userLibraryID;
s.addCondition('itemType', 'is', 'attachment');
var ids = await s.search();
await Zotero.DB.executeTransaction(async function () {
	for (let id of ids) {
		let item = Zotero.Items.get(id);
		if (item.isFileAttachment() && !item.isLinkedFileAttachment()) {
			if (!await item.fileExists()) {
				item.addTag('_missing file');
				await item.save({
					skipDateModifiedUpdate: true
				});
			}
		}
	}
});

leblancj · April 29, 2020

I think it worked; at least some files were identified though far fewer than I expected.
I ran is with "Run as async fn" ticked. I received a return value of ===>undefined<===

Here's a screenshot of the results when I filter on _missing file: https://www.dropbox.com/s/8zdezjwyxwiyzip/2020-04-28 213500 zotero _missing file script results.png?dl=0
There are only 4 attachments with the _missing file tag and all four are indented under an appropriate Zotero file. Doesn't this mean that they are attached?

Is the script okay?

leblancj · April 29, 2020

I ran it again with the debug console on in case this is helpful. Here's the output:
[JavaScript Warning: "unreachable code after return statement" {file: "resource://zotero/loader.jsm -> resource://zotero/bluebird/util.js" line: 201 column: 4 source: " eval(obj);
"}]
[JavaScript Error: "addon.getResourceURI is not a function" {file: "chrome://zotero/content/xpcom/prefs.js" line: 390}]
[JavaScript Error: "Error connecting to server. Check your Internet connection."]
(not sure what the last one's about; I'm able to sync)

dstillman · April 29, 2020

The script finds attachments that are missing files on disk, as indicated by empty blue circles on the right and a file-not-found dialog when you tried to open them. It has nothing to do with being attached to a parent item.

Remember, the point of this script is to let you quickly delete useless attachment items after you use an external duplicate cleaner in the 'storage' directory to delete duplicate PDFs. Have you done that yet?

leblancj · April 29, 2020

I haven't run that yet because I'm worried that the software will delete a duplicate attachment that is correctly linked to a parent item and leave one that isn't. That's because it has no way of knowing (at least from file characteristics like name or date modified) which is correctly linked to a Zotero parent item. I want to make sure I have used the script results correctly. I think I need to:
1. delete all entries that say _missing file (open blue circle). This implies that all the pdfs that remain in Zotero storage are connected to a parent item. (I've done this)
2. Use the external software to delete all duplicates. This would be safe since the duplicates that remain are attached to a parent item. Is that correct? I will go ahead and delete the duplicates if you agree that this is the case.

dstillman · April 29, 2020

No, I think you're still misunderstanding the point of the script. You asked above if you can run a duplicate cleaner on the 'storage' directory. I said yes, but that it will leave attachment items in Zotero with missing files. I wrote this script for you to make it easy to tag and delete those attachments. There's no reason to run the script before running the duplicate cleaner.

The script doesn't do anything to find PDFs in 'storage' that aren't linked to items, but there's no reason to think those exist. This directory is managed by Zotero, so unless you force-quit Zotero in the middle of an import or copied in other files outside of Zotero, there's no reason there'd be orphaned files in there.

You should obviously make a backup of the entire directory before proceeding. If you notice that all attachments under an item are tagged with "_missing file", you can untag one of them and relocate the file manually by searching for it within the backup folder.

leblancj · April 30, 2020

Yes I was more worried about orphaned PDFs in case the dup. software deleted ones that were linked to items and left ones that weren't. But I see your point; if they're in the Zotero storage folder and the database is not corrupted, they would have to be linked to a Zotero item.

I just ran the dups sofware and got rid of exact duplicates (shared same MD5 tag). I then ran your script. I now have a helpful view where I can see all Zotero items and associated files. I see when I right-click on one with the open blue circle and select "show file" that it doesn't exist in a storage folder. However, if I left-double-click Zotero is able to open the pdf and the icon goes from open to solid blue. It then creates a new storage sub-folder and puts the pdf there. Is it taking a copy from the pdf that is also linked to the same Zotero item? Does it do that so I can select the _missing file attachment I wish (i notice that some of my missing files have full file names whereas the ones with the solid blue circle are just called "PDF".

I'm using this feature to make sure that Zotero keeps the most useful filename associated with its item and removing the ones with the _missing file tag.

dstillman · April 30, 2020

I now have a helpful view where I can see all Zotero items and associated files.

It's showing search matches. If you just want to see all child items, you can get that anytime just by clicking an item and pressing + on the keyboard.

I see when I right-click on one with the open blue circle and select "show file" that it doesn't exist in a storage folder. However, if I left-double-click Zotero is able to open the pdf and the icon goes from open to solid blue.

The files are synced, so when they're missing and you double-click on them, Zotero will download them automatically.

The script really is for the very specific purpose I explained: selecting the attachment items associated with the files you deleted so that you can delete those attachments.

i notice that some of my missing files have full file names whereas the ones with the solid blue circle are just called "PDF".

That's just the attachment title. The underlying files should all already be named based on the parent metadata. There's no reason to duplicate the filename in the attachment title, which is why files saved from translators don't do that. I'd really encourage you just to delete the selected attachments and not worry about this further.

leblancj · May 13, 2020

Thanks very much for your excellent and detailed help! Using your script above and your guidance, I was able to clean up the messy export/import from Mendeley.