How to find unlinked files

François Maurice · February 4, 2018

Hi,

Here is a solution for those of us who use linked files feature instead of attached files features and want to know if the folder containing the files (pdf mostly) are linked or not to the Zotero database.

1- Export the database to the bibtex format.
2- Open the .bib file into Jabref.
3- Click "Quality" menu and then "Find unlinked files..."
4- Select the folder where your files are
5- Then click "Scan Directory"

This is possible since the exportation save the pathfiles of your Zotero items.

emilianoeheyns · February 13, 2018

https://github.com/retorquere/zotero-storage-scanner is now compatible with 5.0 and will tag the attachments with "#broken" in zotero.

Gurdas_Sandhu · February 13, 2018

@emilianoeheyns Is this solid or any known issues that could potentially damage a library? Also, the description says the tool will" live updates two smart-folders #duplicates and #broken" and I'm wondering if that means it will create a tag and mark positives with those tags (like you say)?

emilianoeheyns · February 13, 2018

Uh, that must be contributed text, it doesn't look like how I'd phrase it myself. It will just (un)tag attachments, nothing else.

Normal people would say this is solid and there's no potential damage to your library, and certainly no known problems that would cause it. I see no way it could, it never has, and uses only zotero apis to do its work. It doesn't touch the attachments in any other way than to add or remove these tags and then save the item into the DB, all using the same api zotero would to do this. If I however would make the claim that there's no *potential* of damage my autism (or the training in analytic epistemology, it's hard to distinguish the two) would assert itself and would object and say that I can't with absolute certainty state that all potential for damage is excluded.

But yeah, normal people would say zero risk.

Gurdas_Sandhu · February 13, 2018

Thanks, that's assurance enough. Almost all my attachments are "link to file" type, though there are a few webpage snapshots. So, this tool will follow the path defined by each link and if a file does not exist at that path, it will tag my attachment (or link to attachment) with #broken, right?

I wonder if it actually opens the file or verifies the file type is correct? Probably not, and that's almost an overkill in any case.

I'm not clear what the #duplicates does (or how it does the task).

emilianoeheyns · February 13, 2018

AAMOF it doesn't even do that, it asks Zotero whether the path exists (which works for both stored files, linked files and snapshots) and then tags (or untags, if the problem has been fixed) the attachment as appropriate. It does not open the file in any way; the request to Zotero to resolve the path just tells me whether it exists.

I told you this doesn't do much :) it just automates what Zotero can already do.

#duplicates just tags attachments where you have two or more of the same type (so two PDFs, two word documents, whatnot) under the same reference -- it's just the problem I needed fixing when I wrote the plugin.

Gurdas_Sandhu · February 13, 2018

Still, very useful :)

Regarding the "#duplicates just tags attachments where you have two or more of the same type" - so, it only looks at file TYPE and not file name or byte size? Thus, if a top-level item has two PDFs (different file names and sizes), then they will be tagged with #duplicates?

Gurdas_Sandhu · February 13, 2018

One more thing, how fast is this? Ballpark execution time for a library with 3,000 top-level items each with one linked attachment?

emilianoeheyns · February 13, 2018

It doesn't look at file size or file names, just file types. I'm open to suggestions (but short on time, so no promises).

It should be fairly fast because it does so little. No idea on execution time. Should be highly dependent on your system, but highly io-bound, so a slow disk would be a bigger bottleneck than a slow cpu. I think.

Gurdas_Sandhu · April 7, 2018

I ran the tool (back in Feb 13, 2018) and it did a great job of tagging attachments with broken file path.

The duplicates feature worked as described, though it creates a lot of false positives given the decision rule. I have many top-level items with multiple attachments of the same file type - for example, a journal article might have two PDFs one of which is the main manuscript while the other is the supporting information. Another example is when I have two webpages attached to the same top level item.

Once I've fixed the broken file paths I will run the latest version of the tool.

emilianoeheyns · April 7, 2018

I'm open to discussion of other decision rules that can be expected to have reasonable performance.

Gurdas_Sandhu · April 7, 2018

I was going to suggest that duplicates check for filetype AND filename, but I did not since there is no GUI to turn on/off that additional filename criterion. Checking for both would have missed the few instances when I had duplicates but with different filenames. I'd rather have false positives than miss real instances.

emilianoeheyns · April 7, 2018

It would not be uncommon for me to have two attachments with different names (eg post-merge of the reference).

cedricmontero · May 3, 2019

It could be very useful to have a sort of 'check unlinked files' tool directly in Zotero rather than this export procedure in Jabref. Is there a way to put it in development to do list ?
Regards.