automate "retrieve metadata"/"create parent item"

Now that retrieve metadata is working a lot better, I wanted to bring up the idea of automatically running it when users add a top-level PDF again. This could also move into the direction of an oft-requested "watch folder".
I also still think with retrieve metadata failing we should create an empty parent item - and probably default that to journal article.

The only real downside I see is that a user may want to drag a PDF to an item but drags it to a standalone position and then has to wait for retrieve metadata to finish and move the PDF. That seems rare enough to ignore.

Thoughts, concerns?
  • edited April 16, 2014
    I also still think with retrieve metadata failing we should create an empty parent item - and probably default that to journal article.
    Or, (this would of course need to wait for 4.2) we could expand "document" to encompass all possible fields and use that.
    The only real downside I see is that a user may want to drag a PDF to an item but drags it to a standalone position and then has to wait for retrieve metadata to finish and move the PDF. That seems rare enough to ignore.
    In general, I think we need a right-click solution for this. Something like "Dissociate attachment"

    I'm all for automating this though.
  • I think we need a right-click solution for this. Something like "Dissociate attachment"
    hmm - on the one hand, the current solution is impossible - it's like a puzzle game to find the right spot. On the other hand, though, the right-click menu is already way overcrowded and this would be a pretty rare operation, no?
  • Not that rare. Retrieve metadata is not perfect
  • Now that retrieve metadata is working a lot better, I wanted to bring up the idea of automatically running it when users add a top-level PDF again.
    Yeah, I think I'm OK with this now.
    This could also move into the direction of an oft-requested "watch folder".
    I think doing this efficiently would require using platform-specific js-ctypes, to tie into inotify and similar. We don't want to use a polling approach (even with async file access), since it would affect battery life and there'd be a delay anyway. Alternatively we could have a button to manually scan a directory for files.

    In all of these cases, I'm not totally sure what the logic is — mtime since the last check? Then if a file is edited it gets imported again. But ctime probably isn't sufficient, since some software might set ctime based on the wherever the file came from instead of the local time.
    I also still think with retrieve metadata failing we should create an empty parent item - and probably default that to journal article.
    If we do this, do we also need an option on child PDFs to regenerate parent metadata from the PDF? Or do you have to disassociate first, delete the parent item, and then re-recognize? (If the former, it would be handy for more things than just transient recognition failures — it could be helpful for imported libraries, etc. But it'd probably need to use the merge window to show the differences before overwriting.

  • Yeah, I think I'm OK with this now.
    cool!
    If we do this, do we also need an option on child PDFs to regenerate parent metadata from the PDF? Or do you have to disassociate first, delete the parent item, and then re-recognize?
    I don't think we need the regenerate function for this. Presumably, if retrieve metadata fails, the user needs to do this by hand. That said, the feature is frequently requested and I think we'd want this in general for the reasons you say.

    As for the watched folder - I had never looked into that. ZotFile has a sort-of watched folder feature, not sure what Joscha is doing.
  • edited April 16, 2014
    Doesn't ZotFile just grab the latest file in the directory by date? (At least, I think it did a few years ago — from the description it looks like it might check for multiple files now.)
  • no, you're right, it's just for a single file, so that won't help.
  • edited April 16, 2014
    In all of these cases, I'm not totally sure what the logic is — mtime since the last check? Then if a file is edited it gets imported again. But ctime probably isn't sufficient, since some software might set ctime based on the wherever the file came from instead of the local time.
    Zotero could move the attachment from the watch folder instead of copying it. The downside is someone who wants to import files as links and use the watch folder to organize their PDFs.***

    Alternatively, Zotero, could keep track of which files have already been imported by file path and checksum (say MD5). This way, one could avoid re-importing after renaming the PDF or modifying its content somehow. In the case of renames and importing as link, Zotero could automatically adjust the links. In case of modifications and importing normally, Zotero could re-import (though that's possibly a bad idea). The downside is that this is somewhat complicated and ugly internally (I'd say it's worth the hassle for a smooth UX)

    ***At some point in the future
  • Just a note on what zotfile does: I think Dan refers to is the 'Attach new file' function in zotfile. The sort-of watch folder feature is different. It checks the watched folder and saves the mtime of the most recent file. If there is a file with a more recent mtime, zotfile shows a clickable and non-disruptive info window 'click here to add file to zotero'. This files is either added to the currently selected zotero item or (if no item is selected) as an independent item and retrieve metadata starts automatically.
  • Do you poll, or do you watch for a download event in Firefox? (The former, I guess, if it works in Standalone too?) If you do poll, how often?
  • I use a 'focus' event-listener for the 'zotero-items-tree':

    window.ZoteroPane.document.getElementById('zotero-items-tree').addEventListener('focus', Zotero.ZotFile.watchFolder, false);

    So whenever the 'zotero-items-tree' get's focus, I poll. The idea that the pop-up only makes sense when the user is working with Zotero.
  • Hmm, OK. I don't think we'd go that route in Zotero. The mtime check is generally quick, but it still can affect performance and battery life, so I'd be uncomfortable doing it on every pane focus. We do mtime checks for file syncing and have had to add various complicated logic to reduce the number of checks (in addition to making them async), since they were causing problems for people with large libraries. This wouldn't be a big deal with a few files, but an unkempt downloads folder could easily have thousands of files in it.

    It looks like there's actually a more efficient way to do this on Windows, but I think on OS X and Linux we'd have to stat all the files individually:

    https://developer.mozilla.org/en-US/docs/JavaScript_OS.File/OS.File.DirectoryIterator_for_the_main_thread#Example.3A_Sorting_files_by_last_modification_date

    As an alternative, though, in Firefox we could just watch for downloads. I'm not sure what the current capabilities are for Chrome and Safari extensions — it's possible we could have one or both of the connectors watch for downloads and notify Standalone.
  • Most of my pdfs are grey literature and "get metadata" fails. But I would still like to have zotero auto create the parent item, even if only with the filename as title and nothing else, because I don't ever not want a parent item. and because I use zotfile, and zotfile requires a parent item. So when I add a new PDF I have to remember to create a parent item and then remember to rename (because I always want zotfile to move the file to another folder and replace it with a link). So I have to go through these two steps every single time I add a new PDF to zotero, which is pretty fiddly.
Sign In or Register to comment.