Retrieving metadata from PDFs is very slow

quagman · June 26, 2020

Running v5.0.88. I've noticed that when I import a PDF into my Zotero library, the metadata retrieval is much slower than it used to be. It eventually works, but it takes a lot longer than previously. Any idea why?

dstillman · June 26, 2020

It depends on the upstream providers used for a specific PDF.

If you provide a Debug ID for an attempt that's slow, we can have a look.

quagman · June 26, 2020

D1579249704

The metadata retrieval took approximately 20 seconds.

dstillman · June 26, 2020

(4)(+0000001): INSERT OR IGNORE INTO fulltextItemWords (wordID, itemID) SELECT wordID, ? FROM fulltextWords JOIN indexing.fulltextWords USING(word) [22734]

(4)(+0001749): REPLACE INTO fulltextItems (itemID, version, synced, indexedPages, totalPages) VALUES (?, ?, ?, ?, ?) [22734, 0, 0, 11, 11]

(4)(+0000002): DELETE FROM indexing.fulltextWords

(3)(+0000002): Notifier.trigger('refresh', 'item', [22734]) queued

(4)(+0013629): Committed DB transaction zRzWHNRM

Unless that debug output wasn't representative, I don't think you're actually seeing a problem with metadata retrieval. The metadata retrieval appears to be extremely quick in that example. But there was a 14-second delay in indexing the PDF's full-text content, which is very slow.

If you temporarily disable "Automatically retrieve metadata for PDFs" in the General pane of the Zotero preferences, you should be able to distinguish between the time it takes to add them PDF and have it show as indexed in the right-hand pane and the time it takes to retrieve metadata.

For the full-text indexing issue, how big is zotero.sqlite in your Zotero data directory? How many items are in your database?

quagman · June 29, 2020

You're right, it's the indexing that's taking a long time, not the metadata retrieval. My zotero.sqlite is 70MB in size, and I have over 10,200 items in my database.

The thing is, this long indexing delay happened rather suddenly. The delay didn't get longer gradually over time. Is there a fix for something like this?

quagman · July 1, 2020

I just decided to disable indexing since I don't really need that feature anyway. Metadata retrieval is fast again!

dstillman · July 2, 2020

This certainly shouldn't happen in a 70 MB database. Is this by any chance on an old computer with a spinning disk (i.e., not an SSD)?

quagman · July 2, 2020

It does have a spinning disk but the PC is not that old. It's an i5-7500 CPU with 16GB RAM.

To me, the puzzling thing is how the slowdown was rather sudden. It was working quickly before, then suddenly things slowed down.

quagman · April 27, 2021

The slow metadata retrieval problem has reappeared. I had swapped out the old hard drive with an SSD and things were working well but now it's slow again. I have disabled PDF indexing but metadata retrieval is still slow. Here's a debug ID of a recent attempt:
D2112068734

Any help would be appreciated!

dstillman · June 22, 2021

(3)(+0000000): HTTP GET https://doi.org/10.[…]

(3)(+0010599): Translate: Could not find a result using DOI Content Negotiation -- trying next translator

(3)(+0000000): HTTP GET https://doi.org/10.[…] failed with status code 504

@quagman: Sorry I missed this at the time, but this appears to just be a 10-second timeout from Crossref. Wouldn't be a regular thing — it was just instantaneous for me — and nothing we can do about it. If you're seeing it regularly, there could be a problem with your network, but this is almost certainly remote on their end.

adamsmith · June 22, 2021

(IIRC, that was the time CrossRef regularly struggled with API performance)