Proposal: Parallel PDF Finding for Large Libraries
Hi all,
I have a library with ~12,000 items, many missing PDFs. When I run "Find Available PDF" on a large selection, it processes items one at a time, which takes hours. I've been experimenting with a patch to `attachments.js` that allows concurrent requests (defaulting to 3 parallel), and it dramatically speeds up the process while still respecting the existing per-domain rate limiting.
## The Problem
In `addAvailableFiles()`, items are processed sequentially:
```javascript
processNextItem(); // starts ONE item, waits for completion, then next
```
For a library with thousands of items missing PDFs, this means waiting 1+ second per item even when they're from different domains that could be queried in parallel.
## Proposed Solution
Start N items concurrently (configurable, default 3):
```javascript
const MAX_CONCURRENT = Zotero.Prefs.get('findPDF.maxConcurrent') || 3;
for (let i = 0; i < MAX_CONCURRENT; i++) {
processNextItem();
}
```
The existing per-domain rate limiting (`SAME_DOMAIN_REQUEST_DELAY`, `MAX_CONSECUTIVE_DOMAIN_FAILURES`) remains intact, so we're not hammering any single service. We're just allowing requests to *different* domains to happen in parallel.
## Testing
I've tested this with:
- 1000+ items in a batch
- Various concurrency levels (3, 5, 10)
- Verified Unpaywall/Sci-Hub rate limits are still respected
With 3 concurrent requests, processing is roughly 3x faster with no increase in failures.
## Questions for the Team
1. Is this something you'd consider for Zotero core?
2. Should concurrency be user-configurable via hidden pref, or a fixed reasonable default?
3. Any concerns about UI responsiveness with concurrent progress updates?
Happy to submit a PR if there's interest. Thanks for considering!
---
*Note: I can also share the full patch diff if helpful.*
I have a library with ~12,000 items, many missing PDFs. When I run "Find Available PDF" on a large selection, it processes items one at a time, which takes hours. I've been experimenting with a patch to `attachments.js` that allows concurrent requests (defaulting to 3 parallel), and it dramatically speeds up the process while still respecting the existing per-domain rate limiting.
## The Problem
In `addAvailableFiles()`, items are processed sequentially:
```javascript
processNextItem(); // starts ONE item, waits for completion, then next
```
For a library with thousands of items missing PDFs, this means waiting 1+ second per item even when they're from different domains that could be queried in parallel.
## Proposed Solution
Start N items concurrently (configurable, default 3):
```javascript
const MAX_CONCURRENT = Zotero.Prefs.get('findPDF.maxConcurrent') || 3;
for (let i = 0; i < MAX_CONCURRENT; i++) {
processNextItem();
}
```
The existing per-domain rate limiting (`SAME_DOMAIN_REQUEST_DELAY`, `MAX_CONSECUTIVE_DOMAIN_FAILURES`) remains intact, so we're not hammering any single service. We're just allowing requests to *different* domains to happen in parallel.
## Testing
I've tested this with:
- 1000+ items in a batch
- Various concurrency levels (3, 5, 10)
- Verified Unpaywall/Sci-Hub rate limits are still respected
With 3 concurrent requests, processing is roughly 3x faster with no increase in failures.
## Questions for the Team
1. Is this something you'd consider for Zotero core?
2. Should concurrency be user-configurable via hidden pref, or a fixed reasonable default?
3. Any concerns about UI responsiveness with concurrent progress updates?
Happy to submit a PR if there's interest. Thanks for considering!
---
*Note: I can also share the full patch diff if helpful.*
Upgrade Storage