PDF text indexing failures
I have been trying to improve the extent of full text indexing in my library, as about a third of my PDFs were showing as UNindexed under Preferences\Search. Some of that is down to simply tweaking max characters/pages settings there. But the reasons for some indexing failures are less obvious. And re-indexing fails on some items.
If I do an Advanced search for Pages 'greater than' my current setting for 'Maximum pages to index per file', many of the items returned seem to have odd entries for Pages. And their PDFs are often unindexed (but not all are). These odd page entries seem more common these days with online journal content, where pages sometimes do not correspond to physical pages ... eg
Proceedings of Meetings on Acoustics, Vol. 33, 035002 (2018)
So, does indexing look at the Pages figure in deciding which PDFs to index ? If not, how does it check the page count ?
Secondly, attempting re-indexing seems to fail on PDFs with diacritics in the filename. In the PDF's pane, clicking on the spinner next to 'No' pauses for a second but then remains at 'No' (right click Reindex Item also fails). If I then change the PDF file name to remove the diacritics, re-indexing is successful (changes to 'Yes' after the time taken to reindex).
However I do have PDFs with diacritics that were already indexed. So the problem seems specific to re-indexing.
If I do an Advanced search for Pages 'greater than' my current setting for 'Maximum pages to index per file', many of the items returned seem to have odd entries for Pages. And their PDFs are often unindexed (but not all are). These odd page entries seem more common these days with online journal content, where pages sometimes do not correspond to physical pages ... eg
Proceedings of Meetings on Acoustics, Vol. 33, 035002 (2018)
So, does indexing look at the Pages figure in deciding which PDFs to index ? If not, how does it check the page count ?
Secondly, attempting re-indexing seems to fail on PDFs with diacritics in the filename. In the PDF's pane, clicking on the spinner next to 'No' pauses for a second but then remains at 'No' (right click Reindex Item also fails). If I then change the PDF file name to remove the diacritics, re-indexing is successful (changes to 'Yes' after the time taken to reindex).
However I do have PDFs with diacritics that were already indexed. So the problem seems specific to re-indexing.
-
dstillmanedited November 9, 2022"Pages" and "# of Pages" are different fields. longstanding bug on Windows from the underlying platform Zotero is built on. It might be fixed in the next major version.It's just the number of pages in the PDF. It has nothing to do with those fields. That's a
-
tim820edited November 10, 2022Thanks. I worked around the diacritic file name reindexing bug by first setting Zotfile's advanced option to remove diacritics from filenames when move/renaming. However reindexing still sometimes failed on PDFs with hyphens in the file name (but seemingly not always ?). Manually editing the file name usually fixed that.