Zotero Connector: arXiv treated as publication

mjthoraval · August 14, 2022

When I import the following arXiv preprint, it extracts the metadata of the final publication instead of the preprint, together with the PDF file from arXiv:
https://arxiv.org/abs/1703.07038
-> Afkhami, S., Buongiorno, J., Guion, A., Popinet, S., Saade, Y., Scardovelli, R., & Zaleski, S. (2018). Transition in a numerical model of contact line dynamics and forced dewetting. Journal of Computational Physics, 374, 1061–1093. https://doi.org/10.1016/j.jcp.2018.06.078

Is this the expected behaviour? I have tried a few other arXiv articles that have a final publication DOI, and it seems to do the same.

It is nice that Zotero identifies the final publication. It could be useful to import the metadata of the final publication, but then it should be imported in a separate additional item (related to the arXiv item), with the PDF file of the final publication instead of the file from arXiv. The import from arXiv should always create a Preprint item. The mix of arXiv file with publication metadata as it is doing now is wrong I think: arXiv and publication are different things, each with distinct metadata, DOI and PDF file.

adamsmith · August 14, 2022

Is this the expected behaviour? I have tried a few other arXiv articles that have a final publication DOI, and it seems to do the same.

yes, this is intended and by multiple requests. Since the PDF is labelled accordingly, I don't think having it attached to the publication is problematic -- it's not unlike adding green OA versions of articles, which Zotero also does automatically.

mjthoraval · August 14, 2022

Thank you for your reply. I have tried to go through some of the discussions leading to the current status. I have found this one that I find quite illustrative of the dilemma when dealing with arXiv publications:
https://forums.zotero.org/discussion/29556/papers-from-arxiv

I will try to summarize what I understand and give an updated perspective with the recent developments of Zotero which may not have been available when the requests and decisions were made. I will be glad to be pointed to other aspects that I have missed.

Strictly speaking, each arXiv publication version constitutes a different item, also different from the final publication. They can have differences in title, authors, change significantly by adding or removing content, ... The different arXiv versions share the same DOI, but can still be distinguished by the version of the arXiv ID. So in theory, the best practice would be to keep a separate copy for each of them, with their own metadata.

However, a preprint is usually designed to lead to a peer-reviewed publication. That publication should always be cited over the preprint, as peer-reviewed material has a higher value than a preprint. So as soon as the preprint material is published, it is better to use the published metadata.
So from a practical perspective fitting the use of the majority of users, Zotero has decided to prioritize the final publication when importing from preprints repositories. I see that this decision is actually consistent with the "Export Bibtex Citation" feature in arXiv, which exports the metadata of the final publication.

So, for a preprint which has not been published yet, there is new item type Preprint that can deal with them. And after publication, the final publication is preferred. This is nice, and I understand that it is probably the best compromise for most users.

But this still leaves a few issues:

1) The URL included in the import from arXiv is still the arXiv website. This is not consistent with the "Export Bibtex Citation" feature in arXiv, which shows the URL of the final publication. If you are consistent with the logic of creating the proper metadata for citation, you should also keep the URL of the final publication instead of the link to the arXiv website.

2) Considering that Zotero is able to import the PDF file of the final publication directly from the DOI (when it can be downloaded), it could as well do it directly when importing from arXiv I guess? If the Zotero tools are already there to fully import the final publication anyway, the question becomes: is it really useful or suitable to keep any information of the preprint together with the item focused on the final publication?

I agree that it is nice to keep the information about the preprint of a paper together with the final publication. But you still need to decide what is the best way to do it. You can consider two main strategies:

I) Keep all the information under the same item. In that case, it would be important to add the following features to store all this information properly:
a) Improved storage of a secondary URL to keep a clickable link to the arXiv URL, for these reasons.
b) Store the arXiv ID and/or the arXiv DOI. At the moment, only the arXiv ID is kept in the Extra field as "arXiv:1703.07038 [physics]".
c) Improved management of secondary attached files for these reasons. The "arXiv" labelling is nice, but it would need to be more robust when using automatic file renaming features or using different import sources.

II) Split the different versions in different items, and add connections between them:
a) From this perspective, it would make more sense to import only the arXiv metadata and files when importing from arXiv.
b) For those who actually want to import the metadata of the final publication, it is already nearly as fast to go on the final publication page and import from there.
c) The main drawback of this process is that you still need to add the link between the preprint and the final publication manually. But it should probably be possible to add the 2 items at once from the arXiv import if necessary, and add the Related link automatically.

Depending on the strategy you choose, you will also need to solve the following issues:

3) How do you deal with an arXiv entry that has received a more recent version, or has been published in a peer-reviewed journal? There are probably nice ideas to take from the Retracted Items feature, including the identification of outdated items and some warning when you try to cite a preprint that has received an update or has been published in a peer-reviewed article.

4) When you have identified the outdated item, you will need to update it.
It could also use arXiv or DOI lookup feature. Either to create a new entry, erase the old one with the new one, or keep all versions in the same item.

5) If you already have the updated data in a separate item, you will need to decide the behaviour of the Duplicate Items and Merge Items features. At the moment, it is not possible to merge items of different types, so a new merging process will need to be designed.

6) Related to points 4 & 5, you will need to be able to identify whether you already have a more recent version of the item in your library.

Eventually, both strategies I and II could be simultaneously available, with the preference decided by each user. At the moment, strategy I is probably the preferred one, keeping the option to manually enter the preprint information for the users who really need it. But strategy II may be most comprehensive way to deal with the problem and the easiest to implement the additional features needed.

Anyway, it is already nice the way it is, with all the discussion above incremental improvements which may come in the future. Thanks!