Data models

CARomein · September 22, 2022

I work a lot with training language-models (Handwriting Text Models or OCR models) that I train within (various) tools and that model is then made public, within that tool for others to use. I would like this to be correctly referenced when people use this in their publications.

Is there any way of getting these recognised as publication outputs as well with a potential different type as the Datasets (as item type - https://www.zotero.org/support/dev/translators/datasets), or would you recommend something different?

E.g. I would like to have fields that allow for referencing the used tool (e.g. Transkribus, Kraken etc.) and a possible ID number they have within that system too.

adamsmith · September 22, 2022

How is this typically cited? I'd have figured software should work well for this

CARomein · September 23, 2022

That is an interesting suggestion. I will definitely look into it. For like 5 years historians/ people in DH have used 'data models' but - admittedly - do not cite them, nor do they often show that they have been using tools to organise their sources/data. Again, because of the lack of a 'type' to cite these datamodels, it is not done or done randomly.

bwiernik · September 23, 2022

Can you say what you mean explicitly? From what I can see, “Software” is exactly the correct type for such an item

CARomein · September 23, 2022

Well, where software - in my opinion - is the entire 'tool', this would be a particle within that tool created by a random person.

So if I use Transkribus (transkribus.eu) that constitutes the software; but I am building a little bit (an HTR-model) within that tool that I publicly share for others to be using. It is that particle that I'd like to properly reference (given that I have transcribed like 100s of pages to build such a Handwritting Text Model)

adamsmith · September 23, 2022

I disagree -- just like a python script is software even though it's built using other software (i.e. python), an AI model built with software is still software. This is also how I see e.g. NLP models built on training data being cited in computer science, e.g.

I think there is a general issue that non-traditional outputs are rarely cited properly in academia, but I don't think that's due to reference managers but rather due to slow-changing norms.