MetaData Tools

Alrich · March 13, 2025

I searched MetaData because I have pdfs which may have metadata on the internet, but not in the document. I want to use metadata lookup to add the metadata to my homemade pdfs.

This would be if I had to copy the text from the website, put it in word and then create a pdf. the website offered a pdf but it was full of junk. So maybe that pdf has metadata and I want to add it to my pdf.

I have a program that allows me to add metadata to the pdf. (BeCyPDFMetaEdit). I want to know, what data do these metadata servers request? Can I use whatever DOI or ISBN the pdf uses? or do I need all of the title and author, etc.? by which time I have simply entered the metadata myself.

I also encounter a lot of pdfs in the non-academic quarters (government, business and non-profits) which do not even trouble to get a DOI. Who issues these things anyway, and can we request DOIs to be assigned? I'm going to guess it costs money, but these folks must know that important white papers, reports and position papers need to be registered, for citation in papers. I guess we're not going to get the Library of Congress to write a rule this year...

Since there is so much diversity in this environment, we're all just trying to do the best we can. I would find it helpful if there were a widget that would scan the text of a pdf for metadata, or allow me to copy the portion of a web page with the name of the publisher or organization, and drop that copy into the widget to enter into fields. There should be html in the background to guide that assignment, and the user should expect to correct errors.

in my ideal Zotero iteration, Zotero writes to the PDF whatever we enter into the metadata fields, since Zotero already reads, it could write. Any pdf that gets dropped would automatically have a "new item" side bar, with unfilled fields. Entering a DOI might help track down metadata for a corresponding published pdf, or the user might need to fill manually or from the widget.

As an optimal feature, Zotero would have a check box for any PDF lacking a DOI or ISBN, "Request registration".

Enough for MetaData. Look for new topic "merge".

aborel · March 13, 2025

In order to mint a DOI, a publisher (private or public, one-person or multinational corporation) must:

1) have a working contract with Crossref, Datacite, or possibly another DOI registration organization - which involves some kind of payment;
2) provide the necessary metadata to describe the digital object.

So there can't be a DOI without the metadata and some money. Who is supposed to supply that, in your ideal situation?

And PDFs don't actually contain a lot of structured metadata, so it is difficult to estimate how reliable an automatic extraction would be. Title: probably. Authors: I'm not so sure. Publisher: maybe. Number of pages: OK, that one should be easy :-). Document type: forget it. Etc. As for storing structured metadata into a PDF, there are certainly ways to do it in principle but I'm not sure there is a standard choice that you can expect to be recognized by widely used software (such as Zotero or others).

Alrich · March 13, 2025

Well, Maybe the rules need to be relaxed or streamlined so that publishers can pay $20 for a single paper and register online? It's an ideal world, not the world we have. but for sure what you describe sounds clunky. No button in Zotero anyway!!

Otherwise, yes, that basic data would be important, useful. Well, often enough they have a DOI in the document. The Doi is metadata, right?

Any way I know you are busy and I appreciate that you take time to answer questions in the forum. Right now I need to go weed my Zotero library.

adamsmith · March 13, 2025

If there's a DOI in a paper, Zotero should be able to automatically grab the metadata when you drag the file to Zotero.

(And DOI registration has to require a membership because organization issuing DOIs need to have some sort of plan to keep the identified object accessible, e.g. by updating a URL if an item/site moves. The actual cost of especially CrossRef DOIs is quite low https://www.crossref.org/fees/#annual-membership-fees )

aborel · March 13, 2025

It's fairly easy to distinguish a DOI in a string of text automatically: it is constructed with a precise structure, and a computer can use that with sufficient confidence. That's why it works.

And yes, there are commitments beyond just paying a few bucks for a DOI (which is actually the right order of magnitude for the price)

Alrich · March 13, 2025

@adamsmith If there's a DOI in a paper, Zotero should be able to automatically grab the metadata when you drag the file to Zotero.

Even if published in the text of the document? That makes it easier, and I think what that means is that if I include the putative DOI in my pdf, Zotero will identify the document I am trying to store.

adamsmith · March 13, 2025

Yes, Zotero looks for ISBNs and DOIs on the first couple of pages of PDFs.

aborel · March 13, 2025

https://www.zotero.org/support/retrieve_pdf_metadata , section "How it works".

Alrich · March 22, 2025

The problem is that too many don't have a number at all. No help with that in this topic, Thank you for educating me!

aborel · March 22, 2025

The point of this last link was just information - so that you understand the current functionality, and some important differences with your suggestions.

Alrich · March 26, 2025

Of course, thank you.