Reverse keyword search?

sdspieg · September 7, 2021

[Slightly off-topic (AND TL;DR to boot :)), but this is such a great community that somebody may still have an answer OR may get inspired to create sthg like this]

One of the banes of our existence as researchers is the limited keyword-based search query syntax that many mainstream full-text publication databases (think ProQuest, EBSCO, etc.) still use. Some of them have started using some more advanced NLP technologies like semantic similarity to identify 'similar' publications AFTER an initial keywords-based input, but I am not aware of any of them allowing for truly semantic search from the get-go.

The literature on (professional) keywords-based search suggests that a serious query that will provide excellent precision AND recall takes weeks (in not months) to build. That has also been our experience. IF you're not just looking for some 'good' results based on a simple keywords-based search query to write a paper or so; but if instead you're looking for a more sophisticated search query that leverages all availabe search operators (Boolean, proximity, fuzzy, lemmatizing, nested, etc.) to yield a corpus that has BOTH high precision AND high recall - THEN the current search methods really fall short.

But so here comes my question: does anybody know of any code that would

allow a user to 'feed' a model a bunch of 'perfect' sentences (the types of sentences that we all pick up on when we look at the results page of a submitted search to determine whether or not an article is relevant), and

that would then generate a boolean keyword-based search query (hopefully using all available search operators).

So the workflow I'm thinking about here is the following:

researchers start by generating a bunch of 'brilliant sentences' - exactly the types of sentences they'd hope a search query would yield in these databases
- they could highlight them in a selection of 'excellent' papers they already have in some Zotero library (we've been using the new Zotero pdf-highlighting tool for this (see also here)
- they could just make up these sentences themselves
- they could also use some of the new (and amazing NLG - natural language generating - tools, for an example, see here)

they'd feed those sentences to some algorithm that would spit out a sophisticated keyword-based search query

that search query is then what they would enter in the full-text academic databases to generate a corpus with high precision AND high recall

the results could then be
- downloaded into Zotero for bibliographical management purposes; and
- also used in a different format (e.g. json) for further NLP-enhanced analysis.

Thoughts? Suggestions?