Tracking if phrase exists within a list of terms - arrays

I am having difficulty finding a formula to do exactly what I am looking for.
I have two lists, one containing search phrases like ("Sound bars for tv") and another list that contains individual terms like ("TV", "Sound", "bars").
My goal is to see if any of the search phrases match for each keyword within the individual term list.
So for "Sound bars for TV", I would need each of those words to be in the term list for it to come back as a TRUE. Also, and more complicated, if I have the search phrase "Soundbar" and "Sound Bar" these should both pass if both terms are in the list.
Any idea what is the best way to approach this.
I have tried the following unsuccessfully:
Individual terms = the list of terms like "TV", "Sound", "Bars"
Phrase = search phrases like "Sound bars for TV"
The goal would be to create a formula that says "Yes" every word in "Sound bars for TV" is within the Individual terms list.
=SUMPRODUCT(--ISNUMBER(SEARCH(individual terms,phrase)))=COUNTA(individual terms)
=IF(ISNUMBER(SEARCH(phrase,individual terms)), "Yes", "No")
=SUMPRODUCT(--ISNUMBER(SEARCH(individual terms,phrase)))>0

Let's pretend you have a data setup like this:
Column D was made into an Excel table (with Insert -> Table) and named tblTerms. This lets you add and remove terms from the list dynamically.
Now in cell B2 and copied down is this formula:
=SUMPRODUCT(--(COUNTIF(tblTerms[Search Terms],TRIM(MID(SUBSTITUTE(A2," ",REPT(" ",LEN(A2))),LEN(A2)*(ROW(A$1:INDEX(A:A,LEN(A2)-LEN(SUBSTITUTE(A2," ",""))+1))-1)+1,LEN(A2))))=0))=0
Note that you'll have to add "Soundbars" separately to the search terms list. There's not really any way for Excel to recognize individual words in a compound word, and attempting to do that would be extremely unwieldy, even with VBA.

This will parse the string and count the matches then compare that to the number of "words" in the string. If they match then it will return Yes
=IF(SUMPRODUCT(COUNTIF(D:D,TRIM(MID(SUBSTITUTE(A1," ",REPT(" ",999)),(ROW($XFD$1:INDEX($XFD:$XFD,LEN(A1)-LEN(SUBSTITUTE(A1," ",""))+1))-1)*999+1,999))))=LEN(A1)-LEN(SUBSTITUTE(A1," ",""))+1,"Yes","No")

Related

How to eliminate duplicates in textjoin from arrayformula

I have a borrowed piece of code that works on Google Sheets except for duplicates.
The flat-file database is an index of technical articles in forty years of a monthly technical journal.
On a separate sheet there is a list of "tags" which represent keywords that might be used in searches.
This formula looks at three text fields in columns F, G, H - article title, article description, and core skills - compares every word to the list of tags, which is a named range called "tags", and uses TEXTJOIN to list all the "hits" - every word in the three columns that also exists on the tag list.
=TEXTJOIN(", ",TRUE,ArrayFormula(IF(ISNUMBER(SEARCH(tags,$F3369:$H3369)),tags,"")))`
But it lists every instance despite there being duplicates. Here are a few examples of the results:
"cabinet parts, noises, string leveling, string leveling, tools, touchweight, touchweight, verticals"
"regulating, repetition, repetition"
"tuning, tuning, tuning"
I want to eliminate duplicates. Is this possible?

solr search with whitespaces and without whitespaces

I want to search products in the document with whitespaces and without whitespaces like "base ball", "baseball"
if someone searches for "baseball" the result should fetch the records of "baseball" & "base ball"
I am not able to that, also i do not want to use "synonyms" for that.
I have used filter class "WordDelimiterFilterFactory" to get that results i use keywords like sunglass for sun glass, keychain for key chain in synonyms files.
but there will be much more words like this so it's been difficult to find such words whose meaning is same even after split.
so I am looking for the solution where I don't have to use synonyms to get the desired result
I've tried by setting catenateWords='1' to get that result but it also did not match the result.
This is not possible without adding the synonyms. You should add the base ball as a synonyms to baseball.
The WordDelimiterFilterFactory is depricated.
Even if you use WordDelimiterGraphFilterFactory its not possible.
generateWordParts : It spilts the words at camelcase like BaseBall...but its not the case for you.
catenateWords : It also wont work in your case as your word is not having any special char or hyphen separated to join. e.g wi-fi will get wifi.
So either you data should have the separate words to be indexed. It means if you dont want to use synonyms then you have to push baseball and base ball for indexing then only you will be able perform search on these words.

Solr wildcard query on multiple words in text field

I'm searching for "foo" followed by "bar" in a text field named "doc".
My query needs to match the text "foo walks into a bar" but not "bar has place for foo"
I've seen a few similar questions, but no concrete answer.
Queries that don't work:
q=doc:foo*bar
q=doc:/.*foo.bar./
It seems that this is because each word in the text field is tokenized separately. Is there a way to get around this? (Note: I can't change the field type)
Have a look at the Surround Query Parser and at the Complex Phrase Query Parser
The SurroundQParser enables the Surround query syntax, which provides
proximity search functionality.
There are two positional operators: w creates an ordered span query
and n creates an unordered one. Both operators take a numeric value
to indicate distance between two terms. The default is 1, and the
maximum is 99.
Note that the query string is not analyzed in any way.
Example:
{!surround} 3w(foo, bar)
This example would find documents where the terms "foo" and "bar" were
no more than 3 terms away from each other (i.e., no more than 2 terms
between them).
Regarding the Complex Phrase Query Parser, pay attention at the inOrder parameter that let you specify the order of the matched keywords.

Solr search on concatenated text

I would like to make a Solr query, which would for data like
{ "date": ..., "project": ..., "text": ... }
do in order:
Filter them by date range
Group by project, so that I get single row per project with text concatenated
On top of this, do full-text search, with things like some word there is in the concatenated text or it isn't
As a result, I want to get projects for which related texts(for each project all texts concatenated from given date range) contain/don't contain some phrases (depending on a query parameter).
I need to have documents separated so I can filter them by date range, but after filtering I need to concatenate them by project field, so I can make a full-text search query on them as if I would keep whole text for given project together.
I was able to find that for related things it's possible to do something like:
&fq=date:[2013-07-17T00:00:00Z TO 2013-07-20T00:00:00Z]
-
&q=+text:mars-text:venus
I don't know how to do the 2., and how to do 3. so that it's applied to the concatenated texts (at the end). I found that there's some grouping feature but I don't know how to concatenate the text in each group, so I get single entry per group to apply 3. on it.
Is it possible to do such query in Solr? How should it be properly done if it's possible? If no, is it possible to do it effectively with something different than Solr?
Thanks for help.

Does google search API eliminate stop words?

Consider if your search query in google search API is "I Love you".
In this query, "I" and "you" are stop words and they occur in almost every document. The keyword(s) present in this search is "Love" which should be searched for. So, there must be a process to detect the stop words and eliminate them from the document list we feed to the API. Does google do it automatically in their search API or do we have to process the search query before firing the query? If google already uses the IDF (Inverse Document Frequency) table to eliminate (or less - prioritise) the stop words, how do they do it? If not, how can we eliminate those stop words? Does the algorithm (if any) works for other (vernacular) languages too?
Link to Google search API here
Google full text search api does not eliminate stop words.
If you perform a global search with search query "I Love you", you will only get documents which will have all the 3 words and not just stop words
The white space between words, quoted strings, numbers, and dates is
treated as an implicit AND operator.
If you want the same functionality while searching within a field here is one approach to look for:
If you enclose your query between parentheses then search will only return documents that contains all the words in the query.
For the case "I Love you", search query should be:
field_name = "(I Love You)"
or
field_name = "(I AND Love AND You)"
This way you will only get documents that contain all the words and not just stop words.
You can just search for the word "Love" in the index.
If you want to search for the word anywhere in the text, you can use wild card operator *
field_name = "Love*"

Resources