How to eliminate duplicates in textjoin from arrayformula - arrays

I have a borrowed piece of code that works on Google Sheets except for duplicates.
The flat-file database is an index of technical articles in forty years of a monthly technical journal.
On a separate sheet there is a list of "tags" which represent keywords that might be used in searches.
This formula looks at three text fields in columns F, G, H - article title, article description, and core skills - compares every word to the list of tags, which is a named range called "tags", and uses TEXTJOIN to list all the "hits" - every word in the three columns that also exists on the tag list.
=TEXTJOIN(", ",TRUE,ArrayFormula(IF(ISNUMBER(SEARCH(tags,$F3369:$H3369)),tags,"")))`
But it lists every instance despite there being duplicates. Here are a few examples of the results:
"cabinet parts, noises, string leveling, string leveling, tools, touchweight, touchweight, verticals"
"regulating, repetition, repetition"
"tuning, tuning, tuning"
I want to eliminate duplicates. Is this possible?

Related

Tracking if phrase exists within a list of terms

I am having difficulty finding a formula to do exactly what I am looking for.
I have two lists, one containing search phrases like ("Sound bars for tv") and another list that contains individual terms like ("TV", "Sound", "bars").
My goal is to see if any of the search phrases match for each keyword within the individual term list.
So for "Sound bars for TV", I would need each of those words to be in the term list for it to come back as a TRUE. Also, and more complicated, if I have the search phrase "Soundbar" and "Sound Bar" these should both pass if both terms are in the list.
Any idea what is the best way to approach this.
I have tried the following unsuccessfully:
Individual terms = the list of terms like "TV", "Sound", "Bars"
Phrase = search phrases like "Sound bars for TV"
The goal would be to create a formula that says "Yes" every word in "Sound bars for TV" is within the Individual terms list.
=SUMPRODUCT(--ISNUMBER(SEARCH(individual terms,phrase)))=COUNTA(individual terms)
=IF(ISNUMBER(SEARCH(phrase,individual terms)), "Yes", "No")
=SUMPRODUCT(--ISNUMBER(SEARCH(individual terms,phrase)))>0
Let's pretend you have a data setup like this:
Column D was made into an Excel table (with Insert -> Table) and named tblTerms. This lets you add and remove terms from the list dynamically.
Now in cell B2 and copied down is this formula:
=SUMPRODUCT(--(COUNTIF(tblTerms[Search Terms],TRIM(MID(SUBSTITUTE(A2," ",REPT(" ",LEN(A2))),LEN(A2)*(ROW(A$1:INDEX(A:A,LEN(A2)-LEN(SUBSTITUTE(A2," ",""))+1))-1)+1,LEN(A2))))=0))=0
Note that you'll have to add "Soundbars" separately to the search terms list. There's not really any way for Excel to recognize individual words in a compound word, and attempting to do that would be extremely unwieldy, even with VBA.
This will parse the string and count the matches then compare that to the number of "words" in the string. If they match then it will return Yes
=IF(SUMPRODUCT(COUNTIF(D:D,TRIM(MID(SUBSTITUTE(A1," ",REPT(" ",999)),(ROW($XFD$1:INDEX($XFD:$XFD,LEN(A1)-LEN(SUBSTITUTE(A1," ",""))+1))-1)*999+1,999))))=LEN(A1)-LEN(SUBSTITUTE(A1," ",""))+1,"Yes","No")

Solr search on concatenated text

I would like to make a Solr query, which would for data like
{ "date": ..., "project": ..., "text": ... }
do in order:
Filter them by date range
Group by project, so that I get single row per project with text concatenated
On top of this, do full-text search, with things like some word there is in the concatenated text or it isn't
As a result, I want to get projects for which related texts(for each project all texts concatenated from given date range) contain/don't contain some phrases (depending on a query parameter).
I need to have documents separated so I can filter them by date range, but after filtering I need to concatenate them by project field, so I can make a full-text search query on them as if I would keep whole text for given project together.
I was able to find that for related things it's possible to do something like:
&fq=date:[2013-07-17T00:00:00Z TO 2013-07-20T00:00:00Z]
-
&q=+text:mars-text:venus
I don't know how to do the 2., and how to do 3. so that it's applied to the concatenated texts (at the end). I found that there's some grouping feature but I don't know how to concatenate the text in each group, so I get single entry per group to apply 3. on it.
Is it possible to do such query in Solr? How should it be properly done if it's possible? If no, is it possible to do it effectively with something different than Solr?
Thanks for help.

How does appengine's data store query and index multi-value properties?

Lets say I have a Photo class containing a multi-valued property for tags and a date field.
I would like to allow the user to perform a query based on tags (using only a AND operator for more then 1 tag).
For example lets say a user searches for a rainy day.
Select * from Photo where tag='clouds' AND tag='rainy'
How does the zig-zag merge work? I know that two scans are performed, and based on if the keys from both searches point to the same Photo then it's returned. Does this happen in parallel however? Ex: While Search 1 finds a photo that contains tag 'clouds' , Search 2 is finding the first photo that contains tag "rainy". When both searches are done, it becomes synchronous. Search 1 then continues it's scan until it hits the same key as S2. Then while the keys for each search are the same, the photo is returned, and the "cursor" is moved along 1 step for each search?
Secondly, does defining multiple indexes speed up these sort of queries? Ex, if I wanted to allow up to 4 tags then I would need to define the indexes such as:
Index(Photo)
Index(Photo, tag)
Index(Photo, tag,tag)
Index(Photo, tag,tag,tag)
Index(Photo, tag,tag,tag,tag)
Then, performing the same query above will be quicker?
Also, using our original query, lets say we have Millions of photos tagged as cloudy, but only two are tagged as rainy. Does this mean zig-zag will perform relatively slow? Since one of the searches will try to find a matching exist? Even worse, if we have one million photos tagged "rainy" and one million are tagged "cloudly" yet no single photo have both tags in them. Will defining the above index's fix this issue?
Lastly, lets say a photo has 100 tags. Does that mean all the index's above have to include EVERY combination of the 100 tags?
I know there are got-yas (such as a entity can only be indexed 5000 times, and a single multi-valued property can only be indexed a 1000 times).
How does the zig-zag merge work?
You can check out the Google I/O video from 2009 on Building Scalable, Complex Apps on App Engine. Brett Slatkin explains how zig-zag merge works starting at 27 minutes. As he says, "I can't really explain it without showing how it works."

How to implement a complex token-matching algorithm in SOLR

Problem Description
I'm trying to implement a custom algorithm to match user provided free-text input, a company name such as "Ford Motor", against a reference data source consisting of 1.4 million company names.
The algorithm executes following steps:
Step 1) Performs an "Exact Match", followed by "Begins Match" and finally "Contains Match" of user provided search input. Results from this step are also sorted in the same order.
Step 2) Performs a token by token match of search input with reference company name.
Every token is matched in following order: Exact, Begins, Contains, Levenshtein Distance (< 0.2) and Refined Soundex.
E.g. If user input is "Foord Motur Holding" and it's being matched against "The Ford Motor Holdings Company" then first token "Foord" will match "Ford" based on Soundex match, second token "Motur" will match "Motor" based on Edit Distance Algo and and last token "Holding" will match "Holdings" via Begins match.
Scoring:
Every token match is first scored on a scale that rates the matching technique, with Exact match being the best and Soundex being the worst.
The overall score is calculated, on a scale of 0-100%, by calculating a weighted average of individual token-match scores. Weights are assigned based on index-order of token i.e. the first token has highest weight and last token has lowest.
My Partial Solution
I have implemented a simple schema in solr to store referance company names. A String field (called companyName), a simple text field (called as companyText) copied from string and another text field (called as companySoundex) copied from string and using PhoneticFilterFactory for Refined Soundex based matching.
I have been able to replicate step 1) in a single solr query.
For step 2) I plan to fire 3 parallel queries to solr server. First query performing a simple text search on companyText field, second query performing fuzzy match using ~ operator on companyText field and third query performing soundex match on companySoundex field. I plan to somehow combine the results from these 3 parallel queries to get desired final result.
Questions:
1) Is there a better way to replicate Step 2) of original algorithm?
2) Even if I go with my "three-parallel-queries" approach then how to get the "right" sorting order as I get in the original algorithm ?
I guess the main problem is how to compare the solr scores from these 3 entirely different queries to do the final combining of results
Thanks for reading this long question. Any help/pointers would be greatly appreciated.
Look at the DisMax query parser. http://wiki.apache.org/solr/DisMaxRequestHandler
For each separate query, you'll actually build up separate fields in the index for matching. Then use DisMax to combine the queries in a weighted fashion.
I suggest giving up on your 3 parallel queries approach now. Last time I looked into this it was impossible to relate scores from 2 separate queries. It just doesn't work. If you want a single set of results sorted by score, you have to figure out how to do this in a single query.
IMHO, This functionality can not be achieved in out of the box handlers that Solr provides. You should be better with writing a custom query handler that handles and scores the results in this manner.

Difference in scoring between multivalued field and tokenized field

For example I have several tags per document. I can
index them as single text string spliting by space uisng WhiteSpaceTokenizer. (example "tag1 tag2 tag3")
add them separatly to single field name multiple times using KeywordAnalyzer (
example
doc.addField("tags1", "tag1");
doc.addField("tags", "tag2");
doc.addField("tags", "tag23)
)
Both approaches will work. The question is how different will be scoring for those types of indexing? (i.e. field normalization factor, tf/idf count, field length calucaltion, slope factor etc)
Lucene will concatenate all the values for a multivalued filed behind the scene anyway, so it'd not be much different than your first case, if at all. If you use tags only as filters (give me all docs tagged with tag2), then you definitely won't see any difference.
I would think the multi-value would be more accurate.
imagine a tokenized string "spider web developer"
vs
multi-value field with the values "spider" and "web developer"
a search for "web developer" would match both fields but the match vs the multi-value field could be seen as more accurate.

Resources