How does Autocomplete in TwoTerm mode find the second term? - azure-cognitive-search

When I have TwoTerm selected for autocomplete mode and I do a search with "john smith is" the first 5 results are
john smith is a
john smith is active
john smith is the
john smith is an
john smith is also
Can I expect that all of these results appear in an index field as a phrase or how does autocomplete infer the next term (a, active, the, an, also). With search mode set to all it doesn't return any results for these phrases.

No, these results do not exist in the index but the phrases
is a
is active
is the
is an
is also
do exist in the index.
Please take a look at the blog post which explains the different AutoCompletModes.
In TwoTerms mode, Autocomlete takes the last incomplete token and returns 2 terms only if they exist as a phrase in the index. So, only the last incomplete token is used and all other tokens are simply ignored.
If you are looking for a complete phrase match, I'd recommend issuing a search query and not suggest.

Related

Database structure, query vs iteration

(Basic question that probably is a duplicate, but I don't know what to search for, so feel free to edit this question with the proper database terms)
How would one retrieve groups of rows (family members) from a database table based on columns (last and first names) efficiently? E.g. from this
last first ...
doe john ...
doe jane ...
smith jimmy ...
smith ted ...
smith anna ...
to something like this (additional data omitted)
doe : [{first:john, ...}, {jane}],
smith: [{jimmy}, {ted}, {anna}]
Does this require retrieving the common data (last name) with distinctor group by first and then iterating with additional queries (where last="smith") for each name?
I'd think that that naive approach likely is inefficient and there are better solutions.
You need GROUP_CONCAT() aggregate function:
SELECT last, GROUP_CONCAT(first) first
FROM tablename
GROUP BY last
or JSON_GROUP_ARRAY():
SELECT last, JSON_GROUP_ARRAY(first) first
FROM tablename
GROUP BY last
See the demo.

Solr max results for particular field type

I am using Solr 4.10.3.
I have various fields in schema. e.g id,title,content,type etc. I have docs scenario such that many docs have same type value.
id | title | content | type
1 | pro | My | abc
2 | ver | name | ht
3 | art | is | abc
and so on.
When I query Solr, I want total 10 results(as default) but in them only maximum two of type:abc. Rest of the 8 results can be of any type except abc and can be more of one type.
Is there any possible solution.?
Make two queries, once with rows=2 and type:abc, and second time with rows=8 and -type:abc. Rest of the query can be identical. Then combine the results before you show them to users.
EDIT: After some research on what comes next in Solr features, I believe that combining the results will become possible once the streaming expressions are part of Solr (maybe in 5.2). See https://issues.apache.org/jira/browse/SOLR-7377
Starting with Solr 5.2 you can accomplish this using the Streaming Expressions (and the Streaming API under the covers). The expression would be
top(n=10,
merge(
top(n=2,
search(fl="id, title, content, type", q="type:abc", sort="title asc")
),
search(fl="id, title, content, type", q="-type:abc", sort="title asc")
on="title asc"
)
)
What's going on here is that first you're finding at most two documents of type "abc", then you are finding all documents that are not of type "abc". Notice that both of these are sorted by "title asc" (it's important that they be sorted by the same field(s)). Then we merge these two streams on the field "title" in ascending order. What this does is, using the title field, pick the first document from one of the streams, then it picks the second document, then the third, and so on. By merging "on title" what we're doing is using the title field to decide which document comes first in the merged stream. The outer top will ensure that only 10 documents are returned. Because we limited the "type:abc" stream to have at most 2 documents the final merged stream will have at most 2 "type:abc" documents.
Obviously you can change the sort criteria to whatever is best for you but it is critical with a merge that the incoming streams are sorted in the same order. And in fact, the Streaming API will validate that for you and reject the expression if that is not the case.

Vlookup array multiple columns

Excel wiz's,
I'm trying to build a report with a simple drop down list of names. Rather than try to explain in more detail, let me give you a sample dataset:
Table1:
Text Person1 Person2 Person3
String here contains name(s) Mike Smith Robert Johnson Suzy Q
Another string with name(s) Dan Boy John Michael Bob Wise
Different string with name(s) Robert Johnson Suzy Q
In my report sheet, I have a drop down list of all the possible "persons" that I want to chose from and then return all values from the "Text" column in an array. I have been able to make it work with only one column using this formula, where C4 contains my choice in the dropdown list:
INDEX(Table1[#All],SMALL(IF(Table1[Person1]=$C$4,ROW(Table1[Person1])),ROW(1:1)),1)
The text column will contain all the names of the Person columns, but they are in a different case (all caps, can't change format for display purposes). Maybe a SEARCH function would be more useful? I'm not sure. I'm trying to avoid using a macro, but I am not completely opposed.
Let me know what you guys think, and thanks in advance!
Simply re-organize your table so that there's one row per name... the V-Lookup on the name and get the matching list.
Person Text
Mike Smith String with names
Robert Johnson String with names
Suzy Q String with names
Dan Boy Second string with names
are you trying to make validations for teams? like select team, then next drop down gives only members of that team?
you can use offset inside validation. in one cell put a validation for the list of teams. in the other cell, create a list validation, use a offset formula to return the range of members based on the selected team.
edit: not sure how to put in a table, but this is how you would fill a range with vlookup
in the table with the entries, add a column with serial number starting from 1-n
just below the drop down box, enter numbers 1 to n in order
vlookup the serial number in the table, that is the row you are looking up
for the column, use a match to look in the table which column the current selected person is
drag the formula down to fill n numbers

Solr hierarchical facets: how to get all 2nd-level values for the top N 1st-level values

I have a pair of multi-valued index fields, author and author_norm, and I've created a hierarchical facet field for them using the pattern described at https://wiki.apache.org/solr/HierarchicalFaceting#Indexed_Terms. The facet values look like this:
0/Blow, J
1/Blow, J/Blow, Joe
1/Blow, J/Blow, Joseph
1/Blow, J/Blow, Jennifer
0/Smith, M
1/Smith, M/Smith, Michelle
1/Smith, M/Smith, Michael
1/Smith, M/Smith, Mike
Authors are associated to article records, and in most cases an article will have many authors. This means that for a Solr query that returns 100+ articles, there will potentially be 1000+ authors represented.
My problem is that when I go to display this hierarchy to the user, due to my facet.limit and facet.mincount being set to sane values, I do not have the complete set of 2nd-level values, i.e., the 2nd level of my hierarchy will be cut-off at a certain point. I will have something like this:
Blow, J (30)
Blow, Joe (17)
Blow, Joseph (9)
Smith, M (22)
Smith, Michelle (14)
Smith, Michael (6)
I would like to also have the "Blow, Jennifer (4)" and "Smith, Mike (2)" entries in this list, but they're not returned in the response because mincount cutoff is 5. So I end up with a confusing display (17 + 9 != 30, etc).
One option would be to put a little "(more)" link at the bottom of every 2nd-level list and fetch the full set via ajax. I'm not crazy about this solution because it's asking users to work/click more than they really should have to, and also because I can't control the length of the initial 2nd-level listing; sometimes it'll be 3 names + "(more)", sometimes 2 or even 1. That's just ugly.
I could set mincount=1 and limit=-1 for just my hierarchical facet field, but that would be nuts because for a large query (100k hits) I would be fetching 100k+ values I don't need. I only need the full set of 2nd-level values for the top N 1st-level values.
So unless someone has a better suggestion, I'm assuming I need to do some kind of follow-up query. So after all that, here's what I'm really asking: is is there a way to fetch these 2nd-level values in a single follow-up query. Given an initial solr response, how can I get all the 2nd-level permutations of just the top N 1st-level values of my hierarchy?
Thanks!
PS, I'm using Solr 4.0.
You can modify mincount for any level in pivot:
facet.pivot=fieldA,filedB&f.fieldA.limit=3&f.fieldB.limit=-1
Problems starts when both fields are the same facet.pivot=fieldA,filedA in that case I would probably create a copy of fieldA as fieldB

(Full-Text) Search And Database Design

This is a system architecture question on designing full-text search with (relational) database. The specific software I'm using are Solr and PostgreSQL, just FYI.
Suppose we are building a forum with two users Andy and Betty --
Post ID | User | Title | Content
--------|-------|-------------------|---------------------------
1 | Andy | Dark Knight rocks | Dark Knight rocks blah
2 | Betty | I love Twilight | Twilight blah blah
3 | Andy | Twilight sucks | Twilight sucks blah
4 | Betty | Andy sucks | Twilight rocks, Andy sucks
When the posts table is indexed in Solr, we can easily return the posts sorted by their relevancy to "?q=twilight" or "?q=dark+night".
Now we want to add a new feature to search for users instead of posts. A naive implementation would simply index user name and return "Andy" to "?q=a" and "Betty" to "?q=b", but what if we want to make our system smarter to also take into account of the user posts and return "Betty" before "Andy" to "?q=twilight" because Betty mentions Twilight more than Andy does.
How would you design the system to efficiently handle the user-search function for hundreds of thousands of users and millions of posts?
Faceting on User would return number of results per user. If Andy wrote 15 posts that match Twilight while Betty wrote 10, the faceting will return them as such.
But it wont help if both wrote 15 posts about Twilight, but Andy's was supposed to be more relevant; you will see all facet counts (15, 15 in this case) even if you are paginating to see only (say,) top 5 results and Andy made 4 of them.
If above solution is not good enough, consider a background job that writes documents of
type: suggest_user_type (so you can distinguish them by a `fq`)
user: Andy (the user)
concatted_posts: "I think Twilight.." (concatenate the users latest 50 posts)
once a week. And if you
fq=type:suggest_user_type&
q=concatted_posts:twilight&
fl=user
you get a sorted list of users based on relevance of concatted_posts with respect to twilight.
I believe term frequency is included in full text search ranking. It's part of a research area called information retrieval. There's also another value called the inverse document frequency, which filters out common terms.
There are other steps common to ranking text, you may want to have a look at the OpenNLP project if you're interested.
In terms of database design, there's too much to cover in a post and I'm not the one to write it. The general consensus seems to be for very large systems they key is building an efficient index, then distributing this over a number machines to scale performance. I would recommend reading up on Page Rank and how Google developed its systems as a starting point.

Resources