Relevance and Solr Grouping - solr

Say I have the following collection of webpages in a Solr index:
+-----+----------+----------------+--------------+
| ID | Domain | Path | Content |
+-----+----------+----------------+--------------+
| 1 | 1.com | /hello1.html | Hello dude |
| 2 | 1.com | /hello2.html | Hello man |
| 3 | 1.com | /hello3.html | Hello fella |
| 4 | 2.com | /hello1.html | Hello sir |
...
And I want a query for hello to show results grouped by domain like:
Results from 1.com:
/hello1.html
/hello2.html
/hello3.html
Results from 2.com:
/hello1.html
How is ordering determined if I sort by score? I use a combination of TF/IDF and PageRank for my results normally, but since that calculates scores for each individual item, how does it determine how to order the gruops? What if 1.com/hello3.html and 1.com/hello2.html have very low relevance but two results while 2.com/hello1.html has really high relevance and only one result? Or vice versa? Or is relevance summed when there are multiple items in a grouping field?
I've looked around, but haven't been able to find a good answer to this.
Thanks.

It sounds to me like you are using Result Grouping. If that's the case, then the groups are sorted according to the sort parameter, and the records within each group are sorted according to the group.sort parameter. If you sort the groups by sort=score desc (this is the default, so you wouldn't actually need to specify it), then it sorts the groups according to the score of each group. How this score is determined isn't made very clear, but if you look through the examples in the linked documentation you can see this statement:
The groups are sorted by the score of the top document within each group.
So, in your example, if 2.com's hello1.html was the most relevant document in your result set, "Results from 2.com" would be your most relevant group even though "Results from 1.com" includes three times the document count.
If this isn't what you want, your best options are to provide a different sort parameter or result post-processing. For example, for one project I was involved in, (where we had a very modest number of groups,) we chose to pull the top three results for each group and in post processing we calculated our own sort order for the groups based on the combination of their scores and numFound values. This sort of strategy might have been prohibitive for cases with too many groups, and may not be a good idea if the more numerous groups run the risk of making the most relevant documents harder to find.

Related

Why do I get different results with these two queries in Application Insights Anaytics?

I want to get the number of unique users in the last 24h. I came up with these two different queries.
pageViews
| where timestamp > ago(1d) | summarize count() by user_Id | count;
pageViews
| where timestamp > ago(1d) | summarize makeset(user_Id) | extend nb_users = arraylength(set_user_Id);
If I run them I get different results for the number of users. Why is that?
I suspect you're right and the issue is that by default makeset is limited to 128.
You can pass another parameter to makeset(user_id, 1000) to change the maximum set size.
However, if you're trying to find the number of distinct users, dcount(user_Id) would be the simplest way (although it's an approximation) or the former method you've used would give you the most accurate results.

Solr max results for particular field type

I am using Solr 4.10.3.
I have various fields in schema. e.g id,title,content,type etc. I have docs scenario such that many docs have same type value.
id | title | content | type
1 | pro | My | abc
2 | ver | name | ht
3 | art | is | abc
and so on.
When I query Solr, I want total 10 results(as default) but in them only maximum two of type:abc. Rest of the 8 results can be of any type except abc and can be more of one type.
Is there any possible solution.?
Make two queries, once with rows=2 and type:abc, and second time with rows=8 and -type:abc. Rest of the query can be identical. Then combine the results before you show them to users.
EDIT: After some research on what comes next in Solr features, I believe that combining the results will become possible once the streaming expressions are part of Solr (maybe in 5.2). See https://issues.apache.org/jira/browse/SOLR-7377
Starting with Solr 5.2 you can accomplish this using the Streaming Expressions (and the Streaming API under the covers). The expression would be
top(n=10,
merge(
top(n=2,
search(fl="id, title, content, type", q="type:abc", sort="title asc")
),
search(fl="id, title, content, type", q="-type:abc", sort="title asc")
on="title asc"
)
)
What's going on here is that first you're finding at most two documents of type "abc", then you are finding all documents that are not of type "abc". Notice that both of these are sorted by "title asc" (it's important that they be sorted by the same field(s)). Then we merge these two streams on the field "title" in ascending order. What this does is, using the title field, pick the first document from one of the streams, then it picks the second document, then the third, and so on. By merging "on title" what we're doing is using the title field to decide which document comes first in the merged stream. The outer top will ensure that only 10 documents are returned. Because we limited the "type:abc" stream to have at most 2 documents the final merged stream will have at most 2 "type:abc" documents.
Obviously you can change the sort criteria to whatever is best for you but it is critical with a merge that the incoming streams are sorted in the same order. And in fact, the Streaming API will validate that for you and reject the expression if that is not the case.

Vector Space Model query - set of documends search

i'm trying to write a code for vsm search in c. So using a collection of documents i built a hashtable (inverded index) in wich each slot holds a word along with it's df and a pointer to a list in which each slot hold a name of a document(in which the word appeared at least once) along with the tf(how many times it appeared in this doccument). The user will write a question(also chooses weighting qqq.ddd and comparing method but that doesn't matter for my question) and i have to print him the documents that are relevant to it(from the most relevant to the least relevant). So the examples i've seen are showing which are the steps having only one document for example: we have a collection of 1.000.000 documents(N=1.000.000) and we want to compare
1 document: car insurance auto insurance
with the queston: best car insurance
So in the example it creates an array like this:
Term | Query | Document
| tf | tf
auto | 0 | 1
best | 1 | 0
car | 1 | 1
insurance| 1 | 2
The example also gives the df for each term so using these clues and the weighting and comparing methods it's easy to compare them turning them into vectors by finding the 4 coordinates(1 for each word in the array).
So in this example there are 1.000.000 documents and to see how relevant the document with the query is we use 1 time each(4 words) of the words that there are in the query and in the document. So we have to find 4 coordinates and then compare.
In what i'm trying to do there are like 8000 documents each of them having from 3 to 50 words. So how am i suppose to compare how relevant is a query with each document? If i have
a query: ping pong
document 1: this is ping kong
document 2: i am ping tongue
To compare the query-document1 i will use the words: this is ping kong pong (so 5 coordinates) and to compare the query-document2 i will use the words: i am ping tongue is kong (6 coordinates) and then since i use the same comparing method the one with the highest score is the most relevant? OR do i have to use for both the words: this is ping kong am tongue kong (7 coordinates)? So my question is which is the right way to compare all these 8000 documents with the question? I hope i succeed on making my question easy to understand. thank you for your time!

Cassandra, implementing high-cardinality indexes

As it is known, Cassandra is great in low-cardinality indexes and not so good with high-cardinality ones. My column family contains a field storing URL value.
Naturally, searching for this specific value in a big dataset can be slow.
As a solution, I've come up with idea of taking first characters of url and storing them
in separate columns, e.g. test.com/abcd would be stored as (ab, test.com/abcd) columns.
So that when a search by specific URL value needs to be done, I can narrow it down by 26*26 times by searching the "ab" first and only then looking up exact url in the obtained resultset.
Does it look like a working solution to reduce URL cardinality in Cassandra?
If you need this to be really fast, you probably want to consider having a separate table with the value that you are searching for as the column key. Key prefix searches are usually faster than column searches in BigTable implementations.
A problem with that is that a sequential scan is going to have to follow after you use the low-cardinality index, in order to finally arrive at the one specific URL queried.
As Chris Shain mentioned, you can build a separate column family to build an inverted index:
Column Family 'people'
ssn | name | url
----- | ------ | ---
1234 | foo | http://example.com/1234
5678 | bar | http://hello.com/world
Column Family 'urls'
url | ssn
------------------------ | ------
http://example.com/1234 | 1234
http://hello.com/world | 5678
The downside is that you need to maintain the integrity of your manual index yourself.

Database Design: how to store translated numbers?

This is a general DB design question. Assume the following table:
======================================================================
| product_translation_id | language_id | product_id | name | price |
======================================================================
| 1 | 1 | 1 | foobar | 29.99 |
----------------------------------------------------------------------
| 2 | 2 | 1 | !##$%^ | &*()_ |
----------------------------------------------------------------------
(Assume that language_id = 2 is some language that is not based on Latin characters, etc.)
Is it right for me to store the translated price in the DB? While it allows me to display translations properly, I am concerned it will give me problems when I want to do mathematical operations on them (e.g. add a 10% sales tax to &*()_).
What's a good approach to handling numerical translations?
If you can programatically convert "29.99" to "&*()_" then I'd put the price in the product table and leave the translation of it the display layer. If you store it twice then you will have two obvious problems:
You will end up with consistency problems because you're storing the same thing in two different places in two different formats.
You will be storing numeric data in text format.
The first issue will cause you a lot of head aches when you need to update your prices and your accountants will hate you for making a mess of the books.
The second issue will make your database hate you whenever you need to do any computations or comparisons inside the database. Calling CONVERT(string AS DECIMAL) over and over again will have a cost.
You could keep the price in numeric form in the product table (for computation, sorting, etc.) and then have the localized translation in the your translation table as a string. This approach just magnifies the two issues above though. However, if you need to have humans translating your numbers then this approach might be necessary. If you're stuck with this then you can mitigate your consistency problems by running a sanity checker of some sort after each update, you might even be able to wrap the sanity checker in a trigger of some sort.

Resources