Precision, Recall, ROC in solr - solr

My final task is making a search engine. I'm using solr to access and retrieve data from ontology which later will be used as corpuses. I'm entirely new to these (information retrieval, ontology, python and solr) things.
There's a step in information retrieval to evaluate the query result. I'm planning to use Precision, Recall, and ROC score to evaluate this. Is there any way I can use function in solr to calculate the score of precision, recall, and ROC? From solr interface or even the codes behind is doesn't matter.

Unless I'm completely mistaken, precision and recall scores require you to know what the appropriate documents to retrieve and display are before comparing them to the documents retrieved from the search engine. The search already returns what it think is the perfect match for your query, so it's up to you to evaluate that result against the expected result (meaning that you know which documents should have been returned).
If the search engine could decide by itself, it would always give 1 (n/n) for both precision and recall, as that would be the perfect result. If it could evaluate what those numbers would be, it wouldn't need to include them in the search result at all.
If you query for a certain term, Solr will give you all documents containing that term (and if you want, variations of it - depending on your analysis chain). Tuning this relevancy is what your task is, and since it can't be done automagically - as it's dependent on your business case, you'll have to actually perform the measurements yourself with the answer key already decided.

Related

How to manage ranking system in Solr

I have a Solr setup to implement a search engine.
The search engine works (should work) using ranking.
At the same time I'd like to show regular purchased low rank products on top of the results.
Is it possible to do this ?
Solr is built on top of Lucene, and what we call ranking is known as scoring in the Lucene/Solr universe.
This "relevancy score" is computed based on several things that obviously depends on the index and the query, but the scoring formula is called Similarity :
Generally, the Query determines which documents match (a binary
decision), while the Similarity determines how to assign scores to the
matching documents.
Index : Scoring is very much dependent on the way documents are indexed (fieldType definition, norms, etc. and also index time boost will affect scoring at query time).
Query : Lucene usually finds the documents that need to be scored based on boolean logic in the query specification, and then ranks this subset of matching documents via a retrieval model (similarity).
Similarity : This is how Lucene actually determines how to weight the matched terms.
In general, one doesn't have to tweak Similarity unless having very specific and precise needs. When the matching works but not the scoring, in most of the cases re-ranking the result set by adjusting query parameters is sufficient (e.g. boost queries & functions, sorting, grouping).
Now in order to show additional products on top of some results, you can use the Query Elevation Component :
The Query Elevation Component lets you configure the top results for a given query regardless of the normal Lucene scoring.
It is very useful in situations where you want to arbitrarily promote some contents regardless of the user query, because such query might not necessarily match the contents to promote, in which case it would not be possible to boost them to the top without OR-ing the main query in the first place.
Read also Solr Relevancy FAQ.

Getting stable SOLR scores

I run a query against a SOLR core and restrict the result using a filter
like fq: {!frange l=0.7 }query($q). I'm aware that SOLR scores do not
have an absolute meaning, but the 0.7 (just an example) is calculated
based on user input and some heuristics, which works quite well.
The problem is the following: I update quite a few documents in my core.
The updated fields are only meta data fields, which are unrelated to the
above search. But because an update is internally a delete + insert, IDF
and doc counts change. And so do the calculated scores. Suddenly my
query returns different results.
As Yonik explained to me here, this behaviour is by design. So my question is: What is the most simple
and minimal way to keep the scores and the output of my query stable?
Running optimize after each commit should solve the problem, but I
wonder if there is something simpler and less expensive.
You really need to run optimize. When you optimize the index solr clean all documents not pointed yet and make the query stable. This occurs because build this meta data information is expensive to be done all the time a document is updated. Because of this solr just do that on optimize. There is a good way to see if your index is more or less stable... When you access Solr API you could see Num Docs and Max Doc information. If Max Doc is greater than Num Docs it seams that you have some old products affecting your relevancy calculation. Optimizing the index these two numbers is made equal again. If these numbers are equal you can trust IDF is been calculated correctly.

Optimize SOLR for retrieving all search results

Sometimes I don't need just the top X results from a SOLR query, but all results (running into millions). This is easily achievable by searching once with 0 rows as a request parameter, and then re-execute the search with the numFound from the result as number of rows(*)
Of course we can sort the results by e.g. "id asc" to remove relevancy ranking, however, I would like to be able to disable the entire scoring calculation for these queries, as they probably are quite computational intensive and we just don't need them in these cases.
My question:
Is there a way to make SOLR work in boolean mode and effectively run faster on these often slow queries, when all we need is just all results?
(*) I actually usually simply do a paged query where a script walks through the pages (multi threaded), to prevent timeouts on large result sets, yet keep it fast as possible, but this is not important for the question.
This looks like a related question, but apparently the user asked the wrong question and was only after retrieving all results: Solr remove ranking or modify ranking feature; This question is not answered there.
Use filters instead of queries; there is no score calculation for filters.
There is a couple of things to be aware of
Solr deep paging allows you to export large number of results much quicker
Using an export format such as CSV could be faster than using an XML format just due to the formatting and it being more compact
And, as already mentioned, if you are exporting all, put your queries into FilterQuery with caching off
For very complex queries, if you can split it into several steps, you can actually assign different weights to the filters and have them execute in sequence. This allows to use cheap first filter that gets rid of most of the results and only then apply more expensive, more precise, filters

open source ranking algorithms used by Solr

I am working on Solr. I want to know what ranking algorithm it uses when output a query. I am also using Solr search.
Solr uses the Lucene Core , a text search library written in Java, for text search. This is the same project that also powers Elasticsearch, so everything here applies to Elasticsearch too.
The core ranking algorithm (also known as the similarity algorithm) is based on Term-Frequency/Inverse-Document-Frequency, or tf/idf for short . td/idf takes the following factors into account:
(I've copied in a description of tf/idf below from the Elasticsearch documentation - the description would be identical for Solr but this is much better written and easier to understand)
Term frequency
How often does the term appear in the field? The more often, the more
relevant. A field containing five mentions of the same term is more
likely to be relevant than a field containing just one mention.
Inverse document frequency
How often does each term appear in the index? The more often, the less
relevant. Terms that appear in many documents have a lower weight than
more uncommon terms.
Field norm
How long is the field? The longer it is, the less likely it is that
words in the field will be relevant. A term appearing in a short title
field carries more weight than the same term appearing in a long
content field.
You can find the specifics of the Lucene similarity scoring here: http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
Keep in mind that Solr/Lucene supports a rich set of functionality to alter this scoring. This is best read about here in the discussion on Lucene scoring.
If you want to read more about scoring and how to change it I'd start here:
http://wiki.apache.org/solr/SolrRelevancyFAQ
And then I would read up a bit on what a Function Query is:
FunctionQuery allows one to use the actual value of a field and
functions of those fields in a relevancy score.
Basically it provides you with a relatively easy to use mechanism to adjust the relevancy score of a document as a function of the values within certain fields:
http://wiki.apache.org/solr/FunctionQuery

Determining the Similarity Between Items in a Database

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().

Resources