SOLR: size of term dictionary and how to prune it - solr

Two related questions:
Q1. I would like to find out the term dictionary size (in number of terms) of a core.
One thing I do know how to do is to list the file size of *.tim. For example:
> du -ch *.tim | tail -1
1,3G total
But how can I convert this to number of terms? Even a rough estimate would suffice.
Q2. A typical technique in search is to "prune" the index by removing all rare (very low frequency) terms. The objective is not to prune the size of the index, but the size of the actual term dictionary. What would be the simpler way to do this in SOLR, or programatically in SOLRj?
More exactly: I wish to eliminate these terms (tokens) from an existing index (term dictionary and all the other places in the index). The result should be similar to 1) adding the terms to a stop word list, 2) re-indexing an entire collection, 3) removing the terms from the stop word list.

You can get information in the Schema Browser page and click in "Load Term info", in the luke admin handler https://wiki.apache.org/solr/LukeRequestHandler and also, in then stats component https://cwiki.apache.org/confluence/display/solr/The+Stats+Component.
To prune the index, you could do it by do a facet of the field, and get the terms with low frecuency. Then, get the docs and update the document without this term (this could be difficult because it's depends the analyzers and tokenizers of your field). Also, you can use the lucene libraries to open the index and do it programmatically.

You can check the number and distribution of your terms with the AdminUI under the collection's Schema Browser screen. You need to Load Term Info:
Or you can use Luke which allows you to look inside the Lucene index.
It is not clear what you mean to 'remove'. You can add them to the stopwords in the analyzer chain for example if you want to avoid indexing them.

Related

Apache Solr's bizarre search relevancy rankings

I'm using Apache Solr for conducting search queries on some of my computer's internal documents (stored in a database). I'm getting really bizarre results for search queries ordered by descending relevancy. For example, I have 5 words in my search query. The most relevant of 4 results, is a document containing only 2 of those words multiple times. The only document containing all the words is dead last. If I change the words around in just the right way, then I see a better ranking order with the right article as the most relevant. How do I go about fixing this? In my view, the document containing all 5 of the words, should rank higher than a document that has only two of those words (stated more frequently).
What Solr did is a correct algorithm called TF-IDF.
So, in your case, order could be explained by this formula.
One of the possible solutions is to ignore TF-IDF score and count one hit in the document as one, than simply document with 5 matches will get score 5, 4 matches will get 4, etc. Constant Score query could do the trick:
Constant score queries are created with ^=, which
sets the entire clause to the specified score for any documents
matching that clause. This is desirable when you only care about
matches for a particular clause and don't want other relevancy factors
such as term frequency (the number of times the term appears in the
field) or inverse document frequency (a measure across the whole index
for how rare a term is in a field).
Possible example of the query:
text:Julian^=1 text:Cribb^=1 text:EPA^=1 text:peak^=1 text:oil^=1
Another solution which will require some scripting will be something like this, at first you need a query where you will ask everything contains exactly 5 elements, e.g. +Julian +Cribb +EPA +peak +oil, then you will do the same for combination of 4 elements out of 5, if I'm not mistaken it will require additional 5 queries and back forth, until you check everything till 1 mandatory clause. Then you will have full results, and you only need to normalise results or just concatenate them, if you decided that 5-matched docs always better than 4-matched docs. Cons of this solution - a lot of queries, need to run them programmatically, some script would help, normalisation isn't obvious. Pros - you will keep both TF-IDF and the idea of matched terms.

Forward Index vs Inverted index Why?

I was reading about inverted index (used by the text search engines like Solr, Elastic Search etc) and as I understand (if we take "Person" as an example):
The attribute to Person relationship is inverted:
John -> PersonId(1), PersonId(2), PersonId(3)
London -> PersonId(1), PersonId(2), PersonId(5)
I can now search the person records for 'John who lives in London'
Doesn't this solve all the problems? Why do we have the forward (or regular database index) at all? Or in other words, in what cases the regular indexing is useful? Please explain. Thanks.
The point that you're missing is that there is no real technical distinction between a forward index and an inverted index. "Forward" and "inverted" in this case are just descriptive terms to distinguish between:
A list of words contained in a document.
A list of documents containing a word.
The concept of an inverted index only makes sense if the concept of a regular (forward) index already exists. In the context of a search engine, a forward index would be the term vector; a list of terms contained within a particular document. The inverted index would be a list of documents containing a given term.
When you understand that the terms "forward" and "inverted" are really just relative terms used to describe the nature of the index you're talking about - and that really an index is just an index - your question doesn't really make sense any more.
Here's an explanation of inverted index, from Elasticsearch:
Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.
https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html
Inverted indexing is for fast full text search. Regular indexing is less efficient, because the engine looks through all entries for a term, but very fast with indexing!
You can say this:
Forward index: fast indexing, less efficient query's
Inverted index: fast query, slower indexing
But, it's always context related. If you compare it with MySQL: myisam has fast read, innodb has fast insert/update and slower read.
Read more here: https://www.found.no/foundation/indexing-for-beginners-part3/
In forward index, the input is a document and the output is words contained in the document.
{
doc1: [word1, word2, word3],
doc2: [word4, word5]
}
In the reverse/inverted index, the input is a word, and the output is all the documents in which the words are contained.
{
word1: [doc1, doc10, doc3],
word2: [doc5, doc3]
}
Search engines make use of reverse/inverted index to get us documents from keywords.

How can I sort facets by their tf-idf score, rather than popularity?

For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.
When a query is made, TF should be calculated, using all the documents that participate in teh results list.
I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.
Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?
facet.sort
This param determines the ordering of the facet field constraints.
count - sort the constraints by count (highest count first) index - to
return the constraints sorted in their index order (lexicographic by
indexed term). For terms in the ascii range, this will be
alphabetically sorted. The default is count if facet.limit is greater
than 0, index otherwise.
Prior to Solr1.4, one needed to use true instead of count and false
instead of index.
This parameter can be specified on a per field basis.
It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.
This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?
I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.
However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.
For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.
There has been a discussion about this way back in 2009.
Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.
I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.

SOLR: (field-based) relative Term Frequency driving result order

We are consolidating all our collected content on a record in a single content field, which is the main source for SOLR. The problem is that for some records the content field has only 100K characters, for others 10M or more.
As a result, a search on any term will push 10M character records to the top of the result list.
We would like to limit/counterbalance that by introducing something like "relative term frequency" eg the number of occurrences divided by total number of words in the content field.
Since we don't know what terms people will search on, (I think) we cannot calculate this at indexing time.
Any suggestions/ideas on how to do this?
You can start with the Custom Similarity class.
This would allow you to modify the above parameters and scoring factors.
You need to check the tf (term frequency) method and customized it.
The Custom Similarity class can be refereed from the Schema.xml file.
Check the lucene DefaultSimilarity class for reference which is the actual implementation.
Also check Changing Similarity

Getting facet count 0 in solr

I am using solr search with faceting in my application. My use case is in such a way that the index files in the datadir keeps on changing.
The problem is, when I facet based on a particular field. I get the value from the indices that where previously in the data dir (and are not present currently). However they are returned with a value of 0. I don't understand where the values from the previous indices are persisted and are returned during a totally newer search?
Though I can simply skip the facets with count 0, I understand that this can seriously eat over my scalability. Any pointers to not include the facets from previous searchers?
[Edit 1] : The current workaround I am using is add a facet.mincount=1 in my URL. But still, I guess this can eat over my performance.
I couldnt find a comment option & I dont have enough reputation to vote-up!
I have the same exact problem.
We are using atomic updates with solr 4.2.
I found some explanation here: http://collab.sakaiproject.org/pipermail/oae-dev/2011-November/000693.html
Excerpt:
To efficiently handle facets for multi-valued fields (like tags), Solr
builds an "uninverted index" (which you think would just be called an
"index", but I suppose that's even more confusing), which maps
internal document IDs to the list of terms they contain. Calculating
facets from this data structure just requires walking over every
document in the result set, looking up the terms it contains in the
uninverted index, and adding them to the tally for all documents.
However, there's a sneaky optimisation here that causes the zero
counts we're seeing. For terms that appear in more than 5% of
documents, Solr doesn't include them in the uninverted index (leaving
them out helps to keep the size in memory down, I guess), and instead
gets the count for these terms using a regular query against the
Lucene index. Since the set of "common" terms isn't specific to your
result set, and since any given result set won't necessarily contain
all of these terms, you can get back counts of zero.
It may not be from old index values but just terms that exist in more than 5% of documents?
I think facet.mincount=n is not a workaround, you should use it to get only the non-negative facet count.
solrQuery.setQuery("*:*");
solrQuery.addFacetField("foobar");
solrQuery.setFacetMinCount(1);

Resources