Is it possibel to share SOLR index "term dictionaries" across fields - solr

SOLR (Lucene) indices, like all inverted indices, use a term dictionary to assign an index to each term. Each field in the index generates its own term dictionary (which can be inspected in the SOLR admin tool).
I have a very large SOLR index, where each document has very many textual fields. All fields contain english text following a similar distribution.
In my case this is very wasteful: it maintains many very large term dictionaries (in memory) which are almost all the same... as the number of (different) terms in the documents grows these dictionaries grow very large.
I cannot combine all fields into a single search field because I need to run queries restricted over specific fields.
Is there a way to tell SOLR to use the same term dictionary for several fields?
(Afterthought: but perhaps if terms follow a zipfian distribution, the ammount of sharing between fields won't be significant anyway as many terms will appear only once and hence only in one dict?)

Related

Solr Query Performance on Large Number Of Dynamic Fields

This question is a follow-up question to my previous question: Is child documents in solr anti-pattern?
I am creating a new question on dynamic field performance as I did not find any recent relevant posts on this topic and felt it deserved a separate question here.
I am aware that dynamic fields are treated as static fields and performance-wise both are similar.
Further, from what I have read, in terms of memory, dynamic fields are not efficient. Say, if a document has 100 fields and another has 1000(max number of fields in the collection), Apache Solr will allocate the memory block to support all 1000 fields for all the documents in the collection.
I have a requirement where I have 6-7 fields that could be part of child documents and each parent document could have up to 300 child documents. Which means each parent document could have ~2000 fields.
What will be the performance impact on queries when we have such a large number of fields in the document?
That really depends on what you want to do with the field and what the definition of these fields are. With docValues, most earlier issues with memory usage for sparse fields (i.e. fields that only have values in a small number of the total number of documents) are solved.
Also, you can usually rewrite those dynamic fields to a single multiValued field for filtering instead of filtering on each field (i.e. common_field:field_prefix_value where common_field contains the values you want to filter on prefixed with a field name / unique field id).
Anyway, the last case is that it'll depend on how many documents you have in total. If you only have 1000 documents, it won't be an issue in any way. If you have a million, it used to be - depending on what you needed those dynamic fields for. These days it really isn't an issue, and I'd start out with the naive, direct solution and see if that works properly for your use case. It's rather hard to say without knowing exactly what these fields will contain, what the use case for the fields are, what they'll be used for and the query profile of your application.
Also consider using a "side car" index if necessary, i.e. a special index with duplicated data from your main index to solve certain queries or query requirements. You pick which index to search based on the use case, and then return the appropriate data to the user.

Solr: Sort documents by number of fields

Is there a way sort documents in Solr by the number of fields in each document?
The solr core in question has about 200 different fields, while not every field must be present in every doucment. To circle out datasets that contain not enough fields to be correct, I'd like to work through a *:* query sorted from the lowest number of fields per documents upwards.
I didn't find anything on this specific use case. Most results I found were about the relevance of individual fields, however this might not help here given the large field spectrum of the core.
It might be possible by sorting on a function query. That function would return a value that would be higher the more fields the doc has. But I am afraid that function would be huge (and slow), as it would need to enumerate all fields in the function.
By far the easiest thing would be to, at index time, add a 'nbFields' field containing the number of fields. Then you can sort easily on that one.

Storing Inverted Index

I know that inverted indexing is a good way to index words, but what I'm confused about is how the search engines actually store them? For example, if a word "google" appears in document - 2, 4, 6, 8 with different frequencies, where should store them? Can a database table with one-to-many relation would do any good for storing them?
It is highly unlikely that fullfledged SQL-like databases are used for this purpose. First, it is called an inverted index because it is just an index. Each entry is just a reference. As non-relational databases and key-value stores came up as a favourite topic in relation to web technology.
You only ever have one way of accessing the data (by query word). That is why it's called an index.
Each entry is a list/array/vector of references to documents, so each element of that list is very small. The only other information besides of storing a documentID would be to store a tf-idf score for each element.
How to use it:
If you have a single query word ("google") then you look up in the inverted index in which documents this word turns up (2,4,6,8 in your example). If you have tf-idf scores, you can sort the results to report the best matching document first. You then go and look up which documents the document IDs 2,4,6,8 refer to, and report their URL as well as a snippet etc. URL, snippets etc are probably best stored in another table or key-value store.
If you have multiple query words ("google" and "altavista"), you look into the II for both query words and you get two lists of document IDs (2,4,6,8 and 3,7,8,11,19). You take the intersection of both lists, which in this case is (8), which is the list of documents in which both query words occur.
It's a fair bet that each of the major search engines has its own technology for handling inverted indexes. It's also a moderately good bet that they're not based on standard relational database technology.
In the specific case of Google, it is a reasonable guess that the current technology used is derived from the BigTable technology described in 2006 by Fay Chang et al in Bigtable: A Distributed Storage System for Structured Data. There's little doubt that the system has evolved since then, though.
Traditionally, an inverted index is written directly to file and stored on disk somewhere. If you want to do boolean retrieval querying (Either a file contains all the words in the query or not) postings might look like so stored contiguously on file.
Term_ID_1:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N.Term_ID_2:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N.Term_ID_N:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N
The term id is the id of a term, the frequency is the number of docs the term appears in (in other words how long is the postings list) and the doc id is the document that contained the term.
Along with the index, you need to know where everything is on file so mappings also have to be stored somewhere on another file. For instance, given a term_id, the map needs to return the file position that contains that index and then it is possible to seek to that position. Since the frequency_id is recorded in the postings, you know how many doc_ids to read from the file. In addition, there will need to be mappings from the IDs to the actual term/doc name.
If you have a small use case, you may be able to pull this off with SQL by using blobs for the postings list and handling the intersection yourself when querying.
Another strategy for a very small use case is to use a term document matrix.
Possible Solution
One possible solution would be to use a positional index. It's basically an inverted index, but we augment it by adding more information. You can read more about it at Stanford NLP.
Example
Say a word "hello" appeared in docs 1 and 3, in positions (3,5,6,200) and (9,10) respectively.
Basic Inverted Index (note there's no way to find word freqs nor there positions)
"hello" => [1,3]
Positional Index (note we don't only have freqs for each docs, but we also know exactly where the term appeared in the doc)
"hello" => [1:<3,5,6,200> , 3:<9,10>]
Heads Up
Will your index take a lot more size now? You bet!
That's why it's a good idea to compress the index. There are multiple options to compress the postings list using gap encoding, and even more options to compress the dictionary, using general string compression algorithms.
Related Readings
Index compression
Postings file compression
Dictionary compression

solr clustering based on solr fields including geo-spacial location fields

Trying to use carrot2 for doing to resultset clustering. I have couple of questions with respect to this.
a) Can we cluster the documents in Solr/Lucene based on the specific fields in solr? like cluster them based name, person name and geo-distance location (lat, long) with specific field weights?
b) My use case for clustering is not really online, it is more of a batch use case, given that, do we still have this restriction of 1K max no. of results?
Carrot2 performs clustering based only on the natural text of your documents. Person names would probably be too short for meaningful clustering; Carrot2 is not suitable for geo-distance and other numerical data.
The 1k restriction / recommendation is based on the design goal of Carrot2: to cluster small collections of texts (such as search results) fast enough so that the process can be done on-line. Carrot2 does well for collections around 1k documents, but will not scale very well beyond several thousands of documents.

Limiting solr spellchecking based on attributes

I wonder if there is a way to limit the spellchecking to just a part of the index.
Example i have an index containing different products used in different countries.
when a search is performed i limit the solr query to just return the results for COUNTRY X, however the suggestions that are returned are not limited to COUNTRY X, instead i receive results based on the whole index(since i only have one mispell index).
i beleive you can create a separate dictionary one for each country to solve this but here is the twist, i sometimes do a query where i want results back from COUNTRY_X and COUNTRY_Y thus also suggestions limited by those 2 countries, this would in turn result in a dictionary index of its own, seems a little to complicated and the number of dictionary indexes would be large.
I'd try splitting the index per country, i.e. one index for country X and another for country Y. You can easily do this with a multi-core setup. This way each index gets its own dictionary.
When you want to search on multiple countries at once you run a distributed query over the indexes. Distributed support for the spell checking component is only available in trunk as of this writing though.

Resources