Index size 400%+ growth: Normal Solr instance vs SolrCloud - solr

I'm experimenting with different infrastructure approaches and I'm surprised to notice the following.
I've indexed 1.3M documents (all fields indexed, stored, and some shingle-analyzed) using DataImportHandler via sql query in Solr4.4.
Approach1: Single Solr instance
Indexing time: ~10 minutes
Size of "index" folder: 1.6GB
Approach2: SolrCloud with two index slices.
Indexing time: ~11 minutes
Size of "index" folders: 1.6GB + 1.5GB = 3.1GB
Each of index slice has around 0.65M documents adding to original total count which is expected.
Approach3:SolrCloud with two shards (1 leader + 1 replica)
Indexing time: ~30 minutes
Size of "index" folders: Leader (4.6GB), replica (3.8GB) = 8.4GB (expected this to be 1.6gb * 2, but it is ~1.6gb*5.25)
I've followed the SolrCloud tutorial.
I realize that there's some meta-data (please correct me if I'm wrong) like term dictionary, etc. which has to exist in all the instances irrespective of slicing (partition) or sharding (replication).
However, approach 2 and 3 show drastic growth (400%) in the final index size.
Please could you provide insights.

From the overall index size I suppose your documents are quite small. That is why the relative size of the terms dictionary is big - for that number of documents it's pretty similar, so you have it twice. Therefore 1.6 turns into 3.1Gb.
As for Approach 3 - are you sure that it's a clean test? Any chance you have included transaction log in the size? What happens if you optimize?
You can check what exactly adds to the size by checking the index files extensions.
See here:
https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#file-names

Related

Solr: Document size inexplicably large

I updated to Solr 8.4.0 (from 6.x) on a test server and reindexed (this is an index of a complicated Moodle system, mainly lots of very small documents). It worked initially, but later ran out of disk space so I deleted everything and tried indexing a smaller subset of the data, but it still ran out of disk space.
Looking at the segment info chart, the first segment seems reasonable:
Segment _2a1u:
#docs: 603,564
#dels: 1
size: 5,275,671,226 bytes
age: 2020-11-25T22:10:05.023Z
source: merge
That's 8,740 bytes per document - a little high but not too bad.
Segment _28ow:
#docs: 241,082
#dels: 31
size: 5,251,034,504 bytes
age: 2020-11-25T18:33:59.636Z
source: merge
21,781 bytes per document
Segment _2ajc:
#docs: 50,159
#dels: 1
size: 5,222,429,424 bytes
age: 2020-11-25T23:29:35.391Z
source: merge
104,117 bytes per document!
And it gets worse, looking at the small segments near the end:
Segment _2bff:
#docs: 2
#dels: 0
size:23,605,447 bytes
age: 2020-11-26T01:36:02.130Z
source: flush
None of our search documents will have anywhere near that much text.
On our production Solr 6.6 server, which has similar but slightly larger data (some of it gets replaced with short placeholder text in the test server for privacy reasons) the large 5GB-ish segments contain between 1.8 million and 5 million documents.
Does anyone know what could have gone wrong here? We are using Solr Cell/Tika and I'm wondering if somehow it started storing the whole files instead of just the extracted text?
It turns out that a 10MB English language PowerPoint file being indexed, with mostly pictures and only about 50 words of text in the whole thing, is indexed (with metadata turned off) as nearly half a million terms most of which are Chinese characters. Presumably, Tika has incorrectly extracted some of the binary content of the PowerPoint file as if it were text.
I was only able to find this by reducing the index by trial and error until there are only a handful of documents in it (3 documents but using 13MB disk space), then Luke 'Overview' tab let me see that one field (called solr_filecontent in my schema) which contains the indexed Tika results has 451,029 terms. Then, clicking 'Show top terms' shows a bunch of Chinese characters.
I am not sure if there is a less laborious way than trial and error to find this problem, e.g. if there's any way to find documents that have a large number of terms associated. (Obviously, it could be a very large PDF or something that legitimately has that many terms, but in this case, it isn't.) This would be useful as even if there are only a few such instances across our system, they could contribute quite a lot to overall index size.
As for resolving the problem:
1 - I could hack something to stop it indexing this individual document (which is used repeatedly in our test data otherwise I probably wouldn't have noticed) but presumably the problem may occur in other cases as well.
2 - I considered excluding the terms in some way but we do have a small proportion of content in various languages, including Chinese, in our index so even if there is a way to configure it to only use ASCII text or something, this wouldn't help.
3 - My next step was to try different versions to see what happens with the same file, in case it is a bug in specific Tika versions. However, I've tested with a range of Solr versions - 6.6.2, 8.4.0, 8.6.3, and 8.7.0 - and the same behaviour occurs on all of them.
So my conclusion from all this is that:
Contrary to my intitial assumption that this was related to the version upgrade, it isn't actually worse now than it was in the older Solr version.
In order to get this working now I will probably have to do a hack to stop it indexing that specific PowerPoint file (which occurs frequently on our test dataset). Presumably the real dataset wouldn't have too many files like that otherwise it would already have run out of disk space there...

Why am I sometimes getting an OOM when getting all documents from a 800MB index with 8GB of heap?

I need to refresh an index governed by SOLR 7.4. I use SOLRJ to access it on a 64 bit Linux machine with 8 CPUs and 32GB of RAM (8GB of heap for the indexing part and 24GB for SOLR server). The index to be refreshed is around 800MB in size and counts around 36k documents (according to Luke).
Before starting the indexing process itself, I need to "clean" the index and remove the Documents that do not match an actual file on disk (e.g : a document had been indexed previously and has moved since then, so user won't be able to open it if it appears on the result page).
To do so I first need to get the list of Document in index :
final SolrQuery query = new SolrQuery("*:*"); // Content fields are not loaded to reduce memory footprint
query.addField(PATH_DESCENDANT_FIELDNAME);
query.addField(PATH_SPLIT_FIELDNAME);
query.addField(MODIFIED_DATE_FIELDNAME);
query.addField(TYPE_OF_SCANNED_DOCUMENT_FIELDNAME);
query.addField("id");
query.setRows(Integer.MAX_VALUE); // we want ALL documents in the index not only the first ones
SolrDocumentList results = this.getSolrClient().
query(query).
getResults(); // This line sometimes gives OOM
When the OOM appears on the production machine, it appears during that "index cleaning" part and the stack trace reads :
Exception in thread "Timer-0" java.lang.OutOfMemoryError: Java heap space
at org.noggit.CharArr.resize(CharArr.java:110)
at org.noggit.CharArr.reserve(CharArr.java:116)
at org.apache.solr.common.util.ByteUtils.UTF8toUTF16(ByteUtils.java:68)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:868)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:857)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:266)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java:541)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:305)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:747)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:272)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.java:555)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:307)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:200)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:274)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:178)
at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:50)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:614)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:942)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:957)
I've aleady removed the content fields from the query because there were already OOMs, so I thought only storing "small" data would avoid OOMs, but they are still there. Moreover as I started the project for the customer we had only 8GB of RAM (so heap of 2GB), then we increased it to 20GB (heap of 5GB), and now to 32GB (heap of 8GB) and the OOM still appears, although the index is not that large compared to what is described in other SO questions (featuring millions of documents).
Please note that I cannot reproduce it on my dev machine less powerful (16GB RAM so 4GB of heap) after copying the 800 MB index from the production machine to my dev machine.
So to me there could be a memory leak. That's why I followed Netbeans post on Memory Leaks on my dev machine with the 800MB index. From what I see I guess there is a memory leak since indexing after indexing the number of surviving generation keeps increasing during the "index cleaning" (steep lines below) :
What should I do, 8GB of heap is already a huge quantity heap compared to the index characteristics ? So increasing the heap does not seem to make sense because the OOM only appears during the "index cleaning" not while actually indexing large documents, and it seems to be caused by the surviving generations, doesn't it ? Would creating a query object and then applying getResults on it would help the Garbage COllector ?
Is there another method to get all document paths ? Or maybe retrieving them chunk by chunk (pagination) would help even for that small amount of documents ?
Any help appreciated
After a while I finally came across this post. It exactly describe my issue
An out of memory (OOM) error typically occurs after a query comes in with a large rows parameter. Solr will typically work just fine up until that query comes in.
So they advice (emphasize is mine):
The rows parameter for Solr can be used to return more than the default of 10 rows. I have seen users successfully set the rows parameter to 100-200 and not see any issues. However, setting the rows parameter higher has a big memory consequence and should be avoided at all costs.
And this is what I see while retrieving 100 results per page :
The number of surviving generations has decreased dramatically although garbage collector's activity is much more intensive and computation time is way greater. But if this is the cost for avoiding OOM this is OK (see the program looses some seconds per index updates which can last several hours) !
Increasing the number of rows to 500 already makes the memory leak happens again (number of surviving generations increasing) :
Please note that setting the row number to 200 did not cause the number of surviving generations to increase a lot (I did not measure it), but did not perform much better in my test case (less than 2%) than the "100" setting :
So here is the code I used to retrieve all documents from an index (from Solr's wiki) :
SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (! done) {
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrServer.query(q);
String nextCursorMark = rsp.getNextCursorMark();
doCustomProcessingOfResults(rsp);
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
TL;DR : Don't use a number too large for query.setRows ie not greater than 100-200 as a higher number may very much likely cause an OOM.

Apache Solr's bizarre search relevancy rankings

I'm using Apache Solr for conducting search queries on some of my computer's internal documents (stored in a database). I'm getting really bizarre results for search queries ordered by descending relevancy. For example, I have 5 words in my search query. The most relevant of 4 results, is a document containing only 2 of those words multiple times. The only document containing all the words is dead last. If I change the words around in just the right way, then I see a better ranking order with the right article as the most relevant. How do I go about fixing this? In my view, the document containing all 5 of the words, should rank higher than a document that has only two of those words (stated more frequently).
What Solr did is a correct algorithm called TF-IDF.
So, in your case, order could be explained by this formula.
One of the possible solutions is to ignore TF-IDF score and count one hit in the document as one, than simply document with 5 matches will get score 5, 4 matches will get 4, etc. Constant Score query could do the trick:
Constant score queries are created with ^=, which
sets the entire clause to the specified score for any documents
matching that clause. This is desirable when you only care about
matches for a particular clause and don't want other relevancy factors
such as term frequency (the number of times the term appears in the
field) or inverse document frequency (a measure across the whole index
for how rare a term is in a field).
Possible example of the query:
text:Julian^=1 text:Cribb^=1 text:EPA^=1 text:peak^=1 text:oil^=1
Another solution which will require some scripting will be something like this, at first you need a query where you will ask everything contains exactly 5 elements, e.g. +Julian +Cribb +EPA +peak +oil, then you will do the same for combination of 4 elements out of 5, if I'm not mistaken it will require additional 5 queries and back forth, until you check everything till 1 mandatory clause. Then you will have full results, and you only need to normalise results or just concatenate them, if you decided that 5-matched docs always better than 4-matched docs. Cons of this solution - a lot of queries, need to run them programmatically, some script would help, normalisation isn't obvious. Pros - you will keep both TF-IDF and the idea of matched terms.

Google App Engine - Search API index growth

I would like to know how can I estimate the growth (how much the size increasez in a period of time) of an index of App engine Search API (FTS) based on the number of entities inserted and amount of information. For this I would like to know basically how is the index size calculated (on what does it depend). Specifically:
When inserting new entities, is the growth (size) influenced by the number of previous existing entities? (ie. is the growth exponential)? For ex. if I have 1000 entities and I insert 10, the index will grow with X bytes. But if I have 100000 entities and insert 10, will it increase with X or much more than X (exponentially, let' say 10*X) ?
Does the number of fields (properties) influences the size exponentially? For ex. if I have entity A with 2 fields and entity B with 4 fields (let's say identical in values, for mathematical simplicity) will the size increase, when adding entity B, twice as that of entity A or much more than that?
What other means can I use to find statistical information; do I have other tools in the cloud console of app engine, or can I do this programmatically ?
Thank you.
You can check the size of a given index by running the code below.
from google.appengine.api import search
for index in search.get_indexes(fetch_schema=True):
logging.info("index %s", index.storage_usage)
# pseudo code
amount_of_items_to_add = 100
x = 0
for x <= amount_of_items_to_add:
search_api_insert_insert(data)
x+=1
#rerun for loop to see how much the size increased
for index in search.get_indexes(fetch_schema=True):
logging.info("index %s", index.storage_usage)
This code is obviously not a complete working example, but you should be able to build a simple method that takes some data inserts it into the search api and returns how much the used storage increased.
I have run a number of tests for different number of entities and different number of indexed properties per entity and it seams thst the estimated growth of the index reported by the api is not exponential it is linear.
But the most interesting fact to know is that although the size reported is realtime almost, after deleting documents from the index, it may take 12, 24 even 36 hours to update.

SOLR: size of term dictionary and how to prune it

Two related questions:
Q1. I would like to find out the term dictionary size (in number of terms) of a core.
One thing I do know how to do is to list the file size of *.tim. For example:
> du -ch *.tim | tail -1
1,3G total
But how can I convert this to number of terms? Even a rough estimate would suffice.
Q2. A typical technique in search is to "prune" the index by removing all rare (very low frequency) terms. The objective is not to prune the size of the index, but the size of the actual term dictionary. What would be the simpler way to do this in SOLR, or programatically in SOLRj?
More exactly: I wish to eliminate these terms (tokens) from an existing index (term dictionary and all the other places in the index). The result should be similar to 1) adding the terms to a stop word list, 2) re-indexing an entire collection, 3) removing the terms from the stop word list.
You can get information in the Schema Browser page and click in "Load Term info", in the luke admin handler https://wiki.apache.org/solr/LukeRequestHandler and also, in then stats component https://cwiki.apache.org/confluence/display/solr/The+Stats+Component.
To prune the index, you could do it by do a facet of the field, and get the terms with low frecuency. Then, get the docs and update the document without this term (this could be difficult because it's depends the analyzers and tokenizers of your field). Also, you can use the lucene libraries to open the index and do it programmatically.
You can check the number and distribution of your terms with the AdminUI under the collection's Schema Browser screen. You need to Load Term Info:
Or you can use Luke which allows you to look inside the Lucene index.
It is not clear what you mean to 'remove'. You can add them to the stopwords in the analyzer chain for example if you want to avoid indexing them.

Resources