Why am I sometimes getting an OOM when getting all documents from a 800MB index with 8GB of heap? - solr

I need to refresh an index governed by SOLR 7.4. I use SOLRJ to access it on a 64 bit Linux machine with 8 CPUs and 32GB of RAM (8GB of heap for the indexing part and 24GB for SOLR server). The index to be refreshed is around 800MB in size and counts around 36k documents (according to Luke).
Before starting the indexing process itself, I need to "clean" the index and remove the Documents that do not match an actual file on disk (e.g : a document had been indexed previously and has moved since then, so user won't be able to open it if it appears on the result page).
To do so I first need to get the list of Document in index :
final SolrQuery query = new SolrQuery("*:*"); // Content fields are not loaded to reduce memory footprint
query.addField(PATH_DESCENDANT_FIELDNAME);
query.addField(PATH_SPLIT_FIELDNAME);
query.addField(MODIFIED_DATE_FIELDNAME);
query.addField(TYPE_OF_SCANNED_DOCUMENT_FIELDNAME);
query.addField("id");
query.setRows(Integer.MAX_VALUE); // we want ALL documents in the index not only the first ones
SolrDocumentList results = this.getSolrClient().
query(query).
getResults(); // This line sometimes gives OOM
When the OOM appears on the production machine, it appears during that "index cleaning" part and the stack trace reads :
Exception in thread "Timer-0" java.lang.OutOfMemoryError: Java heap space
at org.noggit.CharArr.resize(CharArr.java:110)
at org.noggit.CharArr.reserve(CharArr.java:116)
at org.apache.solr.common.util.ByteUtils.UTF8toUTF16(ByteUtils.java:68)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:868)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:857)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:266)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java:541)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:305)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:747)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:272)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.java:555)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:307)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:200)
at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:274)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:178)
at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:50)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:614)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:942)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:957)
I've aleady removed the content fields from the query because there were already OOMs, so I thought only storing "small" data would avoid OOMs, but they are still there. Moreover as I started the project for the customer we had only 8GB of RAM (so heap of 2GB), then we increased it to 20GB (heap of 5GB), and now to 32GB (heap of 8GB) and the OOM still appears, although the index is not that large compared to what is described in other SO questions (featuring millions of documents).
Please note that I cannot reproduce it on my dev machine less powerful (16GB RAM so 4GB of heap) after copying the 800 MB index from the production machine to my dev machine.
So to me there could be a memory leak. That's why I followed Netbeans post on Memory Leaks on my dev machine with the 800MB index. From what I see I guess there is a memory leak since indexing after indexing the number of surviving generation keeps increasing during the "index cleaning" (steep lines below) :
What should I do, 8GB of heap is already a huge quantity heap compared to the index characteristics ? So increasing the heap does not seem to make sense because the OOM only appears during the "index cleaning" not while actually indexing large documents, and it seems to be caused by the surviving generations, doesn't it ? Would creating a query object and then applying getResults on it would help the Garbage COllector ?
Is there another method to get all document paths ? Or maybe retrieving them chunk by chunk (pagination) would help even for that small amount of documents ?
Any help appreciated

After a while I finally came across this post. It exactly describe my issue
An out of memory (OOM) error typically occurs after a query comes in with a large rows parameter. Solr will typically work just fine up until that query comes in.
So they advice (emphasize is mine):
The rows parameter for Solr can be used to return more than the default of 10 rows. I have seen users successfully set the rows parameter to 100-200 and not see any issues. However, setting the rows parameter higher has a big memory consequence and should be avoided at all costs.
And this is what I see while retrieving 100 results per page :
The number of surviving generations has decreased dramatically although garbage collector's activity is much more intensive and computation time is way greater. But if this is the cost for avoiding OOM this is OK (see the program looses some seconds per index updates which can last several hours) !
Increasing the number of rows to 500 already makes the memory leak happens again (number of surviving generations increasing) :
Please note that setting the row number to 200 did not cause the number of surviving generations to increase a lot (I did not measure it), but did not perform much better in my test case (less than 2%) than the "100" setting :
So here is the code I used to retrieve all documents from an index (from Solr's wiki) :
SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (! done) {
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrServer.query(q);
String nextCursorMark = rsp.getNextCursorMark();
doCustomProcessingOfResults(rsp);
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
TL;DR : Don't use a number too large for query.setRows ie not greater than 100-200 as a higher number may very much likely cause an OOM.

Related

Solr: Document size inexplicably large

I updated to Solr 8.4.0 (from 6.x) on a test server and reindexed (this is an index of a complicated Moodle system, mainly lots of very small documents). It worked initially, but later ran out of disk space so I deleted everything and tried indexing a smaller subset of the data, but it still ran out of disk space.
Looking at the segment info chart, the first segment seems reasonable:
Segment _2a1u:
#docs: 603,564
#dels: 1
size: 5,275,671,226 bytes
age: 2020-11-25T22:10:05.023Z
source: merge
That's 8,740 bytes per document - a little high but not too bad.
Segment _28ow:
#docs: 241,082
#dels: 31
size: 5,251,034,504 bytes
age: 2020-11-25T18:33:59.636Z
source: merge
21,781 bytes per document
Segment _2ajc:
#docs: 50,159
#dels: 1
size: 5,222,429,424 bytes
age: 2020-11-25T23:29:35.391Z
source: merge
104,117 bytes per document!
And it gets worse, looking at the small segments near the end:
Segment _2bff:
#docs: 2
#dels: 0
size:23,605,447 bytes
age: 2020-11-26T01:36:02.130Z
source: flush
None of our search documents will have anywhere near that much text.
On our production Solr 6.6 server, which has similar but slightly larger data (some of it gets replaced with short placeholder text in the test server for privacy reasons) the large 5GB-ish segments contain between 1.8 million and 5 million documents.
Does anyone know what could have gone wrong here? We are using Solr Cell/Tika and I'm wondering if somehow it started storing the whole files instead of just the extracted text?
It turns out that a 10MB English language PowerPoint file being indexed, with mostly pictures and only about 50 words of text in the whole thing, is indexed (with metadata turned off) as nearly half a million terms most of which are Chinese characters. Presumably, Tika has incorrectly extracted some of the binary content of the PowerPoint file as if it were text.
I was only able to find this by reducing the index by trial and error until there are only a handful of documents in it (3 documents but using 13MB disk space), then Luke 'Overview' tab let me see that one field (called solr_filecontent in my schema) which contains the indexed Tika results has 451,029 terms. Then, clicking 'Show top terms' shows a bunch of Chinese characters.
I am not sure if there is a less laborious way than trial and error to find this problem, e.g. if there's any way to find documents that have a large number of terms associated. (Obviously, it could be a very large PDF or something that legitimately has that many terms, but in this case, it isn't.) This would be useful as even if there are only a few such instances across our system, they could contribute quite a lot to overall index size.
As for resolving the problem:
1 - I could hack something to stop it indexing this individual document (which is used repeatedly in our test data otherwise I probably wouldn't have noticed) but presumably the problem may occur in other cases as well.
2 - I considered excluding the terms in some way but we do have a small proportion of content in various languages, including Chinese, in our index so even if there is a way to configure it to only use ASCII text or something, this wouldn't help.
3 - My next step was to try different versions to see what happens with the same file, in case it is a bug in specific Tika versions. However, I've tested with a range of Solr versions - 6.6.2, 8.4.0, 8.6.3, and 8.7.0 - and the same behaviour occurs on all of them.
So my conclusion from all this is that:
Contrary to my intitial assumption that this was related to the version upgrade, it isn't actually worse now than it was in the older Solr version.
In order to get this working now I will probably have to do a hack to stop it indexing that specific PowerPoint file (which occurs frequently on our test dataset). Presumably the real dataset wouldn't have too many files like that otherwise it would already have run out of disk space there...

Redshift many small nodes vs less numbers of bigger nodes

Recently I have been facing cluster restart(outside maintenance window/arbitrary) in AWS Redshift that has been triggered from AWS end. They are not able to identify what is the exact root cause of this reboot. The error that AWS team captured is "out of object memory".
In the meantime, I am trying to scale up the cluster size to avoid this out of object memory(as a blind try), Currently I am using ds2.xlarge node type but I am not sure which of below I need to increase/choose?
Many smaller nodes (increase number of nodes in ds2.xlarge)
Few larger nodes (change to ds2.8xlarge and have less number but increased capacity)
Anyone faced similar issue in Redshift? Any advise?
Going with the configuration, for better performance in this case you should opt for ds2.8xlarge cluster type.
One ds2.xlarge cluster has 13 gb of RAM and 2 slice to perform your workload as compared with ds2.8xlarge which has 244 gb of RAM and 16 slices to perform your workloads.
Now even if you choose 8 ds2.xlarge nodes you will get max 104 GB memory against 244 GB in one node of ds2.8xlarge.
So you should go with ds2.8xlarge node type for handling memory issue along with large amount of storage

Solr Memory Usage - How to reduce memory footprint for solr

Q - I am forced to set Java Xmx as high as 3.5g for my solr app.. If i keep this low, my CPU hits 100% and response time for indexing increases a lot.. And i have hit OOM Error as well when this value is low..
Is this too high? If so, can I reduce this?
Machine Details
4 G RAM, SSD
Solr App Details (Standalone solr app, no shards)
num. of Solr Cores = 5
Index Size - 2 g
num. of Search Hits per sec - 10 [IMP - All search queries have faceting..]
num. of times Re-Indexing per hour per core - 10 (it may happen at
the same time at a moment for all the 5 cores)
Query Result Cache, Document cache and Filter Cache are all default size - 4 kb.
top stats -
VIRT RES SHR S %CPU %MEM
6446600 3.478g 18308 S 11.3 94.6
iotop stats
DISK READ DISK WRITE SWAPIN IO>
0-1200 K/s 0-100 K/s 0 0-5%
Try either increasing the RAM size or increasing the frequency of Index Rebuilt. If you are rebuilding the index 10 times in an hours, then Solr may not be the right choice. Solr Index tries to give faster results by keeping the index files in the OS memory.
Solr always use more than 90% of physical memory

Index size 400%+ growth: Normal Solr instance vs SolrCloud

I'm experimenting with different infrastructure approaches and I'm surprised to notice the following.
I've indexed 1.3M documents (all fields indexed, stored, and some shingle-analyzed) using DataImportHandler via sql query in Solr4.4.
Approach1: Single Solr instance
Indexing time: ~10 minutes
Size of "index" folder: 1.6GB
Approach2: SolrCloud with two index slices.
Indexing time: ~11 minutes
Size of "index" folders: 1.6GB + 1.5GB = 3.1GB
Each of index slice has around 0.65M documents adding to original total count which is expected.
Approach3:SolrCloud with two shards (1 leader + 1 replica)
Indexing time: ~30 minutes
Size of "index" folders: Leader (4.6GB), replica (3.8GB) = 8.4GB (expected this to be 1.6gb * 2, but it is ~1.6gb*5.25)
I've followed the SolrCloud tutorial.
I realize that there's some meta-data (please correct me if I'm wrong) like term dictionary, etc. which has to exist in all the instances irrespective of slicing (partition) or sharding (replication).
However, approach 2 and 3 show drastic growth (400%) in the final index size.
Please could you provide insights.
From the overall index size I suppose your documents are quite small. That is why the relative size of the terms dictionary is big - for that number of documents it's pretty similar, so you have it twice. Therefore 1.6 turns into 3.1Gb.
As for Approach 3 - are you sure that it's a clean test? Any chance you have included transaction log in the size? What happens if you optimize?
You can check what exactly adds to the size by checking the index files extensions.
See here:
https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#file-names

Solr setting physical memory less than index size and out of memory

Is there a setting in Solr 4.3.1 where say my Memory is 15G and my index size is 75G. I always get an out of memory exception . I am updating the indexes in chunks of 50000 and when my index size reached 14.8G , I get an out of memory exception. I turned off all the caching in solrconfig.xml.
Anyway I can push all of them all to index or is it impossible in solr, I also allocated 4 G to JVM which seems to be fine.
Let me know if there are any options
By default, solr 4.3 will mmap index into memory, you may try other DirectoryFactory, such as SimpleFSDirectoryFactory.
See DirectoryFactory section in solrconfig.xml.

Resources