DSE SOLR OOMing - solr

We have had a 3 node DSE SOLR cluster running and recently added a new core. After about a week of running fine, all of the SOLR nodes are now OOMing. The fill up both the JVM Heap (set at 8GB) and the system memory. Then are also constantly flushing the memtables to disk.
The cluster is DSE 3.2.5 with RF=3
here is the solrconfig from the new core:
http://pastie.org/8973780

How big is your Solr index relative to the amount of system memory available for the OS to cache file system pages. Basically, your Solr index needs to fit in the OS file system cache (the amount of system memory available after DSE is started but has not yet processed any significant amount of data.)
Also, how many Solr documents (Cassandra rows) and how many fields (Cassandra columns) are populated on each node? There is no hard limit, but 40 to 100 million is a good guideline as an upper limit - per node.
And, how much system memory and how much JVM heap is available if you restart DSE, but before you start putting load on the server?

For RF=N, where N is the total number of nodes in the cluster or at least the search data center, all of the data will be stored on all nodes, which is okay for smaller datasets, but not okay for larger datasets.
For RF=n, this means that each node will have X/N*n rows or documents, where X is the total number of rows or documents all column families in the data center. X/N*n is the number that you should try to keep below 100 million. That's not a hard limit - some datasets and hardware might be able to handle substantially more, and some datasets and hardware might not even be able to hold that much. You'll have to discover the number that works best for your own app, but the 40 million to 100 million range is a good start.
In short, the safest estimate is for X/N*n to be kept under 40 million for Solr nodes. 100 may be fine for some data sets and beefier hardware.

As far as tuning, one common source of using lots of heap is heavy use of Solr facets and filter queries.
One technique is to use "DocValues" fields for facets since DocValues can be stored off-heap.
Filter queries can be marked as cache=false to save heap memory.
Also, the various Solr caches can be reduced in size or even set to zero. That's in solrconfig.xml.

Related

Solr performance issues

I'm using Solr to handle search on a very large set of documents, I start having performance issues with complex queries with facets and filters.
This is a solr query used to get some data :
solr full request : http://host/solr/discovery/select?q=&fq=domain%3Acom+OR+host%3Acom+OR+public_suffix%3Acom&fq=crawl_date%3A%5B2000-01-01T00%3A00%3A00Z+TO+2000-12-31T23%3A59%3A59Z%5D&fq=%7B%21tag%3Dcrawl_year%7Dcrawl_year%3A%282000%29&fq=%7B%21tag%3Dpublic_suffix%7Dpublic_suffix%3A%28com%29&start=0&rows=10&sort=score+desc&fl=%2Cscore&hl=true&hl.fragsize=200&hl.simple.pre=%3Cstrong%3E&hl.simple.post=%3C%2Fstrong%3E&hl.snippets=10&hl.fl=content&hl.mergeContiguous=false&hl.maxAnalyzedChars=100000&hl.usePhraseHighlighter=true&facet=true&facet.mincount=1&facet.limit=11&facet.field=%7B%21ex%3Dcrawl_year%7Dcrawl_year&facet.field=%7B%21ex%3Ddomain%7Ddomain&facet.field=%7B%21ex%3Dpublic_suffix%7Dpublic_suffix&facet.field=%7B%21ex%3Dcontent_language%7Dcontent_language&facet.field=%7B%21ex%3Dcontent_type_norm%7Dcontent_type_norm&shards=shard1"
When this query is used localy with about 50000 documents, it takes about 10 seconds, but when I try it on host with 200 million documents it takes about 4 minutes. I know naturaly it's going to take a much longer time in the host, but I wonder if anyone had the same issue and was able to get faster results. Knowing that I'm using two Shards.
Waiting for your responses.
You're doing a number of complicated things at once: Date ranges, highlighting, faceting, and distributed search. (Non-solrcloud, looks like)
Still, 10 seconds for a 50k-doc index seems really slow to me. Try selectively removing aspects of your search to see if you can isolate which part is slowing things down and then focus on that. I'd expect that you can find simpler queries that are fast, even if they match a lot of documents.
Either way, check out https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
There are a lot of useful tips there, but the #1 performance issue is usually not having enough memory, especially for large indexes.
Check for how many segments you have on solr
as more the number of segments worse the query response
If you have not set merge factor in your solrConfig.xml then probably you will have close 40 segments which is to bad for query response time
Set your merge factor accordingly
If no new documents are to be added set it 2
mergeFactor
The mergeFactor roughly determines the number of segments.
The mergeFactor value tells Lucene how many segments of equal size to build before merging them into a single segment. It can be thought of as the base of a number system.
For example, if you set mergeFactor to 10, a new segment will be created on the disk for every 1000 (or maxBufferedDocs) documents added to the index. When the 10th segment of size 1000 is added, all 10 will be merged into a single segment of size 10,000. When 10 such segments of size 10,000 have been added, they will be merged into a single segment containing 100,000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each index size.
These values are set in the mainIndex section of solrconfig.xml (disregard the indexDefaults section):
mergeFactor Tradeoffs
High value merge factor (e.g., 25):
Pro: Generally improves indexing speed
Con: Less frequent merges, resulting in a collection with more index files which may slow searching
Low value merge factor (e.g., 2):
Pro: Smaller number of index files, which speeds up searching.
Con: More segment merges slow down indexing.

solr multicore vs sharding vs 1 big collection

I currently have a single collection with 40 million documents and index size of 25 GB. The collections gets updated every n minutes and as a result the number of deleted documents is constantly growing.
The data in the collection is an amalgamation of more than 1000+ customer records. The number of documents per each customer is around 100,000 records on average.
Now that being said, I 'm trying to get an handle on the growing deleted document size. Because of the growing index size both the disk space and memory is being used up. And would like to reduce it to a manageable size.
I have been thinking of splitting the data into multiple core, 1 for each customer. This would allow me manage the smaller collection easily and can create/update the collection also fast. My concern is that number of collections might become an issue. Any suggestions on how to address this problem.
Solr: 4.9
Index size:25 GB
Max doc: 40 million
Doc count:29 million
Thanks
I had the similar sort of issue having multiple customer and big indexed data.
I have the implemented it with version 3.4 by creating a separate core for a customer.
i.e One core per customer. Creating core is some sort of creating indexes or splitting the data as like we do in case of sharding...
Here you are splitting the large indexed data in different smaller segments.
Whatever the seach will happen it will carry in the smaller indexed segment.. so the response time would be faster..
I have almost 700 core created as of now and its running fine for me.
As of now I did not face any issue with managing the core...
I would suggest to go with combination of core and sharding...
It will help you in achieve
Allows to have a different configuration for each core with different behavior and that will not have impact on other cores.
you can perform action like update, load etc. on each core differently.

How to decide Solr cloud shard per node?

We have 16 64 GB RAM, 4 core machines. The index size is around 200 GB. Initially we decided to have 64 shards, ie 4 shards per node. We came to 4 shards per nodes because we have 4 core machine (4 core can process 4 shards at a time). When we tested qtime of the query was pretty high. We re-ran performance test on reduced shards. One for 32 total shards(2shards per node) and 16 total shards(1shard per node). The qtime has gone down drastically(by upto 90%) for 16 shards .
So How is shards per nodes decided? Is there a formula based on machine config and index volume?
One other thing you will want to review is the type and volume of queries you are sending to Solr. There is no single magic formula that you can use, my best advice would be to just test a few different alternatives to see which one performs the best.
One thing to keep in mind is the JVM size and index size per server. I think it'd be nice if you could cache the entire index in memory on each box.
Additionally, make sure you are testing query response time with the queries you will actually be running, not just made up things. Things like grouping and faceting will make a huge difference.

Which NoSQL Database for Mostly Writing

I'm working on a system that will generate and store large amounts of data to disk. A previously developed system at the company used ordinary files to store its data but for several reasons it became very hard to manage.
I believe NoSQL databases are good solutions for us. What we are going to store is generally documents (usually around 100K but occasionally can be much larger or smaller) annotated with some metadata. Query performance is not top priority. The priority is writing in a way that I/O becomes as small a hassle as possible. The rate of data generation is about 1Gbps, but we might be moving on 10Gbps (or even more) in the future.
My other requirement is the availability of a (preferably well documented) C API. I'm currently testing MongoDB. Is this a good choice? If not, what other database system can I use?
The rate of data generation is about 1Gbps,... I'm currently testing MongoDB. Is this a good choice?
OK, so just to clarify, your data rate is ~1 gigaBYTE per 10 seconds. So you are filling a 1TB hard drive every 20 minutes or so?
MongoDB has pretty solid write rates, but it is ideally used in situations with a reasonably low RAM to Data ratio. You want to keep at least primary indexes in memory along with some data.
In my experience, you want about 1GB of RAM for every 5-10GB of Data. Beyond that number, read performance drops off dramatically. Once you get to 1GB of RAM for 100GB of data, even adding new data can be slow as the index stops fitting in RAM.
The big key here is:
What queries are you planning to run and how does MongoDB make running these queries easier?
Your data is very quickly going to occupy enough space that basically every query will just be going to disk. Unless you have a very specific indexing and sharding strategy, you end up just doing disk scans.
Additionally, MongoDB does not support compression. So you will be using lots of disk space.
If not, what other database system can I use?
Have you considered compressed flat files? Or possibly a big data Map/Reduce system like Hadoop (I know Hadoop is written in Java)
If C is key requirement, maybe you want to look at Tokyo/Kyoto Cabinet?
EDIT: more details
MongoDB does not support full-text search. You will have to look to other tools (Sphinx/Solr) for such things.
Larges indices defeat the purpose of using an index.
According to your numbers, you are writing 10M documents / 20 mins or about 30M / hour. Each document needs about 16+ bytes for an index entry. 12 bytes for ObjectID + 4 bytes for pointer into the 2GB file + 1 byte for pointer to file + some amount of padding.
Let's say that every index entry needs about 20 bytes, then your index is growing at 600MB / hour or 14.4GB / day. And that's just the default _id index.
After 4 days, your main index will no longer fit into RAM and your performance will start to drop off dramatically. (this is well-documented under MongoDB)
So it's going to be really important to figure out which queries you want to run.
Have a look at Cassandra. It executes writes are much faster than reads. Probably, that's what you're looking for.

Can Apache Solr Handle TeraByte Large Data

I am an apache solr user about a year. I used solr for simple search tools but now I want to use solr with 5TB of data. I assume that 5TB data will be 7TB when solr index it according to filter that I use. And then I will add nearly 50MB of data per hour to the same index.
1- Are there any problem using single solr server with 5TB data. (without shards)
a- Can solr server answers the queries in an acceptable time
b- what is the expected time for commiting of 50MB data on 7TB index.
c- Is there an upper limit for index size.
2- what are the suggestions that you offer
a- How many shards should I use
b- Should I use solr cores
c- What is the committing frequency you offered. (is 1 hour OK)
3- are there any test results for this kind of large data
There is no available 5TB data, I just want to estimate what will be the result.
Note: You can assume that hardware resourses are not a problem.
if your sizes are for text, rather than binary files (whose text would be usually much less), then I don't think you can pretend to do this in a single machine.
This sounds a lot like Logly and they use SolrCloud to handle such amount of data.
ok if all are rich documents then total text size to index will be much smaller (for me its about 7% of my starting size). Anyway, even with that decreased amount, you still have too much data for a single instance I think.

Resources