I have a Solr server which index billions of logs per day (50k/sec).
For performance reasons I prefer to index every 60M documents (around 20 minutes) in a different core since afterwards the insert rate goes down.
This means I'll have 72 cores per day.
Most searches will be only on the last day, but on some occasions I'll need to search 7-30 days back.
All cores share the exact same schema / settings.
Does Solr knows to hold the schema / settings once for all cores or it will need to create it for each core it loads for query (I'm planning using transient cores and holding only 30 up in a given moment).
I'm specifically concerned about the overhead of loading the schema/settings for tens / hundreds of cores for performing queries.
Related
can we increase the Apache solr performance for importing data from mysql with dataimport ?
currently i am using :
4 core processor
RAM 16 GB
HDD 50 GB
mysql record 1,2 Millions
for now i get 20 minutes for full import the datas.
Usually the best way is to drop using DIH (which is single threaded and runs on a single node - so it won't be easily scalable).
By writing a small, custom indexer in a suitable language (or even by using the bundled post tool), you can run multiple instances of your indexer, index to different nodes (allowing your content to be processed in parallel) and keep multiple threads open to both your backend database and to Solr.
It's important that you don't use explicit commits when indexing from multiple processes or threads - since that'll kill performance when committing often. Use commitWithin instead, telling Solr to automagically issue a commit after x seconds has passed. If you have full control over when all processes / threads have finished, you can issue the commit yourself - i.e. at the end of the indexing process (unless you want documents to become visible while indexing, in that case use commitWithin).
I have a SOLR instance running with a couple of cores, each one of them having between 15 to 25 million documents. Normally the size of each core index (on the disk) is around 30-50 GB each, but there is one particular core index that keeps increasing until hard disk space is full (raising up to 200 GB and more).
When I look at other indexes all files are from the current day, but this one core keeps files also from 4-5 days ago (I guess the data is always duplicated on every import).
What could be causing such behavior and what should I look for when debugging it? Thanks.
We are currently using Couchbase for data caching and there is talk of doing cross-data center replication with it. However, we will need up to 1000 documents replicated to multiple locations every second. Documents will be between 2 and 64K each.
Is there anyone out there with XDCR experience who can tell me whether this is even feasible, or if we will have to use other means to replicate this data at that speed. The only "benchmark" in the documentation at Couchbase implies that the rate of XDCR is only about 100TPS. (149 ms to replicate 11 documents.)
The replication rate of XDCR is limited by network bandwidth and latency first, then CPU and disk IO. Assuming you have enough bandwidth between the datacenters and your clusters are provisioned properly, Couchbase will replicate hundreds of thousands of documents per second, or more. It's a pretty simple experiment to run, just set up XDCR between two singles node clusters and use one of the load generator tools that come with Couchbase to create some traffic. (cbworkloadgen in the Couchbase bin folder or cbc-pillowfight that comes with libcouchbase.)
There are several config settings you can play with to optimize throughput, such as increasing batch size, changing the optimistic replication threshold, etc.
Im using Solr Version 4 (api spring data solr to index,get...documents) and i have to decide which strategy im going to apply for index my documents.
I hesitate between 2 strategies:
Launch a batch periodically to index all documents
Only Index the document when this one has changed
Which strategy is the best ? maybe a mix??or another..
I have some ideas about cons and dis of each but i don't have a big experience with solr.
Depends on how long indexing all your documents takes and how soon you want your index to be updated.
We have several Solr cores - some have less than 100K very small docs and a full import via data import handler (with optimize=true) runs under 1 minute. We can tolerate delays of up to 15 minutes for them, so we run a full import for this core every 15 min.
Then there are cores at the other extreme with several million docs, each of fairly large size, and full indexing will take several hours to complete. For such cores, we have a changelog table in MySQL which only records the docs that changed and we do an incremental indexing only for those docs every few min.
Finally, there are cores that are in the middle, having about 500K docs of decent size, but on these we need atomic updates every 5 to 10 min for certain fields and full document update for certain docs every few min as well. We run delta imports for these. Full index itself takes about 1.5 to 2 hours to run, which we do nightly.
So the answer to your question really depends on what your requirements are.
We're running a master-slave setup with Solr 3.6 using the following auto-commit options:
maxDocs: 500000
maxTime: 600000
We have approx 5 million documents in our index which takes up approx 550GB. We're running both master and slave on Amazon EC2 XLarge instances (4 virtual cores and 15GB). We don't have a particularly high write throughput - about 100 new documents per minute.
We're using Jetty as a container which has 6GB allocated to it.
The problem is that once a commit has started, all our update requests start timing out (we're not performing queries against this box). The commit itself appears to take approx 20-25mins during which time we're unable to add any new documents to Solr.
One of the answers in the following question suggests using 2 cores and swapping them once its fully updated. However this seems a little over the top.
Solr requests time out during index update. Perhaps replication a possible solution?
Is there anything else I should be looking at regarding why Solr seems to be blocking requests? I'm optimistically hoping there's a "dontBlockUpdateRequestsWhenCommitting" flag in the config that I've overlooked...
Many thanks,
According to bounty reason and the problem mentioned at question here is a solution from Solr:
Solr has a capability that is called as SolrCloud beginning with 4.x version of Solr. Instead of previous master/slave architecture there are leaders and replicas. Leaders are responsible for indexing documents and replicas answers queries. System is managed by Zookeeper. If a leader goes down one of its replicas are selected as new leader.
All in all if you want to divide you indexing process that is OK with SolrCloud by automatically because there exists one leader for each shard and they are responsible for indexing for their shard's documents. When you send a query into the system there will be some Solr nodes (of course if there are Solr nodes more than shard count) that is not responsible for indexing however ready to answer the query. When you add more replica, you will get faster query result (but it will cause more inbound network traffic when indexing etc.)
For those who is facing a similar problem, the cause of my problem was i had too many fields in the document, i used automatic fields *_t, and the number of fields grows pretty fast, and when that reach a certain number, it just hogs solr and commit would take forever.
Secondarily, I took some effort to do a profiling, it end up most of the time is consumed by string.intern() function call, it seems the number of fields in the document matters, when that number goes up, the string.intern() seems getting slower.
The solr4 source appears no longer using the string.intern() anymore. But large number of fields still kills the performance quite easily.