can we increase the Apache solr performance for importing data from mysql with dataimport ?
currently i am using :
4 core processor
RAM 16 GB
HDD 50 GB
mysql record 1,2 Millions
for now i get 20 minutes for full import the datas.
Usually the best way is to drop using DIH (which is single threaded and runs on a single node - so it won't be easily scalable).
By writing a small, custom indexer in a suitable language (or even by using the bundled post tool), you can run multiple instances of your indexer, index to different nodes (allowing your content to be processed in parallel) and keep multiple threads open to both your backend database and to Solr.
It's important that you don't use explicit commits when indexing from multiple processes or threads - since that'll kill performance when committing often. Use commitWithin instead, telling Solr to automagically issue a commit after x seconds has passed. If you have full control over when all processes / threads have finished, you can issue the commit yourself - i.e. at the end of the indexing process (unless you want documents to become visible while indexing, in that case use commitWithin).
Related
I am developing an indexing application using Solr. Our current system has two live cores and indexes only one core at a time. It has recently become apparent that the current indexing system will not work long term. One of the live cores needs to be split into two new cores. They will have some overlapping information, but different schemas. Both will need to be updated quickly whenever a new project is ingested into the database.
Is there a way to simultaneously update multiple solr cores using SolrJ?
All cores are in the same solr instance.
We are not using SolrCloud.
The core that needs to be split currently contains approx. 2500000 documents.
Any help is appreciated.
Since you are indexing many documents on a single core I would assume the indexing process takes quite some time and using all system resources ( if configured correctly ). In that case - parallel indexing on the same instance will not help as your multiple threads will be sharing the same resources.
But what you could do is index another core on another instance and then do replication of each core separately.
When you build a Solr client using SolrJ it's specific to the core and not to your complete Solr instance. Having said that you could have multiple process updating any number of cores in your application.
I have performance concern and want a suggestion that which will be best, Multi Core or Multi Instance(with different port)? Lets have a look on My Case First:
Currently I am running solr with multiple core and its running OK. There is only one issue that sometime it goes "out of heap memory while processing facets fields", then I have to restart the solr. ( To minimize the no. of restarts, I starts the solr with high memory : java -Xms1000M -Xmx8000M -jar start.jar )
I have amazon ec2 instance with 8core-2.8GHtz /15GB Ram with optimized hard disk.
I have many database-tables(about 100) and have to create different schemas for each(leads to create different core).
Each table have millions of documents, with 7-9 indexed fields and 10-50 stored fields in each document.
My web portals should handle very high traffic (currently I m having 10 request/second, may increase to 50-100/second). I know 'solr' can handle that but it is to just inform you that I am concern about every-smallest performance issue also
Searching solr by PHP and CURL in to specific core, so there is no problem in searching in different solr instance also.
Question:
As per as I know Solr handles one request at a time. So I think if I create multiple instance of solr and starts those at different port, then my web portal can handle more request at a time. (if user search in different table).
So, what you will suggest me? Multi Core in Single Solr Instance? or Multiple Instances with Single/Dual Core in each?
Is there any problem in having multiple solr instances running at different ports?
NOTE: Here, I can/may/will combine less-searched-core(s)/small-core(s) in one instance AND heavy-traffic-core(s) in separate instance OR two-three-heavy-traffic-core in one-instance etc. Coz, creating different Instances for each table(~100 here) will take too much hardware resources.
As I didn't got any answer since more then week AND I had also tried many case with solr (and also read some articles), I want to share my experience as answer to my own question. This may/will help to future viewer. I tried on serverfault also with no success.
Solr can handle more request at a time.
I have tested it by running a long query [qTime=7203, approx. 7sec] and several small-queries-after-long-one [qTime=30], solr respond for small-queries first even they ran after long-one.
This point gives much reason in answer: Use single solr instance with multiple core. Just assign High memory to JVM.
Other Points:
1. Each solr instance will require RAM, so running multiple instances will require more resources, which will be expensive. And if you are using facets, sort fields then you need to allocate more RAM to each instance.
As you can see in my case I need to start the solr with high memory(8GB). You can see a case for Danish Web Archive, Which uses multiple instances and allocated 9GB RAM to each and having 256GM total RAM.
2. You can run multiple instances of solr on different PORT by java -Djetty.port=8984 -jar start.jar. Everything running ok BUT I got one problem.
While indexing it may give "not enough memory error" and then solr instance will be killed. So you again need to start second instance with high memory, which will leads to more RAM requirement.
3. Solr Resource Requirement and Performance Problem can be understand here. According to this 64bit environment and 12GB RAM is recommended for good performance. Solr Optimization are explained here.
We're running a master-slave setup with Solr 3.6 using the following auto-commit options:
maxDocs: 500000
maxTime: 600000
We have approx 5 million documents in our index which takes up approx 550GB. We're running both master and slave on Amazon EC2 XLarge instances (4 virtual cores and 15GB). We don't have a particularly high write throughput - about 100 new documents per minute.
We're using Jetty as a container which has 6GB allocated to it.
The problem is that once a commit has started, all our update requests start timing out (we're not performing queries against this box). The commit itself appears to take approx 20-25mins during which time we're unable to add any new documents to Solr.
One of the answers in the following question suggests using 2 cores and swapping them once its fully updated. However this seems a little over the top.
Solr requests time out during index update. Perhaps replication a possible solution?
Is there anything else I should be looking at regarding why Solr seems to be blocking requests? I'm optimistically hoping there's a "dontBlockUpdateRequestsWhenCommitting" flag in the config that I've overlooked...
Many thanks,
According to bounty reason and the problem mentioned at question here is a solution from Solr:
Solr has a capability that is called as SolrCloud beginning with 4.x version of Solr. Instead of previous master/slave architecture there are leaders and replicas. Leaders are responsible for indexing documents and replicas answers queries. System is managed by Zookeeper. If a leader goes down one of its replicas are selected as new leader.
All in all if you want to divide you indexing process that is OK with SolrCloud by automatically because there exists one leader for each shard and they are responsible for indexing for their shard's documents. When you send a query into the system there will be some Solr nodes (of course if there are Solr nodes more than shard count) that is not responsible for indexing however ready to answer the query. When you add more replica, you will get faster query result (but it will cause more inbound network traffic when indexing etc.)
For those who is facing a similar problem, the cause of my problem was i had too many fields in the document, i used automatic fields *_t, and the number of fields grows pretty fast, and when that reach a certain number, it just hogs solr and commit would take forever.
Secondarily, I took some effort to do a profiling, it end up most of the time is consumed by string.intern() function call, it seems the number of fields in the document matters, when that number goes up, the string.intern() seems getting slower.
The solr4 source appears no longer using the string.intern() anymore. But large number of fields still kills the performance quite easily.
I've just installed Solr on my Rails application (using sunspot).
I want solr to re-index a couple of columns on one of my tables, the tables is pretty big (~50M records).
What is the recommended batch size to use? currently i'm using 1000 and its running for over a day.
Any ideas?
The batch size is not that important, 1000 is probably OK, though I wouldn't go any larger than that. It depends on the size of the documents, how many bytes of text are indexed for each one.
Are you committing after each batch? That can be slow. I load a 23M document index with a single commit at the end. The documents are small, the metadata for books, and it takes about 90 minutes. To get that speed, I needed to use a single SQL query for the load. Using any subqueries made it about 10X slower.
I'm using the JDBC support in the DataInputHandler, though I may move to some custom code that makes a DB query and submits batches.
I've heard that the CSV input handler is very efficient, so it might work to dump your data to CSV, then load it with that handler.
I implement search engine with solr that import minimal 2 million doc per day.
User must can search on imported doc ASAP (near real-time).
I using 2 dedicated Windows x64 with tomcat 6 (Solr shard mode). every server, index about 120 million doc and about 220 GB (total 500 GB).
I want to get backup incremental from solr index file during update or search.
after search it, find rsync tools for UNIX and DeltaCopy for windows (GUI rsync for windows). but get error (vanished) during update.
how to solve this problem.
Note1:File copy really slow, when file size very large. therefore i can't use this way.
Note2: Can i prevent corrupt index files during update, if windows crash or hardware reset or any other problem ?
You can take a hot backup (i.e. while writing to the index) using the ReplicationHandler to copy Solr's data directory elsewhere on the local system. Then do whatever you like with that directory. You can launch the backup whenever you want by going to a URL like this:
http://host:8080/solr/replication?command=backup&location=/home/jboss/backup
Obviously you could script that with wget+cron.
More details can be found here:
http://wiki.apache.org/solr/SolrReplication
The Lucene in Action book has a section on hot backups with Lucene, and it appears to me that the code in Solr's ReplicationHandler uses the same strategy as outlined there. One of that book's authors even elaborated on how it works in another StackOverflow answer.
Don't run a backup while updating the index. You will probably get a corrupt (therefore useless) backup.
Some ideas to work around it:
Batch up your updates, i.e. instead of adding/updating documents all the time, add/update every n minutes. This will let you run the backup in between those n minutes. Cons: document freshness is affected.
Use a second, passive Solr core: Set up two cores per shard, one active and one passive. All queries are issued against the active core. Use replication to keep the passive core up to date. Run the backup against the passive core. You'd have to disable replication while running the backup. Cons: complex, more moving parts, requires double the disk space to maintain the passive core.