solr indexing strategy - solr

We have millions of documents in mongo we are looking to index on solr. Obviously when we do this the first time we need to index all the documents.
But after that, we should only need to index the documents as they change. What is the best way to do this? Should we call addDocument and then in cron call commit()? What does addDocument vs commit vs optimize do (I am using Apache_Solr_Service)

If you're using Solr 3.x you can forget the optimize, which merges all segments into one big segment. The commit makes changes visible to new IndexReaders; it's expensive, I wouldn't call it for each document you add. Instead of calling it through a cron, I'd use the autocommit in solrconfig.xml. You can tune the value depending on how much time you can wait to get new documents while searching.

The document won't actually be added to the index until you do commit() - it could be rolled back. optimize() will (ostensibly; I've not had particularly good luck with it) reduce the size of the index (documents that have been deleted still take up room unless the index is optimized).

If you set autocommit for your database, then you can be sure that any documents added to the database via update, have been committed, when the autocommit interval has passed. I have used a 5-minute interval and it works fine even when a few thousand updates happen within the 5 minutes. After a full reindex is complete, I wait 5 minutes and then tell people that it is done. In fact, when people ask how quickly updates get into the db, I will tell them that we poll for changes every minute, but that there are variables (such as a sudden big batch) and it is best to not expect things to be updated for 5 or 6 minutes. So far, nobody has really claimed a business need to have it update faster than that.
This is with a 350,000 record db totalling roughly 10G in RAM.

Related

Solr how to improve query speed time after too many atomic update

I'm using solrCloud 7.4 with 3 instance (16GB RAM each instance) and have 1 collection with 10m data. For starter it really fast, almost no query more than 2 seconds.
Then i have updated with transaction (i.e popularity) data in other oracle database to make my collection more relevant. I just simply loop transaction then using solr atomic update like set and inc about 1~10 fields (almost all field type float n long). But transaction has more than 300m data. So the process i set and inc every 10k transaction data to collection in solr.
The update part of 300m data only process once, After that maybe take 50k/day and processing at 0am.
In the end. the collection still have 10m data, but looks like my query has slow down almost up to 10 seconds.
I look in shard overview, each shard have 20+ segment and half of them are deleted document:
Have i do something miss here, why my query time drop?
How do i speed up again like before?
Should i copy and creating new collection n reindex my 10m collection after atomic update (from 300m transc) to the new collection?
The issue is caused by a large number of segments being created, mostly consisting of deleted documents. When you're doing an atomic update, the previous document is fetched, the value is changed, and the new document (with the new value) is indexed. This leaves the old document as deleted, while the new document is written to a new file.
These segments are merged when the mergeFactor value is hit; i.e. when the number of segments gets high enough, they're merged into a new segment file instead of having multiple files around. When this merges happens, deleted documents are expunged (no need to write documents that no longer exists to a new file).
You can force this process to happen by issuing an optimize, and while you usually can rely on mergeFactor to do the job for you (depending on the value of mergeFactor and your indexing strategy), datasets where everything is updated in one go, such as once at night, issuing an optimize afterwards works fine.
The down side is that it'll require extra processing (but that would happen anyway if you just relied on mergeFactor, but not everything at the same time), and up to 2x the current size of the index as temporary space.
You can perform an optimize by calling the update endpoint for your collection: http://localhost:8983/solr/collection/update?optimize=true&maxSegments=1&waitFlush=false
The maxSegments value tells Solr how many segments its acceptable to end up with. The default value is 1. For most use cases that'll be fine.
While calling optimize has gotten a bad rep (since mergeFactor usually should do the work for you, and people tend to call optimize far too often), this is a perfectly fine use case for optimize. There are also optimization enhancements for the optimize command in 7.5, which will help avoid the previous worst case scenarios.

Frequency of Full reindex on SolrCloud

How often do I need to run full reindex on SolrCloud?
It takes more than 12 hours for full reindex to run and we run it every night but is it really necessary to do it as delta runs correctly.
New data comes in at the rate of 2000 documents on every delta per 30 seconds.
Total index size : 20GB
Solr: 6.5.2
If delta runs correctly, there should be no need to run a reindex at all. The exception might be if you do not have disabled any merging while the index is operative; in that case you might end up with a very fragmented index file wise, and the reindex ends up building a complete set as a single index file instead, but isn't usually how Solr is configured, and if it is - it's done for a reason.
So - if your delta is working correctly and you run Solr with fairly standard settings, you can safely skip reindexing unless you're starting over with an empty index (or have a situation where the schema has changed). But be sure that this also includes deletions - a reindex would probably not include deleted elements, so the question then becomes whether your delta import handles deletions as well.
None of our Solr based services reindex at all - everything is done with live updates and a decent merge factor.

When to optimize a Solr Index [duplicate]

I have a classifieds website. Users may put ads, edit ads, view ads etc.
Whenever a user puts an ad, I am adding a document to Solr.
I don't know, however, when to commit it. Commit slows things down from what I have read.
How should I do it? Autocommit every 12 hours or so?
Also, how should I do it with optimize?
A little more detail on Commit/Optimize:
Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.
Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.
Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)
Index Server / Search Server:
Paul Brown was right in that the best design for solr is to have a server dedicated and tuned to indexing, and then replicate the changes to the searching servers. You can tune the index server to have multiple index end points.
eg: http://solrindex01/index1; http://solrindex01/index2
And since the index server is not searching for content you can have it set up with different memory footprints and index warming commands etc.
Hope this is useful info for everyone.
Actually, committing often and optimizing makes things really slow. It's too heavy.
After a day of searching and reading stuff, I found out this:
1- Optimize causes the index to double in size while beeing optimized, and makes things really slow.
2- Committing after each add is NOT a good idea, it's better to commit a couple of times a day, and then make an optimize only once a day at most.
3- Commit should be set to "autoCommit" in the solrconfig.xml file, and there it should be tuned according to your needs.
The way that this sort of thing is usually done is to perform commit/optimize operations on a Solr node located out of the request path for your users. This requires additional hardware, but it ensures that the performance penalty of the indexing operations doesn't impact your users. Replication is used to periodically shuttle optimized index files from the master node to the nodes that perform search queries for users.
Try it first. It would be really bad if you avoided a simple and elegant solution just because you read that it might cause a performance problem. In other words, avoid premature optimization.

SOLR - old Transactions-Logs (tlogs) never deleted

first of all to mention i searched a long time but got n solution, so not i try with my specific problem, trying to keep it short:
solr-spec 4.0.0.2012.10.06.03.04.33
one master, three slaves
around 70.000 documents in index
master gets triggered to full import / generate complete new index ~ once a day
command line options for trigger are:
?command=full-import&verbose=false&clean=false&commit=true&optimize=true
slaves trigger master for new index, if GEN increases (full import + hard commit as mentioned), they pull the new index
no autoCommit / autoSoftCommit set up
the problem ist, that each hard commit the index (~670MB) gets written to disk, once a day, but the old never get deleted.
As far as i read solr keeps enough tlogs to be able to restore the last 100 changes to documents, am i right?
In my setup i am sure at least 100 documents (or data sets within the source database) are changed each day, so i dont understand why solr never deletes old tlogs.
I would be glad if someone can point to the right direction, currently i have no clue what to try next. Also i did not find a setup like this one described having problems like this.
Thx ;)
First you'll probably want to update your Solr-version, as there's been a few transaction log reference leaks fixed since 4.0.
A hard commit should usually remove old transaction logs as the documents are written to disk in the index anyway iirc, which may indicate that you're getting bit by some old references hanging around.
Another option would be to turn off the transaction log completely, since you only generate a completely new index each run anyway and dist that one.

How to configure Solr for improved indexing speed

I have a client program which generates a 1-50 millions Solr documents and add them to Solr.
I'm using ConcurrentUpdateSolrServer for pushing the documents from the client, 1000 documents per request.
The documents are relatively small (few small text fields).
I want to improve the indexing speed.
I've tried to increase the "ramBufferSizeMB" to 1G and the "mergeFactor" to 25 but didn't see any change.
I was wondering if there is some other recommended settings for improving Solr indexing speed.
Any links to relevant materials will be appreciated.
It looks like you are doing a bulk import of data into Solr, so you don't need to search any data right away.
First, you can increase the number of documents per request. Since your documents are small, I would even increase it to 100K docs per request or more and try.
Second, you want to reduce the number of times commits happen when you are bulk indexing. In your solrconfig.xml look for:
<!-- AutoCommit
Perform a hard commit automatically under certain conditions.
Instead of enabling autoCommit, consider using "commitWithin"
when adding documents.
http://wiki.apache.org/solr/UpdateXmlMessages
maxDocs - Maximum number of documents to add since the last
commit before automatically triggering a new commit.
maxTime - Maximum amount of time in ms that is allowed to pass
since a document was added before automatically
triggering a new commit.
openSearcher - if false, the commit causes recent index changes
to be flushed to stable storage, but does not cause a new
searcher to be opened to make those changes visible.
-->
<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
You can disable autoCommit altogether and then call a commit after all your documents are posted. Otherwise you can tweak the numbers as follows:
The default maxTime is 15 secs so an auto commit happens every 15 secs if there are uncommitted docs, so you can set this to something large, say 3 hours (i.e. 3*60*60*1000). You can also add <maxDocs>50000000</maxDocs> which means an auto commit happens only after 50 million documents are added. After you post all your documents, call commit once manually or from SolrJ - it will take a while to commit, but this will be much faster overall.
Also after you are done with your bulk import, reduce maxTime and maxDocs, so that any incremental posts you will do to Solr will get committed much sooner. Or use commitWithin as mentioned in solrconfig.
In addition to what was written above, when using SolrCloud, you may want to consider using the CloudSolrClient when using SolrJ. The CloudSolrClient client class is Zookeeper aware and is able to directly connect to the leader shard speeding up the indexing in some cases.

Resources