Solr indexing issue (out of memory) - looking for a solution - solr

I have a large index of 50 Million docs. all running on the same machine (no sharding).
I don't have an ID that will allow me to update the wanted docs, so for each update I must delete the whole index and to index everything from scratch and commit only at the end when I'm done indexing.
My problem is that every few index runs, My Solr crashes with out of memory exception, I am running with 12.5 GB memory.
From what I understand, until the commit everything is being saved in the memory, so I'm storing in the memory 100M docs instead of 50M. am I right?
But I cannot make commits while I'm indexing, because I deleted all docs at the beginning and than I'll run with partial index which is bad.
Is there any known solutions for that? can sharding solve it or I still going to have the same problem?
Is there a flag that allow me to make soft-commits but it won't change the index until the hard-commit?

You can use the master slave replication. Just dedicate one machine to do your indexing (master solr), and then, if it's finished, you can tell the slave to replicate the index from the master machine. The slave will download the new index, and it will only delete the old index if the download is successful. So it's quite safe.
http://wiki.apache.org/solr/SolrReplication
One other solution to avoid all this replication set-up is to use a reverse proxy, put nginx or something of the like in front of your solr. Use one machine for indexing the new data, and the other for searching. And you can just make the reverse proxy to always point at the one not currently doing any indexing.
If you do one of them, then you can just commit as often as you want.
And because it's generally a bad idea to do indexing and search in one same machine, I will prefer to use the master-slave solution (not to mention you have 50M docs).

out of memory error can be solved by providing more memory to jvm of your container it has nothing to do with your cache .
Use better options for Garbage collection because source of error is your jvm memory being full.
Increase the number of threads because if number of threads for a process is reached a new process is spawn (which have same number of threads as prior one and same memory allocation ).
PLease also write about cpu spike , and any other type of caching mechanism you are using
you can try one thing thats to put all auto warmup to 0 it would speed up commit time
regards
Rajat

Related

Gracefully close Elasticsearch index

I need to update very simple index setting in the Elasticsearch 5 cluster: index.query.default_field. I'm very surprised about the fact, that this setting is not dynamic and I need to close the index to update it. It looks strange to me because from Elasticsearch source code it seems like this settings affect only the way incoming requests are processed and doesn't affect data schemas or something else that can be stored on disk or in memory caches.
But the main problem is that after closing index and update setting I reopen the index and Elasticsearch suddenly start to re-replicate all primary shards in this index. It looks very strange because there is no write requests to that index and data stays unchanged for several weeks.
Are there any methods to gracefully close an index in such a way to prevent further re-replication when we need to open this index again in future?
Also, are there any reasons for setting index.query.default_field to be non-dynamic or this is a mistake?

When to optimize a Solr Index [duplicate]

I have a classifieds website. Users may put ads, edit ads, view ads etc.
Whenever a user puts an ad, I am adding a document to Solr.
I don't know, however, when to commit it. Commit slows things down from what I have read.
How should I do it? Autocommit every 12 hours or so?
Also, how should I do it with optimize?
A little more detail on Commit/Optimize:
Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.
Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.
Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)
Index Server / Search Server:
Paul Brown was right in that the best design for solr is to have a server dedicated and tuned to indexing, and then replicate the changes to the searching servers. You can tune the index server to have multiple index end points.
eg: http://solrindex01/index1; http://solrindex01/index2
And since the index server is not searching for content you can have it set up with different memory footprints and index warming commands etc.
Hope this is useful info for everyone.
Actually, committing often and optimizing makes things really slow. It's too heavy.
After a day of searching and reading stuff, I found out this:
1- Optimize causes the index to double in size while beeing optimized, and makes things really slow.
2- Committing after each add is NOT a good idea, it's better to commit a couple of times a day, and then make an optimize only once a day at most.
3- Commit should be set to "autoCommit" in the solrconfig.xml file, and there it should be tuned according to your needs.
The way that this sort of thing is usually done is to perform commit/optimize operations on a Solr node located out of the request path for your users. This requires additional hardware, but it ensures that the performance penalty of the indexing operations doesn't impact your users. Replication is used to periodically shuttle optimized index files from the master node to the nodes that perform search queries for users.
Try it first. It would be really bad if you avoided a simple and elegant solution just because you read that it might cause a performance problem. In other words, avoid premature optimization.

Kyoto Tycoon remove expired recorde from memory

We have small setup of Kyoto Tycoon [Kyoto Tycoon 0.9.55 (2.18) on Linux (Kyoto Cabinet 1.2.75)] which is Fully In-Memory DB & shared in 3 with Master slave architecture for each shared.
Presently we have issue with expired records which stays in memory & memory utilization goes UP.
When I checked this doc http://fallabs.com/kyototycoon/spex.html#tips
where I found "ktremotemgr vacuum" as per description it perform full GC operation.
But I was looking for another way like something config parameter which take cares of removing expired records from memory.
Any help on this please
Thanks
kt will do this at random and in some cases it is LRU based. Yes the mem utilization will go up for some time.
Following is some documentation from the same link.
In addition, automatic deletion by the capacity limit is performed at random. In that case, fresh records may also be deleted soon. So, setting effectual expiration time not to reach the limit is very important. If you cannot calculate effectual expiration time beforehand, use the cache hash database instead of the default stash database. The following setting is suggested.
$ ktserver '*#bnum=20000000#capsiz=8g'
Note that the space effiency of the cache hash database is worse than that of the stash database. The limit should be up to 50% of the total memory size of the machine. However, automatic deletion by the "capsiz" parameter (not "ktcapsiz") of the cache hash database is based on LRU algorithm, which prevents fresh records from sudden deletion.

How Indices Cope with MVCC?

Greetings Overflowers,
To my understanding (and I hope I'm not right) changes to indices cannot be MVCCed.
I'm wondering if this is also true with big records as copies can be costly.
Since records are accessed via indices (usually), how MVCC can be effective ?
Do, for e.g., indices keep track of different versions of MVCCed records ?
Any recent good reading on this subject ? Really appreciated !
Regards
Index itself can have both the records which can be pruned before returning. So in this case ondex alone can't be used to get the records (the MVCC done by PostGres). InnoDB/Oracle keeps only one version of data/index and using undo section it rebuilds the older versions for older transactions.
You wont have too many copies when DB is use in general as periodically copies will be garbage collected (In PostGres) and Oracle/InnoDB will have undo sections which will be reused when transaction is aborted/committed. If you have too many long running transaction obviously you will have problems.
Indexes are to speed up the access, find the record faster, without touch all of them, Index need not be accurate in first pass, you may need to look at tuple to see if its valid in one particular transaction or not (like in PostGres). racle or InnoDB even index is versioned so you can get data from indexes itself.
Read this to get detail idea of two ways of implementing MVCC (PresGres and Oracle/InnoDB).
InnoDB MVCC and comments here are useful too
PS: I'm not expert in mysql/oracle/postgres internal, still learning how things work.

app engine data pipelines talk - for fan-in materialized view, why are work indexes necessary?

I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!

Resources