Solr indexing - Master/Slave replication, how to handle huge index and high traffic? - solr

I'm currently facing an issue with SOLR (more exactly with the slaves replication) and after having spent quite a few time reading online I find myself having to ask for some enlightenment.
- Does Solr have some limitation in size for its index?
When dealing with a single master, when is it the right moment to decide to use multi cores or multi indexes?
Is there any indications on when reaching a certain size of index, partitioning is recommended?
- Is there any max size when replicating segments from master to slave?
When replicating, is there a segment size limit when the slave won't be able to download the content and index it? What is the threshold to which a slave won't be able to replicate when there's a lot of traffic to retrieve info and lots of new documents to replicate.
To be more factual, here is the context that led me to these questions:
We want to index a fair amount of documents, but when the amount reaches more than a dozen millions, the slaves can't handle it and start failing replicating with a SnapPull error.
The documents are composed with a few text fields (name, type, description, ... about 10 other fields of let's say 20 characters max).
We have one master, and 2 slaves which replicate data from the master.
This is my first time working with Solr (I work usually on webapps using spring, hibernate... but no use of Solr), so I'm not sure how to tackle this issue.
Our idea is for the moment to add multiple cores to the master, and have a slave replicating from each of this core.
Is it the right way to go?
If it is, how can we determine the number of cores needed? Right now we're just going to try and see how it behaves and adjust if necessary, but I was wondering if there was any best practices or some benchmarks that have been done on this specific topic.
For this amount of documents with this average size, x cores or indexes are needed ...
Thanks for any help in how I could deal with a huge amount of documents of average size!
Here is a copy of the error that is being thrown when a slave is trying to replicate:
ERROR [org.apache.solr.handler.ReplicationHandler] - <SnapPull failed >
org.apache.solr.common.SolrException: Index fetch failed :
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329)
at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:264)
at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:280)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:135)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:65)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:142)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:166)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
at java.lang.Thread.run(Thread.java:595)
Caused by: java.lang.RuntimeException: java.io.IOException: read past EOF
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068)
at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:418)
at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:467)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319)
... 11 more
Caused by: java.io.IOException: read past EOF
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:151)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:70)
at org.apache.lucene.index.SegmentInfos$2.doBody(SegmentInfos.java:410)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:704)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:538)
at org.apache.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:402)
at org.apache.lucene.index.DirectoryReader.isCurrent(DirectoryReader.java:791)
at org.apache.lucene.index.DirectoryReader.doReopen(DirectoryReader.java:404)
at org.apache.lucene.index.DirectoryReader.reopen(DirectoryReader.java:352)
at org.apache.solr.search.SolrIndexReader.reopen(SolrIndexReader.java:413)
at org.apache.solr.search.SolrIndexReader.reopen(SolrIndexReader.java:424)
at org.apache.solr.search.SolrIndexReader.reopen(SolrIndexReader.java:35)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1049)
... 14 more
EDIT:
After Mauricio's answer, the solr libraries have been updated to 1.4.1 but this error was still raised.
I increased the commitReserveDuration and even if the "SnapPull Failed" error seems to have disappeared, another one started being raised, not sure about why as I can't seem to find much answer on the web:
ERROR [org.apache.solr.servlet.SolrDispatchFilter] - <ClientAbortException: java.io.IOException
at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:370)
at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:323)
at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:396)
at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:385)
at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:89)
at org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.java:183)
at org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:89)
at org.apache.solr.request.BinaryResponseWriter.write(BinaryResponseWriter.java:48)
at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:322)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:837)
at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:640)
at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1286)
at java.lang.Thread.run(Thread.java:595)
Caused by: java.io.IOException
at org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer(InternalAprOutputBuffer.java:703)
at org.apache.coyote.http11.InternalAprOutputBuffer$SocketOutputBuffer.doWrite(InternalAprOutputBuffer.java:733)
at org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutputFilter.java:124)
at org.apache.coyote.http11.InternalAprOutputBuffer.doWrite(InternalAprOutputBuffer.java:539)
at org.apache.coyote.Response.doWrite(Response.java:560)
at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:365)
... 22 more
>
ERROR [org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/].[SolrServer]] - <Servlet.service() for servlet SolrServer threw exception>
java.lang.IllegalStateException
at org.apache.catalina.connector.ResponseFacade.sendError(ResponseFacade.java:405)
at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:362)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:837)
at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:640)
at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1286)
at java.lang.Thread.run(Thread.java:595)
I still wonder what are the best practices to handle a big index (more than 20G) containing a lot of documents with solr. Am I missing some obvious links somewhere? Tutorials, documentations?

Cores are a tool primarily used to have different schemas in a single Solr instance. Also used as on-deck indexes. Sharding and replication are orthogonal issues.
You mention "a lot of traffic". That's a highly subjective measure. Instead, try to determinate how many QPS (queries per second) you need from Solr. Also, does a single Solr instance answer your queries fast enough? Only then can you determine if you need to scale out. A single Solr instance can handle a lot of traffic, maybe you don't even need to scale.
Make sure you run Solr on a server with plenty of memory (and make sure Java has access to it). Solr is quite memory-hungry, if you put it on a memory-constrained server, performance will suffer.
As the Solr wiki explains, use sharding if a single query takes too long to run, and replication if a single Solr instance can't handle the traffic. "Too long" and "traffic" depend on your particular application. Measure them.
Solr has lots of settings that affect performance: cache auto-warming, stored fields, merge factor, etc. Check out SolrPerformanceFactors.
There are no hard rules here. Every application has different search needs. Simulate and measure for your particular scenario.
About the replication error, make sure you're running 1.4.1 since 1.4.0 had a bug with replication.

Related

Index on array element attribute extremely slow

I'm new to mongodb but not new to databases. I created a collection of documents that look like this:
{_id: ObjectId('5e0d86e06a24490c4041bd7e')
,
,
match[{
_id: ObjectId(5e0c35606a24490c4041bd71),
ts: 1234456,
,
,}]
}
So there is a list of objects on the documents and within the list there might be many objects with the same _id field. I have a handful of documents in this collection and my query that selects on selected match._id's is horribly slow. I mean unnaturally slow.
Query is simply this: {match: {$elemMatch: {_id:match._id }}} and literally hangs the system for like 15 seconds returning 15 matching documents out of 25 total!
I put an index on the collection like this:
collection.createIndex({"match._id" : 1}) but that didn't help.
Explain says execution time is 0 and says it's using the index but it still takes 15 seconds or longer to complete.
I'm getting the same slowness in nodejs and in compass.
Explain Output:
{"explainVersion":"1","queryPlanner":{"namespace":"hp-test-39282b3a-9c0f-4e1f-b953-0a14e00ec2ef.lead","indexFilterSet":false,"parsedQuery":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"maxIndexedOrSolutionsReached":false,"maxIndexedAndSolutionsReached":false,"maxScansToExplodeReached":false,"winningPlan":{"stage":"FETCH","filter":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"inputStage":{"stage":"IXSCAN","keyPattern":{"match._id":1},"indexName":"match._id_1","isMultiKey":true,"multiKeyPaths":{"match._id":["match"]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"match._id":["[ObjectId('5e0c3560e5a9e0cbd994fa52'), ObjectId('5e0c3560e5a9e0cbd994fa52')]"]}}},"rejectedPlans":[]},"executionStats":{"executionSuccess":true,"nReturned":15,"executionTimeMillis":0,"totalKeysExamined":15,"totalDocsExamined":15,"executionStages":{"stage":"FETCH","filter":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"nReturned":15,"executionTimeMillisEstimate":0,"works":16,"advanced":15,"needTime":0,"needYield":0,"saveState":0,"restoreState":0,"isEOF":1,"docsExamined":15,"alreadyHasObj":0,"inputStage":{"stage":"IXSCAN","nReturned":15,"executionTimeMillisEstimate":0,"works":16,"advanced":15,"needTime":0,"needYield":0,"saveState":0,"restoreState":0,"isEOF":1,"keyPattern":{"match._id":1},"indexName":"match._id_1","isMultiKey":true,"multiKeyPaths":{"match._id":["match"]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"match._id":["[ObjectId('5e0c3560e5a9e0cbd994fa52'), ObjectId('5e0c3560e5a9e0cbd994fa52')]"]},"keysExamined":15,"seeks":1,"dupsTested":15,"dupsDropped":0}},"allPlansExecution":[]},"command":{"find":"lead","filter":{"match":{"$elemMatch":{"_id":"5e0c3560e5a9e0cbd994fa52"}}},"skip":0,"limit":0,"maxTimeMS":60000,"$db":"hp-test-39282b3a-9c0f-4e1f-b953-0a14e00ec2ef"},"serverInfo":{"host":"Dans-MacBook-Pro.local","port":27017,"version":"5.0.9","gitVersion":"6f7dae919422dcd7f4892c10ff20cdc721ad00e6"},"serverParameters":{"internalQueryFacetBufferSizeBytes":104857600,"internalQueryFacetMaxOutputDocSizeBytes":104857600,"internalLookupStageIntermediateDocumentMaxSizeBytes":104857600,"internalDocumentSourceGroupMaxMemoryBytes":104857600,"internalQueryMaxBlockingSortMemoryUsageBytes":104857600,"internalQueryProhibitBlockingMergeOnMongoS":0,"internalQueryMaxAddToSetBytes":104857600,"internalDocumentSourceSetWindowFieldsMaxMemoryBytes":104857600},"ok":1}
The explain output confirms that the operation that was explained is perfectly efficient. In particular we see:
The expected index being used with tight indexBounds
Efficient access of the data (totalKeysExamined == totalDocsExamined == nReturned)
No meaningful duration ("executionTimeMillis":0 which implies that the operation took less than 0.5ms for the database to execute)
Therefore the slowness that you're experiencing for that particular operation is not related to the efficiency of the plan itself. This doesn't always rule out the database (or its underlying server) as the source of the slowness completely, but it is usually a pretty strong indicator that either the problem is elsewhere or that there are multiple factors at play.
I would suggest the following as potential next steps:
Check the mongod log file (you can confirm its location by running db.adminCmd("getCmdLineOpts") via the shell connected to the instance). By default any operation slower than 100ms is captured. This will help in a variety of ways:
If there is a log entry (with a meaningful duration) then it confirms that the slowness is being introduced while the database is processing the operation. It could also give some helpful hints as to why that might be the case (waiting for locks or server resources such as storage for example).
If an associated entry cannot be found, then that would be significantly stronger evidence that we are looking in the wrong place for the source of the slowness.
Is the operation that you gathered explain for the exact one that the application and Compass are observing as being slow? Were you connected to the same server and namespace? Is the explained operation simplified in some way, such as the original operation containing sort, projection, collation, etc?
As a relevant example that combines these two, I notice that there are skip and limit parameters applied to the command explained on a mongod seemingly running on a laptop. Are those parameters non-zero when running the application and does the application run against a different database with a larger data set?
The explain command doesn't include everything that an application would. Notably absent is the actual time it takes to send the results across the network. If you had particularly large documents that could be a factor, though it seems unlikely to be the culprit in this particular situation.
How exactly are you measuring the full execution time? Does it potentially include the time to connect to the database? In this case you mentioned that Compass itself also demonstrates the slowness, so that may rule out most of this.
What else is running on the server hosting the database? Is there a container or VM involved? Would the database or the underlying server be experiencing resource contention due to concurrency?
Two additional minor asides:
25 total documents in a collection is extremely small. I would expect even the smallest hardware to be able to process such a request without an index unless there was some complicating factor.
Assuming that match is always an array then the $elemMatch operator is not strictly necessary for this particular query. You can read more about that here. I would not expect this to have a performance impact for your situation.

Interpretation of Steps (1/2) failed in Snowflake

I'm running an sql query on snowflake, which has been running for more than 24 hours.
When I go to 'Profile Overview', I see Steps (1/2, Failed):
However, under 'Details' tab, I see that the status of the query is still 'Running'.
Can someone please explain whether the query is still running or has there been some error ?
This must be retrying after failing on a first attempt. For example, it might have failed on a memory error and would retry internally twice (total three tries) before either successfully completing or failing with an error (in case all tries fail).
You may reach out to Snowflake Support for looking into this.
Your query is likely resource (memory) intensive. Something to check in your query would include:
- large sort or large number of sort?
- large group by (or many aggregates?)
- Many Windowing functions with order by?
For a query to be running for many hours, likely it would be best to dig deeper either with help of support (open case), or look for opportunity to review the query pattern.

Solr - Running out of heap memory while performing spellcheck.build

I'm using solr along with tomcat as servlet. I've setup solr to use only one core, and defined a DIH to import documents row by row from mysql tables.
Everything is fine and works good. the documents get indexed correctly and I can search among 'em.
The problem is that I'm trying to use the suggester module but I have problem building what ever it needs to build for the first time using a url like this:
http://user:pass#localhost:port/solr/corename/suggest?q=whatever&spellcheck.build=true
I left out one important piece of information: the data being imported is 4.7 million records right now.
At first it couldn't event build the spellcheck dictionary(if that is what it's building) for 1 million documents because jvm would run out of heap memory with the following message:
java.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.RuntimeException:
java.lang.OutOfMemoryError: GC overhead limit exceeded at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:793) at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:434) at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at
so I gradually increased the heap memory and right now it's about 2GB, which I supposed is a lot.
Of course the obvious solution is to increase the heap memory of java yet again, but i'm wandering if there's any way to divide and conquer the dictionary building process?
Or any other solution for that matter.
Thanks a lot
1) A parameter that can have a big impact on the spellcheck index size is "thresholdTokenFrequency". Adding the following parameter to your SpellCheckComponent configuration may be a remedy:
<float name="thresholdTokenFrequency">.01</float>
2) If the data in your spellcheck field is copied from different other fields, you may try to setup different SpellCheckComponents operating on seperate fields each.
Did not try this and I'm afraid that merging the results from different SpellCheckComponents may be quite tricky.
Solr needs a lot of memory during building indexes, like spellcheck - index.
Of cause it's not the way to put more and more memory to the machine.
I had an equal problem and found out, that increasing the virtual memory will solve the problem.
You can use ulimit -v to show the current state of the virtual memory.
In my case that was 14GByte for an 5GByte index, which war not enough (10Million docs)
So I put ulimit -v unlimited in the beginning of the tomcat start-script. That solved the problem for me.

Solr indexing issue (out of memory) - looking for a solution

I have a large index of 50 Million docs. all running on the same machine (no sharding).
I don't have an ID that will allow me to update the wanted docs, so for each update I must delete the whole index and to index everything from scratch and commit only at the end when I'm done indexing.
My problem is that every few index runs, My Solr crashes with out of memory exception, I am running with 12.5 GB memory.
From what I understand, until the commit everything is being saved in the memory, so I'm storing in the memory 100M docs instead of 50M. am I right?
But I cannot make commits while I'm indexing, because I deleted all docs at the beginning and than I'll run with partial index which is bad.
Is there any known solutions for that? can sharding solve it or I still going to have the same problem?
Is there a flag that allow me to make soft-commits but it won't change the index until the hard-commit?
You can use the master slave replication. Just dedicate one machine to do your indexing (master solr), and then, if it's finished, you can tell the slave to replicate the index from the master machine. The slave will download the new index, and it will only delete the old index if the download is successful. So it's quite safe.
http://wiki.apache.org/solr/SolrReplication
One other solution to avoid all this replication set-up is to use a reverse proxy, put nginx or something of the like in front of your solr. Use one machine for indexing the new data, and the other for searching. And you can just make the reverse proxy to always point at the one not currently doing any indexing.
If you do one of them, then you can just commit as often as you want.
And because it's generally a bad idea to do indexing and search in one same machine, I will prefer to use the master-slave solution (not to mention you have 50M docs).
out of memory error can be solved by providing more memory to jvm of your container it has nothing to do with your cache .
Use better options for Garbage collection because source of error is your jvm memory being full.
Increase the number of threads because if number of threads for a process is reached a new process is spawn (which have same number of threads as prior one and same memory allocation ).
PLease also write about cpu spike , and any other type of caching mechanism you are using
you can try one thing thats to put all auto warmup to 0 it would speed up commit time
regards
Rajat

Search using Solr vs Map Reduce on Files - which is reliable?

I have an application which needs to store a huge volume of data (around 200,000 txns per day), each record around 100 kb to 200 kb size. The format of the data is going to be JSON/XML.
The application should be highly available , so we plan to store the data on S3 or AWS DynamoDB.
We have use-cases where we may need to search the data based on a few attributes (date ranges, status, etc.). Most searches will be on few common attributes but there may be some arbitrary queries for certain operational use cases.
I researched the ways to search non-relational data and so far found two ways being used by most technologies
1) Build an index (Solr/CloudSearch,etc.)
2) Run a Map Reduce job (Hive/Hbase, etc.)
Our requirement is for the search results to be reliable (consistent with data in S3/DB - something like a oracle query, it is okay to be slow but when we get the data, we should have everything that matched the query returned or atleast let us know that some results were skipped)
At the outset it looks like the index based approach would be faster than the MR. But I am not sure if it is reliable - index may be stale? (is there a way to know the index was stale when we do the search so that we can correct it? is there a way to have the index always consistent with the values in the DB/S3? Something similar to the indexes on Oracle DBs).
The MR job seems to be reliable always (as it fetches data from S3 for each query), is that assumption right? Is there anyway to speed this query - may be partition data in S3 and run multiple MR jobs based on each partition?
You can <commit /> and <optimize /> the Solr index after you add documents, so I'm not sure a stale index is a concern. I set up a Solr instance that handled maybe 100,000 additional documents per day. At the time I left the job we had 1.4 million documents in the index. It was used for internal reporting and it was performant (the most complex query too under a minute). I just asked a former coworker and it's still doing fine a year later.
I can't speak to the map reduce software, though.
You should think about having one Solr core per week/month for instance, this way older cores will be read only, and easier to manager and very easy to spread over several Solr instances. If 200k docs are to be added per day for ever you need either that or Solr sharding, a single core will not be enough for ever.

Resources