Solr full reindex appears to have failed on two nodes

Solr full reindex appears to have failed on two nodes - solr

I attempted to perform a full Solr reindex for our Cassandra cluster this past weekend. It seemed that two nodes were taking a lot longer than the other three, in fact they keep indexing for hours after the others were done. Finally it seemed they had finished, at least in the web console they both said "no" for indexing field in the web console.
Unfortunately about an hour later one of those two nodes became completely unresponsive, and ultimately had to be restarted.
Today I'm looking at the nodes, and the 3 that didn't seem to have any problems all claim to have about 14.8 million docs or so, which is about what it should be. However the two that were stuck, or took forever (including the one that ulimately became unresponsive) have only 9 and 7 million respectively. That is a huge discrepancy which tells me that they didn't complete correctly.
So, to resolve the issue I have two questions:
1) Since this was a full reindex, are the changes that were implemented to the schema and hence the reason for the full index, good? In other words is it only the indexing part that didn't finish, so can I just run a regular in place reindex to get everything back to the way it should be?
2) Assuming I don't have to run a full reindex, can I just run an in place reindex on the two nodes that are out of whack? From a time perspective this would be ideal as I'd have to do it after hours anyway, and it would hopefully finish overnight.
Just wondering how to proceed, as I haven't had this issue in the past.

Regarding your questions:
1) Yes, you can do a reload with in-place reindex by setting reindex=true, deleteAll=false.
2) Yes, you can run an in-place reindex on the failed nodes only by invoking a reload on each node and setting reindex=true, deleteAll=false, distributed=false.
Have a look at: http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchReldCore.html
Anyway, it would be good to first understand why those nodes failed: that kind of behaviour looks like an out of memory error, but are there any exceptions in your logs?

Related

Index on array element attribute extremely slow

I'm new to mongodb but not new to databases. I created a collection of documents that look like this:
{_id: ObjectId('5e0d86e06a24490c4041bd7e')
,
,
match[{
_id: ObjectId(5e0c35606a24490c4041bd71),
ts: 1234456,
,
,}]
}
So there is a list of objects on the documents and within the list there might be many objects with the same _id field. I have a handful of documents in this collection and my query that selects on selected match._id's is horribly slow. I mean unnaturally slow.
Query is simply this: {match: {$elemMatch: {_id:match._id }}} and literally hangs the system for like 15 seconds returning 15 matching documents out of 25 total!
I put an index on the collection like this:
collection.createIndex({"match._id" : 1}) but that didn't help.
Explain says execution time is 0 and says it's using the index but it still takes 15 seconds or longer to complete.
I'm getting the same slowness in nodejs and in compass.
Explain Output:
{"explainVersion":"1","queryPlanner":{"namespace":"hp-test-39282b3a-9c0f-4e1f-b953-0a14e00ec2ef.lead","indexFilterSet":false,"parsedQuery":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"maxIndexedOrSolutionsReached":false,"maxIndexedAndSolutionsReached":false,"maxScansToExplodeReached":false,"winningPlan":{"stage":"FETCH","filter":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"inputStage":{"stage":"IXSCAN","keyPattern":{"match._id":1},"indexName":"match._id_1","isMultiKey":true,"multiKeyPaths":{"match._id":["match"]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"match._id":["[ObjectId('5e0c3560e5a9e0cbd994fa52'), ObjectId('5e0c3560e5a9e0cbd994fa52')]"]}}},"rejectedPlans":[]},"executionStats":{"executionSuccess":true,"nReturned":15,"executionTimeMillis":0,"totalKeysExamined":15,"totalDocsExamined":15,"executionStages":{"stage":"FETCH","filter":{"match":{"$elemMatch":{"_id":{"$eq":"5e0c3560e5a9e0cbd994fa52"}}}},"nReturned":15,"executionTimeMillisEstimate":0,"works":16,"advanced":15,"needTime":0,"needYield":0,"saveState":0,"restoreState":0,"isEOF":1,"docsExamined":15,"alreadyHasObj":0,"inputStage":{"stage":"IXSCAN","nReturned":15,"executionTimeMillisEstimate":0,"works":16,"advanced":15,"needTime":0,"needYield":0,"saveState":0,"restoreState":0,"isEOF":1,"keyPattern":{"match._id":1},"indexName":"match._id_1","isMultiKey":true,"multiKeyPaths":{"match._id":["match"]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"match._id":["[ObjectId('5e0c3560e5a9e0cbd994fa52'), ObjectId('5e0c3560e5a9e0cbd994fa52')]"]},"keysExamined":15,"seeks":1,"dupsTested":15,"dupsDropped":0}},"allPlansExecution":[]},"command":{"find":"lead","filter":{"match":{"$elemMatch":{"_id":"5e0c3560e5a9e0cbd994fa52"}}},"skip":0,"limit":0,"maxTimeMS":60000,"$db":"hp-test-39282b3a-9c0f-4e1f-b953-0a14e00ec2ef"},"serverInfo":{"host":"Dans-MacBook-Pro.local","port":27017,"version":"5.0.9","gitVersion":"6f7dae919422dcd7f4892c10ff20cdc721ad00e6"},"serverParameters":{"internalQueryFacetBufferSizeBytes":104857600,"internalQueryFacetMaxOutputDocSizeBytes":104857600,"internalLookupStageIntermediateDocumentMaxSizeBytes":104857600,"internalDocumentSourceGroupMaxMemoryBytes":104857600,"internalQueryMaxBlockingSortMemoryUsageBytes":104857600,"internalQueryProhibitBlockingMergeOnMongoS":0,"internalQueryMaxAddToSetBytes":104857600,"internalDocumentSourceSetWindowFieldsMaxMemoryBytes":104857600},"ok":1}

The explain output confirms that the operation that was explained is perfectly efficient. In particular we see:
The expected index being used with tight indexBounds
Efficient access of the data (totalKeysExamined == totalDocsExamined == nReturned)
No meaningful duration ("executionTimeMillis":0 which implies that the operation took less than 0.5ms for the database to execute)
Therefore the slowness that you're experiencing for that particular operation is not related to the efficiency of the plan itself. This doesn't always rule out the database (or its underlying server) as the source of the slowness completely, but it is usually a pretty strong indicator that either the problem is elsewhere or that there are multiple factors at play.
I would suggest the following as potential next steps:
Check the mongod log file (you can confirm its location by running db.adminCmd("getCmdLineOpts") via the shell connected to the instance). By default any operation slower than 100ms is captured. This will help in a variety of ways:
If there is a log entry (with a meaningful duration) then it confirms that the slowness is being introduced while the database is processing the operation. It could also give some helpful hints as to why that might be the case (waiting for locks or server resources such as storage for example).
If an associated entry cannot be found, then that would be significantly stronger evidence that we are looking in the wrong place for the source of the slowness.
Is the operation that you gathered explain for the exact one that the application and Compass are observing as being slow? Were you connected to the same server and namespace? Is the explained operation simplified in some way, such as the original operation containing sort, projection, collation, etc?
As a relevant example that combines these two, I notice that there are skip and limit parameters applied to the command explained on a mongod seemingly running on a laptop. Are those parameters non-zero when running the application and does the application run against a different database with a larger data set?
The explain command doesn't include everything that an application would. Notably absent is the actual time it takes to send the results across the network. If you had particularly large documents that could be a factor, though it seems unlikely to be the culprit in this particular situation.
How exactly are you measuring the full execution time? Does it potentially include the time to connect to the database? In this case you mentioned that Compass itself also demonstrates the slowness, so that may rule out most of this.
What else is running on the server hosting the database? Is there a container or VM involved? Would the database or the underlying server be experiencing resource contention due to concurrency?
Two additional minor asides:
25 total documents in a collection is extremely small. I would expect even the smallest hardware to be able to process such a request without an index unless there was some complicating factor.
Assuming that match is always an array then the $elemMatch operator is not strictly necessary for this particular query. You can read more about that here. I would not expect this to have a performance impact for your situation.

When to optimize a Solr Index [duplicate]

I have a classifieds website. Users may put ads, edit ads, view ads etc.
Whenever a user puts an ad, I am adding a document to Solr.
I don't know, however, when to commit it. Commit slows things down from what I have read.
How should I do it? Autocommit every 12 hours or so?
Also, how should I do it with optimize?

A little more detail on Commit/Optimize:
Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.
Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.
Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)
Index Server / Search Server:
Paul Brown was right in that the best design for solr is to have a server dedicated and tuned to indexing, and then replicate the changes to the searching servers. You can tune the index server to have multiple index end points.
eg: http://solrindex01/index1; http://solrindex01/index2
And since the index server is not searching for content you can have it set up with different memory footprints and index warming commands etc.
Hope this is useful info for everyone.

Actually, committing often and optimizing makes things really slow. It's too heavy.
After a day of searching and reading stuff, I found out this:
1- Optimize causes the index to double in size while beeing optimized, and makes things really slow.
2- Committing after each add is NOT a good idea, it's better to commit a couple of times a day, and then make an optimize only once a day at most.
3- Commit should be set to "autoCommit" in the solrconfig.xml file, and there it should be tuned according to your needs.

The way that this sort of thing is usually done is to perform commit/optimize operations on a Solr node located out of the request path for your users. This requires additional hardware, but it ensures that the performance penalty of the indexing operations doesn't impact your users. Replication is used to periodically shuttle optimized index files from the master node to the nodes that perform search queries for users.

Try it first. It would be really bad if you avoided a simple and elegant solution just because you read that it might cause a performance problem. In other words, avoid premature optimization.

SOLR - old Transactions-Logs (tlogs) never deleted

first of all to mention i searched a long time but got n solution, so not i try with my specific problem, trying to keep it short:
solr-spec 4.0.0.2012.10.06.03.04.33
one master, three slaves
around 70.000 documents in index
master gets triggered to full import / generate complete new index ~ once a day
command line options for trigger are:
?command=full-import&verbose=false&clean=false&commit=true&optimize=true
slaves trigger master for new index, if GEN increases (full import + hard commit as mentioned), they pull the new index
no autoCommit / autoSoftCommit set up
the problem ist, that each hard commit the index (~670MB) gets written to disk, once a day, but the old never get deleted.
As far as i read solr keeps enough tlogs to be able to restore the last 100 changes to documents, am i right?
In my setup i am sure at least 100 documents (or data sets within the source database) are changed each day, so i dont understand why solr never deletes old tlogs.
I would be glad if someone can point to the right direction, currently i have no clue what to try next. Also i did not find a setup like this one described having problems like this.
Thx ;)

First you'll probably want to update your Solr-version, as there's been a few transaction log reference leaks fixed since 4.0.
A hard commit should usually remove old transaction logs as the documents are written to disk in the index anyway iirc, which may indicate that you're getting bit by some old references hanging around.
Another option would be to turn off the transaction log completely, since you only generate a completely new index each run anyway and dist that one.

Postgresql inserts stop at random number of records

I am developing a test application that requires me to insert 1 million records in a Postgresql database but at random points the insert stops and if I try to restart the insertion process, the application refuses to populate the table with more records. I've read that databases have a size cap, which is around 4 Gb, but I'm sure my database didn't even come close to this value.
So, what other reasons could be for why insertion stopped?
It happened a few times, once capping at 170872 records, another time at 25730 records.
I know the question might sound silly but I can't find any other reasons for why it stops inserting.
Thanks in advance!
JUST A QUICK UPDATE:
Indeed the problem isn't the database cap, here are the official figures for PostgreSQL:
- Maximum Database Size Unlimited
- Maximum Table Size 32 TB
- Maximum Row Size 1.6 TB
- Maximum Field Size 1 GB
- Maximum Rows per Table Unlimited
- Maximum Columns per Table 250 - 1600 depending on column types
- Maximum Indexes per Table Unlimited
Update:
Error in log file:
2012-03-26 12:30:12 EEST WARNING: there is no transaction in progress
So I'm looking up for an answer that fits this issue. If you can give any hints I would be very grateful.

I've read that databases have a size cap, which is around 4 Gb
I rather doubt that. It's certainly not true about PostgreSQL.
[...]at random points the insert stops and if I try to restart the insertion process, the application refuses to populate the table with more records
Again, I'm afraid I doubt this. Unless your application has become self aware it's refusing to do nothing. It might be crashing, or locking, or waiting for something to happen though.
I know the question might sound silly but I can't find any other reasons for why it stops inserting.
I don't think you've looked hard enough. Obvious things to check:
Are you getting any errors in the PostgreSQL logs?
If not, are you sure you're logging errors? Issue a bad query to check.
Are you getting any errors in the application?
If not,. are you sure you're checking? Again, check
What is/are the computer(s) up to? How much CPU/RAM/Disk IO is in use? Any unusual activity?
Any unusual locks begin taken (check the pg_locks view).
If you asked the question having checked the above then there's someone who'll be able to help. Probably though, you'll figure it out yourself once you've got the facts in front of you.

OK - if you're getting "no transaction in progress" that means you're issuing a commit/rollback but outside of an explicit transaction. If you don't issue a "BEGIN" then each statement gets its own transaction.
This is unlikely to be the cause of the problem.
Something is causing the inserts to stop, and you've still not told us what. You said earlier you weren't getting any errors inside the application. That shouldn't be possible if PostgreSQL is returning an error you should be picking it up in the application.
It's difficult to be more helpful without more accurate information. Every statement you send to PostgreSQL will return a status code. If you get an error inside a multi-statement transaction then all the statements in that transaction will be rolled back. You've either got some confused transaction control in the application or it is falling down for some other reason.

One of the possibilities is that the OP is using ssl, and the ssl_renegotiation_limit is reached. In any case: set the log_connections / log_disconnections to "On" and check the logfile.

I found out what was the problem with my insert command, and although it might seem funny it's one of those things you never thought could go wrong.
My application is developed in Django and has a command that simply calls for the file that does the insert operations into the tables.
i.e. in the command line terminal I just write:
time python manage.py populate_sql
The reason for which I use the time command is because I want to see how long it takes for the insertion to execute. Well, the problem was here. That time command issued an error, a Out of memory error which stopped the insertion into the database. I found this little code while running the command with the --verbose option which lets you see all the details of the command.
I would like to thank you all for your answers, for the things that I have learned from them and for the time you used trying to help me.
EDIT:
If you have a Django application in which you make a lot of operations with the database, then my advice to you is to set the 'DEBUG' variable in settings.py to 'FALSE' because it eats up a lot of your memory in time.
So,
DEBUG = False
And in the end, thank you again for the support Richard Huxton!

solr indexing strategy

We have millions of documents in mongo we are looking to index on solr. Obviously when we do this the first time we need to index all the documents.
But after that, we should only need to index the documents as they change. What is the best way to do this? Should we call addDocument and then in cron call commit()? What does addDocument vs commit vs optimize do (I am using Apache_Solr_Service)

If you're using Solr 3.x you can forget the optimize, which merges all segments into one big segment. The commit makes changes visible to new IndexReaders; it's expensive, I wouldn't call it for each document you add. Instead of calling it through a cron, I'd use the autocommit in solrconfig.xml. You can tune the value depending on how much time you can wait to get new documents while searching.

The document won't actually be added to the index until you do commit() - it could be rolled back. optimize() will (ostensibly; I've not had particularly good luck with it) reduce the size of the index (documents that have been deleted still take up room unless the index is optimized).

If you set autocommit for your database, then you can be sure that any documents added to the database via update, have been committed, when the autocommit interval has passed. I have used a 5-minute interval and it works fine even when a few thousand updates happen within the 5 minutes. After a full reindex is complete, I wait 5 minutes and then tell people that it is done. In fact, when people ask how quickly updates get into the db, I will tell them that we poll for changes every minute, but that there are variables (such as a sudden big batch) and it is best to not expect things to be updated for 5 or 6 minutes. So far, nobody has really claimed a business need to have it update faster than that.
This is with a 350,000 record db totalling roughly 10G in RAM.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight