first of all to mention i searched a long time but got n solution, so not i try with my specific problem, trying to keep it short:
solr-spec 4.0.0.2012.10.06.03.04.33
one master, three slaves
around 70.000 documents in index
master gets triggered to full import / generate complete new index ~ once a day
command line options for trigger are:
?command=full-import&verbose=false&clean=false&commit=true&optimize=true
slaves trigger master for new index, if GEN increases (full import + hard commit as mentioned), they pull the new index
no autoCommit / autoSoftCommit set up
the problem ist, that each hard commit the index (~670MB) gets written to disk, once a day, but the old never get deleted.
As far as i read solr keeps enough tlogs to be able to restore the last 100 changes to documents, am i right?
In my setup i am sure at least 100 documents (or data sets within the source database) are changed each day, so i dont understand why solr never deletes old tlogs.
I would be glad if someone can point to the right direction, currently i have no clue what to try next. Also i did not find a setup like this one described having problems like this.
Thx ;)
First you'll probably want to update your Solr-version, as there's been a few transaction log reference leaks fixed since 4.0.
A hard commit should usually remove old transaction logs as the documents are written to disk in the index anyway iirc, which may indicate that you're getting bit by some old references hanging around.
Another option would be to turn off the transaction log completely, since you only generate a completely new index each run anyway and dist that one.
Related
I have a classifieds website. Users may put ads, edit ads, view ads etc.
Whenever a user puts an ad, I am adding a document to Solr.
I don't know, however, when to commit it. Commit slows things down from what I have read.
How should I do it? Autocommit every 12 hours or so?
Also, how should I do it with optimize?
A little more detail on Commit/Optimize:
Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.
Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.
Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)
Index Server / Search Server:
Paul Brown was right in that the best design for solr is to have a server dedicated and tuned to indexing, and then replicate the changes to the searching servers. You can tune the index server to have multiple index end points.
eg: http://solrindex01/index1; http://solrindex01/index2
And since the index server is not searching for content you can have it set up with different memory footprints and index warming commands etc.
Hope this is useful info for everyone.
Actually, committing often and optimizing makes things really slow. It's too heavy.
After a day of searching and reading stuff, I found out this:
1- Optimize causes the index to double in size while beeing optimized, and makes things really slow.
2- Committing after each add is NOT a good idea, it's better to commit a couple of times a day, and then make an optimize only once a day at most.
3- Commit should be set to "autoCommit" in the solrconfig.xml file, and there it should be tuned according to your needs.
The way that this sort of thing is usually done is to perform commit/optimize operations on a Solr node located out of the request path for your users. This requires additional hardware, but it ensures that the performance penalty of the indexing operations doesn't impact your users. Replication is used to periodically shuttle optimized index files from the master node to the nodes that perform search queries for users.
Try it first. It would be really bad if you avoided a simple and elegant solution just because you read that it might cause a performance problem. In other words, avoid premature optimization.
I attempted to perform a full Solr reindex for our Cassandra cluster this past weekend. It seemed that two nodes were taking a lot longer than the other three, in fact they keep indexing for hours after the others were done. Finally it seemed they had finished, at least in the web console they both said "no" for indexing field in the web console.
Unfortunately about an hour later one of those two nodes became completely unresponsive, and ultimately had to be restarted.
Today I'm looking at the nodes, and the 3 that didn't seem to have any problems all claim to have about 14.8 million docs or so, which is about what it should be. However the two that were stuck, or took forever (including the one that ulimately became unresponsive) have only 9 and 7 million respectively. That is a huge discrepancy which tells me that they didn't complete correctly.
So, to resolve the issue I have two questions:
1) Since this was a full reindex, are the changes that were implemented to the schema and hence the reason for the full index, good? In other words is it only the indexing part that didn't finish, so can I just run a regular in place reindex to get everything back to the way it should be?
2) Assuming I don't have to run a full reindex, can I just run an in place reindex on the two nodes that are out of whack? From a time perspective this would be ideal as I'd have to do it after hours anyway, and it would hopefully finish overnight.
Just wondering how to proceed, as I haven't had this issue in the past.
Regarding your questions:
1) Yes, you can do a reload with in-place reindex by setting reindex=true, deleteAll=false.
2) Yes, you can run an in-place reindex on the failed nodes only by invoking a reload on each node and setting reindex=true, deleteAll=false, distributed=false.
Have a look at: http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchReldCore.html
Anyway, it would be good to first understand why those nodes failed: that kind of behaviour looks like an out of memory error, but are there any exceptions in your logs?
We got the following problem at hand. We want to do a full reindex with 100 % read availability during the process. The problem arises when deleting old documents from the index. At the moment we´re doing sth. like this:
1) fetch all data from db and update solr index per solrServer.add()
2) get all document ids that were updated and compare them with all the document ids in index
3) delete all documents that are in index but weren´t updated
This seems to work but is there maybe a better/easier solution for this task?
The changes do not become visible until you commit. So, you can issue delete and then index all your documents. Just make sure automatic commits are not there. This obviously requires more memory.
Alternatively, you can do a separate field with generational stamp (e.g. increasing ID or timestamp). Then, you issue a query delete to pick up the left over documents with old generation.
Finally, you can index into a new Core/Collection and then swap out the active collection to point to the new one. Then, you can just delete the old collection directory.
It sounds like you may have a performance issue with the deletes. IF you do this:
delete id:12345
delete id:23456
delete id:13254
then it is a lot slower than this:
delete id:(12345 OR 23456 OR 13254)
Collect the list of ids that need to be deleted, batch them in groups of 100 or so, and transform those batches into delete queries using parentheses and OR. I have done this with batches of deletes numbering several thousand, and it is much faster than stepping through one at a time.
I am developing a test application that requires me to insert 1 million records in a Postgresql database but at random points the insert stops and if I try to restart the insertion process, the application refuses to populate the table with more records. I've read that databases have a size cap, which is around 4 Gb, but I'm sure my database didn't even come close to this value.
So, what other reasons could be for why insertion stopped?
It happened a few times, once capping at 170872 records, another time at 25730 records.
I know the question might sound silly but I can't find any other reasons for why it stops inserting.
Thanks in advance!
JUST A QUICK UPDATE:
Indeed the problem isn't the database cap, here are the official figures for PostgreSQL:
- Maximum Database Size Unlimited
- Maximum Table Size 32 TB
- Maximum Row Size 1.6 TB
- Maximum Field Size 1 GB
- Maximum Rows per Table Unlimited
- Maximum Columns per Table 250 - 1600 depending on column types
- Maximum Indexes per Table Unlimited
Update:
Error in log file:
2012-03-26 12:30:12 EEST WARNING: there is no transaction in progress
So I'm looking up for an answer that fits this issue. If you can give any hints I would be very grateful.
I've read that databases have a size cap, which is around 4 Gb
I rather doubt that. It's certainly not true about PostgreSQL.
[...]at random points the insert stops and if I try to restart the insertion process, the application refuses to populate the table with more records
Again, I'm afraid I doubt this. Unless your application has become self aware it's refusing to do nothing. It might be crashing, or locking, or waiting for something to happen though.
I know the question might sound silly but I can't find any other reasons for why it stops inserting.
I don't think you've looked hard enough. Obvious things to check:
Are you getting any errors in the PostgreSQL logs?
If not, are you sure you're logging errors? Issue a bad query to check.
Are you getting any errors in the application?
If not,. are you sure you're checking? Again, check
What is/are the computer(s) up to? How much CPU/RAM/Disk IO is in use? Any unusual activity?
Any unusual locks begin taken (check the pg_locks view).
If you asked the question having checked the above then there's someone who'll be able to help. Probably though, you'll figure it out yourself once you've got the facts in front of you.
OK - if you're getting "no transaction in progress" that means you're issuing a commit/rollback but outside of an explicit transaction. If you don't issue a "BEGIN" then each statement gets its own transaction.
This is unlikely to be the cause of the problem.
Something is causing the inserts to stop, and you've still not told us what. You said earlier you weren't getting any errors inside the application. That shouldn't be possible if PostgreSQL is returning an error you should be picking it up in the application.
It's difficult to be more helpful without more accurate information. Every statement you send to PostgreSQL will return a status code. If you get an error inside a multi-statement transaction then all the statements in that transaction will be rolled back. You've either got some confused transaction control in the application or it is falling down for some other reason.
One of the possibilities is that the OP is using ssl, and the ssl_renegotiation_limit is reached. In any case: set the log_connections / log_disconnections to "On" and check the logfile.
I found out what was the problem with my insert command, and although it might seem funny it's one of those things you never thought could go wrong.
My application is developed in Django and has a command that simply calls for the file that does the insert operations into the tables.
i.e. in the command line terminal I just write:
time python manage.py populate_sql
The reason for which I use the time command is because I want to see how long it takes for the insertion to execute. Well, the problem was here. That time command issued an error, a Out of memory error which stopped the insertion into the database. I found this little code while running the command with the --verbose option which lets you see all the details of the command.
I would like to thank you all for your answers, for the things that I have learned from them and for the time you used trying to help me.
EDIT:
If you have a Django application in which you make a lot of operations with the database, then my advice to you is to set the 'DEBUG' variable in settings.py to 'FALSE' because it eats up a lot of your memory in time.
So,
DEBUG = False
And in the end, thank you again for the support Richard Huxton!
We have millions of documents in mongo we are looking to index on solr. Obviously when we do this the first time we need to index all the documents.
But after that, we should only need to index the documents as they change. What is the best way to do this? Should we call addDocument and then in cron call commit()? What does addDocument vs commit vs optimize do (I am using Apache_Solr_Service)
If you're using Solr 3.x you can forget the optimize, which merges all segments into one big segment. The commit makes changes visible to new IndexReaders; it's expensive, I wouldn't call it for each document you add. Instead of calling it through a cron, I'd use the autocommit in solrconfig.xml. You can tune the value depending on how much time you can wait to get new documents while searching.
The document won't actually be added to the index until you do commit() - it could be rolled back. optimize() will (ostensibly; I've not had particularly good luck with it) reduce the size of the index (documents that have been deleted still take up room unless the index is optimized).
If you set autocommit for your database, then you can be sure that any documents added to the database via update, have been committed, when the autocommit interval has passed. I have used a 5-minute interval and it works fine even when a few thousand updates happen within the 5 minutes. After a full reindex is complete, I wait 5 minutes and then tell people that it is done. In fact, when people ask how quickly updates get into the db, I will tell them that we poll for changes every minute, but that there are variables (such as a sudden big batch) and it is best to not expect things to be updated for 5 or 6 minutes. So far, nobody has really claimed a business need to have it update faster than that.
This is with a 350,000 record db totalling roughly 10G in RAM.