Is it normal to need to restart Solr/Apache regularly? - solr

It seems like about once every 2 to 3 months or so something about our solr implementation breaks down. Most recently the process we use to reindex our solr cores broke. It's a console application that just does two things: 1) Clear the indexes, 2) Rebuild the indexes. That's it, and it just does it by issuing http web requests to the server.
I was getting a 500 response when it was trying to clear the 1st core. None of the other cores had a problem though (except for the fact that they came after the 1st one and it is a synchronous process so nothing got reindexed). I spent a little time troubleshooting but ultimately I just restarted apache and it worked.
That seems to be the solution for everything. I wish I could remember previous issues I've run into, but something a little different seems to happen each time and it's always just much easier to reset apache than to spend the time to troubleshoot (plus it always happens in production and so spending hours troubleshooting isn't really a good option if I can fix it in seconds).
I wish these issues happened in staging or development where I could take time to investigate further, but it's always in production. So I'm starting to wonder if I should just create a task to reset Apache server every night.
When I reset it and it just suddenly works normally without me having to make a single change, I really have to wonder about the stability of Solr. Is this normal for others who use solr?

Related

Spectre/Meltdown slowing down delphi service

I have a problem with the spectre/meltdown patch from windows (it got released somewhere around Q1 last year). When activated, my delphi REST service is being slowed down about 15 times (so if a request takes 1 second, with the activated patches its about 15 seconds). I have traced the slowdown down to the database connection. Somehow the translation from parameters, after they have all been set, to the sql text, takes really long and then the execution on the database itself takes a lot longer than usual. First I helped myself by cutting down the sql statement to couple of rows, and it got faster (so more rows mean a lot more time. Approximately its like, if you add one more row to an update/insert statement it takes 0.2-0.3 seconds more to process the transaction. As far as I saw it, select statements work fine).
After I got the same issue on other requests, and the application is still in development, I turned of the patches, and everything got a lot faster. Now the administrator insists that the patches are being turned on, and the problem is there again .
Did anybody experience something like this, or is there a possiblity to exclude an application from being targeted by the patches? The strange thing is, I also have an client/server application that is using the same business logic. The client/server application is also being slowed down, but approximately just around the factor of 2. So thats the thing that I dont quite understand. With the same functions, it takes a lot longer from within the service, than from the client/server application.
Ah yes, I am using devart for the database connection, and its an mssql server (2016). The service and the client/server application are written in delphi XE7 (now trying to update do Xe10.2 hoping that this will help)
Thanks

Google appengine, least expensive way to run heavy datastore write cron job?

I have a Google appengine application, written in Go, that has a cron process which runs once a day at 3am. This process looks at all of the changes that have happened to my data during the day and stores some meta data about what happened. My users can run reports on this meta data to see trends that have happened over several months. The process does around 10-20 million datastore writes every night. It all works just fine, but since I have started running it I have noticed a significant increase in my monthly bill from Google (from around $50/month to around $400/month).
I have just setup a very basic taskqueue that this runs in, I have not changed the default settings at all. Is there a better way that I could be running this process at night that could save me money? I have never messed around with the backends (which are now depreciated) or the modules api, and I know they've changed a lot of this stuff recently so I'm not sure where to start looking. Any advice would be greatly appreciated.
Look at your instances at 3am. It might be that GAE spins up a lot of them to handle the job. You could configure your job to make it run less paralel so it will take longer but perhaps it will need only 1 instance then.
However, if your database writes are indeed the biggest factor this won't make a big impact.
You can try looking at your data models and indexes. Remember that each indexed field costs 2 writes extra, so see if you can remove indexes from some fields if you don't need them.
One improvement that you can do is to batch your write operations, you can use memcache for this (pay the dedicated one since it's more reliable). Write the updates to memcache, once it's about 900K, flush it to datastore. This will reduces the number of write to datastore A LOT, especially if your metadata's size is small.

Backend "Process moved to a different machine" and fails withh error 500

I have a process that takes around five minutes to complete. It runs on a cron job every two hours in a backend instance.
Recently the process has started to fail; not every time but a few times a day. First thing that happens is that the memcache starts to throw exceptions:
04:21:13.640 com.google.appengine.api.memcache.LogAndContinueErrorHandler handleServiceError: Service error in memcache
com.google.appengine.api.memcache.MemcacheServiceException: Memcache get: exception getting 1 key (ItemFollowableCompleted:RegionUS:P8XD:0)
at com.google.appengine.api.memcache.MemcacheServiceApiHelper$RpcResponseHandler.handleApiProxyException(MemcacheServiceApiHelper.java:68)
at com.google.appengine.api.memcache.MemcacheServiceApiHelper$1.absorbParentException(MemcacheServiceApiHelper.java:109)
None of these are fatal exceptions but a few seconds later the process terminated without warning or shutdown message. Logs show
04:21:30.591 Process moved to a different machine.
and an error 500.
Is this a google infrastructure problem related to memcache or is there something in the app code that could be causing it?
No, it's not an error in Google infrastructure. Your process is expected to be moved among instances when needed (maintenance, more demand from your side, ...), and there's nothing you can do to prevent it.
Nonetheless there are a few things you could do to alleviate any effect this could have in your app.
Look [1] for some suggestions on how to keep track of your pending jobs when your instance is shut down and also have a look at the background threads.
I'm guessing you're using Python, if not, look for your corresponding language.
[1] https://developers.google.com/appengine/docs/python/backends/#Python_Backend_states
I have the same problem when I use ndb.putmulti() to load data. I tried a few things
1. increase my backends machine size, I moved to B4_1G
2. sleep between ndb.putmulti() (2 minutes for every 200 entities)
3. Dedicated memcache (1G)
1 and 2 were not very helpful, 3 seems to help.
I think rapid updates to ndb datastore affecting memcache is the root cause in my case. I could not find any other way besides paying for dedicated memcache.
I also met the issue "Process moved to a different machine" in the backend module too.
The issue context is as below:
Get the query result from one KIND
Iterating each entity in the query result, I will do some tasks and write new entities to different KINDs
The "Process moved to a different machine" happens during the half of iterating
After some experiments, I found it is due to "too many writing transactions in one request". Everything is fine when the size of query result is small, but cause problem when it becomes larger.
The final solution I took is to use Task Queue, the work should be done for a entity is looked as one task and be put into the PushQueue. So the issue is gone.
Hope this will help :)

Appengine responses becoming slower?

my ajax calls to AppEngine doing some very basic logic (and doing all the actual processing in the background, isolated from the frontend) tend to be at least 200% slower than they used to. Like taking 3 seconds instead of one out of a sudden since a week or so.
I am wondering if you guys had a similar experience or something changed in the meantime I am not aware of, quota wise maybe. I am using the free quota.
Thanks
Zac
To my knowledge there is no particular change going on, but we can't be sure. However slow response time can have multiple root causes.
If you have no traffic on your application then you might have zero instance running, therefore when you make your request there is the time for an instance to start up.
If you have a lot of traffic, depending on your configuration the request can take more time. You need to fine tune wether the request waits to be handled by an "overloaded" instance or if another instance should start.
If you use an API maybe there is something wrong with it.
I would suggest you enable appstats in your app, it will show you what takes time in your request: you will definitely see if this is something on your side or not.

simple Solr deployment with two servers for redundancy

I'm deploying the Apache Solr web app in two redundant Tomcat 6 servers,
to provide redundancy and improved availability. At this point, scalability is not a issue.
I have a load balancer that can dynamically route traffic to one server or the other or both.
I know that Solr supports master/slave configuration, but that requires manual recovery if the slave receives updates during the master outage (which it will in my use case).
I'm considering a simpler approach using the ability to reload a core:
- only one of the two servers is receiving traffic at any time (the "active" instance), but both are running,
- both instances share the same index data and
- before re-routing traffic due to an outage, the now active instance is told to reload the index core(s)
Limited testing of failovers with both index reads and writes has been successful. What implications/issues am I missing?
Your thoughts and opinions welcomed.
The simple approach to redundancy your considering seems reasonable but you will not be able to use it for disaster recovery unless you can share the data/index to/from a different physical location using your NAS/SAN.
Here are some suggestions:-
Make backups for disaster recovery and test those backups work as an index could conceivably have been corrupted as there are no checksums happening internally in SOLR/Lucene. An index could get wiped or some records could get deleted and merged away without you knowing it and backups can be useful for recovering those records/docs at a later time if you need to perform an investigation.
Before you re-route traffic to the second instance I would run some queries to load caches and also to test and confirm the current index works before it goes online.
Isolate the updates to one location and process and thread to ensure transactional integrity in the event of a cutover as it could be difficult to manage consistency as SOLR does not use a vector clock to synchronize updates like some databases. I personally would keep a copy of all updates in order separately from SOLR in some other store just in case a small time window needs to be repeated.
In general, my experience with SOLR has been excellent as long as you are not using cutting edge features and plugins. I have one instance that currently has 40 million docs and an uptime of well over a year with no issues. That doesn't mean you wont have issues but gives you an idea of how stable it could be.
I hardly know anything about Solr, so I don't know the answers to some of the questions that need to be considered with this sort of setup, but I can provide some things for consideration. You will have to consider what sorts of failures you want to protect against and why and make your decision based on that. There is, after all, no perfect system.
Both instances are using the same files. If the files become corrupt or unavailable for some reason (hardware fault, software bug), the second instance is going to fail the same as the first.
On a similar note, are the files stored and accessed in such a way that they are always valid when the inactive instance reads them? Will the inactive instance try to read the files when the active instance is writing them? What would happen if it does? If the active instance is interrupted while writing the index files (power failure, network outage, disk full), what will happen when the inactive instance tries to load them? The same questions apply in reverse if the 'inactive' instance is going to be writing to the files (which isn't particularly unlikely if it wasn't designed with this use in mind; it might for example update some sort of idle statistic).
Also, reloading the indices sounds like it could be a rather time-consuming operation, and service will not be available while it is happening.
If the active instance needs to complete an orderly shutdown before the inactive instance loads the indices (perhaps due to file validity problems mentioned above), this could also be time-consuming and cause unavailability. If the active instance can't complete an orderly shutdown, you're gonna have a bad time.

Resources