Nagios/Icinga Checks running for long time - nagios

I am currently running Icinga1 to monitor around ~6000 services.
On the Icinga dashboard, I see that the average check time is ~ 300 s, which means some of my checks are running slow. Unfortunately , because there are 6000 checks I don't have a way to figure out all the checks that are running for more than a sec.
Is there a way to figure out the checks which run for more than a certain duration(say 5 sec) either from the classic UI or from the logs.

Try the wiki - performance tuning with Icinga1 is a big topic over there.
https://wiki.icinga.org/display/howtos/Identify+long+lasting+checks

I figured out that it using the icinga.cfg you can write host and service check performance data to files in custom formats.
Documentation for host performance data
Documentation for service performance data
You can also setup pnp4nagios to read this info and convert it into graphical reports.

You can consider using "crontab" to make the checks on different times to lower the use of the system.
Read more about crontab here - http://www.adminschoice.com/crontab-quick-reference

Related

Selenium Webdriver + Jmeter + StormRunner for Performance test

I wanted to try out integration of Selenium Jmeter and StormRunner. My end goal is to do Load testing with 'n' number of users on StormRunner
What ? - For e.g. I have Selenium Script, convert it in to Jmeter (I can get this information from many sources)
Then my Jmeter script should get ready
Then upload Jmeter script in to StormRunner and pass the necessary parameter through Jenkins and run the load test.
I really want the opinion here about feasibility and whether it is in right direction or not.
Idea here is that Automated Load/Performance test
Selenium is a browser automation framework and JMeter acts on HTTP protocol level so your "Automated" requirement might not be fulfilled especially if your tests are relying on client-side checks like sorting or waiting for element to appear.
Theoretically given you properly configure JMeter it can behave like a real browser, but it still not be executing client-side JavaScript.
If you're fine with this constraint - your approach is valid, if not and the "automated functional test" requirement is the must - consider migrating to TruClient Protocol instead
Why wouldn't you covert your script to a native Loadrunner/Stormrunner form of virtual user?
You should look at the value of what you are trying to achieve. The end value of a performance test is in analysis. Analysis simply takes the timing records and the resource measurements produced during the test, bringing them together on a common timestamp, and then allowing you to analyze what resource "X" is being impinged when timing record "Y" is too long. This then points to some configuration or code which locks up on resource, "X."
What is your path to value in your model? You speak about converting a functional test script to a performance one. Realistically, you should already know that your code, "works for one," before you get to asking, "Does it work for many?" There is a change in script definitions which typically accompanies this understanding.
Where are your collection of resources noted? Which Resources? On which Hosts? This is on the "path to value" problem where you need to have the resource measurements to diagnose root cause of poor performance.

NDB query().iter() of 1000<n<1500 entities is wigging out

I have a script that, using Remote API, iterates through all entities for a few models. Let's say two models, called FooModel with about 200 entities, and BarModel with about 1200 entities. Each has 15 StringPropertys.
for model in [FooModel, BarModel]:
print 'Downloading {}'.format(model.__name__)
new_items_iter = model.query().iter()
new_items = [i.to_dict() for i in new_items_iter]
print new_items
When I run this in my console, it hangs for a while after printing 'Downloading BarModel'. It hangs until I hit ctrl+C, at which point it prints the downloaded list of items.
When this is run in a Jenkins job, there's no one to press ctrl+C, so it just runs continuously (last night it ran for 6 hours before something, presumably Jenkins, killed it). Datastore activity logs reveal that the datastore was taking 5.5 API calls per second for the entire 6 hours, racking up a few dollars in GAE usage charges in the meantime.
Why is this happening? What's with the weird behavior of ctrl+C? Why is the iterator not finishing?
This is a known issue currently being tracked on the Google App Engine public issue tracker under Issue 12908. The issue was forwarded to the engineering team and progress on this issue will be discussed on said thread. Should this be affecting you, please star the issue to receive updates.
In short, the issue appears to be with the remote_api script. When querying entities of a given kind, it will hang when fetching 1001 + batch_size entities when the batch_size is specified. This does not happen in production outside of the remote_api.
Possible workarounds
Using the remote_api
One could limit the number of entities fetched per script execution using the limit argument for queries. This may be somewhat tedious but the script could simply be executed repeatedly from another script to essentially have the same effect.
Using admin URLs
For repeated operations, it may be worthwhile to build a web UI accessible only to admins. This can be done with the help of the users module as shown here. This is not really practical for a one-time task but far more robust for regular maintenance tasks. As this does not use the remote_api at all, one would not encounter this bug.

Laravel Session File gigantic

I am using Laravel 5 for my web application,
since running it for over a week, the session are stored as files with over 9MB file size. Instead of the 1kb it used to be.
The CPU is running at 99% all the time and the server is not responding anymore. What causes this enormous file size and what do i need to do to reduce it?
Thanks!
You can play around with the session settings in config/session.php, specifically the lottery setting might help you out.
You can also switch the session driver, if your system is unable to cope with the files. Depending on what you actually store in your sessions and the size of your application, it might be beneficial to switch to a different session driver. Avaiable options can be found here: http://laravel.com/docs/5.1/session#introduction

How to ensure that a bot/scraper does not get blocked

I coded a simple scraper , who's job is to go on several different pages of a site. Do some parsing , call some URL's that are otherwise called via AJAX , and store the data in a database.
Trouble is , that sometimes my ip is blocked after my scraper executes. What steps can I take so that my ip does not get blocked? Are there any recommended practices? I have added a 5 second gap between requests to almost no effect. The site is medium-big(need to scrape several URLs)and my internet connection slow, so the script runs for over an hour. Would being on a faster net connection(like on a hosting service) help ?
Basically I want to code a well behaved bot.
lastly I am not POST'ing or spamming .
Edit: I think I'll break my script into 4-5 parts and run them at different times of the day.
You could use rotating proxies, but that wouldn't be a very well behaved bot. Have you looked at the site's robots.txt?
Write your bot so that it is more polite, i.e. don't sequentially fetch everything, but add delays in strategic places.
Following guidelines set in robots.txt is a good first step. There are tools such as import.io and morph.io. There are also packages/ plugins for servers. For example x-ray; a node.js which have options to assist in quickly writing responsible scrapers e.g. throttle, delays, max connections etc.

How can I find why some classic asp pages randomly take a real long time to execute?

I'm working on a rather large classic asp / SQL Server application.
A new version was rolled out a few months ago with a lot of new features, and I must have a very nasty bug somewhere : some very basic pages randomly take a very long time to execute.
A few clues :
It isn't the database : when I run the query profiler, it doesn't detect any long running query
When I launch IIS Diagnostic tools, reqviewer shows that the request is in state "processing"
This can happen on ANY page
I can't reproduce it easily, it's completely random.
To have an idea of "a very long time" : this morning I had a page take more than 5 minutes to execute, when it normaly should be returned to the client in less than 100 ms.
The application can handle rather large upload and download of files (up to 2 gb in size). This is also handled with a classic asp script, using SoftArtisan FileUp. Don't think it can cause the problem though, we've had these uploads for quite a while now.
I've had the problem on two separate servers (in two separate locations, with different sets of data). One is running the application with good ol' SQL Server 2000 and the other runs SQL Server 2005. The web server is IIS 6 in both cases.
Any idea what the problem is or on how to solve that kind of problem ?
Thanks.
Sebastien
Edit :
The problem came from memory fragmentation. Some asp pages were used to download files from the server. File sizes could go from a few kb to more than 2 gb. These variations in size induced memory fragmentation. The asp pages could also take quite some time to execute (the time for the user to download the pages minus what is put in cache at IIS's level), which is not really standard for server pages that should execute quickly.
This is what I did to improve things :
Put all the download logic in a single asp page with session turned off
That allowed me to put that asp page in a specific pool that could be recycled every so often (download would now disturb the rest of the application no more)
Turn on LFH (Low Fragmention Heap), which is not by default on Windows 2003, in order to reduce memory fragmentation
References for LFH :
http://msdn.microsoft.com/en-us/library/aa366750(v=vs.85).aspx
Link (there is a dll there that you can use to turn on LFH, but the article is in French. You'll have to learn our beautiful language now!)
I noticed the same thing on a classic ASP + ajax application that I worked on. Using Timer, I timed the page load to be 153 milliseconds but in the firebug waterfall chart it randomly says 3.5 seconds. The Timer output is on the response and the waterfall chart claims that it's Firefox waiting for a response from the server. Because the waterfall chart also shows the response, I can compare the waterfall chart to the timer and there's a huge discrepancy 'every so often'
Can you establish whether this is a problem for all pages or a common subset of pages?
If a subset examine what these pages have in common, for example they all use a specific COM dll, that other pages don't.
Does this problem affect multiple clients or just a few?
IOW is there an issue with a specific browser OS version.
Is this public or intranet?
Can you reproduce the problem from a client you own?
Is there any chance there are some full-text search queries going on SQL Server?
Because if so, and if SQL Server has no access to internet, it may cause a 45-second delay every few hours or so when it tries to check the certifications (though this does not apply to SQL Server 2000).
For a detailed explanation of what I'm referring to, read this.
Are any other apps running on your web server? If so, is your problematic in the same app pool as any of them? If so, try creating a dedicated app pool for it. Maybe one of the other apps is having a problem and is adversely affecting yours.
One thing to watch out for is if you have server side debugging turned on in IIS, the web server will run in single threaded mode.
So if you try to load a page, and someone else has hit that url at the same time, you will be queued up behind them. It will seem like pages take a long time to load, but its simply because the server is doling out page requests in a single file line and sometimes you aren't at the front of the line.
You may have turned this on for debugging and forgot to turn it off for production.

Resources