I want to share, store and monitor some date between different servers.
This is what I want to do. I found all I need is to have a place that the manager write commands to, and the workers read commands and do it. The workers should also register themselves there so I get some information. TTL is also needed, so I can tell if a worker stopped.
Etcd seems good but I don't need that reliability. And it is too over weighted for me. Is there an similar thing runs on a single machine? It's better to have low latency, so when the manager write a command, the worker receive it quickly.
Related
I've just started using Nagios to monitor a group of broadcast transmitters. Each transmitter is defined as a host, and each aspect of the transmitter I wish to monitor (RF forward, RF reflected, power supply voltages, etc) is defined as a service. In doing so, I can get an alarm if any of these aspects are out of tolerance, and can use the performance data to graph each aspect (using pnp4nagios, in this case).
To check the transmitters' telemetry data, I wrote some scripts, one to address the unique facilities of each make/model of transmitter involved. In keeping with the way I've seen other Nagios checks work, an argument to the script allows you to select which aspect you want reported.
At first I was content with this. It worked like any more-traditional use of Nagios I'd encountered. But then I hit a snag.
Because each service check is scheduled individually, diagnosing an alarm condition can be tricky, since the various services aren't all being checked at the same time - and therefore the set of values I'm looking at is unlikely to be time-aligned. If all the service check values were from the same moment in time, it would be easier to detect correlations (since the set of values would essentially be a snapshot).
My first thought would be to deal with this by running a single instance of a single command, which would return values for multiple services. This would also seem far more efficient than opening as many connection instances as there are services to be checked. From a scripting perspective, this is easily done. But from a Nagios config perspective, I don't know how (or if?) you'd do that.
I know I could also divorce the data collection from the Nagios check, caching the telemetry values all at once periodically, and feeding Nagios values from the cache. But I don't want to introduce added delays if I can help it.
Thoughts?
My first thought would be to deal with this by running a single instance of a single command, which would return values for multiple services. This would also seem far more efficient than opening as many connection instances as there are services to be checked. From a scripting perspective, this is easily done. But from a Nagios config perspective, I don't know how (or if?) you'd do that.
There's nothing strange about this from a Nagios perspective, because what you're essentially doing is writing your own plugin, and plugins can be as general or specific as you want them to be.
When writing your own plugin, it's good to remember:
Your script is responsible for all failures, so make sure you handle garbage responses, failed connections and whatever other errors you predict may happen in the plugin itself, and exit with appropriate error levels.
Since you may encounter errors you didn't expect, it probably makes sense to have the plugin write what it's doing to a log file, as well as what responses it got.
The plugin must use exit codes to alert Nagios correctly. If you want performance data, it needs to be given in the correct syntax. See the development guidelines.
I'm considering submitting the service data passively. It would solve all the problems I mentioned. But it would create a few minor new ones - now there's external processes to keep running, and it's a little outside the mainstream way of doing things (might put a future admin through a little pain to figure out how it works).
I don't think this is a better solution than writing your own plugin, unless the data is coming from nodes actively pushing it out.
For example, in an IoT context, the nodes you are monitoring may actually be sending passive check results directly to the Nagios instance. In that setting, passive checks make sense, because you just want to take whatever someone else gives you and action in case no results come in (freshness).
In your case, it sounds like writing your own script would take care of both the timing issue and whatever else additional logic you want in your script, and as far as Nagios is concerned it should only run it on a schedule and watch the exit codes, then act as configured if it fails.
The application used by a group of 100+ users was made with VB6 and RDO. A replacement is coming, but the old one is still maintained. Users moved to a different building across the street and problems began. My opinion regarding the problem has been bandwidth, but I've had to argue with others who say it's database. Users regularly experience network slowness using the application, but also workstation tasks in general. The application moves large audio files and indexes them on occasion as well as others. Occasionally the database becomes hung. We have many top end, robust SQL Servers, so it is not a server problem. What I figured out is, a transaction is begun on a connection, but fails to complete properly because of a communication error. Updates from other connections become blocked, they continue stacking up, and users are down half a day. What I've begun doing the moment I'm told of a problem, after verifying the database is hung, is set the database to single user then back to multiuser to clear connections. They must all restart their applications. Today I found out there is a bandwidth limit at their new location which they regularly max out. I think in the old location there was a big pipe servicing many people, but now they are on a small pipe servicing a small number of people, which is also less tolerant of momentary high bandwidth demands.
What I want to know is exactly what happens to packets, both coming and going, when a bandwidth limit is reached. Also I want to know what happens in SQL Server communication. Do some packets get dropped? Do they start arriving more out of sequence? Do timing problems occur?
I plan to start controlling such things as file moves through the application. But I also want to know what configurations are usually present on network nodes regarding transient high demand.
This is a very broad question. Networking is very key (especially in Availability Groups or any sort of mirroring set up) to good performance. When transactions complete on the SQL server, they are then placed in the output buffer. The app then needs to 'pick up' that data, clear it's output buffer and continue on. I think (without knowing your configuration) that your apps aren't able to complete the round trip because the network pipe is inundated with requests, so the apps can't get what they need to successfully finish and close out. This causes havoc as the network can't keep up with what the apps and SQL server are trying to do. Then you have a 200 car pileup on a 1 lane highway.
Hindsight being what it is, there should have been extensive testing on the network capacity before everyone moved across the street. Clearly, that didn't happen so you are kind of left to do what you can with what you have. If the company can't get a stable networking connection, the situation may be out of your control. If you're the DBA, I highly recommend you speak to your higher ups and explain to them the consequences of the reduced network capacity. Often times, showing the consequences of inaction can lead to action.
Out of curiosity, is there any way you can analyze what waits are happening when the pileup happens? I'm thinking it will be something along the lines of ASYNC_NETWORK_IO which is usually indicative that SQL is waiting on the app to come back and pick up it's data.
What are my best options for logging 3k events per second from a c file ? Following of the options which come to my mind. Not able to decide which would be robust solution with less failure points, higher reliability and less latency.
Use a messaging server to relay events as they happen
Use syslog for logging
Use Unix pipe
Use of logging agents like fluent which will send events to analysis server
Write a log file locally and then rotate periodically rotate it to analysis server using something like rsync
Try syslog. No reason to make it too complicated. With syslog-ng you can do local logging through UDP, then set up the local syslogd to forward everything through TCP to a central syslog server. You might need to run without fsync on the central syslog server to keep up with that load (but test first), but that can be mitigated with forwarding everything to two separate machines. This gives you the asynchronous performance locally and enough reliability that you should almost never lose events.
Another option I've done is to log events into Redis, Riak or some other nosql data store (I usually don't recommend them for anything complex, but event logging is right up their alley). Set up mirroring for redundancy and they should be able to keep up way more than 3k events per second.
I'm deploying the Apache Solr web app in two redundant Tomcat 6 servers,
to provide redundancy and improved availability. At this point, scalability is not a issue.
I have a load balancer that can dynamically route traffic to one server or the other or both.
I know that Solr supports master/slave configuration, but that requires manual recovery if the slave receives updates during the master outage (which it will in my use case).
I'm considering a simpler approach using the ability to reload a core:
- only one of the two servers is receiving traffic at any time (the "active" instance), but both are running,
- both instances share the same index data and
- before re-routing traffic due to an outage, the now active instance is told to reload the index core(s)
Limited testing of failovers with both index reads and writes has been successful. What implications/issues am I missing?
Your thoughts and opinions welcomed.
The simple approach to redundancy your considering seems reasonable but you will not be able to use it for disaster recovery unless you can share the data/index to/from a different physical location using your NAS/SAN.
Here are some suggestions:-
Make backups for disaster recovery and test those backups work as an index could conceivably have been corrupted as there are no checksums happening internally in SOLR/Lucene. An index could get wiped or some records could get deleted and merged away without you knowing it and backups can be useful for recovering those records/docs at a later time if you need to perform an investigation.
Before you re-route traffic to the second instance I would run some queries to load caches and also to test and confirm the current index works before it goes online.
Isolate the updates to one location and process and thread to ensure transactional integrity in the event of a cutover as it could be difficult to manage consistency as SOLR does not use a vector clock to synchronize updates like some databases. I personally would keep a copy of all updates in order separately from SOLR in some other store just in case a small time window needs to be repeated.
In general, my experience with SOLR has been excellent as long as you are not using cutting edge features and plugins. I have one instance that currently has 40 million docs and an uptime of well over a year with no issues. That doesn't mean you wont have issues but gives you an idea of how stable it could be.
I hardly know anything about Solr, so I don't know the answers to some of the questions that need to be considered with this sort of setup, but I can provide some things for consideration. You will have to consider what sorts of failures you want to protect against and why and make your decision based on that. There is, after all, no perfect system.
Both instances are using the same files. If the files become corrupt or unavailable for some reason (hardware fault, software bug), the second instance is going to fail the same as the first.
On a similar note, are the files stored and accessed in such a way that they are always valid when the inactive instance reads them? Will the inactive instance try to read the files when the active instance is writing them? What would happen if it does? If the active instance is interrupted while writing the index files (power failure, network outage, disk full), what will happen when the inactive instance tries to load them? The same questions apply in reverse if the 'inactive' instance is going to be writing to the files (which isn't particularly unlikely if it wasn't designed with this use in mind; it might for example update some sort of idle statistic).
Also, reloading the indices sounds like it could be a rather time-consuming operation, and service will not be available while it is happening.
If the active instance needs to complete an orderly shutdown before the inactive instance loads the indices (perhaps due to file validity problems mentioned above), this could also be time-consuming and cause unavailability. If the active instance can't complete an orderly shutdown, you're gonna have a bad time.
I am implementing a small database like MySQL.. Its a part of a larger project..
Right now i have designed the core database, by which i mean i have implemented a parser and i can now execute some basic sql queries on my database.. it can store, update, delete and retrieve data from files.. As of now its fine.. however i want to implement this on network..
I want more than one user to be able to access my database server and execute queries on it at the same time... I am working under Linux so there is no issue of portability right now..
I know i need to use Sockets which is fine.. I also know that i need to use a concept like Thread Pool where i will be required to create a maximum number of threads initially and then for each client request wake up a thread and assign it to the client..
As for now what i am unable to figure out is how all this is actually going to be bundled together.. Where should i implement multithreading.. on client side / server side.? how is my parser going to be configured to take input from each of the clients separately?(mostly via files i think?)
If anyone has idea about how i can implement this pls do tell me bcos i am stuck here in this project...
Thanks.. :)
If you haven't already, take a look at Beej's Guide to Network Programming to get your hands dirty in some socket programming.
Next I would take his example of a stream client and server and just use that as a single threaded query system. Once you have got this down, you'll need to choose if you're going to actually use threads or use select(). My gut says your on disk database doesn't yet support parallel writes (maybe reads), so likely a single server thread servicing requests is your best bet for starters!
In the multiple client model, you could use a simple per-socket hashtable of client information and return any results immediately when you process their query. Once you get into threading with the networking and db queries, it can get pretty complicated. So work up from the single client, add polling for multiple clients, and then start reading up on and tackling threaded (probably with pthreads) client-server models.
Server side, as it is the only person who can understand the information. You need to design locks or come up with your own model to make sure that the modification/editing doesn't affect those getting served.
As an alternative to multithreading, you might consider event-based single threaded approach (e.g. using poll or epoll). An example of a very fast (non-SQL) database which uses exactly this approach is redis.
This design has two obvious disadvantages: you only ever use a single CPU core, and a lengthy query will block other clients for a noticeable time. However, if queries are reasonably fast, nobody will notice.
On the other hand, the single thread design has the advantage of automatically serializing requests. There are no ambiguities, no locking needs. No write can come in between a read (or another write), it just can't happen.
If you don't have something like a robust, working MVCC built into your database (or are at least working on it), knowing that you need not worry can be a huge advantage. Concurrent reads are not so much an issue, but concurrent reads and writes are.
Alternatively, you might consider doing the input/output and syntax checking in one thread, and running the actual queries in another (query passed via a queue). That, too, will remove the synchronisation woes, and it will at least offer some latency hiding and some multi-core.