What are my best options for logging 3k events per second from a c file ? Following of the options which come to my mind. Not able to decide which would be robust solution with less failure points, higher reliability and less latency.
Use a messaging server to relay events as they happen
Use syslog for logging
Use Unix pipe
Use of logging agents like fluent which will send events to analysis server
Write a log file locally and then rotate periodically rotate it to analysis server using something like rsync
Try syslog. No reason to make it too complicated. With syslog-ng you can do local logging through UDP, then set up the local syslogd to forward everything through TCP to a central syslog server. You might need to run without fsync on the central syslog server to keep up with that load (but test first), but that can be mitigated with forwarding everything to two separate machines. This gives you the asynchronous performance locally and enough reliability that you should almost never lose events.
Another option I've done is to log events into Redis, Riak or some other nosql data store (I usually don't recommend them for anything complex, but event logging is right up their alley). Set up mirroring for redundancy and they should be able to keep up way more than 3k events per second.
Related
I've just started using Nagios to monitor a group of broadcast transmitters. Each transmitter is defined as a host, and each aspect of the transmitter I wish to monitor (RF forward, RF reflected, power supply voltages, etc) is defined as a service. In doing so, I can get an alarm if any of these aspects are out of tolerance, and can use the performance data to graph each aspect (using pnp4nagios, in this case).
To check the transmitters' telemetry data, I wrote some scripts, one to address the unique facilities of each make/model of transmitter involved. In keeping with the way I've seen other Nagios checks work, an argument to the script allows you to select which aspect you want reported.
At first I was content with this. It worked like any more-traditional use of Nagios I'd encountered. But then I hit a snag.
Because each service check is scheduled individually, diagnosing an alarm condition can be tricky, since the various services aren't all being checked at the same time - and therefore the set of values I'm looking at is unlikely to be time-aligned. If all the service check values were from the same moment in time, it would be easier to detect correlations (since the set of values would essentially be a snapshot).
My first thought would be to deal with this by running a single instance of a single command, which would return values for multiple services. This would also seem far more efficient than opening as many connection instances as there are services to be checked. From a scripting perspective, this is easily done. But from a Nagios config perspective, I don't know how (or if?) you'd do that.
I know I could also divorce the data collection from the Nagios check, caching the telemetry values all at once periodically, and feeding Nagios values from the cache. But I don't want to introduce added delays if I can help it.
Thoughts?
My first thought would be to deal with this by running a single instance of a single command, which would return values for multiple services. This would also seem far more efficient than opening as many connection instances as there are services to be checked. From a scripting perspective, this is easily done. But from a Nagios config perspective, I don't know how (or if?) you'd do that.
There's nothing strange about this from a Nagios perspective, because what you're essentially doing is writing your own plugin, and plugins can be as general or specific as you want them to be.
When writing your own plugin, it's good to remember:
Your script is responsible for all failures, so make sure you handle garbage responses, failed connections and whatever other errors you predict may happen in the plugin itself, and exit with appropriate error levels.
Since you may encounter errors you didn't expect, it probably makes sense to have the plugin write what it's doing to a log file, as well as what responses it got.
The plugin must use exit codes to alert Nagios correctly. If you want performance data, it needs to be given in the correct syntax. See the development guidelines.
I'm considering submitting the service data passively. It would solve all the problems I mentioned. But it would create a few minor new ones - now there's external processes to keep running, and it's a little outside the mainstream way of doing things (might put a future admin through a little pain to figure out how it works).
I don't think this is a better solution than writing your own plugin, unless the data is coming from nodes actively pushing it out.
For example, in an IoT context, the nodes you are monitoring may actually be sending passive check results directly to the Nagios instance. In that setting, passive checks make sense, because you just want to take whatever someone else gives you and action in case no results come in (freshness).
In your case, it sounds like writing your own script would take care of both the timing issue and whatever else additional logic you want in your script, and as far as Nagios is concerned it should only run it on a schedule and watch the exit codes, then act as configured if it fails.
I am looking for something to use as a simple service registry and am considering etcd. For this use-case availability is more important than consistency. Clients must be able to read/write keys to any of the nodes even when the cluster is split. Can etcd be used in this way? It doesn't matter if some of the writes are lost when things come back together as they will be quickly updated by service "I am alive" heartbeat timers.
I'm also new to etcd. What I have noticed is when network partitioning happens, reads still work for the nodes which are not in main quorum. They will see inconsistent data.
As for the writes they fail with "Raft internal error"
I'm writing a multiprocess server in C and an I'm just wondering what are the best tools to debug and test my programs? Specifically what is being sent to the client and vise versa.
Thank you for your help.
Every process should write log. It's not exactly a debugging tool like gdb, but very-very useful.
Every log entry should contain a precise timestamp, the process ID, and socket data. You can write the log to file(s), to database, maybe to a log server. Logging to a database (e.g. SQLite) is useful because it's easy to filter the log for specific time range, for specific client connection etc. It's also easy to merge the log of different processes (SQLite: ATTACH DATABASE). On Linux, I would consider using syslog.
Specify different logging levels. Detailed logging helps to debug your code in the development phase. Basic logging will help you tracking down rare errors which will emerge in the long term. Make sure you can turn on and off logging and set logging levels easily, without turning off your server.
I'm deploying the Apache Solr web app in two redundant Tomcat 6 servers,
to provide redundancy and improved availability. At this point, scalability is not a issue.
I have a load balancer that can dynamically route traffic to one server or the other or both.
I know that Solr supports master/slave configuration, but that requires manual recovery if the slave receives updates during the master outage (which it will in my use case).
I'm considering a simpler approach using the ability to reload a core:
- only one of the two servers is receiving traffic at any time (the "active" instance), but both are running,
- both instances share the same index data and
- before re-routing traffic due to an outage, the now active instance is told to reload the index core(s)
Limited testing of failovers with both index reads and writes has been successful. What implications/issues am I missing?
Your thoughts and opinions welcomed.
The simple approach to redundancy your considering seems reasonable but you will not be able to use it for disaster recovery unless you can share the data/index to/from a different physical location using your NAS/SAN.
Here are some suggestions:-
Make backups for disaster recovery and test those backups work as an index could conceivably have been corrupted as there are no checksums happening internally in SOLR/Lucene. An index could get wiped or some records could get deleted and merged away without you knowing it and backups can be useful for recovering those records/docs at a later time if you need to perform an investigation.
Before you re-route traffic to the second instance I would run some queries to load caches and also to test and confirm the current index works before it goes online.
Isolate the updates to one location and process and thread to ensure transactional integrity in the event of a cutover as it could be difficult to manage consistency as SOLR does not use a vector clock to synchronize updates like some databases. I personally would keep a copy of all updates in order separately from SOLR in some other store just in case a small time window needs to be repeated.
In general, my experience with SOLR has been excellent as long as you are not using cutting edge features and plugins. I have one instance that currently has 40 million docs and an uptime of well over a year with no issues. That doesn't mean you wont have issues but gives you an idea of how stable it could be.
I hardly know anything about Solr, so I don't know the answers to some of the questions that need to be considered with this sort of setup, but I can provide some things for consideration. You will have to consider what sorts of failures you want to protect against and why and make your decision based on that. There is, after all, no perfect system.
Both instances are using the same files. If the files become corrupt or unavailable for some reason (hardware fault, software bug), the second instance is going to fail the same as the first.
On a similar note, are the files stored and accessed in such a way that they are always valid when the inactive instance reads them? Will the inactive instance try to read the files when the active instance is writing them? What would happen if it does? If the active instance is interrupted while writing the index files (power failure, network outage, disk full), what will happen when the inactive instance tries to load them? The same questions apply in reverse if the 'inactive' instance is going to be writing to the files (which isn't particularly unlikely if it wasn't designed with this use in mind; it might for example update some sort of idle statistic).
Also, reloading the indices sounds like it could be a rather time-consuming operation, and service will not be available while it is happening.
If the active instance needs to complete an orderly shutdown before the inactive instance loads the indices (perhaps due to file validity problems mentioned above), this could also be time-consuming and cause unavailability. If the active instance can't complete an orderly shutdown, you're gonna have a bad time.
I'm building a mobile application in VB.NET (compact framework), and I'm wondering what the best way to approach the potential offline interactions on the device. Basically, the devices have cellular and 802.11, but may still be offline (where there's poor reception, etc). A driver will scan boxes as they leave his truck, and I want to update the new location - immediately if there's network signal, or queued if it's offline and handled later. It made me think, though, about how to handle offline-ness in general.
Do I cache as much data to the device as I can so that I use it if it's offline - Essentially, each device would have a copy of the (relevant) production data on it? Or is it better to disable certain functionality when it's offline, so as to avoid the headache of synchronization later? I know this is a pretty specific question that depends on my app, but I'm curious to see if others have taken this route.
Do I build the application itself to act as though it's always offline, submitting everything to a local queue of sorts that's owned by a local class (essentially abstracting away the online/offline thing), and then have the class submit things to the server as it can? What about data lookups - how can those be handled in a "Semi-live" fashion?
Or should I have the application attempt to submit requests to the server directly, in real-time, and handle it if it itself request fails? I can see a potential problem of making the user wait for the timeout, but is this the most reliable way to do it?
I'm not looking for a specific solution, but really just stories of how developers accomplish this with the smoothest user experience possible, with a link to a how-to or heres-what-to-consider or something like that. Thanks for your pointers on this!
We can't give you a definitive answer because there is no "right" answer that fits all usage scenarios. For example if you're using SQL Server on the back end and SQL CE locally, you could always set up merge replication and have the data engine handle all of this for you. That's pretty clean. Using the offline application block might solve it. Using store and forward might be an option.
You could store locally and then roll your own synchronization with a direct connection, web service of WCF service used when a network is detected. You could use MSMQ for delivery.
What you have to think about is not what the "right" way is, but how your implementation will affect application usability. If you disable features due to lack of connectivity, is the app still usable? If you have stale data, is that a problem? Maybe some critical data needs to be transferred when you have GSM/GPRS (which typically isn't free) and more would be done when you have 802.11. Maybe you can run all day with lookup tables pulled down in the morning and upload only transactions, with the device tracking what changes it's made.
Basically it really depends on how it's used, the nature of the data, the importance of data transactions between fielded devices, the effect of data latency, and probably other factors I can't think of offhand.
So the first step is to determine how the app needs to be used, then determine the infrastructure and architecture to provide the connectivity and data access required.
I haven't used it myself, but have you looked into the "store and forward" capabilities of the CF? It may suit your needs. I believe it uses an Exchange mailbox as a message queue to send SOAP packets to and from the device.
The best way to approach this is to always work offline, then use message queues to handle sending changes to and from the device. When the driver marks something as delivered, for example, update the item as delivered in your local store and also place a message in an outgoing queue to tell the server it's been delivered. When the connection is up, send any queued items back to the server and get any messages that have been queued up from the server.