"Optimum" sampling interval of sensor system - sampling

I am working with a system, S1, where many remote devices report in to a central server at constant intervals, f. Each device reports asynchronously with the rest of the system.
The (complete) state of S1 can be queried via a request-response API.
Is there an 'optimal' frequency for another system, S2, to query S1 that balances resource consumption and concurrency between S1 and S2?
A naive reading of Nyquist-Shannon leads me to 0.5f. Is there a better alternative?

You are best off sampling at f. Unless you are really concerned about frequency response of your system (if you are, then you also need to be really concerned about phase response too, and asynchrous reporting means that you are not), you don't want to change sampling rates in the system. It is best to use the same sample rate, even if you are resampling.
If you resample f to 0.5 f, then you need to properly implement a new LPF on the data to avoid aliasing, and that doesn't sound like what you want to do.
If your data is all very slow compared to f, then you probably should reduce f if you want to reduce resource usage (i.e. broadcast or battery power).

Related

Library design methodology

I want to make the "TRAP AGENT" library. The trap agent library keeps the tracks of the various parameter of the client system. If the parameter of the client system changes above threshold then trap agent library at client side notifies to the server about that parameter. For example, if CPU usage exceeds beyond threshold then it will notify the server that CPU usage is exceeded. I have to measure 50-100 parameters (like memory usage, network usage etc.) at client side.
Now I have the basic idea about the design, but I am stuck with the entire library design.
I have thought of below solutions:
I can create a thread for each parameter (i.e. each thread will monitor single parameter).
I can create a process for each parameter (i.e. each process will monitor single parameter).
I can classify the various parameters into the various groups, like data usage parameter will fall into network group, CPU memory usage parameter will fall into the system group, and then will create thread for each group.
Now 1st solution is looking good as compare to 2nd. If I am adopting 1st solution then it may fail when I want to upgrade my library for 100 to 1000 parameters. Because I have to create 1000 threads at that time, which is not good design (I think so; if I am wrong correct me.)
3rd solution is good, but response time will be high since many parameters will be monitored in single thread.
Is there any better approach?
In general, it's a bad idea to spawn threads 1-to-1 for any logical mapping in your code. You can quickly exhaust the available threads of the system.
In .NET this is very elegantly handled using thread pools:
Thread vs ThreadPool
Here is a C++ discussion, but the concept is the same:
Thread pooling in C++11
Processes are also high overhead on Windows. Both designs sound like they would ironically be quite taxing on the very resources you are trying to monitor.
Threads (and processes) give you parallelism where you need it. For example, letting the GUI be responsive while some background task is running. But if you are just monitoring in the background and reporting to a server, why require so much parallelism?
You could just run each check, one after the other, in a tight event loop in one single thread. If you are worried about not sampling the values as often, I'd say that's actually a benefit. It does no help to consume 50% CPU to monitor your CPU. If you are spot-checking values once every few seconds that is probably fine resolution.
In fact high resolution is of no help if you are reporting to a server. You don't want to denial-of-service-attack your server by doing a HTTP call to it multiple times a second once some value triggers.
NOTE: this doesn't mean you can't have a pluggable architecture. You could create some base class that represents checking a resource and then create subclasses for each specific type. Your event loop could iterate over an array or list of objects, calling each one successively and aggregating the results. At the end of the loop you report back to the server if any are out of range.
You may want to add logic to stop checking (or at least stop reporting back to the server) for some "cool down period" once a trap hits. You don't want to tax your server or spam your logs.
You can follow below methodology:
1.You can have two threads one thread is dedicated to measure emergency parameter and second thread monitors non emergency parameter.
hence response time for emergency parameter will be less.
2.You can define 3 threads.First thread will monitor the high priority(emergency parameter).Second thread will monitor the intermediate priority parameter. and last thread will monitor lowest priority parameter.
So overall response time will be improved as compared to first solution.
3.If response time is not concern then you can monitor all the parameters in single thread.But in this case response time becomes worst when you upgrade your library to monitor 100 to 1000 parameters.
So in 1st case there will be more response time for non emergency parameter.While in 3rd case there will be definitely very high response time.
So solution 2 is better.

Data compression techniques for power plant data

I am studying about data management recently by myself. After reading some time, I still did not get the whole picture of how data is flowing from data acquisition to database or warehouse.
In power plant, I have 1000 sensors installed, so I want to know what happened before data is stored in database. For instance, from sensor data is sampled with 1Hz frequency, then with this big amount of data we need to do data compression, then send it to database I guess...So I want to know how those are all done, especially with the data compression, if the data are digital value with time stamp, what kind of data compression techniques can be used...and in Big Data concept, how data is compressed..
The way OSIsoft PI does this is by checking how much a collected point has deviated from the previous point. If it is a small amount then the point gets "dropped" so only meaningful data is stored. When you ask for a value at a time in which no data exists. PI interpolates it.
Data can be compressed in many ways, from zipping it up to totally custiomised solutions. In fact, for Power Plant data as you are looking at one of the larger systems is PI from OSISOFT. I used to work for a company who used them for 8 power stations. They have a totally bespoke database system where they store all their measurements. It is apparently optimised so that frequent readings from a sensor take up little space, and missing readings don't increase the space taken much. How they do it I have no idea - I expect it is proprietary and they won't tell people.
However, how data flows from sensor to database can be complex. Have a poke around the Osisoft site - they have some data available.

If we make a number every millisecond, how much data would we have in a day?

I'm a bit confused here... I'm being offered to get into a project, where would be an array of certain sensors, that would give off reading every millisecond ( yes, 1000 reading in a second ). Reading would be a 3 or 4 digit number, for example like 818 or 1529. This reading need to be stored in a database on a server and accessed remotely.
I never worked with such big amounts of data, what do you think, how much in terms of MBs reading from one sensor for a day would be?... 4(digits)x1000x60x60x24 ... = 345600000 bits ... right ? about 42 MB per day... doesn't seem too bad, right?
therefor a DB of, say, 1 GB, would hold 23 days of info from 1 sensor, correct?
I understand that MySQL & PHP probably would not be able to handle it... what would you suggest, maybe some aps? azure? oracle?
3 or 4 digit number =
4 bytes if you store it as a string.
2 bytes storing it as a 16bit (0-65535) integer
1000/sec -> 60,000/minute -> 3,600,000/hour, 86,400,000/day
as string: 86,400,000 * 4 bytes = 329megabytes/day
as integer:86,400,000 * 2bytes = 165megabytes/day
Your DB may not perform too well under that kind of insert load, especially if you're running frequent selects on the same data. optimizing a DB for largescale retrieval slows things down for fast/frequent inserts. On the other hand, inserting a simple integer is not exactly a "stressful" operation.
You'd probably be better off inserting into a temporary database, and do an hourly mass copy into the main 'archive' database. You do your analysis/mining on that main archive table, with the understanding that its data will be up to 1 hour stale.
But in the end, you'll have to benchmark variations of all this and see what works best for your particular usage case. There's no "you must do X to achieve Y" type advice in databaseland.
Most likely you will need not to keep the data with such a high discretization for a long time. You may use several options to minimize the volumes. First, after some period of time you may collapse hourly data into min/max/avg values; you may keep detailed info only for some unstable situations detected or situations that require to keep detailed data by definition. Also, many things may be turned into events logging. These approaches were implemented and successfully used a couple of decades ago in some industrial automation systems provided by the company I have been working for at that time. The available storage devices sizes were times smaller than you can find today.
So, first, you need to analyse the data you will be storing and then decide how to optimize it's storage.
Following #MarcB's numbers, 2 bytes at 1kHz, is just 2KB/s, or 16Kbit/s. This is not really too much of a problem.
I think a sensible and flexible approach should be to construct a queue of sensor readings which the database can simply pop until it is clear. At these data rates, the problem is not the throughput (which could be handled by a dial-up modem) but the gap between the timings. Any system caching values will need to be able to get out of the way fast enough for the next value to be stored; 1ms is not long to return, particularly if you have GC interference.
The advantage of a queue is that it is cheap to add something to the queue at one end, and the values can be processed in bulk at the other end. So the sensor end gets the responsiveness it needs and the database gets to process in bulk.
İf you do not need relational database you can use a NoSQL database like mongodb or even a much simper solution like JDBM2, if you are using java.

A.I.: How would I train a Neural Network across multiple machines?

So, for larger networks with large data sets, they take a while to train. It would be awesome if there was a way to share the computing time across multiple machines. However, the issue with that is that when a neural network is training, the weights are constantly being altered every iteration, and each iteration is more or less based on the last -- which makes the idea of distributed computing at the very least a challenge.
I've thought that for each portion of the network, the server could send maybe a 1000 sets of data to train a network on... but... you'd have roughly the same computing time as I wouldn't be able to train on different sets of data simultaneously (which is what I want to do).
But even if I could split up the network's training into blocks of different data sets to train on, how would I know when I'm done with that set of data? especially if the amount of data sent to the client machine isn't enough to achieve the desired error?
I welcome all ideas.
Quoting http://en.wikipedia.org/wiki/Backpropagation#Multithreaded_Backpropagation:
When multicore computers are used multithreaded techniques can greatly decrease the amount of time that backpropagation takes to converge. If batching is being used, it is relatively simple to adapt the backpropagation algorithm to operate in a multithreaded manner.
The training data is broken up into equally large batches for each of the threads. Each thread executes the forward and backward propagations. The weight and threshold deltas are summed for each of the threads. At the end of each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to the neural network.
which is essentially what other answers here describe.
Depending on your ANN model you can exploit some parallelism on multiple machines by running the same model with the same training and validation data on multiple machines but set different ANN properies; initial values, ANN parameters, noise etc, for different runs.
I used to do this a lot to make sure I'd explored the problem space effectively and wasn't stuck in local minima etc. This is a very easy way to take advantage of multiple machines without having to recode your algorith. Just another approach you might want to consider.
My assumption is you have more than 1 training set, and you have a gold standard. Also, I assume you have some way of storing the state of the neural network (whether it's a list of probability weights for each node, or something along those lines).
Using as many compute nodes in a cluster as you can, launch the program on a data set on each node. Save the results for each, and test on the gold standard. Which ever neural network state performs best set as the input for the next round of training. Repeat as much as you see fit
If I understand correctly, you're trying to figure out a way to train an ANN on a cluster of machines? As you stated, partitioning the network isn't the right approach, and as far as I know, is seemingly unfeasible for most models. A possible approach might be to partition the training sets and run local copies of your network, and then merge the results. An intuitive way to do this and gain some validation along the way would be with cross-validation. As you stated, knowing when the network has had the right amount of training is a problem, but that variability is a problem inherent to neural nets in general, not in parallelizing the work.
As you also stated, the updates that happen during each iteration of training are dependent on the current state of the weights, but without mixing up training sets/validation, you're likely overfitting. This is why CV is nice, because your training sets will all get a chance to play a role in the training, and the validating, across multiple samples.
If you do batch training, the weight are only altered after you have been through the entire dataset. You can compute the weight update vector for each data point in the set on a separate machine/core and add them up at the end, then proceed with the next epoch.
Here is a link to a question about batch training.

How to store and compress data for real time data logging?

When developing software that records input signals (numbers) in real time, how can this data be best stored and compressed? Would an SQL engine be good for this, permitting fast data mining in the future, or are there other data formats that would be suitable or compressed enough for upto 1000 data samples per second?
I don't mind building in VC++ but ideas applicable to C# would be ideal.
It is hard to say without more info, such as, what is the source, will you be needing to query the stored data, and so on.
But for 1000 samples/sec, you should propably look at holding a few seconds of data in memory, and then writing them out in bulk to persistent storage on another thread. (Multi-processor machine recommended).
If you decide to do it via a managed language, keep the same data structure around for keeping the samples - so that the GC does not need to collect memory too often. You can get marginally better performance by using pointers and the unsafe keyword (provides direct access to the memory structure and eliminates bounds checking code for arrays).
I don't know how much CPU time is needed for you to collect each sample; and how time-critical it is to read each sample at a specified time (will they be buffered in the device you are reading from ?). If the sampling is time-critical, you have 1 ms per sample; and then you probably cannot afford the risk of the garbage collector kicking in, as it will block your thread for some time. In this case, I would go for an unmanaged approach.
SQL Server would easily be able to hold your data, or you could write them to a file. It mostly depends on what you need to do with the data at a later time. I don't know how much data each sample is, but let's assume it is 8 bytes. Then you have 8000 bytes per second to write of raw data - perhaps you have some overhead, so it could be 10 kB/s. Most storage mechanisms I can think of will be able to write data at this speed. Just make sure to write on another thread than the one that are doing the sampling.
You may want to look at time-series databases, rather than relational. These will be optimised to deal with the sort of data and usage you're considering.
Kx is a popular choice, as is Fame.

Resources