Data compression techniques for power plant data - database

I am studying about data management recently by myself. After reading some time, I still did not get the whole picture of how data is flowing from data acquisition to database or warehouse.
In power plant, I have 1000 sensors installed, so I want to know what happened before data is stored in database. For instance, from sensor data is sampled with 1Hz frequency, then with this big amount of data we need to do data compression, then send it to database I guess...So I want to know how those are all done, especially with the data compression, if the data are digital value with time stamp, what kind of data compression techniques can be used...and in Big Data concept, how data is compressed..

The way OSIsoft PI does this is by checking how much a collected point has deviated from the previous point. If it is a small amount then the point gets "dropped" so only meaningful data is stored. When you ask for a value at a time in which no data exists. PI interpolates it.

Data can be compressed in many ways, from zipping it up to totally custiomised solutions. In fact, for Power Plant data as you are looking at one of the larger systems is PI from OSISOFT. I used to work for a company who used them for 8 power stations. They have a totally bespoke database system where they store all their measurements. It is apparently optimised so that frequent readings from a sensor take up little space, and missing readings don't increase the space taken much. How they do it I have no idea - I expect it is proprietary and they won't tell people.
However, how data flows from sensor to database can be complex. Have a poke around the Osisoft site - they have some data available.

Related

InfluxDB: storing audio or video

We are looking at InfluxDB to store large numbers of streamed measurements (1-2 tera-samples). Additionally, we would like to be able to also store audio and video streams corresponding to the measurements (not all of them but many). To me at least this makes sense, since it is all time base data. But I don't see any discussion of this online.
I imagine that the video data could be broken up into frames. And that the audio data could be broken up into 100msec audio frames.
Has anyone tried this? Any recommendations?
Cheers.
Kevin
Most time-series databases are optimized for storing floating point values, with the occasional string here and there. Storing BLOBs beyond perhaps 1KB is likely not a good use case for InfluxDB, although we haven't done much performance testing with larger binary data.
That said, I don't quite follow your use case. It seems more like you need to index audio and video, rather than store and analyze time series data. TSDBs aren't just optimized for storing things with time as the primary axis, they are also optimized for aggregating those values and looking for change over time. Your use case doesn't seem to involve any aggregation or pattern searching, just a simple look-up table by time.
I would think a NoSQL database would be just as good for this, or perhaps OpenTSDB, which builds on top of Cassandra.

How to store and retrieve large numbers of data points for graphical visualization?

I'm thinking about building a web-based data logging and visualization service. The basic idea is that at some timed interval something (e.g. a sensor) reports a value (e.g. temperature) to the server. The server records this value into a database. There would be a web-based UI that allows me to view this data on a time-based graph. Ideally this graph would have various resolutions (last 30 seconds, last week, last year, etc). In a super ideal world, I would be able to zoom into the data for any point in time.
The problem is that the sensors are going to generate enormous amounts of data. For example, a sensor that reports a value every 5 seconds will generate about 18k values a day. I'm imagining a system that has thousands of sensors. Over time, this becomes lots of data.
The naive solution is to throw this data into a relational database and retrieve it in the various ways I want, but that won't scale.
The simple solution is to reduce the amount of data by performing periodic roll-ups of the data. New data might go into a table that has data points every 5 seconds. Every hour, some system pumps this data into another table that has data points every minute and the original data is deleted. This repeats for a few levels. The downside to this is that the further back in time you go, the less detailed the data is. That's probably fine. I would imagine that I would need enormous amounts of hardware to support full resolution of data over all time as compared to a system with this sort of rollup.
Is there a better way to do this? Is there an existing solution? I have to imagine this is a fairly common problem.
You probably want a fixed sized database like RRDTool: http://oss.oetiker.ch/rrdtool/
Also Graphite is built on top of a similar datastore implementation: http://graphite.wikidot.com/

If we make a number every millisecond, how much data would we have in a day?

I'm a bit confused here... I'm being offered to get into a project, where would be an array of certain sensors, that would give off reading every millisecond ( yes, 1000 reading in a second ). Reading would be a 3 or 4 digit number, for example like 818 or 1529. This reading need to be stored in a database on a server and accessed remotely.
I never worked with such big amounts of data, what do you think, how much in terms of MBs reading from one sensor for a day would be?... 4(digits)x1000x60x60x24 ... = 345600000 bits ... right ? about 42 MB per day... doesn't seem too bad, right?
therefor a DB of, say, 1 GB, would hold 23 days of info from 1 sensor, correct?
I understand that MySQL & PHP probably would not be able to handle it... what would you suggest, maybe some aps? azure? oracle?
3 or 4 digit number =
4 bytes if you store it as a string.
2 bytes storing it as a 16bit (0-65535) integer
1000/sec -> 60,000/minute -> 3,600,000/hour, 86,400,000/day
as string: 86,400,000 * 4 bytes = 329megabytes/day
as integer:86,400,000 * 2bytes = 165megabytes/day
Your DB may not perform too well under that kind of insert load, especially if you're running frequent selects on the same data. optimizing a DB for largescale retrieval slows things down for fast/frequent inserts. On the other hand, inserting a simple integer is not exactly a "stressful" operation.
You'd probably be better off inserting into a temporary database, and do an hourly mass copy into the main 'archive' database. You do your analysis/mining on that main archive table, with the understanding that its data will be up to 1 hour stale.
But in the end, you'll have to benchmark variations of all this and see what works best for your particular usage case. There's no "you must do X to achieve Y" type advice in databaseland.
Most likely you will need not to keep the data with such a high discretization for a long time. You may use several options to minimize the volumes. First, after some period of time you may collapse hourly data into min/max/avg values; you may keep detailed info only for some unstable situations detected or situations that require to keep detailed data by definition. Also, many things may be turned into events logging. These approaches were implemented and successfully used a couple of decades ago in some industrial automation systems provided by the company I have been working for at that time. The available storage devices sizes were times smaller than you can find today.
So, first, you need to analyse the data you will be storing and then decide how to optimize it's storage.
Following #MarcB's numbers, 2 bytes at 1kHz, is just 2KB/s, or 16Kbit/s. This is not really too much of a problem.
I think a sensible and flexible approach should be to construct a queue of sensor readings which the database can simply pop until it is clear. At these data rates, the problem is not the throughput (which could be handled by a dial-up modem) but the gap between the timings. Any system caching values will need to be able to get out of the way fast enough for the next value to be stored; 1ms is not long to return, particularly if you have GC interference.
The advantage of a queue is that it is cheap to add something to the queue at one end, and the values can be processed in bulk at the other end. So the sensor end gets the responsiveness it needs and the database gets to process in bulk.
İf you do not need relational database you can use a NoSQL database like mongodb or even a much simper solution like JDBM2, if you are using java.

Flexible storage and retrieval of motion capture data

I want to flexibly access motion capture data from C/C++ code. We currently have a bunch of separate files (.c3d format). We can expect the full set of data to be several hours long and tracking about 50 markers (4 floats each) per frame, sampled at 60 hz. So we're probably looking at a couple of gigabytes of data.
I'd like to have a database that can hold the data, allowing it to be relatively rapidly retrieved, augmented, and modified. I like to be able to apply labels to the data and retrieve sequences of frames by label, time indices (e.g., frame 400-2000, or every 30th frame) or other potential criteria.
Does such a thing already exist? Could I do it with SQLite for example? Does anyone have an intuition for what kind of performance I might get?
Currently, I'm just loading one .c3d file at a time and processing it. I haven't yet begun to apply meta-data/labels to sequences. I'll be accessing the sequences for visualization, statistical analysis, and training for machine-learning.
If you need to store multi-gigabytes of data with a known schema you might want to look into a binary flat file database. Of those available, I would recommend HDF5. It is not a relational database like SQLite, but provides rich support for array and matrix data with excellent performance. It also includes MPI support, if you ever expand your machine-learning onto a cluster.

How to store and compress data for real time data logging?

When developing software that records input signals (numbers) in real time, how can this data be best stored and compressed? Would an SQL engine be good for this, permitting fast data mining in the future, or are there other data formats that would be suitable or compressed enough for upto 1000 data samples per second?
I don't mind building in VC++ but ideas applicable to C# would be ideal.
It is hard to say without more info, such as, what is the source, will you be needing to query the stored data, and so on.
But for 1000 samples/sec, you should propably look at holding a few seconds of data in memory, and then writing them out in bulk to persistent storage on another thread. (Multi-processor machine recommended).
If you decide to do it via a managed language, keep the same data structure around for keeping the samples - so that the GC does not need to collect memory too often. You can get marginally better performance by using pointers and the unsafe keyword (provides direct access to the memory structure and eliminates bounds checking code for arrays).
I don't know how much CPU time is needed for you to collect each sample; and how time-critical it is to read each sample at a specified time (will they be buffered in the device you are reading from ?). If the sampling is time-critical, you have 1 ms per sample; and then you probably cannot afford the risk of the garbage collector kicking in, as it will block your thread for some time. In this case, I would go for an unmanaged approach.
SQL Server would easily be able to hold your data, or you could write them to a file. It mostly depends on what you need to do with the data at a later time. I don't know how much data each sample is, but let's assume it is 8 bytes. Then you have 8000 bytes per second to write of raw data - perhaps you have some overhead, so it could be 10 kB/s. Most storage mechanisms I can think of will be able to write data at this speed. Just make sure to write on another thread than the one that are doing the sampling.
You may want to look at time-series databases, rather than relational. These will be optimised to deal with the sort of data and usage you're considering.
Kx is a popular choice, as is Fame.

Resources