I am working on a monitoring project based on Web. There are nearly 50 sensors with the sample frequency 50Hz. All the raw data of sensor must be stored into the database. That means that nearly 2500 data per second will be deal and 200 million per day. And the data must be saved for at least three years. My job is making a webserver to show the real-time and historical sensor data through the database. The time-delay displaying is allowed.
Which database should we choose for this application? SQL server or Oracle? Is it possible for these databases to stand so huge I/O transactions per second?
How to design the database structure for real-time data and historical data? My opinions are that there are two databases, one store real-time data, another store historical data. First the coming data was stored in real-time database, and at the some time(eg. 23:59:59 every day ), I write a sql transaction to transfer the real-time data to the historical database. So for the real-time displaying, it read the real-time database. And for historical data shown, it read historical database. Is it feasible? And how to determine the time to transfer the data? I think on day is too long for some huge data.
How to store the time information? It comes a data per 20 million seconds. For one data + one time information, the database will expand so huge. For that the datetime type of sql server 2008 take 8 byte space. For 50 data+ one time information , the volume of database will reduce ,but for the data shown , it will cost time to get ervery data from the 50 data. How to balance the database size and the efficient of data reading?
Related
i'm looking for the best database for my big data project.
We are collecting data from some sensors. Every row has about one hundred column.
every day we store some milions of rows.
The most common query is for retreiving data for one sensor in a range of date.
at the moment i use a percona mysql cluster. when i ask data for a range on some days, the response is fast. The problem is when i ask data for a month.
The database is perfectly optimized, but the response time is not acceptable.
I would like to change percona cluster with a database able to perform query in parallel on all the nodes to improve response time.
With Cassandra i could partition data accross nodes (maybe based on the current date) but i have read that cassandra cannot read data between partition in parallel, but i have to create a query for every day. (i don't know why)
Is there a database that manage shard queries automatically, so i can distribute data across all nodes?
With Cassandra, If you split your data across multiple partitions, you still can read data between partition in parallel by executing multiples queries asynchronously.
Cassandra drivers help you handle this, see execute_concurrent from the python driver.
Moreover, the cassandra driver is aware of the data partitioning, it knows which node holds which data. So when reading or writing, it chooses an appropriate node to send the query, according to the driver load balancing policy (specifically with the TokenAwarePolicy).
Thus, the client acts as a load balancer, and your request is processed in parallel by the available nodes.
I am a newbie to Database Systems and I was wondering what is difference between Temporal database and Time-series database. I have searched on the internet but I am not getting any comparison of the two.
A temporal database stores events which happen at a certain time or for a certain period. For example, the address of a customer may change so when you join the invoice table with the customer the answer will be different before and after the move of the customer.
A time-series database stores time-series which are array of number indexed by time. Like the evolution of the temperature with one measure every hour. Or the stock value every second.
Time-series database: A time series database is a database that is optimized to store time-series data. This is data that is stored along with a time stamp so that changes in the data can be measured over time. Prometheus is a time-series database used by Sound Cloud, Docker and Show Max.
Real world uses:
Autonomous trading algorithms, continuously collects data on market changes.
DevOps monitoring stores data of the state of the system over its run time.
Temporal databases contain data that is time sensitive. That is, the data are stored with time indicators such as the valid time (time for which the entry remains valid) and transaction time (time the data was entered into the database). Any database can be used as a temporal database if the data is managed correctly.
Real world uses:
Shop inventory systems keep track of stock quantities, time of purchase and best-before-dates.
Industrial processes that are dependant on valid time data during manufacturing and sales.
I have some devices which will log data to table each seconds.Each device will have 16 records per second, As the number of devices grow large the table will have billions of records, Now I’m using sql server, some times a simple record count query itself takes seconds to execute.
There are situation where we need historical data mostly as average of data in hour so we were processing large data each hour and converting it into hourly data so there will be only 16 records for a device in an hour but now there is a requirement to get all records between some time ranges and process it so we need to access big data.
currently I use sql server, Can you please suggest some alternative methods or how to deal with big data in sql server or some other db.
I don't think that's too much for SQL Server. For starters, please see the links below.
https://msdn.microsoft.com/en-us/library/ff647793.aspx?f=255&MSPPError=-2147217396
http://sqlmag.com/sql-server-2008/top-10-sql-server-performance-tuning-tips
http://www.tgdaily.com/enterprise/172441-12-tuning-tips-for-better-sql-server-performance
Make sure your queries are tuned properly and make sure the tables are indexed properly.
I developed industrial software to poll data from devices/RTUs and log these data into relational database every seconds. The HMI of this software allows user to query these data from the database and represent it in the form of table/chart.
Now, this data stored can scale very fast. Normally, there can be easily 100 devices, in which each devices have 100 data, and need to be logged every seconds. We are talking about 100*100*60*60*24 = 864000000 data per day. This industrial software is expected to run 24/7, all year long.
Here's the problem, because of the scale of the data. The querying of the data can be painfully slow. If I were to plot data for 3 months, the SQL query would takes minutes.
My question is, does Hadoop (distributed storage and analytic system) is suitable for my application? Can I leverage the power of Hadoop to speed up the querying of data in my application? How?
Note that the data integrity in my application is very critical.
I'm looking for help deciding on which database system to use. (I've been googling and reading for the past few hours; it now seems worthwhile to ask for help from someone with firsthand knowledge.)
I need to log around 200 million rows (or more) per 8 hour workday to a database, then perform weekly/monthly/yearly summary queries on that data. The summary queries would be for collecting data for things like billing statements, eg. "How many transactions of type A did each user run this month?" (could be more complex, but that's the general idea).
I can spread the database amongst several machines, as necessary, but I don't think I can take old data offline. I'll definitely need to be able to query a month's worth of data, maybe a year. These queries would be for my own use, and wouldn't need to be generated in real-time for an end-user (they could run overnight, if needed).
Does anyone have any suggestions as to which databases would be a good fit?
P.S. Cassandra looks like it would have no problem handling the writes, but what about the huge monthly table scans? Is anyone familiar with Cassandra/Hadoop MapReduce performance?
I'm working on a very similar process at the present (a web domain crawlling database) with the same significant transaction rates.
At these ingest rates, it is critical to get the storage layer right first. You're going to be looking at several machines connecting to the storage in a SAN cluster. A singe database server can support millions of writes a day, it's the amount of CPU used per "write" and the speed that the writes can be commited.
(Network performance also often is an early bottleneck)
With clever partitioning, you can reduce the effort required to summarise the data. You don't say how up-to-date the summaries need to be, and this is critical. I would try to push back from "realtime" and suggest overnight (or if you can get away with it monthly) summary calculations.
Finally, we're using a 2 CPU 4GB RAM Windows 2003 virtual SQL Server 2005 and a single CPU 1GB RAM IIS Webserver as our test system and we can ingest 20 million records in a 10 hour period (and the storage is RAID 5 on a shared SAN). We get ingest rates upto 160 records per second batched in blocks of 40 records per network round trip.
Cassandra + Hadoop does sound like a good fit for you. 200M/8h is 7000/s, which a single Cassandra node could handle easily, and it sounds like your aggregation stuff would be simple to do with map/reduce (or higher-level Pig).
Greenplum or Teradata will be a good option. These databases are MPP and can handle peta-scale data. Greenplum is a distributed PostgreSQL db and also has it own mapreduce. While Hadoop may solve your storage problem but it wouldn't be helpful for performing summary queries on your data.