Is Hadoop suitable for real-time querying, high data integrity application? - database

I developed industrial software to poll data from devices/RTUs and log these data into relational database every seconds. The HMI of this software allows user to query these data from the database and represent it in the form of table/chart.
Now, this data stored can scale very fast. Normally, there can be easily 100 devices, in which each devices have 100 data, and need to be logged every seconds. We are talking about 100*100*60*60*24 = 864000000 data per day. This industrial software is expected to run 24/7, all year long.
Here's the problem, because of the scale of the data. The querying of the data can be painfully slow. If I were to plot data for 3 months, the SQL query would takes minutes.
My question is, does Hadoop (distributed storage and analytic system) is suitable for my application? Can I leverage the power of Hadoop to speed up the querying of data in my application? How?
Note that the data integrity in my application is very critical.

Related

how are realtime database systems different from timeseries databases

I understand time series databases like influxDB etc... are used to store metrics or variables that change over time. eg uses would be to store sensor data or metrics like counters and timers.
How different is it from a realtime database since timeseries data too is realtime in a sense. Can I use timeseries db for realtime data or vice-versa.Else, is there a database that handles both?
Time-series database is a database that stores values distributed over time (timestamp + value array for each series).
A real-time database is a database that satisfies real-time guarantees and should meet some time constraints and deadlines. E.g. the database system can guarantee that the query will be executed in no longer than 100ms. If it takes longer, the error will be triggered. Any database system can be real-time, e.g. a relational one or KV-store. A good example of such systems are the ticker plants used by stock exchanges (NYSE, TSX or NASDAQ).
TLDR
Real-time database and time-series database are two different things.

Database structure for 2000 data per second and real-time shown?

I am working on a monitoring project based on Web. There are nearly 50 sensors with the sample frequency 50Hz. All the raw data of sensor must be stored into the database. That means that nearly 2500 data per second will be deal and 200 million per day. And the data must be saved for at least three years. My job is making a webserver to show the real-time and historical sensor data through the database. The time-delay displaying is allowed.
Which database should we choose for this application? SQL server or Oracle? Is it possible for these databases to stand so huge I/O transactions per second?
How to design the database structure for real-time data and historical data? My opinions are that there are two databases, one store real-time data, another store historical data. First the coming data was stored in real-time database, and at the some time(eg. 23:59:59 every day ), I write a sql transaction to transfer the real-time data to the historical database. So for the real-time displaying, it read the real-time database. And for historical data shown, it read historical database. Is it feasible? And how to determine the time to transfer the data? I think on day is too long for some huge data.
How to store the time information? It comes a data per 20 million seconds. For one data + one time information, the database will expand so huge. For that the datetime type of sql server 2008 take 8 byte space. For 50 data+ one time information , the volume of database will reduce ,but for the data shown , it will cost time to get ervery data from the 50 data. How to balance the database size and the efficient of data reading?

SSIS ETL vs RESTful Web Service vs Service Bus

I have databases across geographical locations and there is a need to synchronize databases near real time.
As per my information, SSIS ETL is suitable only for batch updates. Real time updates can be achieved by Web services or Service Bus.
Further, only SSIS ETL can handle larger volumes.
I am looking for limits on the velocity or volumes of data beyond where I can not think of Web services or Service Bus and trade-off analysis.
What is the approach suitable if the requirement is Larger Volumes and near Real Time updates.
I'd suggest that you take a look at the SqlBulkCopy class. It lets you do fast high volume inserts (just inserts, not updates) from .Net code. So your code could grab a bunch of messages off the bus, and then insert them really quickly.
We are prototyping solutions to a problem similar to yours. SqlBulkCopy appears to be at least 10 times faster than normal insert statements, quite possibly more. It was the main, but not on only, factor in speeding up our process from taking 8 hours to taking only 15 minutes.

A huge data storage problem

I'm starting to design a new application that will be used by about 50000 devices. Each device generates about 1440 registries a day, this means that will be stored over 72 million of registries per day. These registries keep coming every minute, and I must be able to query this data by a Java application (J2EE). So it need to be fast to write, fast to read and indexed to allow report generation.
Devices only insert data and the J2EE application will need to read then occasionally.
Now I'm looking to software alternatives to support this kind of operation.
Putting this data on a single table would lead to a catastrophic condition, because I won't be able to use this data due to its amount of data stored over a year.
I'm using Postgres, and database partitioning seems not to be a answer, since I'd need to partition tables by month, or may be more granular approach, days for example.
I was thinking on a solution using SQLite. Each device would have its own SQLite database, than the information would be granular enough for good maintenance and fast insertions and queries.
What do you think?
Record only changes of device positions - most of the time any device will not move - a car will be parked, a person will sit or sleep, a phone will be on unmoving person or charged etc. - this would make you an order of magnitude less data to store.
You'll be generating at most about 1TB a year (even when not implementing point 1), which is not a very big amount of data. This means about 30MB/s of data, which single SATA drive can handle.
Even a simple unpartitioned Postgres database on not too big hardware should manage to handle this. The only problem could be when you'll need to query or backup - this can be resolved by using a Hot Standby mirror using Streaming Replication - this is a new feature in soon to be released PostgreSQL 9.0. Just query against / backup a mirror - if it is busy it will temporarily and automatically queue changes, and catch up later.
When you really need to partition do it for example on device_id modulo 256 instead of time. This way you'd have writes spread out on every partition. If you partition on time just one partition will be very busy on any moment and others will be idle. Postgres supports partitioning this way very well. You can then also spread load to several storage devices using tablespaces, which are also well supported in Postgres.
Time-interval partitioning is a very good solution, even if you have to roll your own. Maintaining separate connections to 50,000 SQLite databases is much less practical than a single Postgres database, even for millions of inserts a day.
Depending on the kind of queries that you need to run against your dataset, you might consider partitioning your remote devices across several servers, and then query those servers to write aggregate data to a backend server.
The key to high-volume tables is: minimize the amount of data you write and the number of indexes that have to be updated; don't do UPDATEs or DELETEs, only INSERTS (and use partitioning for data that you will delete in the future—DROP TABLE is much faster than DELETE FROM TABLE!).
Table design and query optimization becomes very database-specific as you start to challenge the database engine. Consider hiring a Postgres expert to at least consult on your design.
Maybe it is time for a db that you can shard over many machines? Cassandra? Redis? Don't limit yourself to sql db's.
Database partition management can be automated; time-based partitioning of the data is a standard way of dealihg with this type of problem, and I'm not sure that I can see any reason why this can't be done with PostgreSQL.
You have approximately 72m rows per day - assuming a device ID, datestamp and two floats for coordinates you will have (say) 16-20 bytes per row plus some minor page metadata overhead. A back-of-fag-packet capacity plan suggests around 1-1.5GB of data per day, or 400-500GB per year, plus indexes if necessary.
If you can live with periodically refreshed data (i.e. not completely up to date) you could build a separate reporting table and periodically update this with an ETL process. If this table is stored on separate physical disk volumes it can be queried without significantly affecting the performance of your transactional data.
A separate reporting database for historical data would also allow you to prune your operational table by dropping older partitions, which would probably help with application performance. You could also index the reporting tables and create summary tables to optimise reporting performance.
If you need low latency data (i.e. reporting on up-to-date data), it may also be possible to build a view where the lead partitions are reported off the operational system and the historical data is reported from the data mart. This would allow the bulk queries to take place on reporting tables optimised for this, while relatively small volumes of current data can be read directly from the operational system.
Most low-latency reporting systems use some variation of this approach - a leading partition can be updated by a real-time process (perhaps triggers) and contains relatively little data, so it can be queried quickly, but contains no baggage that slows down the update. The rest of the historical data can be heavily indexed for reporting. Partitioning by date means that the system will automatically start populating the next partition, and a periodic process can move, re-index or do whatever needs to be done for the historical data to optimise it for reporting.
Note: If your budget runs to PostgreSQL rather than Oracle, you will probably find that direct-attach storage is appreciably faster than a SAN unless you want to spend a lot of money on SAN hardware.
That is a bit of a vague question you are asking. And I think you are not facing a choice of database software, but an architectural problem.
Some considerations:
How reliable are the devices, and how
well are they connected to the
querying software?
How failsafe do
you need the storage to be?
How much extra processing power do the devices
have to process your queries?
Basically, your idea of a spatial partitioning is a good idea. That does not exclude a temporal partition, if necessary. Whether you do that in postgres or sqlite depends on other factors, like the processing power and available libraries.
Another consideration would be whether your devices are reliable and powerful enough to handle your queries. Otherwise, you might want to work with a centralized cluster of databases instead, which you can still query in parallel.

Database solution for 200million writes/day, monthly summarization queries

I'm looking for help deciding on which database system to use. (I've been googling and reading for the past few hours; it now seems worthwhile to ask for help from someone with firsthand knowledge.)
I need to log around 200 million rows (or more) per 8 hour workday to a database, then perform weekly/monthly/yearly summary queries on that data. The summary queries would be for collecting data for things like billing statements, eg. "How many transactions of type A did each user run this month?" (could be more complex, but that's the general idea).
I can spread the database amongst several machines, as necessary, but I don't think I can take old data offline. I'll definitely need to be able to query a month's worth of data, maybe a year. These queries would be for my own use, and wouldn't need to be generated in real-time for an end-user (they could run overnight, if needed).
Does anyone have any suggestions as to which databases would be a good fit?
P.S. Cassandra looks like it would have no problem handling the writes, but what about the huge monthly table scans? Is anyone familiar with Cassandra/Hadoop MapReduce performance?
I'm working on a very similar process at the present (a web domain crawlling database) with the same significant transaction rates.
At these ingest rates, it is critical to get the storage layer right first. You're going to be looking at several machines connecting to the storage in a SAN cluster. A singe database server can support millions of writes a day, it's the amount of CPU used per "write" and the speed that the writes can be commited.
(Network performance also often is an early bottleneck)
With clever partitioning, you can reduce the effort required to summarise the data. You don't say how up-to-date the summaries need to be, and this is critical. I would try to push back from "realtime" and suggest overnight (or if you can get away with it monthly) summary calculations.
Finally, we're using a 2 CPU 4GB RAM Windows 2003 virtual SQL Server 2005 and a single CPU 1GB RAM IIS Webserver as our test system and we can ingest 20 million records in a 10 hour period (and the storage is RAID 5 on a shared SAN). We get ingest rates upto 160 records per second batched in blocks of 40 records per network round trip.
Cassandra + Hadoop does sound like a good fit for you. 200M/8h is 7000/s, which a single Cassandra node could handle easily, and it sounds like your aggregation stuff would be simple to do with map/reduce (or higher-level Pig).
Greenplum or Teradata will be a good option. These databases are MPP and can handle peta-scale data. Greenplum is a distributed PostgreSQL db and also has it own mapreduce. While Hadoop may solve your storage problem but it wouldn't be helpful for performing summary queries on your data.

Resources