What advantages does CnosDB have in storing or querying compared to other time series databases?
It would be nice to have an intuitive sample such as IoT performance comparison
Related
I've been researching about Apache Hive for the past month and all I've managed to find are articles stating what is actually Hive (by Apache), how to install it, and how to create tables in it.
I've never once found anything stating its actual practical use in the industry, even in a small brand company. Is Hive really not that popular in the industry compared to other data warehouses/databases?
Apache Hive is a first "SQL on Hadoop" framework that translates your SQL queries to Map-Reduce jobs.
It's more meant for batch type of processing and not interactive response time. (I would leave more Hive on Spark, Hive on Tez etc outside of this discussion )
We use Hive (along with Spark) for in ELT pipelines to ingest and transform our raw datasets into "Data Vaults" and then further to Data Marts in our Hadoop environments. We pretty much standardized on Parquet for those tables.
For BI dashboards, those Data Marts are being queried by Impala. Some other production jobs use Spark SQL. Both Impala and Spark SQL are another "SQL on Hadoop" dialects (just like Hive) that can be used to access "big data"/hadoop data sets.
That being said, we still use more traditional data warehouses (using Oracle in our case) in the same projects, but we can only push subset of data there (because of size/performance limitations of these traditional approaches).
To your question "even in a small brand company" - I think if a key word is "small" here, then you don't necessarily need Hive (and maybe any other "big data" technologies). If datasets are small and don't necessarily warrant more scalable Big Data technologies, then you should be fine and maybe even more productive in your development efforts with more traditional databases.
We use Hive on Tez along with other tools like Spark, sqoop, etc for ETL to build data marts in a 15Pb warehouse.
I have never been able to join 50 billion rows of data in single query on some database but Hive. Hive is scalable virtually unlimited.
There are so many different databases.
relational databases
nosql databases
key/value
document store
wide columns store
graph databases
And database technologies
in-memory
column oriented
All have their advantages and disadvantages.
For me it is very difficult to understand how to evaluate or choose a suitable database for a big data project.
I think about Hadoop, that has many functions to save data in hdfs or access different databases for analytics.
Is it right to say, that hadoop can make it easier to choose the right database, because it can be used at first as a data storage? so if i have hadoop hdfs as my main datastorage, i can still change my database for my application afterwards or use multiple databases?
First and Foremost, Hadoop is not a database. It is a distributed Filesystem.
For me it is very difficult to understand how to evaluate or choose a suitable database for a big data project.
The choice of database for a project depends on these factors,
Nature of the data storage and retrieval
If it is meant for transactions, It is highly recommended that you stick to an ACID database.
If it is to be used for web-applications or Random Access, then you have wide variety of choices from the traditional SQL ones and to the latest database technologies which support HDFS as storage layer, like HBase. Traditional are well suited for Random Access as they highly support constraints and indexes.
If analytical batch processing is the concern, based on the structure complexity and volume, choice can be made among all the available ones.
Data Format or Structure
Most of the SQL databases support Structured data (the data which can be formatted into tables), some do extend their support beyond that for storing JSON and likewise.
If the data is unstructured, especially flatfiles, storing and processing them can be easily done with any Bigdata supportive technologies like Hadoop, Spark, Storm. Again these technologies will come into picture only if the volume is high.
Different database technologies play well for different data formats. For example, Graph databases are well suited for storing structures representing relationships or graphs.
Size
This is the next bigger concern, more the data more the need for scalability. So it is better to choose a technology that supports Scale-Out Architecture (Hadoop, NoSql) than Scale-In. This could become a bottleneck in the future when you plan to store more.
I think about Hadoop, that has many functions to save data in hdfs or access different databases for analytics.
Yes, you can use HDFS as your storage layer and use any of the HDFS supported databases to do the processing(Choice of Processing framework is another concern to choose from batch to near real time to real time). To be noted is that Relational databases do not support HDFS storage. Some NoSql databases, like MongoDB, also support HDFS storage.
if i have hadoop hdfs as my main datastorage, i can still change my database for my application afterwards or use multiple databases?
This could be tricky depending upon which database you want to pair with afterwards.
HDFS is not a posix-compatible filesystem, so you can't just use it as a general purpose storage and then deploy any DB on top of it. The database you'll deploy should have explicit support for HDFS. There are a few options: HBase, Hive, Impala, SolR.
Could someone get into the technicals as to why NoSQL equivalents of MySQL joins are so much more expensive than SQL ones?
I'm not sure it is more expensive.
Joins in a non-distributed system are straightforward. The data's on the same server, and you simply go and find it.
In a distributed system - which just about every NoSQL is - the data might be local, or it might be remote. You have to do at least a calculation to determine the location of the data, then get the data from the other servers.
It is a hard problem (maybe even a fool's errand) to optimize the cluster so associated data is on the "same server" - because what is "associated" changes. Some distributed NoSQL databases, such as my companies' Aerospike database, make a point of randomly distributing data so joins are all equal cost - even if slower. Some cases - like small, well known tables with commonly read and infrequently written data - can be implemented on every server easily, but few NoSQL allow that (currently).
To create a comparison, you'll have to create a distributed SQL cluster, using your favorite distributed SQL technology (such as MySQL Cluster), and compare client join patterns vs the overhead of SQL parsing, optimizer runs, and then fetching from different parts of the cluster.
I've never seen such a benchmark. Most NoSQL use patterns eschew the normalize-and-join SQL patterns, so it is not well tested.
I have a database whose size could go upto 1TB in a month. If I do a query directly, its taking a long time. So I was thinking of using Hadoop on top of the Database - most of the time my query would involve searching entire database. My database instance would be either 1 or 2, not more than that. After a while we purge the database.
So can we use hadoop framework since it helps processing large amount of data?
Hadoop is not "something you query" but you can use it to process a large amount of data and create a search index which you then load into a system you can query.
You can also look into HBase if you want a store for big data. In addition to HBase there are a number of other key-value or non-relational (NoSQL) stores that work well with large data.
A proper answer depends on the kind of query you are running. Are you always running a specific query? If so, then a key-value store works well; just choose the right keys. If your query needs to search the entire database as you say, and you only make one query every hour or two, then yes, in principle, you could write a simple "query" in Hive that will read from your HDFS store.
Note that querying in Hive only saves you time versus an RDBMS or a simple grep when you have a lot of data and access to a decent-sized cluster. If you only have one machine, it's a non-solution.
Hadoop works better on distributed system. Moreover 1TB is not big data., for this your relational database will do the job.
The real power of hadoop comes when you have to process 100 TB or more of data .. where the relational databases fail.
If look into Hbase it is fast but it is not a substitute to your MySQL or Oracle..
Just for End-Of-Day data there will be billions of rows. What is the best way to store all that data. Is SQL Server 2008 good enough for that or should I look towards NoSQL solution, like MongoDB. Any suggestions?
That would be cool to have one master db with read/write permissions and one ore more replications of it for read only operations. Only master database will be used for adding new prices into the storage. Also that would be cool to be able replicate OHLC prices for most popular securities individually in order to optimize read access.
This data then will be streamed to a trading platform on clients' machines.
You should consider Oracle Berkeley DB which is in production doing this within the infrastructure of a few well known stock exchanges. Berkeley DB will allow you to record information at a master as simple key/value pairs, in your case I'd imagine a timestamp for the key and an encoded OHLC set for the value. Berkeley DB supports single master multi-replica replication (called "HA" for High Availability) to support exactly what you've outlined - read scalability. Berkeley DB HA will automatically fail-over to a new master if/when necessary. Using some simple compression and other basic features of Berkeley DB you'll be able to meet your scalability and data volume targets (billions of rows, tens of thousands of transactions per second - depending on your hardware, OS, and configuration of BDB - see the 3n+1 benchmark with BDB for help) without issue.
When you start working on accessing that OHLC data consider Berkeley DB's support for bulk-get and make sure you use the B-Tree access method (because your data has order and locality will provide much faster access). Also consider the Berkeley DB partitioning API to split your data (perhaps based on symbol or even based on time). Finally, because you'll be replicating the data you can relax the durability constraints to DB_TXN_WRITE_NOSYNC as long as your replication acknowledgement policy is requires a quorum of replicas ACK a write before considering it durable. You'll find that a fast network beats a fast disk in this case. Also, to offload some work from your master, enable peer-to-peer log replica distribution.
But, first read the replication manager getting started guide and review the rep quote example - which already implements some of what you're trying to do (handy, eh?).
Just for the record, full disclosure I work as a product manager at Oracle on Berkeley DB products. I have for the past nine years, so I'm a tad biased. I'd guess that the other solutions - SQL based or not - might eventually give you a working system, but I'm confident that Berkeley DB can without too much effort.
If you're really talking billions of new rows a day (Federal Express' data warehouse isn't that large), then you need an SQL database that can partition across multiple computers, like Oracle or IBM's DB2.
Another alternative would be a heavy-duty system managed storage like IBM's DFSMS.