Cassandra (Data Replication From Database For BI) - database

We have multiple database which we query and generate report. Since we have to create complex queries and do lot of joins etc, Is it a good Idea if we use Cassandra or Hadoop or Elasticsearch to load data (daily jobs to load data or incremental updates) and query this database for all the task.
Which would be preferred choice Cassandra or Hadoop or Elasticsearch or MongoDB ?
We also want to build a Web UI for reporting and analytics on the consolidated database.

Want to improve this post? Add citations from reputable sources by editing the post. Posts with unsourced content may be edited or deleted.
I cannot recommend MongoDB. It's a subpar in terms of big data analysing, its Map-Reduce implementation is poor, Map-Reduce is slow and single-threaded. Cassandra + Hadoop or HDFS + Hadoop is your choice. In case of Hadoop you are not limited with storage type, you can flush (or store initially) your data in HDFS and iterate it with MapReduce.
If you need a durability look at the Cassandra. First, Cassandra is very easy in maintenance and very reliable. I believe Cassandra is the most reliable noSQL db in the world. It's absolutely horizontally scallable, no name nodes, no master/slaves, all nodes a leveled in rights.
With Elasticsearch you can do only search. If you have a lot of data and you needed an analytics you should look towards Hadoop and MapReduce.
With Hadoop you can to start using Hive or Pig - the most powerfull map-reduce abstractions I've ever seen. With Hadoop you can even start thinking about migration to Spark/Shark.

Cassandra would be a best if your choice is limited to those three as writing joins in MapReduce programs involves lot of efforts with multiple and chaining of MapReduce programs to get one join correctly. If your options are open, Apache Hive can be leveraged to non interactive or reporting applications as it supports quite number of SQL functions such as joins, group by, order by etc. Apache Hive is again supports SQL like queries and there wouldn't be much different from the traditional SQLs.
You could also consider Apache Drill, Hortonworks Stinger and Cloudera Impala for interactive reporting applications.

Related

Can we use snowflake as database for Data driven web application?

I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.

Database selection for performance and scalability for C# APIs

I've been working on a project for dating-like app, kind of tinder/bumble. I've been debating which database to use Cassandra or MongoDB. So far I have experience only with MS SQL, mysql and Unidata... I've been looking into Cassandra and MongoDB because of scalability, but I've heard Tinder had issues with their MongoDB, thus they had to call in for help. Even if it is not any of those 2, what else would you suggest? Learning DB would not be an issue for me, but I am looking for performance and scalability. Main programming language will be C# (if it helps) and preferably I am looking for building this in cloud (Azure Cosmos DB, aws dynamoDB or similar). My thoughts are NoSQL DB because of scalability but I wouldn't be opposed to select RDBMS if there is strong reason.
Suggestions, comments, thoughts?
Cassandra has some advantages over mongodb.
There is no master-slave in cassandra. Any node can receive any
query. If master goes down on mongodb, you'll face with little down time.
It is easy to scale cassandra, adding a node is not a challange.
Writes are very fast.
Read query with primary key is fast.
Also
There is no aggregation in cassandra
Bad performance for very high update/delete (increasing tombstones causes bad performance impact : http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html)
Not efficient for fulltext search applications
No transactions
No joins
Secondary indexes are not equal to rdbs indexes and should not use very often
So you can not use cassandra for every use cases. If your data model does not fit for cassandra just consider another db which fits your requirements.
Also take a look at : https://blog.pythian.com/cassandra-use-cases/

What are the practical industry applications of Apache Hive?

I've been researching about Apache Hive for the past month and all I've managed to find are articles stating what is actually Hive (by Apache), how to install it, and how to create tables in it.
I've never once found anything stating its actual practical use in the industry, even in a small brand company. Is Hive really not that popular in the industry compared to other data warehouses/databases?
Apache Hive is a first "SQL on Hadoop" framework that translates your SQL queries to Map-Reduce jobs.
It's more meant for batch type of processing and not interactive response time. (I would leave more Hive on Spark, Hive on Tez etc outside of this discussion )
We use Hive (along with Spark) for in ELT pipelines to ingest and transform our raw datasets into "Data Vaults" and then further to Data Marts in our Hadoop environments. We pretty much standardized on Parquet for those tables.
For BI dashboards, those Data Marts are being queried by Impala. Some other production jobs use Spark SQL. Both Impala and Spark SQL are another "SQL on Hadoop" dialects (just like Hive) that can be used to access "big data"/hadoop data sets.
That being said, we still use more traditional data warehouses (using Oracle in our case) in the same projects, but we can only push subset of data there (because of size/performance limitations of these traditional approaches).
To your question "even in a small brand company" - I think if a key word is "small" here, then you don't necessarily need Hive (and maybe any other "big data" technologies). If datasets are small and don't necessarily warrant more scalable Big Data technologies, then you should be fine and maybe even more productive in your development efforts with more traditional databases.
We use Hive on Tez along with other tools like Spark, sqoop, etc for ETL to build data marts in a 15Pb warehouse.
I have never been able to join 50 billion rows of data in single query on some database but Hive. Hive is scalable virtually unlimited.

Advantages of Hadoop in combination to any database

There are so many different databases.
relational databases
nosql databases
key/value
document store
wide columns store
graph databases
And database technologies
in-memory
column oriented
All have their advantages and disadvantages.
For me it is very difficult to understand how to evaluate or choose a suitable database for a big data project.
I think about Hadoop, that has many functions to save data in hdfs or access different databases for analytics.
Is it right to say, that hadoop can make it easier to choose the right database, because it can be used at first as a data storage? so if i have hadoop hdfs as my main datastorage, i can still change my database for my application afterwards or use multiple databases?
First and Foremost, Hadoop is not a database. It is a distributed Filesystem.
For me it is very difficult to understand how to evaluate or choose a suitable database for a big data project.
The choice of database for a project depends on these factors,
Nature of the data storage and retrieval
If it is meant for transactions, It is highly recommended that you stick to an ACID database.
If it is to be used for web-applications or Random Access, then you have wide variety of choices from the traditional SQL ones and to the latest database technologies which support HDFS as storage layer, like HBase. Traditional are well suited for Random Access as they highly support constraints and indexes.
If analytical batch processing is the concern, based on the structure complexity and volume, choice can be made among all the available ones.
Data Format or Structure
Most of the SQL databases support Structured data (the data which can be formatted into tables), some do extend their support beyond that for storing JSON and likewise.
If the data is unstructured, especially flatfiles, storing and processing them can be easily done with any Bigdata supportive technologies like Hadoop, Spark, Storm. Again these technologies will come into picture only if the volume is high.
Different database technologies play well for different data formats. For example, Graph databases are well suited for storing structures representing relationships or graphs.
Size
This is the next bigger concern, more the data more the need for scalability. So it is better to choose a technology that supports Scale-Out Architecture (Hadoop, NoSql) than Scale-In. This could become a bottleneck in the future when you plan to store more.
I think about Hadoop, that has many functions to save data in hdfs or access different databases for analytics.
Yes, you can use HDFS as your storage layer and use any of the HDFS supported databases to do the processing(Choice of Processing framework is another concern to choose from batch to near real time to real time). To be noted is that Relational databases do not support HDFS storage. Some NoSql databases, like MongoDB, also support HDFS storage.
if i have hadoop hdfs as my main datastorage, i can still change my database for my application afterwards or use multiple databases?
This could be tricky depending upon which database you want to pair with afterwards.
HDFS is not a posix-compatible filesystem, so you can't just use it as a general purpose storage and then deploy any DB on top of it. The database you'll deploy should have explicit support for HDFS. There are a few options: HBase, Hive, Impala, SolR.

Need Suggestions: Utilizing columnar database

I am working on a project which is highly performance dashboard where results are mostly aggregated mixed with non-aggregated data. First page is loaded by 8 different complex queries, getting mixed data. Dashboard is served by a centralized database (Oracle 11g) which is receiving data from many systems in realtime ( using replication tool). Data which is shown is realized through very complex queries ( multiple join, count, group by and many where conditions).
The issue is that as data is increasing, DB queries are taking more time than defined/agreed. I am thinking to move aggregated functionality to Columnar database say HBase ( all the counts), and rest linear data will be fetched from Oracle. Both the data will be merged based on a key on App layer. Need experts opinion if this is correct approach.
There are few things which are not clear to me:
1. Will Sqoop be able to load data based on query/view or only tables? on continuous basis or one time?
2. If a record is modified ( e.g. status is changed), how will HBase get to know?
My two cents. HBase is a NoSQL database build for fast lookup queries, not to make aggregated, ad-hoc queries.
If you are planning to use a hadoop cluster, you can try hive with parquet storage formart. If you need near real-time queries, you can go with MPP database. A commercial option is Vertica or maybe Redshift from Amazon. For an open-source solution, you can use InfoBrigth.
These columnar options is going to give you a greate aggregate query performance.

Resources