Can anyone tell me what is the difference between Apache HBase database and Bigtable? Or are they same?
Which one supports relations, if any?
If they are Big Searcher what is the difference?
They are similar, but not the same!
Bigtable was initially released in 2005, but wasn't
available to general public until 2015. Apache HBase was created based on Google's publication Bigtable: A Distributed Storage System for Structured Data with initial release in 2008.
Some similarities:
Both are NoSQL. That means both don't support joins, transactions, typed columns, etc.
Both can handle significant amounts of data - petabyte-scale! This achieved because of support of linear horizontal scaling.
Both make emphasis on high-availability - through replication, versioning.
Both are schema-free: you can create table and add column families or columns later.
Both have APIs for most popular languages - Java, Python, C#, C++. Complete lists of supporting languages differ a bit.
Both support Apache HBase Java's API: after Apache HBase's success Google added support for HBase-like API for Bigtable but with some limitations - see API differences.
Some differences:
Apache HBase is an open source project, while Bigtable is not.
Apache HBase can be installed on any environment, it uses Apache Hadoop's HDFS as underlying storage. Bigtable is available only as a cloud service from Google.
Apache HBase is free, while Bigtable is not.
While some APIs are common, others are not - Bigtable supports gRPC (protobuf based) API, while Apache HBase have Thrift and REST APIs.
Apache HBase supports server side scripting (e.q. triggers) and in general is more open to extensions due to its open source nature.
Bigtable supports multi-cluster replication.
Apache HBase has immediate consistency always, while Bigtable has eventual consistency in worst case scenarios.
Different security models - Apache HBase uses Access Control Lists, while Bigtable relies on Google's Cloud Identity and Access Management.
See more at their websites - Bigtable and Apache HBase .
They are similar enough that Google now offers access to Bigtable via HBase 1.0 APIs: cloud.google.com/bigtable
Related
The older version of REdis supported multiple DBs, but since the latest version does not support multiple DBs, I would like to know if Cassandra or Mongo can be used ?
I would require multiple DBs cause I use the same Redis Instance to support different application databases
Redis and Cassandra are very different beasts - and used for different goals. Where Redis is mostly in-mem storage (like caches), cassandra is built to store your data on-disc.
You could define multiple key-spaces and multiple tables (within the key-spaces) to emulate the 'multiple DBs' that redis offers. but again, I think you'd probably be using the wrong tool for the job.
I have been reading more about the low-latency ability that HBase database system offers on Hadoop. While most Hadoop data stores are meant for write-only map/reduce functions, HBase appears to have low-latency update/delete features as well.
Is HBase a good candidate to be used to replace existing live application databases?
I do use HBase as a back end for a client facing web application. It all depends on how the data is structured in Hbase for faster retrieval (all ties back to RowKey design) and how updates/CURD operations are handled (adding versions)
additional reference
HBase as web app backend
hbase as database in web application
The answer is YES one can replace an existing database by carefully evaluating the primary objectives of the application (especially performance)
I'm trying to find a database solution and I came across Infobright and Amazon Redshift as potential solutions. Both are columnar databases. Infobright has been around for quite sometime whereas Amazon Redshift is newer.
What is the DBA effort between Infobright and Amazon Redshift?
How accessible is Infobright (API, query interface, etc.) vs AWS?
Where do both sit in your system architecture? Do the operate as a layer on top of your traditional RDBMS?
What is the DevOps effort to setting up both Infobright and Redshift?
I'm leaning a bit more towards Redshift because my application is hosted on AWS and I thought this would create tangible benefits in the long-run since everything is in AWS. Thank you in advance!
Firstly, I'll admit that I work for Infobright. I've done significant research into Redshift, and I feel I can give an honest opinion. I just wrote up a comparison between the two technologies; it can be found here: https://www.infobright.com/wp-content/plugins/download-monitor/download.php?id=37
DBA Effort - Infobright requires very little administration. You cannot index; you don't need to partition/etc. It's an SMP architecture and scales well. Thus, you won't be dealing with multiple nodes. Redshift is also fairly simple. You will need to maintain sorts as well as ensure Analyse is run enough.
Infobright uses a MySQL Shell. Thus, any tool that can utilize MySQL can utilize Infobright. Therefore, you have the same set of tools/interfaces/APIs for Infobright as you do with MySQL. AWS does have an SQL interface, and it does have some API capabilities. It does require that you load directly from S3. Infobright loads from flat files and named pipes from local or remote servers.
Both databases are analytic databases. You would not want to use either as a transactional database. Instead, you typically push data from your transactional system to your analytic database.
DevOps to setup Infobright will be lower than Redshift. However, Redshift is not that overly complicated either. Maintenance of the environment is more of a requirement for Redshift, though.
Infobright does have many AWS-specific installations. In fact, we have implementations that approach nearly 100TB of raw storage on one server. That said, Redshift with many nodes can achieve petabyte scale on an implementation.
There are other factors that can impact your choice. For example, Redshift has very nice failover/HA options already built-in. On the flipside, Infobright can support many concurrent queries and users; Redshift limits queries to 15 regardless of cluster size.
Take a look at the document, and feel free to contact me if you have any specific questions about either technology.
We have multiple database which we query and generate report. Since we have to create complex queries and do lot of joins etc, Is it a good Idea if we use Cassandra or Hadoop or Elasticsearch to load data (daily jobs to load data or incremental updates) and query this database for all the task.
Which would be preferred choice Cassandra or Hadoop or Elasticsearch or MongoDB ?
We also want to build a Web UI for reporting and analytics on the consolidated database.
Want to improve this post? Add citations from reputable sources by editing the post. Posts with unsourced content may be edited or deleted.
I cannot recommend MongoDB. It's a subpar in terms of big data analysing, its Map-Reduce implementation is poor, Map-Reduce is slow and single-threaded. Cassandra + Hadoop or HDFS + Hadoop is your choice. In case of Hadoop you are not limited with storage type, you can flush (or store initially) your data in HDFS and iterate it with MapReduce.
If you need a durability look at the Cassandra. First, Cassandra is very easy in maintenance and very reliable. I believe Cassandra is the most reliable noSQL db in the world. It's absolutely horizontally scallable, no name nodes, no master/slaves, all nodes a leveled in rights.
With Elasticsearch you can do only search. If you have a lot of data and you needed an analytics you should look towards Hadoop and MapReduce.
With Hadoop you can to start using Hive or Pig - the most powerfull map-reduce abstractions I've ever seen. With Hadoop you can even start thinking about migration to Spark/Shark.
Cassandra would be a best if your choice is limited to those three as writing joins in MapReduce programs involves lot of efforts with multiple and chaining of MapReduce programs to get one join correctly. If your options are open, Apache Hive can be leveraged to non interactive or reporting applications as it supports quite number of SQL functions such as joins, group by, order by etc. Apache Hive is again supports SQL like queries and there wouldn't be much different from the traditional SQLs.
You could also consider Apache Drill, Hortonworks Stinger and Cloudera Impala for interactive reporting applications.
I am currently examining different NoSQL and RDBMSes regarding their replication abilities in order to build distributed systems.
Reading through several papers and books, I get the feeling, that some vendors or authors use their own definitions regarding the terms
Master-Master Replication (Replication between two servers)
Master-Slave Replication (Replication between mutliple Servers in order to increase reading speed, writes are only able for the master server)
Multi-Master Replication (= Peer-To-Peer?)
Peer-To-Peer Replication (replication between n nodes, each can read/write)
Merge Replication (?)
E.g: Some mix up the terms Master-Master and Peer-to-Peer as the same, while in Mysql docus for instance I found it is differentiated between Master-Master and Multi-Master (=Peer-to-peer???) Replication.
Where is the difference in Multi-Master and Peer-to-Peer replication?
Is Multi-Master replication's use case more oriented towards Clustering while Peer-To-Peer targets distributed content to distributed applications?
I would like to sort things out and be sure that I have the right understanding in these terms, so maybe a discussion in here would help to merge some knowledge.
Regards, Chris
Edit: added merge replication to the list and some explanations as I understand them...
Regarding CouchDB, the story is simple. Here it is:
There is only one replication mode for CouchDB. The source copies all its data to the target, subject to an optional yes/no filter. I described CouchDB replication in another question. The key point is that "replication" is simply a DB client. It connects to both couches, reads from the source, and writes to the target.
Any other big-picture architecture (peer-to-peer, multi-master, master-slave) is just the implementation of the developers or the system administrators. For example, if GETs are distributed to many couches, but POST go to one central couch which replicates to the others, that is effectively master-slave. If you put a CouchDB in every major city for performance, and they replicate directly with each other, that is multi-master replication.
Within the CouchDB community, and especially from Chris Anderson's projects and presentations, "peer-to-peer" replication is a concept where CouchDB is everywhere: mobile phones, data centers, telephone poles. And replication happens directly between couches in a decentralized way, without a central authority or architecture, like the web itself.