Google Cloud Bigtable vs Google Cloud Datastore - google-app-engine

What is the difference between Google Cloud Bigtable and Google Cloud Datastore / App Engine datastore, and what are the main practical advantages/disadvantages? AFAIK Cloud Datastore is build on top of Bigtable.

Based on experience with Datastore and reading the Bigtable docs, the main differences are:
Bigtable was originally designed for HBase compatibility, but now has client libraries in multiple languages. Datastore was originally more geared towards Python/Java/Go web app developers (originally App Engine)
Bigtable is 'a bit more IaaS' than Datastore in that it's not 'just there' but requires a cluster to be configured.
Bigtable supports only one index - the 'row key' (the entity key in Datastore)
This means queries are on the Key, unlike Datastore's indexed properties
Bigtable supports atomicity only on a single row - there are no transactions
Mutations and deletions appear not to be atomic in Bigtable, whereas Datastore provides eventual and strong consistency, depending on the read/query method
The billing model is very different:
Datastore charges for read/write operations, storage and bandwidth
Bigtable charges for 'nodes', storage and bandwidth

Bigtable is optimized for high volumes of data and analytics
Cloud Bigtable doesn’t replicate data across zones or regions (data within a single cluster is replicated and durable), which means Bigtable is faster and more efficient, and costs are much lower, though it is less durable and available in the default configuration
It uses the HBase API - there’s no risk of lock-in or new paradigms to learn
It is integrated with the open-source Big Data tools, meaning you can analyze the data stored in Bigtable in most analytics tools customers use (Hadoop, Spark, etc.)
Bigtable is indexed by a single Row Key
Bigtable is in a single zone
Cloud Bigtable is designed for larger companies and enterprises who often have larger data needs with complex backend workloads.
Datastore is optimized to serve high-value transactional data to applications
Cloud Datastore has extremely high availability with replication and data synchronization
Datastore, because of its versatility and high availability, is more expensive
Datastore is slower writing data due to synchronous replication
Datastore has much better functionality around transactions and queries (since secondary indexes exist)

Bigtable and Datastore are extremely different. Yes, the datastore is build on top of Bigtable, but that does not make it anything like it. That is kind of like saying a car is build on top of wheels, and so a car is not much different from wheels.
Bigtable and Datastore provide very different data models and very different semantics in how the data is changed.
The main difference is that the Datastore provides SQL-database-like ACID transactions on subsets of the data known as entity groups (though the query language GQL is much more restrictive than SQL). Bigtable is strictly NoSQL and comes with much weaker guarantees.

I am going to try to summarize all the answers above plus what is given in Coursea Google Cloud Platform Big Data and Machine Learning Fundamentals
+---------------------+------------------------------------------------------------------+------------------------------------------+--+
| Category | BigTable | Datastore | |
+---------------------+------------------------------------------------------------------+------------------------------------------+--+
| Technology | Based on HBase(uses HBase API) | Uses BigTable itself | |
| ---------------- | | | |
| Access Mataphor | Key/Value (column-families) like Hbase | Persistent hashmap | |
| ---------------- | | | |
| Read | Scan Rows | Filter Objects on property | |
| ---------------- | | | |
| Write | Put Row | Put Object | |
| ---------------- | | | |
| Update Granularity | can't update row ( you should write a new row, can't update one) | can update attribute | |
| ---------------- | | | |
| Capacity | Petabytes | Terbytes | |
| ---------------- | | | |
| Index | Index key only (you should properly design the key) | You can index any property of the object | |
| Usage and use cases | High throughput, scalable flatten data | Structured data for Google App Engine | |
+---------------------+------------------------------------------------------------------+------------------------------------------+--+
Check this image too:

If you read papers, BigTable is this and Datastore is MegaStore. Datastore is BigTable plus replication, transaction, and index. (and is much more expensive).

This might be another set of key differences between Google Cloud Bigtable and Google Cloud Datastore along with other services. The contents shown in the image below can also help you in selecting the right service.

A relatively minor point to consider, as of November 2016, bigtable python client library is still in Alpha, which means the future change might not be backward compatible. Also, bigtable python library is not compatible with App Engine's standard environment. You have to use the flexible one.

Cloud Datastore is a highly-scalable NoSQL database for your applications.
Like Cloud Bigtable, there is no need for you to provision database instances.
Cloud Datastore uses a distributed architecture to automatically manage
scaling. Your queries scale with the size of your result set, not the size of your
data set.
Cloud Datastore runs in Google data centers, which use redundancy to
minimize impact from points of failure. Your application can still use Cloud
Datastore when the service receives a planned upgrade.
Choose Bigtable if the data is:
Big
● Large quantities (>1 TB) of semi-structured or structured data
Fast
● Data is high throughput or rapidly changing
NoSQL
● Transactions, strong relational semantics not required
And especially if it is:
Time series
● Data is time-series or has natural semantic ordering
Big data
● You run asynchronous batch or real-time processing on the data
Machine learning
● You run machine learning algorithms on the data
Bigtable is designed to handle massive workloads at consistent low latency
and high throughput, so it's a great choice for both operational and analytical
applications, including IoT, user analytics, and financial data analysis.

Datastore is more application ready and suitable for a wide range of services, especially for microservices.
The underlying technology of Datastore is Big Table, so you can imagine Big Table is more powerfuly.
Datastore come with 20K free operation per days, you can expect to host a server with reliable DB with ZERO cost.
You can also check out this Datastore ORM library, it comes with a lot of great feature
https://www.npmjs.com/package/ts-datastore-orm

Related

Elasticsearch vs RDMBs for Aggregations/Reporting Data

Has anyone has experience switching between Elasticsearch and a relational DB like mysql/postgres/? What are the pros/cons of both?
Background: looking to build a dashboard UI to show store/item related metrics and need the correct tool on the backend side that provides flexibility in queries (Imagine that the UI has selectors for date ranges and then the UI shows top items sold, total sales, etc.) in different time based charts. Some other notes are that we are just going to be using aggregations/nested aggregations (wouldn't be taking advantage of text search) around stores or items.
I know you could use both but which one is preferable in terms of
performance? I imagine that they would be largely similar
durability? I imagine elasticsearch and it automatically replicates data
maintenance? I imagine elasticsearch would be worse (maintaining a cluster vs maintaining a single node)
cost? I imagine an elasticsearch cluster storing the same amount of data would cost more because of replication
development work? I imagine elasticsearch would cause development to take longer using elasticsearch's custom queries vs writing APIs around sql queries
Are these assumptions correct?
Are there other dbs/data stores that I should consider over these 2 options?
Based on my experience Elastic Search is a superb tool for :
Search
Real-time data Aggregation
Real-time reporting with extensive filtering support
We are also using Elastic Search for powering our real-time reports having extensive filter options (like date-range, status, etc).
We compared aggregation performance of E.S and MongoDB with similar set of machines and for aggregating 5 million records mongo-db took around 12 Sec while E.S took time under 1 sec.
performance? I imagine that they would be largely similar
If you have pure aggregation use case on loads of data requiring extensive filtering, searching etc then the performance of ES would be unmatched.
durability? I imagine elastic search and it automatically replicates
data
Yes E.S do have inherent replication support, as it is a distributed system.
maintenance? I imagine elasticsearch would be worse (maintaining a
cluster vs maintaining a single node)
Definitely distributed systems demand more maintenance but you can use the Hosted version of ES (e.g AWS Elasti-cache) as well
cost? I imagine an elasticsearch cluster storing the same amount of
data would cost more because of replication
Considering cluster is required with replication support as well. Infra cost will be larger.
development work? I imagine elasticsearch would cause development to
take longer using elasticsearch's custom queries vs writing APIs
around sql queries
It depends on the experience with E.S. Since Mysql has been around for long, most dev folks are skilled with that. Any new technology has it's learning curve.
Keep in mind :
E.S is not an ACID compliant datastore.
No Transactions support is there. If your system is purely transactional, then you may require relational-db as a read/write store and E.S for powering aggregation use cases.

User database ownership in webapps

I am developing a webapp where each user will access his own data.
Think about pivotal tracker and the such as an example, and assume each user will store 2 different data types like so:
table project
id | name
0 | foo
1 | bar
table story
id | name | effort
1 | baz | 5
2 | ex | 2
I can think of 2 solutions.
1) Provide each table with an additional user_id column so that each data is bound to his owner
2) Setup a new database schema for each new user
Personally, i am more on 2) because it would grant a higher security rate (not bound to the application level).
What would be the recommended way, and why?
Solution 2 appears rather exotic to me. It means creating new database schemas each time a user is added. Now, if you have very few users, and new users only get added very rarely, this may be feasible. But if you have lots of users to accomodate, you will need an automatism to create those schemas on the fly. Sure this is possible, but you will leave the grounds of existing tools and frameworks that support your development. E.g., Java Persistency API links a Java class to a table and won't support dynamic data base definition.
Also, I have doubts concerning the security level. In a web app, the database "user" is not the actual human user behind the browser, but the application server, which owns a database connection for its entire runtime. Therefore, individual human user's access rights aren't handled by the database, but by the application.

Google Bigtable vs BigQuery for storing large number of events

Background
We'd like to store our immutable events in a (preferably) managed service. Average size of one event is less than 1 Kb and we have between 1-5 events per second. The main reason for storing these events is to be able to replay them (perhaps using table scanning) once we create future services that might be interested in these events. Since we're in the Google Cloud we're obviously looking at Google's services as first choice.
I suspect that Bigtable would be a good fit for this but according to the price calculator it'll cost us more than 1400 USD per month (which to us is a big deal):
Looking at something like BigQuery renders a price of 3 USD per month (if I'm not missing something essential):
Even though a schema-less database would be better suited for us we would be fine with essentially storing our events as a blob with some metadata.
Questions
Could we use BigQuery for this instead of Bigtable to reduce costs? For example BigQuery has something called streaming inserts which to me seems like something we could use. Is there anything that'll bite us in the short or long term that I might not be aware of if going down this route?
Bigtable is great for large (>= 1TB) mutable data sets. It has low latency under load and is managed by Google. In your case, I think you're on the right track with BigQuery.
FYI
Cloud Bigtable is not a relational database; it does not support SQL queries or joins, nor does it support multi-row transactions.
Also, it is not a good solution for small amounts of data (< 1 TB).
Consider these cases:
- If you need full SQL support for an online transaction processing
(OLTP) system, consider Google Cloud SQL.
If you need interactive querying in an online analytical processing
(OLAP) system, consider Google BigQuery.
If you need to store immutable blobs larger than 10 MB, such as large
images or movies, consider Google Cloud Storage.
If you need to store highly structured objects, or if you require
support for ACID transactions and SQL-like queries, consider Cloud
Datastore.
The overall cost boils down to how often you will 'query' the data. If it's an backup and you don't replay events too often, it'll be dirt cheap. However, if you need to replay it once daily, you start triggering the 5$/TB scanned too easily. We were surprised too how cheap inserts and storage were, but this is ofc because Google expects you to run expensive queries at some point in time on them. You'll have to design around a few things though. E.g. AFAIK streaming inserts have no guarantue's of being written to the table and you have to poll frequently on tail of list to see if it was really written. Tailing can be done efficiently with time range table decorator, though (not paying for scanning whole dataset).
If you don't care about order, you can even list a table for free. No need to run a 'query' then.
This flowchart may help in deciding between different Google cloud storage offerings (Disclaimer! copied this image from Google cloud's page)
If your usecase is a live database(let's say, backend of a website), BigTable is what you need (Still it's not really an OLTP system though) . If it is more of an data analytics/ datawarehouse kind of purpose, then BigQuery is what you need.
Think of OLTP vs OLAP; Or if you are familiar with Cassandra and Hadoop, BigTable roughly equates to Cassandra, BigQuery roughly equates to Hadoop (Agreed, not a fair comparison, but you get the idea)
https://cloud.google.com/images/storage-options/flowchart.svg
Please keep in mind that Bigtable is not a relational database, it's a noSQL solution without any SQL features like JOIN etc. If you want an RDBMS OLTP, you might need to look at cloudSQL (mysql/ postgres) or spanner.
Cloud spanner is relatively young, but is powerful and promising. At least, google marketing claims that it's features are best of both worlds (Traditional RDBMS and noSQL)
Cost Aspect
Cost aspect is already covered nicely here https://stackoverflow.com/a/34845073/6785908
I know this is very late answer, but adding it anyway incase it may help somebody else in future.
Hard to summarize better than it is already done by Google.
I think you need to figure out how you are going to use (replay) your data (events) and this can help you in making final decision.
So far, BigQuery looks like a best choice for you
Bigtable is a distributed (run on clusters) database for applications that manage massive data. Its designed for massive unstructured data, scales horizontally and made of column families. It stores data in key value pairs as opposed to relational or structured databases.
BigQuery is a datawarehouse application. That means it provides connection to several data sources or streams such that they can be extracted, transformed and loaded into bigQuery table for further analysis. Unlike Bigtable, It does store data in structured tables and supports SQL queries.
Use cases; If you want to do analytics or business intelligence by deriving insights from collected data on from different sources (applications, research, surveys, feedback, logs etc...) of your organisation , you may want to pull all this information into one location. This location will most likely be a Bigquery data warehouse.
If you have an application that collects Big data, in other words massive information (High data volume) per time at higher speeds (High velocity) and in unstructured inconsistent forms with different data types as audio, text, video, images, etc... ( Variety and veracity), then your probable choice of database application for this app would be Bigtable.

GAE DataStore vs Google Cloud SQL for Enterprise Management Systems

I am building an application that is an enterprise management system using gae. I have built several applications using gae and the datastore, but never one that will require a high volume of users entering transactions along with the need for administrative and management reporting. My biggest fear is that when I need to create cross-tab and other detailed reports (or business intelligence reporting and data manipulation) I will be facing a mountain of problems with gae's datastore querying and data pull limits. Is it really just architectural preference or are there quantitative concerns here?
In the past I have built systems using C++/c#/Java against an Oracle/MySql/MSSql (with a caching layer sprinkled in for some added performance on complex or frequently accessed db results).
I keep reading that we are to throw away the old mentality of relational data and move to the new world of the big McHashTable in the sky... but new isnt always better... Any insight or experience on the above would be helpful.
From the Cloud SQL FAQ:
Should I use Google Cloud SQL or the App Engine Datastore?
This depends on the requirements of the application. Datastore provides NoSQL key-value > storage that is highly scalable, but does not support the complex queries offered by a SQL database. Cloud SQL supports complex queries and ACID transactions, but this means the database acts as a ‘fixed pipe’ and performance is less scalable. Many applications use both types of storage.
If you need a lot of writes (~XXX per/s) to db entity w/ distributed keys, that's where the Google App Engine datastore really shine.
If you need support for complex and random user crafted queries, that's where Google Cloud SQL is more convenient.
What is scare me more in GAE datastore is index number limitation. For example if you need search by some field or sorting - you need +1 index. Totally you can have 200 indexes. If you have entity with 10 searchable fields and you can sort by any field - there will be about 100 combunations. So you need 100 indexes. I have developed few small projects for gae - and this is success stories. But when big one come - this is not for gae.
About cache - you can do it with gae, but they distributed cache works very slow. I prefer to create private single instance of permanent backend with RESTfull API that holds cached values in memory. Frontend instances call this API to get/set values.
Maybe it is posible to build complex system with gae, but this will be a set of small applications/services.

What database does Google use?

Is it Oracle or MySQL or something they have built themselves?
Bigtable
A Distributed Storage System for Structured Data
Bigtable is a distributed storage
system (built by Google) for managing structured data
that is designed to scale to a very
large size: petabytes of data across
thousands of commodity servers.
Many projects at Google store data in
Bigtable, including web indexing,
Google Earth, and Google Finance.
These applications place very
different demands on Bigtable, both in
terms of data size (from URLs to web
pages to satellite imagery) and
latency requirements (from backend
bulk processing to real-time data
serving).
Despite these varied
demands, Bigtable has successfully
provided a flexible, high-performance
solution for all of these Google
products.
Some features
fast and extremely large-scale DBMS
a sparse, distributed multi-dimensional sorted map, sharing characteristics of both row-oriented and column-oriented databases.
designed to scale into the petabyte range
it works across hundreds or thousands of machines
it is easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration
each table has multiple dimensions (one of which is a field for time, allowing versioning)
tables are optimized for GFS (Google File System) by being split into multiple tablets - segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size.
Architecture
BigTable is not a relational database. It does not support joins nor does it support rich SQL-like queries. Each table is a multidimensional sparse map. Tables consist of rows and columns, and each cell has a time stamp. There can be multiple versions of a cell with different time stamps. The time stamp allows for operations such as "select 'n' versions of this Web page" or "delete cells that are older than a specific date/time."
In order to manage the huge tables, Bigtable splits tables at row boundaries and saves them as tablets. A tablet is around 200 MB, and each machine saves about 100 tablets. This setup allows tablets from a single table to be spread among many servers. It also allows for fine-grained load balancing. If one table is receiving many queries, it can shed other tablets or move the busy table to another machine that is not so busy. Also, if a machine goes down, a tablet may be spread across many other servers so that the performance impact on any given machine is minimal.
Tables are stored as immutable SSTables and a tail of logs (one log per machine). When a machine runs out of system memory, it compresses some tablets using Google proprietary compression techniques (BMDiff and Zippy). Minor compactions involve only a few tablets, while major compactions involve the whole table system and recover hard-disk space.
The locations of Bigtable tablets are stored in cells. The lookup of any particular tablet is handled by a three-tiered system. The clients get a point to a META0 table, of which there is only one. The META0 table keeps track of many META1 tablets that contain the locations of the tablets being looked up. Both META0 and META1 make heavy use of pre-fetching and caching to minimize bottlenecks in the system.
Implementation
BigTable is built on Google File System (GFS), which is used as a backing store for log and data files. GFS provides reliable storage for SSTables, a Google-proprietary file format used to persist table data.
Another service that BigTable makes heavy use of is Chubby, a highly-available, reliable distributed lock service. Chubby allows clients to take a lock, possibly associating it with some metadata, which it can renew by sending keep alive messages back to Chubby. The locks are stored in a filesystem-like hierarchical naming structure.
There are three primary server types of interest in the Bigtable system:
Master servers: assign tablets to tablet servers, keeps track of where tablets are located and redistributes tasks as needed.
Tablet servers: handle read/write requests for tablets and split tablets when they exceed size limits (usually 100MB - 200MB). If a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
Lock servers: instances of the Chubby distributed lock service. Lots of actions within BigTable require acquisition of locks including opening tablets for writing, ensuring that there is no more than one active Master at a time, and access control checking.
Example from Google's research paper:
A slice of an example table that
stores Web pages. The row name is a
reversed URL. The contents column
family contains the page contents, and
the anchor column family contains the
text of any anchors that reference the
page. CNN's home page is referenced by
both the Sports Illustrated and the
MY-look home pages, so the row
contains columns named
anchor:cnnsi.com and
anchor:my.look.ca. Each anchor cell
has one version; the contents column
has three versions, at timestamps
t3, t5, and t6.
API
Typical operations to BigTable are creation and deletion of tables and column families, writing data and deleting columns from a row. BigTable provides this functions to application developers in an API. Transactions are supported at the row level, but not across several row keys.
Here is the link to the PDF of the research paper.
And here you can find a video showing Google's Jeff Dean in a lecture at the University of Washington, discussing the Bigtable content storage system used in Google's backend.
It's something they've built themselves - it's called Bigtable.
http://en.wikipedia.org/wiki/BigTable
There is a paper by Google on the database:
http://research.google.com/archive/bigtable.html
Spanner is Google's globally distributed relational database management system (RDBMS), the successor to BigTable. Google claims it is not a pure relational system because each table must have a primary key.
Here is the link of the paper.
Spanner is Google's scalable, multi-version, globally-distributed, and
synchronously-replicated database. It is the first system to
distribute data at global scale and support externally-consistent
distributed transactions. This paper describes how Spanner is
structured, its feature set, the rationale underlying various design
decisions, and a novel time API that exposes clock uncertainty. This
API and its implementation are critical to supporting external
consistency and a variety of powerful features: non-blocking reads in
the past, lock-free read-only transactions, and atomic schema changes,
across all of Spanner.
Another database invented by Google is Megastore. Here is the abstract:
Megastore is a storage system developed to meet the requirements of
today's interactive online services. Megastore blends the scalability
of a NoSQL datastore with the convenience of a traditional RDBMS in a
novel way, and provides both strong consistency guarantees and high
availability. We provide fully serializable ACID semantics within
fine-grained partitions of data. This partitioning allows us to
synchronously replicate each write across a wide area network with
reasonable latency and support seamless failover between datacenters.
This paper describes Megastore's semantics and replication algorithm.
It also describes our experience supporting a wide range of Google
production services built with Megastore.
As others have mentioned, Google uses a homegrown solution called BigTable and they've released a few papers describing it out into the real world.
The Apache folks have an implementation of the ideas presented in these papers called HBase. HBase is part of the larger Hadoop project which according to their site "is a software platform that lets one easily write and run applications that process vast amounts of data." Some of the benchmarks are quite impressive. Their site is at http://hadoop.apache.org.
Although Google uses BigTable for all their main applications, they also use MySQL for other (perhaps minor) apps.
And it's maybe also handy to know that BigTable is not a relational database (like MySQL) but a huge (distributed) hash table which has very different characteristics. You can play around with (a limited version) of BigTable yourself on the Google AppEngine platform.
Next to Hadoop mentioned above there are many other implementations that try to solve the same problems as BigTable (scalability, availability). I saw a nice blog post yesterday listing most of them here.
Google primarily uses Bigtable.
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size.
For more information, download the document from here.
Google also uses Oracle and MySQL databases for some of their applications.
Any more information you can add is highly appreciated.
Google services have a polyglot persistence architecture. BigTable is leveraged by most of its services like YouTube, Google Search, Google Analytics etc. The search service initially used MapReduce for its indexing infrastructure but later transitioned to BigTable during the Caffeine release.
Google Cloud datastore has over 100 applications in production at Google both facing internal and external users. Applications like Gmail, Picasa, Google Calendar, Android Market & AppEngine use Cloud Datastore & Megastore.
Google Trends use MillWheel for stream processing. Google Ads initially used MySQL later migrated to F1 DB - a custom written distributed relational database. Youtube uses MySQL with Vitess. Google stores exabytes of data across the commodity servers with the help of the Google File System.
Source: Google Databases: How Do Google Services Store Petabyte-Exabyte Scale Data?
YouTube Database – How Does It Store So Many Videos Without Running Out Of Storage Space?

Resources