Why does Google App Engine restrict GQL queries? - google-app-engine

I was reading about App Engine on wikipedia and came across some GQL restrictions:
JOIN is not supported
can SELECT from at most one table at a time
can put at most 1 column in the WHERE clause
What are the advantages of these restrictions?
Are these restrictions common in other places where scalability is a priority?

The datastore that GQL talks to is:
not a relational database like MySQL or PostgreSQL
is a Column-oriented DBMS called BigTable
One reason to have a database like this is to have a very high performance database that you can scale across hundreds of servers.
GQL is not SQL it is SQL-like.
Here are some references:
http://en.wikipedia.org/wiki/Column-oriented_DBMS
http://en.wikipedia.org/wiki/BigTable
http://code.google.com/appengine/docs/datastore/overview.html
http://code.google.com/appengine/docs/datastore/gqlreference.html

I believe the answer is in fact to do with the underlying technology of the datastore rather than any kind of restriction on what is available. Google aren't using a relational database under the hood, but instead BigTable, they have just added a nice API which uses SQL like queries to limit the learning curve for those who are used to using a relational database. For those who are more used to using ORM's will take to it like a duck to water.

the existing answers do a good job with the high-level question.
one additional note: the third restriction you mention isn't actually true. GQL queries can include as many columns in the WHERE clause as you like. there are a few caveats, but number of columns is not explicitly limited. more:
http://code.google.com/appengine/docs/python/datastore/queries.html

Related

BigTable aggregation data

I have been trying use BigTable with connector to BigQuery. And when I try test query performance from 1 million rows I got query speed result ~ 50sec.
My SQL:
SELECT
DATE(geo_table_cell.timestamp) AS day,
geo_table_cell.value,
COUNT(*) AS countNumber
FROM
`project-dev.project_dev_bt_eu.dev-project`,
UNNEST(geo.COLUMN) AS geo_table,
UNNEST(geo_table.cell) AS geo_table_cell
WHERE
geo_table.name = 'cc'
AND rowkey LIKE 'profile%'
GROUP BY
geo_table_cell.value,
DATE(geo_table_cell.timestamp)
My questions is:
What the best solution for aggregate data from BigTable? (The same aggregation from ElasticSearch takes less than ~2 sec)
Why BigQuery works to slow with BigTable connector?
If I understand correct BigTable not good choice for present data to dashboards (filters work very slow);
1.- In case query speed is a must, loading the data into BigQuery instead of setting up an external data source would be the most efficient way.
Nevertheless, there are some things you can do to improve BigQuery, or BigTable performances.
2.- This connector is still in the Beta stage, and has some performance considerations. We should also take into consideration that BigTable is a noSQL (non relational) database and is not intended for SQL queries.
In case you are exploring the data model you want to use in your application, I recommend you consider all these options and choose the one that fits better with your needs.
3.- I would say it is not a good choice if you want to query your data using SQL. Understanding de non relational architecture of BigTable, the most effective way to read your data would be sending read requests. You can find some code samples about this, in different languages in the official documentation.

Solr relational database

Can solr be used as a relational database?
I am building a product database using Solr, but i am also trying to add competitor products within the same database, so when one item comes up the equivalent of other entries also show up
No, Solr should not be used as a relational database.
That does not mean that what you want isn't a good fit for Solr, just that it main usefulness lies outside of what relational databases are good at.
You can use regular search, "MoreLikeThis" or similar functionality (such as graphs or analytics from the streaming expressions support) to find similar or identical products.

HBase or Hive - web requests

Are either HBase/Hive suitable replacements as your traditional (non)relational database? Will they be able to serve up web-requests from web clients and respond in a timely manner? Are HBase/Hive only suitable for large dataset analysis? Sorry I'm a noob at this subject. Thanks in advance!
Hive is not at all suitable for any real time need such as timely web responses. You can use HBase though. But don't think about either HBase or Hive as a replacement of traditional RDBMSs. Both were meant to serve different needs. If your data is not huge enough better go with a RDBMS. RDBMSs are still the best choice(if they fit into your requirements). Technically speaking, HBase is really more a DataStore than DataBase because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
And the most important thing which could struck a newbie is the lack of SQL support by HBase, since it belongs to NoSQL family of stores.
And HBase/Hive are not the only options to handle large datasets. You have several options like Cassandra, Hypertable, MongoDB, Accumulo etc etc. But each one is meant for solving some specific problem. For example, MongoDB is used handling document data. So, you need to analyze your use case first and based on that you have to choose the datastore which suits your requirements.
You might find this list useful which compares different NoSQL datastores.
HTH
Hive is data warehouse tool, and it is mainly used for batch processing.
HBase is NoSQL database which allows random access based on rowkey (primary key). It is used for transactional access. It doesn't have indexing support which could be limitation for your needs.
Thanks,
Dino

How to create database table in Google App Engine

How to create database table in Google App Engine
You don't. You create Entities of different kinds. Datastore is not a relational database[*].
If you want to imagine that GAE creates one "table" for each kind, the "columns" of that "table" being the properties of the entities, then you're welcome to do so. But I don't think it helps.
[*] I don't know whether it meets some technical definition, but it certainly doesn't drive like SQL-based databases.
According to http://code.google.com/appengine/docs/python/datastore/
App Engine Datastore is a schemaless object datastore providing
robust, scalable storage for your web application, with the following
features:
No planned downtime
Atomic transactions
High availability of reads and writes
Strong consistency for reads and ancestor queries
Eventual consistency for all other queries
The Python Datastore interface includes a rich data modeling API and a SQL-like query language called GQL.
In simple words just create you model class, create an object of this class and after first call of put() method for this object the "table"(I think the term here is kind) will be created on the fly. But you definitely have to read the documentation and check some examples. The will help you to understand the specifics of Google Datastore and how it differs from the common RDBMS
In simple words, i would say that with Google BigTable you don't need to create your tables because there are already six Big Tables ready to store whatever you want.

Google's Bigtable vs. A Relational Database [duplicate]

This question already has answers here:
Closed 13 years ago.
Duplicates
Why should I use document based database instead of relational database?
Pros/Cons of document based database vs relational database
I don't know much about Google's Bigtable but am wondering what the difference between Google's Bigtable and relational databases like MySQL is. What are the limitations of both?
Bigtable is Google's invention to deal with the massive amounts of information that the company regularly deals in. A Bigtable dataset can grow to immense size (many petabytes) with storage distributed across a large number of servers. The systems using Bigtable include projects like Google's web index and Google Earth.
According to Google whitepaper on the subject:
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
The internal mechanics of Bigtable versus, say, MySQL are so dissimilar as to make comparison difficult, and the intended goals don't overlap much either. But you can think of Bigtable a bit like a single-table database. Imagine, for example, the difficulties you would run into if you tried to implement Google's entire web search system with a MySQL database -- Bigtable was built around solving those problems.
Bigtable datasets can be queried from services like AppEngine using a language called GQL ("gee-kwal") which is a based on a subset of SQL. Conspicuously missing from GQL is any sort of JOIN command. Because of the distributed nature of a Bigtable database, performing a join between two tables would be terribly inefficient. Instead, the programmer has to implement such logic in his application, or design his application so as to not need it.
Google's BigTable and other similar projects (ex: CouchDB, HBase) are database systems that are oriented so that data is mostly denormalized (ie, duplicated and grouped).
The main advantages are:
- Join operations are less costly because of the denormalization
- Replication/distribution of data is less costly because of data independence (ie, if you want to distribute data across two nodes, you probably won't have the problem of having an entity in one node and other related entity in another node because similar data is grouped)
This kind of systems are indicated for applications that need to achieve optimal scale (ie, you add more nodes to the system and performance increases proportionally). In an RDBMS like MySQL or Oracle, when you start adding more nodes if you join two tables that are not in the same node, the join cost is higher. This becomes important when you are dealing with high volumes.
RDBMS' are nice because of the richness of the storage model (tables, joins, fks). Distributed databases are nice because of the ease of scale.

Resources