I am learning how to use the Google App Engine. My question is on the meaning of words used to describe a BigTable database.
I have read google's paper on big table This paper describes the BigTable data model in terms of rows, cells, column families, columns and column keys. I felt the examples given in the paper gave me a good idea of how to start designing a BigTable database.
I then looked at the Google App Engine datastore documentation and API . This uses the terms entity, properties, key, entity groups and index to describe the data model.
My question is:
What is the relationship between the terms used in the paper and the terms used in the API?
The Datastore is built upon Megastore which is built upon an implementation of BigTable.
When developing for appengine you don't really need to concern yourself with the Megastore and BigTable concepts, as it's all out of your control under the hood and the datastore builds much more on top of them anyway. It would be like trying to model your MySQL data after learning how BerkleyDB is implemented.... interesting but ultimately not that useful for your application.
Read through the megastore paper, which will probably give you mostly what you are looking for, and also checkout some of the Google IO talks that bring up the High Replication datastore as these touch a little on what goes on inside.
Related
I'm thinking of porting an application from RoR to Python App Engine that is heavily geo search centric. I've been using one of the open source GeoModel (i.e. geohashing) libraries to allow the application to handle queries that answer questions like "what restaurants are near this point (lat/lng pair)" and things of that nature.
GeoModel uses a ListProperty which creates a heavy index which had me concerned about pricing as i have about 10 million entities that would need to be loaded into production.
This article that I found this morning seems pretty scary in terms of costs:
https://groups.google.com/forum/?fromgroups#!topic/google-appengine/-FqljlTruK4
So my question is - is geohashing a moot concept now that Google has released their full text search which has support for geo searching? It's not clear what's going on behind the scenes with this new API though and I'm concerned the index sizes might be just as big as if I used the GeoModel approach.
The other problem with the search API is that it appears I'd have to create not only my models in the datastore but then replicate some of that data (GeoPtProperty and entity_key for the model it represents at a minimum) into Documents which greatly increases my data set.
Any thoughts on this? At the moment I'm contemplating scraping this port as being too expensive although I've really enjoyed working in the App Engine environment so far and would love to get away from EC2 for some of my applications.
You're asking many questions here:
is geohashing a moot concept: Probably not, I suspect the Search API uses geohashing, or something similar for its location search.
can you use the Search API vs implementing it yourself: yes, but I don't know the cost one way or the other.
is geohashing expensive on app engine: in the message thread the cost is bad due to high index write costs. you'll have to engineer your geohashing data to minimize the indexing. If GeoModel puts a lot of indexed values in the list, you may be in trouble - I wouldn't use it directly without knowing how the indexing works. My guess is that if you reduce the location accuracy you can reduce the number of indexed entries, and that could save you a lot of cost.
As mentioned in the thread, you could have the geohashing run in CloudSQL.
I want to use a graph database for a web application (involving a web of Users, Posts, Comments, Votes, Answers, Documents and Document-Merges and some other transitive relationships on Users and Documents). So I start asking myself if there is something like a design methodology for Graph Databases, i.e. a kind of analogon to the design principles recommended for Relational Databases (like those normal forms)?
Example questions (of many questions arising):
Is it a good idea, to create a Top-Node Users, having relationships ("exist") on any User-Node in the Database?
Is it a good idea to build in version management (i.e. create relationships (something like "follows")) pointing to updated versions of a Document / Post in a way that going back this relationship means watching the changes the document went through.
etc...
So, do we need a Graph Database Design Cookbook?
The Gremlin User Group (http://tinkerpop.com/) and Neo4j User Group (https://groups.google.com/forum/?fromgroups#!forum/neo4j) are good places to discuss graph-database modeling.
You can create supernodes such as "Users," but it may be better and more performant to use indexes and create an index entry for each user with a key=element_type, value="user", id=user_node_id.
A "follows" relation is often used for people/friends like on Facebook and Twitter so I wouldn't use that for versioning. You can build a versioning system into to Neo4j that timestamps each entry and use a last-write wins algorithm, and there are other database systems like Datomic that have this built in.
See Lightbulb's model (https://github.com/espeed/lightbulb/blob/master/lightbulb/model.py) for an example blog model in Bulbs/Python (http://bulbflow.com).
First off, I come from a RDBMS/SQL/C++/Java/Python background and I'm a newbie
to Gaelyk, the Google API and the Google datastore.
I like to model (using flowcharts for code and DB modeling tools for the database)
before I code.
I've used Erwin heavily in the past to do DB modeling.
In Erwin, I've designed a logical / physical data model of a database I'd like to
implement using the Google datastore and Gaelyk with the Google AppEngine SDK.
I wanted to design the data layout before coding anything.
My design tool of choice has been Erwin Data Modeler.
When I looked at the Google datastore, I saw that there
are no relational constraints, and joins are done via
WHERE clause :bind variables.
How can I map my existing model (with PKs/FKs, dependent entities, heavy relational links) to the Google datastore?
Is there a modeling tool that will allow me to design for the Google datastore?
Is the DB design supposed to flow from the Gaelyk MVC pattern and direct coding?
I'm not used to this as I come from an RDBMS background where you model heavily
and all good things come from good relational design.
Also, before coding a database client app in an imperative language (C++, C, Java, Python),
I like to write pseudocode, BUT first and foremost comes the DB design (if the app
has a DB back-end)
Am I doing this all wrong? It looks like there's a set of tools available to me
to start coding, but the design tool set is not there.
Addendum:
Here is the logical model I'm trying to map
How would I map a circular relationship
account --(1:m)-- following --(m:1)-- following_account_id --(1:1)-- account_id?
In general, the guiding principle of the App Engine datastore - and all nonrelational databases - is "optimize for reads". In short that means denormalize, denormalize, denormalize. In some cases, that will make updates harder - for example, if you make your username the primary key of your accounts table, and a user wants to change usernames - and in some cases that will require duplicating data, such as storing persistent counts. All of this is worthwhile, though, since it gives much better read performance and scalability, and in a typical webapp, reads outnumber writes by factors of hundreds to one.
Looking at your model in particular, it's very normalized - more so than most RDBMS models I've seen, even. Some suggestions:
Roll up things like 'user_name_id' into your main accounts table.
For things like 'following', use a list property if the number of people someone follows is typically small (<1000), or the fan-out pattern otherwise.
Pick a reasonable primary key for each table where practical, such as username or email, and use that as a key name. This allows looking up records with get operations instead of queries, which are substantially faster.
When a lookup table such as 'account type' is necessary, make sure the foreign key is sufficiently descriptive you only have to look up the corresponding record for administrative actions. Better, store small, infrequently changing details like this outside the datastore, so they can be accessed instantly.
For things like tags, use list properties to reduce the number of times you have to lookup related entities, and to make indexing easier.
This only scratches the surface, of course, and there's a lot of collected wisdom here on SO, in the groups, and on blogs like mine. Feel free to come back and ask specific questions about data modelling!
To answer your other questions, no, there are no GAE-specific data modelling tools I'm aware of, but you can use a standard diagramming tool as you already are. Models are indeed defined in code, since the datastore is schemaless, but that doesn't have to be a barrier to the order in which you implement things.
I know that app engine is implemented on big table, can anyone describe the difference between actual implementation of big table and google's implementation of big table .i.e (App engine)
Bigtable provides a basic key/value store, described in the paper here. Values are stored in rows and columns. Row and column keys are arbitrary byte strings. For more details see the paper. The basic operations Bigtable provides are lookups on individual row and column keys, and ranges of rows.
On top of Bigtable, there's an abstraction layer called Megastore. Megastore uses the bigtable primitives to construct a more versatile database platform. It adds indexing - using separate bigtables as indexes - and queries using those indexes. It also adds replication support. It's Megastore that provides most of what we think of as the App Engine datastore, such as composite indexes and the variety of queries the datastore provides.
Finally, App Engine implements a few things of its own on top of Megastore, such as the format of App Engine entity keys, giving each app its own datastore, and implementing certain operations like 'IN' and '!=' in an abstraction layer in each language's SDK.
How to create database table in Google App Engine
You don't. You create Entities of different kinds. Datastore is not a relational database[*].
If you want to imagine that GAE creates one "table" for each kind, the "columns" of that "table" being the properties of the entities, then you're welcome to do so. But I don't think it helps.
[*] I don't know whether it meets some technical definition, but it certainly doesn't drive like SQL-based databases.
According to http://code.google.com/appengine/docs/python/datastore/
App Engine Datastore is a schemaless object datastore providing
robust, scalable storage for your web application, with the following
features:
No planned downtime
Atomic transactions
High availability of reads and writes
Strong consistency for reads and ancestor queries
Eventual consistency for all other queries
The Python Datastore interface includes a rich data modeling API and a SQL-like query language called GQL.
In simple words just create you model class, create an object of this class and after first call of put() method for this object the "table"(I think the term here is kind) will be created on the fly. But you definitely have to read the documentation and check some examples. The will help you to understand the specifics of Google Datastore and how it differs from the common RDBMS
In simple words, i would say that with Google BigTable you don't need to create your tables because there are already six Big Tables ready to store whatever you want.