Everything in one "table" on app engine? - google-app-engine

This question refers to database design using app engine and objectify. I want to discuss pros and cons of the approach of placing all (or let's say multiple) entities into a single "table".
Let's say I have a (very simplified) data model of two entities:
class User {
#Index Long userId;
String name;
}
class Message {
#Index Long messageId;
String message;
private Ref<User> recipient;
}
At first glance, it makes no sense to put these into the same "table" as they are completely different.
But let's look at what happens when I want to search across all entities. Let's say I want to find and return users and messages, which satisfy some search criteria. In traditional database design I would either do two separate search requests, or else create a separate index "table" during writes where I repeat fields redundantly so that I can later retrieve items in a single search request.
Now let's look at the following design. Assume I would use a single entity, which stores everything. The datastore would then look like this:
Type | userId | messageId | Name | Message
USER | 123456 | empty | Jeff | empty
MESSAGE | empty | 789012 | Mark | This is text.
See where I want to go? I could now search for a Name and would find all Users AND Messages in a single request. I would even be able to add an index field, something like
#Index List index;
to the "common" entity and would not need to write data twice.
Given the behavior of the datastore that it never returns a record when searching for an indexed field which is empty, and combining this with partial indexes, I could also get the User OR Message by querying fields unique to a given Type.
The cost for storing long (non-normalized) records is not higher than storing individual records, as long as many fields are empty.
I see further advantages:
I could use the same "table" for auditing as well, as every record
stored would form a "history" entry (as long as I don't allow
updates, in which case I would need to handle this manually).
I can easily add new Types without extending the db schema.
When search results are returned over REST, I can return them in a single List, and the client looks at the Type.
There might be disadvantages as well, for example with caching, but maybe not. I can't see this at this point.
Anybody there, who has tried going down this route or who can see serious drawbacks to this approach?

This is actually how the google datastore works under the covers. All of your entities (and everyone else's entities) are stored in a single BigTable that looks roughly like this:
{yourappid}/{key}/{serialized blob of your entity data}
Indexes are stored in three BigTables shared across all applications. I try to explain this in a fair amount of detail in my answer to this question: efficient searching using appengine datastore ancestor paths
So to rephrase your question, is it better to have Google maintain the Kind or to maintain it yourself in your own property?
The short answer is that having Google maintain the Kind makes it harder to query across all Kinds but makes it easier to query within one Kind. Maintaining the pseudo-kind yourself makes it easier to query across all Kinds but makes it harder to query within one Kind.
When Google maintains the Kind as per normal use, you already understand the limitation - there is no way to filter on a property across all different kinds. On the other hand, using a single Kind with your own descriminator means you must add an extra filter() clause every time you query:
ofy().load().type(Anything.class).filter("discriminator", "User").filter("name >", "j")
Sometimes these multiple-filter queries can be satisfied with zigzag merges, but some can't. And even the ones that can be satisfied with zigzag aren't as efficient. In fact, this tickles the specific degenerative case of zigzags - low-cardinality properties like the discriminator.
Your best bet is to pick and choose your shared Kinds carefully. Objectify makes this easy for you with polymorphism: https://code.google.com/p/objectify-appengine/wiki/Entities#Polymorphism
A polymorphic type hierarchy shares a single Kind (the kind of the base #Entity); Objectify manages the discriminator property for you and ensures queries like ofy().load().type(Subclass.class) are converted to the correct filter operation under the covers.
I recommend using this feature sparingly.

One SERIOUS drawback to that will be indexes:
every query you do will write a separate index to be servable, then ALL writes you do will need to write to ALL these tables (for NO reason, in a good amount of cases).
I can't think of other drawbacks at the moment, except the limit of a meg per entity (if you have a LOT of types, with a LOT of values, you might run into this as you end up having a gazillion columns)
Not mentioning how big your ONE entity model would be, and how possibly convoluted your code to "triage" your entity types could end up being

Related

Database performance: Using one entity/table with the max. possible properties or split to different entities/tables?

im need to design some database tables but im not sure about the performance impact. In my case its more about the read performance than for saving the data.
The situation
With the help of pattern recognition im finding out how many values of a certain object needs to be saved in my postgresql database.
Amount other lets say fixed properties the only difference is if 1, 2 or 3 values of the same type needs to be saved.
Currently im having 3 entities/tables which differ only in having having 1, 2 or 3 not nullable properties of the same type.
For example:
EntityTestOne/TableOne {
... other (same) properties
String optionOne;
}
EntityTestTwo/TableTwo {
... other (same) properties
String optionOne;
String optionTwo;
}
EntityTestThree/TableThree {
... other (same) properties
String optionOne;
String optionTwo;
String optionThree;
}
I expect to have several million records in production and im thinking what could be the performance impact of this variant and what could be alternatives.
Alternatives
Other options which come into my mind:
Use only one entity class or table with 3 options (optionTwo and optionThree will be nullable then). If to talk of millions of expected records
plus caching im asking myself isn't it a kind of 'waste' to save millions of null values in at least two (caching) layers (database itself and hibernate). In a another answer i read yesterday saving a null value in postgresql need only 1 bit what i think isnt that much if we talk about several millions of records which can contain some nullable properties (link).
Create another entity/table and use a collection (list or set) relationship instead
For example:
EntityOption {
String value;
}
EntityTest {
... other (same) properties
List<EntityOption> options;
}
If to use this relationship: What would give a better performance in case of creating new records:
Creating for every new EntityTest new EntityOption's or doing a
lookup before and reference a existing EntityOption if exists? What about the read performance while fetching them later and the joins which will be needed then?
Compared to the variant with one plain Entity with three options i can imagine it could be slightly slower...
As im not that strong in database design and working with hibernate im interested of the pros and cons of these approaches and if there are even more alternatives.
I even would like to ask the question if postgresql is the right choice for this or if should think about using another (free) database.
Thanks!
The case is pretty clear in my opinion: If you have an upper limit of three properties per object, use a single table with nullable attributes.
A NULL value does not take up any space in the database. For every row, PostgreSQL stores a bitmap that contains which attributes are NULL. This bitmap is always stored, except when all attributes are not nullable. See the documentation for details.
So don't worry about storage space in this case.
Using three different tables or storing the attributes in a separate table will probably lead to UNIONs or JOINs in your queries, which will make the queries more complicated and slow.
There are many inheritance strategy for creating entity class, I think you should go with single table strategy, where there will be a discriminator column (managed by hibernate itself), and all common filed will be used by each entity and some specific fields will be use by specific entity and remain null for other entity.
This will get improved read performance.
For your ref. :
http://www.thejavageek.com/2014/05/14/jpa-single-table-inheritance-example/

Datastore efficiency, low level API

Every Cloud Datastore query computes its results using one or more indexes, which contain entity keys in a sequence specified by the index's properties and, optionally, the entity's ancestors. The indexes are updated incrementally to reflect any changes the application makes to its entities, so that the correct results of all queries are available with no further computation needed.
Generally, I would like to know if
datastore.get(List<Key> listOfKeys);
is faster or slower than a query with the index file prepared (with the same results).
Query q = new Query("Kind")(.setFilter(someFilter));
My current problem:
My data consists of Layers and Points. Points belong to only one unique layer and have unique ids within a layer. I could load the points in several ways:
1) Have points with a "layer name" property and query with a filter.
- Here I am not sure whether the datastore would have the results prepared because as the layer name changes dynamically.
2) Use only keys. The layer would have to store point ids.
KeyFactory.createKey("Layer", "layer name");
KeyFactory.createKey("Point", "layer name"+"x"+"point id");
3) Use queries without filters: I don't actually need the general kind "Point" and could be more specific: kind would be ("layer name"+"point id")
- What are the costs to creating more kinds? Could this be the fastest way?
Can you actually find out how the datastore works in detail?
faster or slower than a query with the index file prepared (with the same results).
Fundamentally a query and a get by key are not guaranteed to have the same results.
Queries are eventually consistent, while getting data by key is strongly consistent.
Your first challenge, before optimizing for speed, is probably ensuring that you're showing the correct data.
The docs are good for explaining eventual vs strong consistency, it sounds like you have the option of using an ancestor query which can be strongly consistent. I would also strongly recommend avoiding using the 'name' - which is dynamic - as the entity name, this will cause you an excessive amount of grief.
Edit:
In the interests of being specifically helpful, one option for a working solution based on your description would be:
Give a unique id (a uuid probably) to each layer, store the name as a property
Include the layer key as the parent key for each point entity
Use an ancestor query when fetching points for a layer (which is strongly consistent)
An alternative option is to store points as embedded entities and only have one entity for the whole layer - depends on what you're trying to achieve.

Database storage design of large amounts of heterogeneous data

Here is something I've wondered for quite some time, and have not seen a real (good) solution for yet. It's a problem I imagine many games having, and that I can't easily think of how to solve (well). Ideas are welcome, but since this is not a concrete problem, don't bother asking for more details - just make them up! (and explain what you made up).
Ok, so, many games have the concept of (inventory) items, and often, there are hundreds of different kinds of items, all with often very varying data structures - some items are very simple ("a rock"), others can have insane complexity or data behind them ("a book", "a programmed computer chip", "a container with more items"), etc.
Now, programming something like that is easy - just have everything implement an interface, or maybe extend an abstract root item. Since objects in the programming world don't have to look the same on the inside as on the outside, there is really no issue with how much and what kind of private fields any type of item has.
But when it comes to database serialization (binary serialization is of course no problem), you are facing a dilemma: how would you represent that in, say, a typical SQL database ?
Some attempts at a solution that I have seen, none of which I find satisfying:
Binary serialization of the items, the database just holds an ID and a blob.
Pro's: takes like 10 seconds to implement.
Con's: Basically sacrifices every database feature, hard to maintain, near impossible to refactor.
A table per item type.
Pro's: Clean, flexible.
Con's: With a wide variety come hundreds of tables, and every search for an item has to query them all since SQL doesn't have the concept of table/type 'reference'.
One table with a lot of fields that aren't used by every item.
Pro's: takes like 10 seconds to implement, still searchable.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
A few tables with a few 'base profiles' for storage where similar items get thrown together and use the same fields for different data.
Pro's: I've got nothing.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
What ideas do you have? Have you seen another design that works better or worse?
It depends if you need to sort, filter, count, or analyze those attribute.
If you use EAV, then you will screw yourself nicely. Try doing reports on an EAV schema.
The best option is to use Table Inheritance:
PRODUCT
id pk
type
att1
PRODUCT_X
id pk fk PRODUCT
att2
att3
PRODUCT_Y
id pk fk PRODUCT
att4
att 5
For attributes that you don't need to search/sort/analyze, then use a blob or xml
I have two alternatives for you:
One table for the base type and supplemental tables for each “class” of specialized types.
In this schema, properties common to all “objects” are stored in one table, so you have a unique record for every object in the game. For special types like books, containers, usable items, etc, you have another table for each unique set of properties or relationships those items need. Every special type will therefore be represented by two records: the base object record and the supplemental record in a particular special type table.
PROS: You can use column-based features of your database like custom domains, checks, and xml processing; you can have simpler triggers on certain types; your queries differ exactly at the point of diverging concerns.
CONS: You need two inserts for many objects.
Use a “kind” enum field and a JSONB-like field for the special type data.
This is kind of like your #1 or #3, except with some database help. Postgres added JSONB, giving you an improvement over the old EAV pattern. Other databases have a similar complex field type. In this strategy you roll your own mini schema that you stash in the JSONB field. The kind field declares what you expect to find in that JSONB field.
PROS: You can extract special type data in your queries; can add check constraints and have a simple schema to deal with; you can benefit from indexing even though your data is heterogenous; your queries and inserts are simple.
CONS: Your data types within JSONB-like fields are pretty limited and you have to roll your own validation.
Yes, it is a pain to design database formats like this. I'm designing a notification system and reached the same problem. My notification system is however less complex than yours - the data it holds is at most ids and usernames. My current solution is a mix of 1 and 3 - I serialize data that is different from every notification, and use a column for the 2 usernames (some may have 2 or 1). I shy away from method 2 because I hate that design, but it's probably just me.
However, if you can afford it, I would suggest thinking outside the realm of RDBMS - it sounds like Non-RDBMS (especially key/value storage ones) may be a better fit to store these data, especially if item 1 and item 2 differ from each item a lot.
I'm sure this has been asked here a million times before, but in addition to the options which you have discussed in your question, you can look at EAV schema which is very flexible, but which has its own sets of cons.
Another alternative is database systems which are not relational. There are object databases as well as various key/value stores and document databases.
Typically all these things break down to some extent when you need to query against the flexible attributes. This is kind of an intrinsic problem, however. Conceptually, what does it really mean to query things accurately which are unstructured?
First of all, do you actually need the concurrency, scalability and ACID transactions of a real database? Unless you are building a MMO, your game structures will likely fit in memory anyway, so you can search and otherwise manipulate them there directly. In a scenario like this, the "database" is just a store for serialized objects, and you can replace it with the file system.
If you conclude that you do (need a database), then the key is in figuring out what "atomicity" means from the perspective of the data management.
For example, if a game item has a bunch of attributes, but none of these attributes are manipulated individually at the database level (even though they could well be at the application level), then it can be considered as "atomic" from the data management perspective. OTOH, if the item needs to be searched on some of these attributes, then you'll need a good way to index them in the database, which typically means they'll have to be separate fields.
Once you have identified attributes that should be "visible" versus the attributes that should be "invisible" from the database perspective, serialize the latter to BLOBs (or whatever), then forget about them and concentrate on structuring the former.
That's where the fun starts and you'll probably need to use "all of the above" strategy for reasonable results.
BTW, some databases support "deep" indexes that can go into heterogeneous data structures. For example, take a look at Oracle's XMLIndex, though I doubt you'll use Oracle for a game.
You seem to be trying to solve this for a gaming context, so maybe you could consider a component-based approach.
I have to say that I personally haven't tried this yet, but I've been looking into it for a while and it seems to me something similar could be applied.
The idea would be that all the entities in your game would basically be a bag of components. These components can be Position, Energy or for your inventory case, Collectable, for example. Then, for this Collectable component you can add custom fields such as category, numItems, etc.
When you're going to render the inventory, you can simply query your entity system for items that have the Collectable component.
How can you save this into a DB? You can define the components independently in their own table and then for the entities (each in their own table as well) you would add a "Components" column which would hold an array of IDs referencing these components. These IDs would effectively be like foreign keys, though I'm aware that this is not exactly how you can model things in relational databases, but you get the idea.
Then, when you load the entities and their components at runtime, based on the component being loaded you can set the corresponding flag in their bag of components so that you know which components this entity has, and they'll then become queryable.
Here's an interesting read about component-based entity systems.

Choosing the right model for storing and querying data?

I am working on my first GAE project using java and the datastore. And this is my first try with noSQL database. Like a lot of people i have problems understanding the right model to use. So far I've figured out two models and I need help to choose the right one.
All the data is represented in two classes User.class and Word.class.
User: couple of string with user data (username, email.....)
Word: two strings
Which is better :
Search in 10 000 000 entities for the 100 i need. For instance every entity Word have a string property owner and i query (owner = ‘John’).
In User.class i add property List<Word> and method getWords() that returns the list of words. So i query in 1000 users for the one i need and then call method like getWords() that returns List<Word> with that 100 i need.
Which one uses less resources ? Or am i going the wrong way with this ?
The answer is to use appstats and you can find out:
AppStats
To keep your application fast, you need to know:
Is your application making unnecessay RPC calls? Should it be caching
data instead of making repeated RPC calls to get the same data? Will
your application perform better if multiple requests are executed in
parallel rather than serially?
Run some tests, try it both ways and see what appstats says.
But I'd say that your option 2) is better simply because you don't need to search millions of entities. But who knows for sure? The trouble is that "resources" are a dozen different things in app engine - CPU, datastore reads, datastore writes etc etc etc.
For your User class, set a unique ID for each user (such as a username or email address). For the Word class, set the parent of each Word class as a specific User.
So, if you wanted to look up words from a specific user, you would do an ancestor query for all words belonging to that specific user.
By setting an ID for each user, you can get that user by ID as opposed to doing an additional query.
More info on ancestor queries:
https://developers.google.com/appengine/docs/java/datastore/queries#Ancestor_Queries
More info on IDs:
https://developers.google.com/appengine/docs/java/datastore/entities#Kinds_and_Identifiers
It really depends on the queries you're using. I assume that you want to find all the words given a certain owner.
Most likely, 2 would be cheaper, since you'll need to fetch the user entity instead of running a query.
2 will be a bit more work on your part, since you'll need to manually keep the list synchronized with the instances of Word
Off the top of my head I can think of 2 problems with #2, which may or may not apply to you:
A. If you want to find all the owners given a certain word, you'll need to keep that list of words indexed. This affects your costs. If you mostly find words by owner, and rarely find owners by words, it'll still make sense to do it this way. However, if your search pattern flips around and you're searching for owners by words a lot, this may be the wrong design. As you see, you need to design the models based on the queries you will be using.
B. Entities are limited to 1MB, and there's a limit on the number of indexed properties (5000 I think?). Those two will limit the number of words you can store in your list. Make sure that you won't need more than that limit of words per user. Method 1 allows you unlimted words per user.

should I make two separate tables for two similar objects

I want to store "Tweets" and "Facebook Status" in my app as part of "Status collection" so every status collection will have a bunch of Tweets or a bunch of Facebook Statuses. For Facebook I'm only interested in text so I won't store videos/photos for now.
I was wondering in terms of best practice for DB design. Is it better to have one table (put the max for status to 420 to include both Facebook and Twitter limit) with "Type" column that determines what status it is or is it better to have two separate tables? and Why?
Strictly speaking, a tweet is not the same thing as a FB update. You may be ignoring non-text for now, but you may change your mind later and be stuck with a model that doesn't work. As a general rule, objects should not be treated as interchangeable unless they really are. If they are merely similar, you should either use 2 separate tables or use additional columns as necessary.
All that said, if it's really just text, you can probably get away with a single table. But this is a matter of opinion and you'll probably get lots of answers.
I would put the messages into one table and have another that defines the type:
SocialMediaMessage
------------------
id
SocialMediaTypeId
Message
SocialMediaType
---------------
Id
Name
They seem similar enough that there is no point to separate them. It will also make your life easier if you want to query across both Social Networking sites.
Its probably easier to use on table and use type to identify them. You will only need one query/stored procedure to access the data instead of one query for each type when you have multiple tables.

Resources