One decision that I have run into a few times is how to handle passing around either the key or embedded IDs of the entities. Each seems equally feasible given the encoders and marshalling methods built in with the datastore keys, but I was wondering if there is any sort of best practice on this choice. An example might be for a URL accessing a user’s files, where users have the default auto-generated numerical IDs, of the form: website.com/users/{userIdentifier}/files
I am trying to determine whether the number embedded in the datastore keys is preferable to the actual key strings themselves. Is it safe to have datastore keys out in the wild? I would like to standardize the way we handle those identifiers across our system and was wondering if there are any best practices on this.
The only reason to use a full Key as opposed to an identifier is to get the ancestor information embedded in the key itself without passing an additional data. While this may be convenient in some cases, I don't think it's a big enough of an advantage to use keys as a standard method of reference within an app.
The advantages of using an identifier are more substantial: (a) they are much smaller, and (b) they do not reveal any information about their ancestors (which may or may not be an issue).
The smaller size comes into play quite often: you may want to use an id in a URL, hold a list of ids in a memcache (which has a 1MB limit), etc.
Datastore keys contain (at least) next information:
Kind
Reference to ancestor
String or Int ID
Do you really need/want to pass in URL or keep in your DB AppID & Kind?
Compare this 2 urls (logically, in case of key it would be probably encoded with urlsafe()):
/list-of-orders?user=123
/list-of-orders?user=User/123
Or this 2 fields:
Table: Orders
---------------------
| UserKey | UserID |
---------------------
| User/123 | 123 |
---------------------
Why would you want to keep & pass around repetitive information about app & kind? Usually your app reference its own entities and kind is known by column or parameter name.
Unless you build some orchestration/integration between few apps it's more effective to use just IDs.
Related
I have a database table with an autoincrement ID as primary key.
For each record of this table, I can have up to 3 files, which can be publicly available so random filename generation is not mandatory, and these files are optional.
I think I have 2 possible solutions:
Store a random generated filename in 3 nullable varchar column and store all the files in the same place:
columns: a | b | c
uploads/f6se54fse654.jpg
Don't store the filenames, but place them in specific folders and name them the same than the primary key value:
uploads/a/1.jpg
uploads/b/1.jpg
uploads/c/1.jpg
With this last solution, I know that uploads/a/1.jpg belongs to record with ID 1, and is a file of type a. But I have to check if the file exists because the files are optional.
Do you think there is a good practice in all that? Or maybe there is a better approach?
If the files you are talking about are intended to be displayed or downloaded by users (whether for visitors or for authenticated users, filtered by roles (ACL) or not), it is important to ensure (IMHO) that the user will not be able to guess other information other than the content of the concerned resource which has been sent to him. There is no perfect solution that can be applied to all cases without exception, so let's take an example to give you more explanations.
In order to enhance the security and total opacity of sensitive data, for example for the specific case of uploads/users/7/invoices/3.pdf, I think it would be wise to ensure that absolutely no one can guess the number of files that are potentially associated with the user or any other entity (because otherwise, in this example, we could imagine that there potentially are other accessible files - 1.pdf and 2.pdf). By design, we generally want to give access to files in a well defined and specific cases and context. However, this may not be the case for an image file which is intended to be seen by everyone (a profile photo, for example). That's why the context matters in some way.
If you choose to keep the auto-incremented identifiers as names to refer to your files, this can also give information about the size of the data stored in your database (/uploads/invoices/128.pdf informs that you may already have 127 invoices on your server) and potentially motivate unscrupulous people to try to reach resources that should never be fetched out of the defined context. This case may be less obvious if you choose to use some kind of unique generated identifiers (GUID).
I recommend that you read this article concerning the generation of (G)/(U)UIDs (a 128-bit hexadecimal numbers) to be stored in your database for each uploaded or created file. If you use MySQL in its latest version it is even possible to host this identifier in a binary (16) type which offers an automatic conversion to UUID, I let you read this interesting topic associated with what I refer about. It will probably output this as /uploads/invoices/b0016303-8e4f-487a-8c30-5dddf1ebf7e9.pdf which is a lot better as long as you ensure that the generated identifier is unique hash.
It does not seem useful to me here to talk about performance issues because today there are many methods for caching files or path and urls, which avoid having to make requests each time in a lot of cases where a resource is called (often ordered by their popularity rank in bigdata cases).
Last, but not least, many web and mobile platform applications (I think of Slack, Discord, Facebook, Twitter...) which store a lot of media files every day which are often associated with accounts users, both public and confidential files and information, generate a unique hash for each of them.
Twitter is using its own unique identifier string (64-bits BIGINT) generator called Twitter Snowflake which you might be interesting to read too. It is based on the UNIX epoch value which is, by definition, unique at each millisecond tick.
There isn't a global and perfect solution which can be applied for everything but I hope that this will help you as you may want to take a deeper look in this and find the "best solution" for each context and entity you'll store and link files.
This question refers to database design using app engine and objectify. I want to discuss pros and cons of the approach of placing all (or let's say multiple) entities into a single "table".
Let's say I have a (very simplified) data model of two entities:
class User {
#Index Long userId;
String name;
}
class Message {
#Index Long messageId;
String message;
private Ref<User> recipient;
}
At first glance, it makes no sense to put these into the same "table" as they are completely different.
But let's look at what happens when I want to search across all entities. Let's say I want to find and return users and messages, which satisfy some search criteria. In traditional database design I would either do two separate search requests, or else create a separate index "table" during writes where I repeat fields redundantly so that I can later retrieve items in a single search request.
Now let's look at the following design. Assume I would use a single entity, which stores everything. The datastore would then look like this:
Type | userId | messageId | Name | Message
USER | 123456 | empty | Jeff | empty
MESSAGE | empty | 789012 | Mark | This is text.
See where I want to go? I could now search for a Name and would find all Users AND Messages in a single request. I would even be able to add an index field, something like
#Index List index;
to the "common" entity and would not need to write data twice.
Given the behavior of the datastore that it never returns a record when searching for an indexed field which is empty, and combining this with partial indexes, I could also get the User OR Message by querying fields unique to a given Type.
The cost for storing long (non-normalized) records is not higher than storing individual records, as long as many fields are empty.
I see further advantages:
I could use the same "table" for auditing as well, as every record
stored would form a "history" entry (as long as I don't allow
updates, in which case I would need to handle this manually).
I can easily add new Types without extending the db schema.
When search results are returned over REST, I can return them in a single List, and the client looks at the Type.
There might be disadvantages as well, for example with caching, but maybe not. I can't see this at this point.
Anybody there, who has tried going down this route or who can see serious drawbacks to this approach?
This is actually how the google datastore works under the covers. All of your entities (and everyone else's entities) are stored in a single BigTable that looks roughly like this:
{yourappid}/{key}/{serialized blob of your entity data}
Indexes are stored in three BigTables shared across all applications. I try to explain this in a fair amount of detail in my answer to this question: efficient searching using appengine datastore ancestor paths
So to rephrase your question, is it better to have Google maintain the Kind or to maintain it yourself in your own property?
The short answer is that having Google maintain the Kind makes it harder to query across all Kinds but makes it easier to query within one Kind. Maintaining the pseudo-kind yourself makes it easier to query across all Kinds but makes it harder to query within one Kind.
When Google maintains the Kind as per normal use, you already understand the limitation - there is no way to filter on a property across all different kinds. On the other hand, using a single Kind with your own descriminator means you must add an extra filter() clause every time you query:
ofy().load().type(Anything.class).filter("discriminator", "User").filter("name >", "j")
Sometimes these multiple-filter queries can be satisfied with zigzag merges, but some can't. And even the ones that can be satisfied with zigzag aren't as efficient. In fact, this tickles the specific degenerative case of zigzags - low-cardinality properties like the discriminator.
Your best bet is to pick and choose your shared Kinds carefully. Objectify makes this easy for you with polymorphism: https://code.google.com/p/objectify-appengine/wiki/Entities#Polymorphism
A polymorphic type hierarchy shares a single Kind (the kind of the base #Entity); Objectify manages the discriminator property for you and ensures queries like ofy().load().type(Subclass.class) are converted to the correct filter operation under the covers.
I recommend using this feature sparingly.
One SERIOUS drawback to that will be indexes:
every query you do will write a separate index to be servable, then ALL writes you do will need to write to ALL these tables (for NO reason, in a good amount of cases).
I can't think of other drawbacks at the moment, except the limit of a meg per entity (if you have a LOT of types, with a LOT of values, you might run into this as you end up having a gazillion columns)
Not mentioning how big your ONE entity model would be, and how possibly convoluted your code to "triage" your entity types could end up being
Trying to define some policy for keys in a key-value store (we are using Redis). The keyspace should be:
Shardable (can introduce more servers and spread out the keyspace between them)
Namespaced (there should be some mechanism to "group" keys together logically, for example by domain or associated concepts)
Efficient (try to use as little as possible space in the DB for keys, to allow for as much data as possible)
As collision-less as possible (avoid keys for two different objects to be equal)
Two alternatives that I have considered are these:
Use prefixes for namespaces, separated by some character (like human_resources:person:<some_id>).The upside of this is that it is pretty scalable and easy to understand. The downside would be possible conflicts depending on the separator (what if id has the character : in it?), and possibly size efficiency (too many nested namespaces might create very long keys).
Use some data structure (like Ordered Set or Hash) to store namespaces. The main drawback to this would be loss of "shardability", since the structure to store the namespaces would need to be in a single database.
Question: What would be a good way to manage a keyspace in a sharded setup? Should we use one these alternatives, or is there some other, better pattern that we have not considered?
Thanks very much!
The generally accepted convention in the Redis world is option 1 - i.e. namespaces separated by a character such as colon. That said, the namespaces are almost always one level deep. For example : person:12321 instead of human_resources:person:12321.
How does this work with the 4 guidelines you set?
Shardable - This approach is shardable. Each key can get into a different shard or same shard depending on how you set it up.
Namespaced Namespace as a way to avoid collisions works with this approach. However, namespaces as a way to group keys doesn't work out. In general, using keys as a way to group data is a bad idea. For example, what if the person moves from department to another? If you change the key, you will have to update all references - and that gets tricky.
Its best to ensure the key never changes for an object. Grouping can then be handled externally by creating a separate index.
For example, lets say you want to group people by department, by salary range, by location. Here's how you'd do it -
Individual people go in separate hash with keys persons:12321
Create a set for each group by - For example : persons_by:department - and only store the numeric identifiers for each person in this set. For example [12321, 43432]. This way, you get the advantages of Redis' Integer Set
Efficient The method explained above is pretty efficient memory wise. To save some more memory, you can compress the keys further on the application side. For example, you can store p:12321 instead of persons:12321. You should do this only if you have determined via profiling that you need such memory savings. In general, it isn't worth the cost.
Collision Free This depends on your application. Each User or Person should have a primary key that never changes. Use this in your Redis key, and you won't have collisions.
You mentioned two problems with this approach, and I will try to address them
What if the id has a colon?
It is of course possible, but your application's design should prevent it. Its best not to allow special characters in identifiers - because they will be used across multiple systems. For example, the identifier will very likely be a part of the URL, and colon is a reserved character even for urls.
If you really must allow special characters in your identifier, you would have to write a small wrapper in your code that encodes the special characters. URL encoding is perfectly capable of handling this.
Size Efficiency
There is a cost to long keys, however it isn't too much. In general, you should worry about the data size of your values rather than the keys. If you think keys are consuming too much memory, profile the database using a tool like redis-rdb-tools.
If you do determine that key size is a problem and want to save the memory, you can write a small wrapper that rewrites the keys using an alias.
I am making a mobile iOS app. A user can create an account, and upload strings. It will be like twitter, you can follow people, have profile pictures etc. I cannot estimate the user base, but if the app takes off, the total dataset may be fairly large.
I am storing the actual objects on Amazon S3, and the keys on a DataBase, listing Amazon S3 keys is slow. So which would be better for storing keys?
This is my knowledge of SimpleDB and DynamoDB:
SimpleDB:
Cheap
Performs well
Designed for small/medium datasets
Can query using select expressions
DynamoDB:
Costly
Extremely scalable
Performs great; millisecond response
Cannot query
These points are correct to my understanding, DynamoDB is more about killer. speed and scalability, SimpleDB is more about querying and price (still delivering good performance). But if you look at it this way, which will be faster, downloading ALL keys from DynamoDB, or doing a select query with SimpleDB... hard right? One is using a blazing fast database to download a lot (and then we have to match them), and the other is using a reasonably good-performance database to query and download the few correct objects. So, which is faster:
DynamoDB downloading everything and matching OR SimpleDB querying and downloading that
(NOTE: Matching just means using -rangeOfString and string comparison, nothing power consuming or non-time efficient or anything server side)
My S3 keys will use this format for every type of object
accountUsername:typeOfObject:randomGeneratedKey
E.g. If you are referencing to an account object
Rohan:Account:shd83SHD93028rF
Or a profile picture:
Rohan:ProfilePic:Nck83S348DD93028rF37849SNDh
I have the randomly generated key for uniqueness, it does not refer to anything, it is simply there so that keys are not repeated therefore overlapping two objects.
In my app, I can either choose SimpleDB or DynamoDB, so here are the two options:
Use SimpleDB, store keys with the format but not use the format for any reference, instead use attributes stored with SimpleDB. So, I store the key with attributes like username, type and maybe others I would also have to include in the key format. So if I want to get the account object from user 'Rohan'. I just use SimpleDB Select to query the attribute 'username' and the attribute 'type'. (where I match for 'account')
DynamoDB, store keys and each key will have the illustrated format. I scan the whole database returning every single key. Then get the key and take advantage of the key format, I can use -rangeOfString to match the ones I want and then download from S3.
Also, SimpleDB is apparently geographically-distributed, how can I enable that though?
So which is quicker and more reliable? Using SimpleDB to query keys with attributes. Or using DynamoDB to store all keys, scan (download all keys) and match using e.g. -rangeOfString? Mind the fact that these are just short keys that are pointers to S3 objects.
Here is my last question, and the amount of objects in the database will vary on the decided answer, should I:
Create a separate key/object for every single object a user has
Create an account key/object and store all information inside there
There would be different advantages and disadvantages points between these two options, obviously. For example, it would be quicker to retrieve if it is all separate, but it is also more organized and less large of a dataset for storing it in one users account.
So what do you think?
Thanks for the help! I have put a bounty on this, really need an answer ASAP.
Wow! What a Question :)
Ok, lets discuss some aspects:
S3
S3 Performance is low most likely as you're not adding a Prefix for Listing Keys.
If you sharding by storing the objects like: type/owner/id, listing all the ids for a given owner (prefixed as type/owner/) will be fast. Or at least, faster than listing everything at once.
Dynamo Versus SimpleDB
In general, thats my advice:
Use SimpleDB when:
Your entity storage isn't going to pass over 10GB
You need to apply complex queries involving multiple fields
Your queries aren't well defined
You can leverage from Multi-Valued Data Types
Use DynamoDB when:
Your entity storage will pass 10GB
You want to scale demand / throughput as it goes
Your queries and model is well-defined, and unlikely to change.
Your model is dynamic, involving a loose schema
You can cache on your client-side your queries (so you can save on throughput by querying the cache prior to Dynamo)
You want to do aggregate/rollup summaries, by using Atomic Updates
Given your current description, it seems SimpleDB is actually better, since:
- Your model isn't completely defined
- You can defer some decision aspects, since it takes a while to hit the (10GiB) limits
Geographical SimpleDB
It doesn't support. It works only from us-east-1 afaik.
Key Naming
This applies most to Dynamo: Whenever you can, use Hash + Range Key. But you could also create keys using Hash, and apply some queries, like:
List all my records on table T which starts with accountid:
List all my records on table T which starts with accountid:image
However, those are Scans at all. Bear that in mind.
(See this for an overview: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_Scan.html)
Bonus Track
If you're using Java, cloudy-data on Maven Central includes SimpleJPA with some extensions to Map Blob Fields to S3. So give it a look:
http://bitbucket.org/ingenieux/cloudy
Thank you
I want to send unique references to the client so that they client can refer back to specific objects. The encoded keys appengine provides are sometimes 50 bytes long, and I probably only need two or three bytes (I could hope to need four or five, but that won't be for a while!).
Sending the larger keys is actually prohibitively expensive, since I might be sending 400 references at a time.
So, I want to map these long keys to much shorter keys. An obvious solution is to store a mapping in the datastore, but then when I'm sending 400 objects I'm doing 400 additional queries, right? Maybe I mitigate the expense by keeping copies of the mappings in memcache as well. Is there a better way?
Can I just yank the number out of the unencoded keys that appengine creates and use that? I only need whatever id I use to be unique per entity kind, not across the whole app.
Thanks,
Riley
Datastore keys include extra information you don't need - like the app ID. So you definitely do not need to send the entire keys.
If these references are to a particular Kind in your datastore, then you can do even better and just send the key_name or numeric ID (whichever your keys use). If the latter is the case, then you could transmit each key with just a few bytes (you could opt for either a variable-length or fixed-length integer encoding depending on which would be more compact for your specific case [probably the former until most of the IDs you're sending get quite large]).
When you receive these partial keys back from the user, it should be easy to reconstruct the full key which you need to retrieve the entities from the datastore. If you are using the Python runtime, you could use db.Key.from_path(kind_name, numeric_id_or_key_name).
A scheme like this should be both simpler and (a lot) faster than trying to use the datastore/memcache to store a custom mapping.
You don't need a custom mapping mechanism. Just use entity key names to store your short identifier :
entity = MyKind(key_name=your_short_id)
entity.put()
Then you can fetch these short identitiers in one query :
keys = MyKind.all(keys_only=True).filter(...).fetch(400)
short_ids = [key.name() for key in keys]
Finally, use MyKind.get_by_key_name(short_id) in order to retrieve entities from identifiers sent back by your users.