I want to send unique references to the client so that they client can refer back to specific objects. The encoded keys appengine provides are sometimes 50 bytes long, and I probably only need two or three bytes (I could hope to need four or five, but that won't be for a while!).
Sending the larger keys is actually prohibitively expensive, since I might be sending 400 references at a time.
So, I want to map these long keys to much shorter keys. An obvious solution is to store a mapping in the datastore, but then when I'm sending 400 objects I'm doing 400 additional queries, right? Maybe I mitigate the expense by keeping copies of the mappings in memcache as well. Is there a better way?
Can I just yank the number out of the unencoded keys that appengine creates and use that? I only need whatever id I use to be unique per entity kind, not across the whole app.
Thanks,
Riley
Datastore keys include extra information you don't need - like the app ID. So you definitely do not need to send the entire keys.
If these references are to a particular Kind in your datastore, then you can do even better and just send the key_name or numeric ID (whichever your keys use). If the latter is the case, then you could transmit each key with just a few bytes (you could opt for either a variable-length or fixed-length integer encoding depending on which would be more compact for your specific case [probably the former until most of the IDs you're sending get quite large]).
When you receive these partial keys back from the user, it should be easy to reconstruct the full key which you need to retrieve the entities from the datastore. If you are using the Python runtime, you could use db.Key.from_path(kind_name, numeric_id_or_key_name).
A scheme like this should be both simpler and (a lot) faster than trying to use the datastore/memcache to store a custom mapping.
You don't need a custom mapping mechanism. Just use entity key names to store your short identifier :
entity = MyKind(key_name=your_short_id)
entity.put()
Then you can fetch these short identitiers in one query :
keys = MyKind.all(keys_only=True).filter(...).fetch(400)
short_ids = [key.name() for key in keys]
Finally, use MyKind.get_by_key_name(short_id) in order to retrieve entities from identifiers sent back by your users.
Related
I want to move an SQL database to cloud datastore. The sql database uses integer ids while datastore uses string key names. I know there is a way to allocate ids and stuff, but there is no need for this with string key names. So I could simply convert the integer id to a string and use that as key name:
key('Entity', '342353425')
Is there some problem with this approach? I guess it's still provides good lookup performance if app engine asks for string key names.
Key names are allocated so as to be random and evenly distributed. If you are using your custom IDs you must be sure that those are not monotonically increasing values, as
pointed before, it can lead directly to Datastore latency.
This document has the best practices for Datastore and specifically describes the best practices regarding Keys Best Practices on Datastore
"If an application generates large traffic, such sequential numbering could lead to hotspots that impact Datastore latency. To avoid the issue of sequential numeric IDs, obtain numeric IDs from the allocateIds() method. The allocateIds() method generates well-distributed sequences of numeric IDs."
By default, datastore gives each entity an integer ID. You have the option to specify a string name instead of this integer ID or to specify your own integer ID.
If you specify your own string name or integer ID, then it can harm the ability of your app to scale to a large number of entities.
Essentially, the scaling of datastore requires that names or IDs be distributed across a range and not be sequential, clustered, and probably other things I'm not aware of.
If your SQL IDs are sequential then your app won't scale well. If the SQL IDs look random, then it should be ok.
I am new to key-value stores. My objective is to use an embedded key-value store to keep the persistent data model. The data model comprises of few related tables if designed with conventional RDBMS. I was checking a medium article on modeling a table for key value store. Although the article uses Level DB with Java I am planning to use RocksDB or FASTER with C++ for my work.
It uses a scheme where one key is used for every attribute of each row, like the following example.
$table_name:$primary_key_value:$attribute_name = $value
The above is fine for point lookups when usercode is aware about exactly which key to get. But there are scenarios like searching for users having same email address, or searching for users above a certain age or searching for users of one specific gender. In search scenarios the article performs a linear scan through all keys. In each iterations it checks the pattern of the key and applies the business logic (checking the value for match) once a key with a matching pattern is found.
It seems that, such type of searching is inefficient and in worst case it needs to traverse through the entire store. To solve that a reverse lookup table is required. My question is
How to model the reverse lookup table ? Is it some sort of reinvention of wheel ? Is there any alternative way ?
One solution that readily comes in mind is to have a separate ? store for each index-able property like the following.
$table_name:$attribute_name:$value_1 = $primary_key_value
With this approach the immediate question is
How to handle collisions in this reverse lookup table ? because multiple $primary_keys may be associated with the same vale.
As an immediate solution, instead of storing a single value an array of multiple primary keys can be stored as shown below.
$table_name:$attribute_name:$value_1 = [$primary_key_value_1, ... , $primary_key_value_N]
But such type of modeling requires usercode to parse the array from string and again serialize that to string after manipulation several times (assuming the underlying key-value store is not aware about array values).
Is it efficient to store multiple keys as array value ? or there exists some vendor provided efficient way ?
Assuming that the stringifi'ed array like design works, there has to be such indexes for each indexable properties. So this gives a fine grained control on what to index and what not to index. Next design decision that comes in mind is where these indexes will be store ?
should the indexes be stored in a separate store/file ? or in the same store/file the actual data belongs to ? Should there be a different store for each property ?
For this question, I don't have a clue because both of these approaches require more or less same amount of I/O. However having large data file will have more things on disk and fewer things on memory (so more I/O), whereas for multiple files there will be more things on memory so less page faults. This assumption could be totally wrong depending on the architecture of the specific key-value store. At the same time having too many files turns into a problem of managing a complicated file structure. Also, maintaining indexes require transactions for insert, update and delete operations. Having multiple files results into single updation in multiple trees, whereas having single file results into multiple updation in single tree.
Is transaction more specifically transaction involving multiple store/files supported ?
Not only the indices there are some meta information of the table that are also required to be kept along with the table data. To generate a new primary key (auto incremented) it is required to have prior knowledge about the last row number or last primary key generated because something like a COUNT(*) won't work. Additionally as all keys are not indexed, the meta information may include what properties are indexed and what properties are not indexed.
How to store the meta information of each table ?
Again the same set of questions appear for the meta table also. e.g. should the meta be a separate store/file ? Additionally as we have noticed that not all properties are indexed we may even decide to store each row as a JSON encoded value in the data store and keep that along with the index stores. The underlying key-value store vendor will treat that JSON as a string value like the following.
$table_name:data:$primary_key_value = {$attr_1_name: $attr_1_value, ..., $attr_N_name: $attr_N_value}
...
$table_name:index:$attribute_name = [$primary1, ..., $primaryN]
However reverse lookups are still possible through the indexes pointing towards the primary key.
Is there any drawbacks of using JSON encoded values instead of storing all properties as separate keys ?
So far I could not find any draw backs using this method, other than forcing the user to use JSON encoding, and some heap allocation in for JSON encoding/decoding.
The problems mentioned above is not specific to any particular application. These problems are generic enough to be associated to all developments using key-value store. So it is essential to know whether there is any reinvention of wheel.
Is there any defacto standard solution of all the problems mentioned in the question ? Does the solutions differ from the one stated in the question ?
How to model the reverse lookup table ? Is it some sort of reinvention of wheel ? Is there any alternative way ?
All the ways you describe are valid ways to create an index.
It does not re-invent the wheel in RocksDB because RocksDB does not support indices.
It really depends on the data, in general you will need to copy the index value and the primary key into another space to create the index.
How to handle collisions in this reverse lookup table ? because multiple $primary_keys may be associated with the same vale.
You can serialize pks using JSON (or something else). The problem with that approach is when the pks grow very large (which might or might not be a thing).
Is it efficient to store multiple keys as array value ? or there exists some vendor provided efficient way ?
With RocksDB, you have nothing that will make it "easier".
You did not mention the following approach:
$table_name:$attribute_name:$value_1:$primary_key_value_1 = ""
$table_name:$attribute_name:$value_1:$primary_key_value_2 = ""
...
$table_name:$attribute_name:$value_1:$primary_key_value_n = ""
Where the value is empty. And the indexed pk is part of the key.
should the indexes be stored in a separate store/file ? or in the same store/file the actual data belongs to ? Should there be a different store for each property ?
It depends on the key-value store. With rocksdb, if you need transactions, you must stick to one db file.
Is transaction more specifically transaction involving multiple store/files supported ?
Only Oracle Berkeley DB and WiredTiger support that feature.
How to store the meta information of each table ?
metadata can be in the database or the code.
Is there any drawbacks of using JSON encoded values instead of storing all properties as separate keys ?
Yeah, like I said above, if you encoded all pks into a single value, it might lead to problem downstream when the number of pk is large. For instance, you need to read the whole list to do pagination.
Is there any defacto standard solution of all the problems mentioned in the question ? Does the solutions differ from the one stated in the question ?
To summarize:
With RocksDB, Use a single database file
In the index, encode the primary key inside the key, and leave value empty, to be able to paginate.
One decision that I have run into a few times is how to handle passing around either the key or embedded IDs of the entities. Each seems equally feasible given the encoders and marshalling methods built in with the datastore keys, but I was wondering if there is any sort of best practice on this choice. An example might be for a URL accessing a user’s files, where users have the default auto-generated numerical IDs, of the form: website.com/users/{userIdentifier}/files
I am trying to determine whether the number embedded in the datastore keys is preferable to the actual key strings themselves. Is it safe to have datastore keys out in the wild? I would like to standardize the way we handle those identifiers across our system and was wondering if there are any best practices on this.
The only reason to use a full Key as opposed to an identifier is to get the ancestor information embedded in the key itself without passing an additional data. While this may be convenient in some cases, I don't think it's a big enough of an advantage to use keys as a standard method of reference within an app.
The advantages of using an identifier are more substantial: (a) they are much smaller, and (b) they do not reveal any information about their ancestors (which may or may not be an issue).
The smaller size comes into play quite often: you may want to use an id in a URL, hold a list of ids in a memcache (which has a 1MB limit), etc.
Datastore keys contain (at least) next information:
Kind
Reference to ancestor
String or Int ID
Do you really need/want to pass in URL or keep in your DB AppID & Kind?
Compare this 2 urls (logically, in case of key it would be probably encoded with urlsafe()):
/list-of-orders?user=123
/list-of-orders?user=User/123
Or this 2 fields:
Table: Orders
---------------------
| UserKey | UserID |
---------------------
| User/123 | 123 |
---------------------
Why would you want to keep & pass around repetitive information about app & kind? Usually your app reference its own entities and kind is known by column or parameter name.
Unless you build some orchestration/integration between few apps it's more effective to use just IDs.
Trying to define some policy for keys in a key-value store (we are using Redis). The keyspace should be:
Shardable (can introduce more servers and spread out the keyspace between them)
Namespaced (there should be some mechanism to "group" keys together logically, for example by domain or associated concepts)
Efficient (try to use as little as possible space in the DB for keys, to allow for as much data as possible)
As collision-less as possible (avoid keys for two different objects to be equal)
Two alternatives that I have considered are these:
Use prefixes for namespaces, separated by some character (like human_resources:person:<some_id>).The upside of this is that it is pretty scalable and easy to understand. The downside would be possible conflicts depending on the separator (what if id has the character : in it?), and possibly size efficiency (too many nested namespaces might create very long keys).
Use some data structure (like Ordered Set or Hash) to store namespaces. The main drawback to this would be loss of "shardability", since the structure to store the namespaces would need to be in a single database.
Question: What would be a good way to manage a keyspace in a sharded setup? Should we use one these alternatives, or is there some other, better pattern that we have not considered?
Thanks very much!
The generally accepted convention in the Redis world is option 1 - i.e. namespaces separated by a character such as colon. That said, the namespaces are almost always one level deep. For example : person:12321 instead of human_resources:person:12321.
How does this work with the 4 guidelines you set?
Shardable - This approach is shardable. Each key can get into a different shard or same shard depending on how you set it up.
Namespaced Namespace as a way to avoid collisions works with this approach. However, namespaces as a way to group keys doesn't work out. In general, using keys as a way to group data is a bad idea. For example, what if the person moves from department to another? If you change the key, you will have to update all references - and that gets tricky.
Its best to ensure the key never changes for an object. Grouping can then be handled externally by creating a separate index.
For example, lets say you want to group people by department, by salary range, by location. Here's how you'd do it -
Individual people go in separate hash with keys persons:12321
Create a set for each group by - For example : persons_by:department - and only store the numeric identifiers for each person in this set. For example [12321, 43432]. This way, you get the advantages of Redis' Integer Set
Efficient The method explained above is pretty efficient memory wise. To save some more memory, you can compress the keys further on the application side. For example, you can store p:12321 instead of persons:12321. You should do this only if you have determined via profiling that you need such memory savings. In general, it isn't worth the cost.
Collision Free This depends on your application. Each User or Person should have a primary key that never changes. Use this in your Redis key, and you won't have collisions.
You mentioned two problems with this approach, and I will try to address them
What if the id has a colon?
It is of course possible, but your application's design should prevent it. Its best not to allow special characters in identifiers - because they will be used across multiple systems. For example, the identifier will very likely be a part of the URL, and colon is a reserved character even for urls.
If you really must allow special characters in your identifier, you would have to write a small wrapper in your code that encodes the special characters. URL encoding is perfectly capable of handling this.
Size Efficiency
There is a cost to long keys, however it isn't too much. In general, you should worry about the data size of your values rather than the keys. If you think keys are consuming too much memory, profile the database using a tool like redis-rdb-tools.
If you do determine that key size is a problem and want to save the memory, you can write a small wrapper that rewrites the keys using an alias.
I am making a mobile iOS app. A user can create an account, and upload strings. It will be like twitter, you can follow people, have profile pictures etc. I cannot estimate the user base, but if the app takes off, the total dataset may be fairly large.
I am storing the actual objects on Amazon S3, and the keys on a DataBase, listing Amazon S3 keys is slow. So which would be better for storing keys?
This is my knowledge of SimpleDB and DynamoDB:
SimpleDB:
Cheap
Performs well
Designed for small/medium datasets
Can query using select expressions
DynamoDB:
Costly
Extremely scalable
Performs great; millisecond response
Cannot query
These points are correct to my understanding, DynamoDB is more about killer. speed and scalability, SimpleDB is more about querying and price (still delivering good performance). But if you look at it this way, which will be faster, downloading ALL keys from DynamoDB, or doing a select query with SimpleDB... hard right? One is using a blazing fast database to download a lot (and then we have to match them), and the other is using a reasonably good-performance database to query and download the few correct objects. So, which is faster:
DynamoDB downloading everything and matching OR SimpleDB querying and downloading that
(NOTE: Matching just means using -rangeOfString and string comparison, nothing power consuming or non-time efficient or anything server side)
My S3 keys will use this format for every type of object
accountUsername:typeOfObject:randomGeneratedKey
E.g. If you are referencing to an account object
Rohan:Account:shd83SHD93028rF
Or a profile picture:
Rohan:ProfilePic:Nck83S348DD93028rF37849SNDh
I have the randomly generated key for uniqueness, it does not refer to anything, it is simply there so that keys are not repeated therefore overlapping two objects.
In my app, I can either choose SimpleDB or DynamoDB, so here are the two options:
Use SimpleDB, store keys with the format but not use the format for any reference, instead use attributes stored with SimpleDB. So, I store the key with attributes like username, type and maybe others I would also have to include in the key format. So if I want to get the account object from user 'Rohan'. I just use SimpleDB Select to query the attribute 'username' and the attribute 'type'. (where I match for 'account')
DynamoDB, store keys and each key will have the illustrated format. I scan the whole database returning every single key. Then get the key and take advantage of the key format, I can use -rangeOfString to match the ones I want and then download from S3.
Also, SimpleDB is apparently geographically-distributed, how can I enable that though?
So which is quicker and more reliable? Using SimpleDB to query keys with attributes. Or using DynamoDB to store all keys, scan (download all keys) and match using e.g. -rangeOfString? Mind the fact that these are just short keys that are pointers to S3 objects.
Here is my last question, and the amount of objects in the database will vary on the decided answer, should I:
Create a separate key/object for every single object a user has
Create an account key/object and store all information inside there
There would be different advantages and disadvantages points between these two options, obviously. For example, it would be quicker to retrieve if it is all separate, but it is also more organized and less large of a dataset for storing it in one users account.
So what do you think?
Thanks for the help! I have put a bounty on this, really need an answer ASAP.
Wow! What a Question :)
Ok, lets discuss some aspects:
S3
S3 Performance is low most likely as you're not adding a Prefix for Listing Keys.
If you sharding by storing the objects like: type/owner/id, listing all the ids for a given owner (prefixed as type/owner/) will be fast. Or at least, faster than listing everything at once.
Dynamo Versus SimpleDB
In general, thats my advice:
Use SimpleDB when:
Your entity storage isn't going to pass over 10GB
You need to apply complex queries involving multiple fields
Your queries aren't well defined
You can leverage from Multi-Valued Data Types
Use DynamoDB when:
Your entity storage will pass 10GB
You want to scale demand / throughput as it goes
Your queries and model is well-defined, and unlikely to change.
Your model is dynamic, involving a loose schema
You can cache on your client-side your queries (so you can save on throughput by querying the cache prior to Dynamo)
You want to do aggregate/rollup summaries, by using Atomic Updates
Given your current description, it seems SimpleDB is actually better, since:
- Your model isn't completely defined
- You can defer some decision aspects, since it takes a while to hit the (10GiB) limits
Geographical SimpleDB
It doesn't support. It works only from us-east-1 afaik.
Key Naming
This applies most to Dynamo: Whenever you can, use Hash + Range Key. But you could also create keys using Hash, and apply some queries, like:
List all my records on table T which starts with accountid:
List all my records on table T which starts with accountid:image
However, those are Scans at all. Bear that in mind.
(See this for an overview: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_Scan.html)
Bonus Track
If you're using Java, cloudy-data on Maven Central includes SimpleJPA with some extensions to Map Blob Fields to S3. So give it a look:
http://bitbucket.org/ingenieux/cloudy
Thank you