I have an entity with 2 properties: UserId(String) and RSSSubscriptions(String). Instances of this class will be storing in App Engine Datastore.
Where RSSSubscriptions should be a key value pair like "Site1: Feed1", "Site2: Feed2".
Since datatypes like Hashmaps are not persistable I am forced to keep this data in a String format. Currently I have stored it as a string type with JSONArray format. Say, "[{"Site1: Feed1"}, {"Site2: Feed2"}]".
My client will be an Android app. So Iam supposed to parse this string as JSON Array at client side. But I think its a bad idea to create a String with JSON format and append it with existing string, each time when user is adding new subscription. Any better Ideas?
You can use JSONProperty which is supported by ndb for that particular reason. In my opinion its a "hairy" solution to store Json as string and parse it back and forth. You have to be very careful to guarantee validity.
Correct answer depends on several factors with expected number of pairs being the most important. Important to remember that there are significant costs associated with storing the pair in an entity accessed by query. There are numerous ops costs for doing a query, and there will be significant cpu time. Compare this to using a single record keyed by user id, and storing the JSON inside a TextProperty. That is one small op cost and cpu times which will likely be 10x less than a query.
Please consider these factors when deciding to go with the technically cleaner approach of querying entities. Myself, I would always use a serialized string inside a TextProperty for anything in the "thousands of pairs" volume unless there was a very high rate of deletions (and even this it likely the string approach could be better). Using a query is generally the last design choice for GAE given its high resource costs.
Related
I want to move an SQL database to cloud datastore. The sql database uses integer ids while datastore uses string key names. I know there is a way to allocate ids and stuff, but there is no need for this with string key names. So I could simply convert the integer id to a string and use that as key name:
key('Entity', '342353425')
Is there some problem with this approach? I guess it's still provides good lookup performance if app engine asks for string key names.
Key names are allocated so as to be random and evenly distributed. If you are using your custom IDs you must be sure that those are not monotonically increasing values, as
pointed before, it can lead directly to Datastore latency.
This document has the best practices for Datastore and specifically describes the best practices regarding Keys Best Practices on Datastore
"If an application generates large traffic, such sequential numbering could lead to hotspots that impact Datastore latency. To avoid the issue of sequential numeric IDs, obtain numeric IDs from the allocateIds() method. The allocateIds() method generates well-distributed sequences of numeric IDs."
By default, datastore gives each entity an integer ID. You have the option to specify a string name instead of this integer ID or to specify your own integer ID.
If you specify your own string name or integer ID, then it can harm the ability of your app to scale to a large number of entities.
Essentially, the scaling of datastore requires that names or IDs be distributed across a range and not be sequential, clustered, and probably other things I'm not aware of.
If your SQL IDs are sequential then your app won't scale well. If the SQL IDs look random, then it should be ok.
When you insert an Entity into datastore with a #id Long id; property, the datastore automatically creates a random (or what seems like a random) Long value as the id that looks like: 5490350115034675.
I would like to set the Long id myself but have it be randomly generated from datastore.
I found this piece of code that seems to do just that:
Key<MyEntity> entityKey = factory().allocateId(MyEntity.class);
Long commentId = entityKey.getId();
Then I can pass in the commentId into the constructor of MyEntity and subsequently save it to the datastore.
When I do that however, I do not seem to get a randomly generated id, it seems to follow some weird pattern where the first allocated id is 1 and the next one is 10002, then 20001 and so on.
Not sure what all that means and if it is safe to continue using... Is this the only way to do this?
When you use the autogenerated ids (ie Long), GAE uses the 'scattered' id generator which gives you ids from a broad range of the keyspace. This is because high volume writing (thousands per second) of more-or-less contiguous values in an index results in a lot of table splitting, hurting performance.
When you use allocateId(), you get an id from the older allocator that was used before scattered ids. They aren't necessarily contiguous or monotonic but they tend to start small and grow.
You can mix and match; allocations will never conflict.
I presume, however, that you want random-looking ids because you want them to be hard to guess. Despite their appearance at first glance, the scattered id allocator does not produce unguessable ids. If you want sparse ids that will prevent someone from scanning your keyspace, you need to explicitly add a random element. Or just use UUID.randomUUID() in the first place.
App Engine allocates IDs using its own internal algorithm designed to improve datastore performance. I would trust App Engine team to do their magic.
Introducing your own scheme for allocating IDs is not as simple - you have to account for eventual consistency, etc. And it's unlikely that you will gain anything, performance-wise, from all this effort.
I am making a mobile iOS app. A user can create an account, and upload strings. It will be like twitter, you can follow people, have profile pictures etc. I cannot estimate the user base, but if the app takes off, the total dataset may be fairly large.
I am storing the actual objects on Amazon S3, and the keys on a DataBase, listing Amazon S3 keys is slow. So which would be better for storing keys?
This is my knowledge of SimpleDB and DynamoDB:
SimpleDB:
Cheap
Performs well
Designed for small/medium datasets
Can query using select expressions
DynamoDB:
Costly
Extremely scalable
Performs great; millisecond response
Cannot query
These points are correct to my understanding, DynamoDB is more about killer. speed and scalability, SimpleDB is more about querying and price (still delivering good performance). But if you look at it this way, which will be faster, downloading ALL keys from DynamoDB, or doing a select query with SimpleDB... hard right? One is using a blazing fast database to download a lot (and then we have to match them), and the other is using a reasonably good-performance database to query and download the few correct objects. So, which is faster:
DynamoDB downloading everything and matching OR SimpleDB querying and downloading that
(NOTE: Matching just means using -rangeOfString and string comparison, nothing power consuming or non-time efficient or anything server side)
My S3 keys will use this format for every type of object
accountUsername:typeOfObject:randomGeneratedKey
E.g. If you are referencing to an account object
Rohan:Account:shd83SHD93028rF
Or a profile picture:
Rohan:ProfilePic:Nck83S348DD93028rF37849SNDh
I have the randomly generated key for uniqueness, it does not refer to anything, it is simply there so that keys are not repeated therefore overlapping two objects.
In my app, I can either choose SimpleDB or DynamoDB, so here are the two options:
Use SimpleDB, store keys with the format but not use the format for any reference, instead use attributes stored with SimpleDB. So, I store the key with attributes like username, type and maybe others I would also have to include in the key format. So if I want to get the account object from user 'Rohan'. I just use SimpleDB Select to query the attribute 'username' and the attribute 'type'. (where I match for 'account')
DynamoDB, store keys and each key will have the illustrated format. I scan the whole database returning every single key. Then get the key and take advantage of the key format, I can use -rangeOfString to match the ones I want and then download from S3.
Also, SimpleDB is apparently geographically-distributed, how can I enable that though?
So which is quicker and more reliable? Using SimpleDB to query keys with attributes. Or using DynamoDB to store all keys, scan (download all keys) and match using e.g. -rangeOfString? Mind the fact that these are just short keys that are pointers to S3 objects.
Here is my last question, and the amount of objects in the database will vary on the decided answer, should I:
Create a separate key/object for every single object a user has
Create an account key/object and store all information inside there
There would be different advantages and disadvantages points between these two options, obviously. For example, it would be quicker to retrieve if it is all separate, but it is also more organized and less large of a dataset for storing it in one users account.
So what do you think?
Thanks for the help! I have put a bounty on this, really need an answer ASAP.
Wow! What a Question :)
Ok, lets discuss some aspects:
S3
S3 Performance is low most likely as you're not adding a Prefix for Listing Keys.
If you sharding by storing the objects like: type/owner/id, listing all the ids for a given owner (prefixed as type/owner/) will be fast. Or at least, faster than listing everything at once.
Dynamo Versus SimpleDB
In general, thats my advice:
Use SimpleDB when:
Your entity storage isn't going to pass over 10GB
You need to apply complex queries involving multiple fields
Your queries aren't well defined
You can leverage from Multi-Valued Data Types
Use DynamoDB when:
Your entity storage will pass 10GB
You want to scale demand / throughput as it goes
Your queries and model is well-defined, and unlikely to change.
Your model is dynamic, involving a loose schema
You can cache on your client-side your queries (so you can save on throughput by querying the cache prior to Dynamo)
You want to do aggregate/rollup summaries, by using Atomic Updates
Given your current description, it seems SimpleDB is actually better, since:
- Your model isn't completely defined
- You can defer some decision aspects, since it takes a while to hit the (10GiB) limits
Geographical SimpleDB
It doesn't support. It works only from us-east-1 afaik.
Key Naming
This applies most to Dynamo: Whenever you can, use Hash + Range Key. But you could also create keys using Hash, and apply some queries, like:
List all my records on table T which starts with accountid:
List all my records on table T which starts with accountid:image
However, those are Scans at all. Bear that in mind.
(See this for an overview: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_Scan.html)
Bonus Track
If you're using Java, cloudy-data on Maven Central includes SimpleJPA with some extensions to Map Blob Fields to S3. So give it a look:
http://bitbucket.org/ingenieux/cloudy
Thank you
Recently we decided it would benefit us if the IDs of our Datastore entities weren't soo big. Biggest reason being, we use these IDs in URLs that we'd like to keep nice and short.
Currently, as an example, the IDs of our entities grow like this:
id=2
id=2003
id=2004
id=2027
id=2028
id=5002
id=5204
id=6001
id=7534
id=8001
id=10192
id=11306
id=14306
id=16330
id=18306
id=20321
id=41312
id=79306
id=113308
id=113311
etc.
As you can see, sometimes the increase is in the tens of thousands.
Now, we could cope with all this hassle by creating a sharded counter big enough to count the number of entities for us and then assign the IDs ourselves, but I would still like it better if the Datastore would assign the keys for us.
Is there any way of telling the Datastore to re-calculate the available IDs, so that next time I'd store an entity, it would get the lowest available ID? They don't need to be sequential in our case.
UPDATE:
As #Amber suggested, we could encode the digits to base62 to have them shorter (at most 11 digits for 64-bit unsigned ints).
While this approach is not too bad, it has a few disadvantages. First I'm not sure how good UX it is. Second, some digits would clash with other strings that we currently use in URLs.
As an example:
/books/(\d+)(/book-name)?
/books/selection
The book with id 26086738530 would have the URLs '/books/selection/book-name' and '/books/selection', clashing with our other page.
I'm afraid there isn't a mechanism in the datastore that allows you to control the automatic id creation.
How many objects do you estimate you will have in the project life time? because long ids seems like an hassle now but might be a necessary anyway when you will have tens of thousands objects in the store.
As goes for the base62, you can route base62 ids thru a different url.
I've been getting more into app engine, and am a little concerned about space used per object in the datastore (I'm using java). It looks like for each record, the names of the object field are encoded as part of the object. Therefore, if I have lots of tiny records, the additional space used for encoding field names could grow to be quite a significant portion of my datastore usage - is this true?
If true, I can't think of any good strategies around this - for example, in a game application, I want to store a record each time a user makes a move. This will result in a lot of little objects being stored. If I were to rewrite and use a single object which stored all the moves a user makes, then serialize/deserialize time would increase as the moves list grows:
So:
// Lots of these?
class Move {
String username;
String moveaction;
long timestamp;
}
// Or:
class Moves {
String username;
String[] moveactions;
long[] timestamps;
}
am I interpreting the tradeoffs correctly?
Thank you
Your assessment is entirely correct. You can reduce overhead somewhat by choosing shorter field names; if your mapping framework supports it, you can then alias the shorter names so that you can use more user-friendly ones in your app.
Your idea of aggregating moves into a single entity is probably a good one; it depends on your access pattern. If you regularly need to access information on only a single move, you're correct that the time spent will grow with the number of moves, but if you regularly access lists of sequential moves, this is a non-issue. One possible compromise is separating the moves into groups - one entity per hundred moves, for example.