Avoiding unnecessarily large IDs in App Engine - google-app-engine

Recently we decided it would benefit us if the IDs of our Datastore entities weren't soo big. Biggest reason being, we use these IDs in URLs that we'd like to keep nice and short.
Currently, as an example, the IDs of our entities grow like this:
id=2
id=2003
id=2004
id=2027
id=2028
id=5002
id=5204
id=6001
id=7534
id=8001
id=10192
id=11306
id=14306
id=16330
id=18306
id=20321
id=41312
id=79306
id=113308
id=113311
etc.
As you can see, sometimes the increase is in the tens of thousands.
Now, we could cope with all this hassle by creating a sharded counter big enough to count the number of entities for us and then assign the IDs ourselves, but I would still like it better if the Datastore would assign the keys for us.
Is there any way of telling the Datastore to re-calculate the available IDs, so that next time I'd store an entity, it would get the lowest available ID? They don't need to be sequential in our case.
UPDATE:
As #Amber suggested, we could encode the digits to base62 to have them shorter (at most 11 digits for 64-bit unsigned ints).
While this approach is not too bad, it has a few disadvantages. First I'm not sure how good UX it is. Second, some digits would clash with other strings that we currently use in URLs.
As an example:
/books/(\d+)(/book-name)?
/books/selection
The book with id 26086738530 would have the URLs '/books/selection/book-name' and '/books/selection', clashing with our other page.

I'm afraid there isn't a mechanism in the datastore that allows you to control the automatic id creation.
How many objects do you estimate you will have in the project life time? because long ids seems like an hassle now but might be a necessary anyway when you will have tens of thousands objects in the store.
As goes for the base62, you can route base62 ids thru a different url.

Related

Allocate a random Id to use in Entity constructor Google App Engine

When you insert an Entity into datastore with a #id Long id; property, the datastore automatically creates a random (or what seems like a random) Long value as the id that looks like: 5490350115034675.
I would like to set the Long id myself but have it be randomly generated from datastore.
I found this piece of code that seems to do just that:
Key<MyEntity> entityKey = factory().allocateId(MyEntity.class);
Long commentId = entityKey.getId();
Then I can pass in the commentId into the constructor of MyEntity and subsequently save it to the datastore.
When I do that however, I do not seem to get a randomly generated id, it seems to follow some weird pattern where the first allocated id is 1 and the next one is 10002, then 20001 and so on.
Not sure what all that means and if it is safe to continue using... Is this the only way to do this?
When you use the autogenerated ids (ie Long), GAE uses the 'scattered' id generator which gives you ids from a broad range of the keyspace. This is because high volume writing (thousands per second) of more-or-less contiguous values in an index results in a lot of table splitting, hurting performance.
When you use allocateId(), you get an id from the older allocator that was used before scattered ids. They aren't necessarily contiguous or monotonic but they tend to start small and grow.
You can mix and match; allocations will never conflict.
I presume, however, that you want random-looking ids because you want them to be hard to guess. Despite their appearance at first glance, the scattered id allocator does not produce unguessable ids. If you want sparse ids that will prevent someone from scanning your keyspace, you need to explicitly add a random element. Or just use UUID.randomUUID() in the first place.
App Engine allocates IDs using its own internal algorithm designed to improve datastore performance. I would trust App Engine team to do their magic.
Introducing your own scheme for allocating IDs is not as simple - you have to account for eventual consistency, etc. And it's unlikely that you will gain anything, performance-wise, from all this effort.

How many records / rows / nodes is a lot in Firebase?

I'm creating an app where I will store users under all postalcodes/zipcodes they want to deliver to. The structure looks like this:
postalcodes/{{postalcode}}/{{userId}}=true
The reason for the structure is to easily fetch all users who deliver to a certain postal code.
ex. postalcodes/21121/
If all user applies like 500 postalcodes and the app has about 1000 users it can become a lot of records:
500x1000 = 500000
Will Firebase easily handle that many records in data storage, or should I consider a different approach/solution? What are your thoughts?
Kind regards,
Elias
I'm quite sure Firebase can return 500k nodes without a problem.
The bigger concerns are how long that retrieval will take (especially in this mobile-first era) and what your application will show your user based on that many nodes.
A list with 500k rows is hardly useful, so most likely you'll show a subset of the data.
Say you just show the first screenful of nodes. How many nodes will that be? 20? So why would you already retrieve the other nodes already in that case? I'd simply retrieve the nodes needed to build the first screen and load the rest on demand - when/if needed.
Alternatively I could imagine you show a digest of the nodes (like a total number of nodes and maybe some averages per zip code area). You'd need all nodes to determine that digest. But I'd hardly consider it to task of a client application to determine the digest values. That's more something of a server-side task. That server could use the same technology as client-apps (i.e. the JavaScript API), but it wouldn't be bothered (as much) by bandwidth and time constraints.
Just some ideas of how I would approach this, so ymmv.

Finding unique products (never seen before by a user) in a datastore sorted by a dynamically changing value (i.e. product rating)

been trying to solve this problem for a week and couldn't come up with any solutions in all my research so I thought I'd ask you all.
I have a "Product" table and a "productSent" table, here's a quick scheme to help explain:
class Product(ndb.Model):
name = ndb.StringProperty();
rating = ndb.IntegerProperty
class productSent(ndb.Model): <--- the key name here is md5(Product Key+UUID)
pId = ndb.KeyProperty(kind=Product)
uuId = ndb.KeyProperty(kind=userData)
action = ndb.StringProperty()
date = ndb.DateTimeProperty(auto_now_add=True)
My goal is to show users the highest rated product that they've never seen before--fast. So to keep track of the products users have seen, I use the productSent table. I created this table instead of using Cursors because every time the rating order changes, there's a possibility that the cursor skips the new higher ranking product. An example: assume the user has seen products 1-24 in the db. Next, 5 users liked product #25, making it the #10 product in the database--I'm worried that the product will never be shown again to the user (and possibly mess things up on a higher scale).
The problem with the way I'm doing it right now is that, once the user has blown past the first 1,000 products, it really starts slowing down the query performance. Because I'm literally pulling 1,000+ results, checking if they've been sent by querying against the productSent table (doing a keyName lookup to speed things up) and going through the loop until 15 new ones have been detected.
One solution I thought of was to add a repeated property (listProperty) to the Product table of all the users who have seen a product. Or if I don't want to have inequality filters I could put a repeated property of all the users who haven't seen a product. That way when I query I can dynamically take those out. But I'm afraid of what happens when I have 1,000+ users:
a) I'll go through the roof on the limit of repeated properties in one entity.
b) The index size will increase size costs
Has anyone dealt with this problem before (I'm sure someone has!) Any tips on the best way to structure it?
update
Okay, so had another idea. In order to minimize the changes that take place when a rating (number of likes) changes, I could have a secondary column that only has 3 possible values: positive, neutral, negative. And sort by that? Ofcourse for items that have a rating of 0 and get a 'like' (making them a positive) would still have a chance of being out of order or skipped by the cursor--but it'd be less likely. What do y'all think?
Sounds like the inverse, productNotSent would work well here. Every time you add a new product, you would add a new productNotSent entity for each user. When the user wants to see the highest rated product they have not seen, you will only have to query over the productNotSent entities that match that user. If you put the rating directly on the productNotSent you could speed the query up even more, since you will only have to query against one Model.
Another idea would be to limit the number of productNotSent entities per user. So each user only has ~100 of these entities at a time. This would mean your query would be constant for each user, regardless of the number of products or users you have. The creation of new productNotSent entities would become more complex, though. You'd have to have a cron job or something that "tops up" a user's collection of productNotSent entities when they use some up. You also may want to double-check that products rated higher than those already within the user's set of productNotSent entities get pushed in there. These are a little more difficult and well require some design trade-offs.
Hope this helps!
I do not know your expected volumes and exact issues (only did a quick perusal of your question), but you may consider using Json TextProperty storage as part of your plan. Create dictionaries/lists and store them in records by json.dump()ing them to a TextProperty. When the client calls, simply send the TextProperties to the client, and figure everything out on the client side once you JSON.parse() them. We have done some very large array/object processing in JS this way, and it is very fast (particularly indexed arrays). When the user clicks on something, send a transaction back to update their record. Set up some pull or push queue processes to handle your overall product listing updates, major customer rec updates, etc.
One downside is higher bandwidth going out of you app, but I think this cost will be minimal given potential processing savings on GAE. If you structure this right, you may be able to use get_by_id() to replace all or most of your planned indices and queries. We have found json.loads() and json.dumps() to be very fast inside the app, but we only use simple dictionary/list structures.This approach will be, though, a big, big quantum measure lower than your planned use of queries. The other potential issue is that very large objects may run into soft memory limits. Be sure that your Json objects are fairly simple+lightweight to avoid this (e.g. do no include product description, sub-objects, etc. in the Json item, just the basics such as product number). HTH, -stevep

Choosing the right model for storing and querying data?

I am working on my first GAE project using java and the datastore. And this is my first try with noSQL database. Like a lot of people i have problems understanding the right model to use. So far I've figured out two models and I need help to choose the right one.
All the data is represented in two classes User.class and Word.class.
User: couple of string with user data (username, email.....)
Word: two strings
Which is better :
Search in 10 000 000 entities for the 100 i need. For instance every entity Word have a string property owner and i query (owner = ‘John’).
In User.class i add property List<Word> and method getWords() that returns the list of words. So i query in 1000 users for the one i need and then call method like getWords() that returns List<Word> with that 100 i need.
Which one uses less resources ? Or am i going the wrong way with this ?
The answer is to use appstats and you can find out:
AppStats
To keep your application fast, you need to know:
Is your application making unnecessay RPC calls? Should it be caching
data instead of making repeated RPC calls to get the same data? Will
your application perform better if multiple requests are executed in
parallel rather than serially?
Run some tests, try it both ways and see what appstats says.
But I'd say that your option 2) is better simply because you don't need to search millions of entities. But who knows for sure? The trouble is that "resources" are a dozen different things in app engine - CPU, datastore reads, datastore writes etc etc etc.
For your User class, set a unique ID for each user (such as a username or email address). For the Word class, set the parent of each Word class as a specific User.
So, if you wanted to look up words from a specific user, you would do an ancestor query for all words belonging to that specific user.
By setting an ID for each user, you can get that user by ID as opposed to doing an additional query.
More info on ancestor queries:
https://developers.google.com/appengine/docs/java/datastore/queries#Ancestor_Queries
More info on IDs:
https://developers.google.com/appengine/docs/java/datastore/entities#Kinds_and_Identifiers
It really depends on the queries you're using. I assume that you want to find all the words given a certain owner.
Most likely, 2 would be cheaper, since you'll need to fetch the user entity instead of running a query.
2 will be a bit more work on your part, since you'll need to manually keep the list synchronized with the instances of Word
Off the top of my head I can think of 2 problems with #2, which may or may not apply to you:
A. If you want to find all the owners given a certain word, you'll need to keep that list of words indexed. This affects your costs. If you mostly find words by owner, and rarely find owners by words, it'll still make sense to do it this way. However, if your search pattern flips around and you're searching for owners by words a lot, this may be the wrong design. As you see, you need to design the models based on the queries you will be using.
B. Entities are limited to 1MB, and there's a limit on the number of indexed properties (5000 I think?). Those two will limit the number of words you can store in your list. Make sure that you won't need more than that limit of words per user. Method 1 allows you unlimted words per user.

List of keys or separate model?

I'm building an app with users and their activities. Now I'm thinking of the best way of setting up the datastore models. Which one is fastest/preferred, and why?
A
class User(db.Model):
activities = db.ListProperty(db.Key)
...
class Activity(db.Model):
...
activities = db.get(user.activities)
or
B
class User(db.Model):
...
class Activity(db.Model):
owner = db.ReferenceProperty(reference_class=User)
...
activities = Activity.filter('owner =', user)
If a given activity can only have a single owner, definitely use a ReferenceProperty.
It's what ReferencePropertys are designed for
It'll automatically set up back-references for you, which can be handy since it gives you a bi-directional link (unlike the ListProperty which is a uni-directional link)
It enforces that the thing being linked to is the proper type/class
It enforces that only a single user is linked to a given activity
It lets you automatically fetch the linked objects without having to write an explicit query, if you so desire
I'm guessing the difference is going to be marginal and will likely depend more on your application than some concrete difference in read/write times based on your models.
I would say use the first option if you're going to use info from every activity a user has done each time you fetch a user. In other words, if almost everything a user does on your application coincides with a large subset of their activities, then it makes sense to always have the activities available.
Use option B if you don't need the activities all of the time. This will result in a separate request on the data store whenever you need to use the activity, but it will also make the requests smaller. Making an extra request likely adds more overhead than making bigger requests.
All of that being said, I would be surprised if you had a noticeable difference between these two approaches. The area where you're going to get much more noticeable performance improvements is by using memcache.
I don't know about the performance difference, I suspect it'll be similar. When it comes to perf, things are hard to control with the GAE datastore. If all your queries happen to hit the same tablet (bigtable server), that could limit your perf more than the query itself.
The big difference is that A would be cheaper than B. Since you have a list of activities you want, you don't need to write an index for every activity object you write. If activities are written a lot, your savings add up.
Since you have the activity key, you also have the ability to do a highly-consistent get() rather than an eventually consistent filter()
On the flip side, you won't be able to do backwards references, like look up an owner given an activity. Your ListProperty can also cause you to hit your maximum entity size - there will eventually be a hard limit on the number of activities per user. If you went with B, you can have a huge number of activities per user.
Edit: I forgot, you can have backwards reference if you index your ListProperty, but then that way, writing your User object would get expensive, and the limit on the number of indexed properties would limit the size of your list. So even though it's possible, B is still preferable if you need backwards references.
A will be a good deal faster because it is working purely with keys. Looking up objects with just keys goes straight to the data node in BigTable, whereas B requires a lookup on the indices first which is slower (and costs will go up with the number of Activity entities).
If you never need to test for ownership, you can modify A to not index the key list. This is definitely the cheapest and most efficient route. However, as I understand it, if you later need to index them app engine cannot retroactively update indices on the key list. So only disable the index if you're certain you'll never need it.
How about C: setting Activity's parent to user key? So that you can fetch user's activities with a Activity.query(ancestor=user.key).
That way you don't need additional keys/properties + good way to group your entities for HR datastore.

Resources