How are Long ids used in Google Datastore insert/update queries? - google-app-engine

Our product is using Google Datastore as the application database. Most of the entities use IDs of type Long and some of type String. I noticed that the IDs of type Long are not in consecutive order.
Now we are exporting some big tables, with around 30 - 40 million entries, to json files for some business purposes. Initially we expected that a simple query like "ofy().load().type(ENTITY.class).startAt(cursor).limit(BATCH_LIMIT).iterator()" will help us iterate through the entire content of that specific table, starting from the first entry and ending with the most recently created one. We are working in batches and storing the cursor after every batch, so that the next task can load the batch and resume.
But after noticing that an entity created some minutes ago can have an ID smaller than the ID of another entity created 1 week ago, we are wondering if we should consider a content freeze during this export period. On one hand it's critical to make a good export and not to miss older data up to a specific date, on the other hand a content freeze longer than 1 day is a problem for our customers.
What do you advice us to do?
Thanks,
Cristian.

I do not think you need to worry about uniqueness of your id. Datastore build on top of Bigtable with 6 tables.
first table stores entities
second stores entities by kind
third stores indexes for the property values in the ascending order
fourth to store indexes for the property values in the descending order
fifth stores indexes for multiple properties together
sixth keeps a track of the next unique ID for Kind
Format is something like this.
[application ID]-[namespace]-[Kind]-[ID]
It is garanties of uniqueness each entities.
Yes, the format on that table is [Application ID]-[Kind Name] and the value is the next value. Let say you have kind products and that table will look like this |key(yourapp-products), Next ID(3)|. Now you created new entity for kind products it will be assigned to ID(3) and the row on that table will get new value |key(yourapp-products), Next ID(4)|. Also to mention that table has only one row since we have only one kind products.
Do you specify ID yourself or let datastore generate itself? It sounds like you have "Pre-allocating IDs" issue, just speculating but for every batch you need sort Kind.allocate_ids(size=blah) that way you can keep sequence.

Related

Amazon DynamoDB Single Table Design For Blog Application

New to this community. I need some help in designing the Amazon Dynamo DB table for my personal projects.
Overview, this is a simple photo gallery application with following attributes.
UserID
PostID
List item
S3URL
Caption
Likes
Reports
UploadTime
I wish to perform the following queries:
For a given user, fetch 'N' most recent posts
For a given user, fetch 'N' most liked posts
Give 'N' most recent posts (Newsfeed)
Give 'N' most liked posts (Newsfeed)
My solution:
Keeping UserID as the partition key, PostID as the sort key, likes and UploadTime as the local secondary index, I can solve the first two query.
I'm confused on how to perform query operation for 3 and 4 (Newsfeed). I know without partition ket I cannot query and scan is not an effective solution. Any workaround for operatoin 3 and 4 ?
Any idea on how should I design my DB ?
It looks like you're off to a great start with your current design, well done!
For access pattern #3, you want to fetch the most recent posts. One way to approach this is to create a global secondary index (GSI) to aggregate posts by their creation time. For example, you could create a variable named GSI1PK on your main table and assign it a value of POSTS and use the upload_time field as the sort key. That would look something like this:
Viewing the secondary index (I've named it GSI1), your data would look like this:
This would allow you to query for Posts and sort by upload_time. This is a great start. However, your POSTS partition will grow quite large over time. Instead of choosing POSTS as the partition key for your secondary index, consider using a truncated timestamp to group posts by date. For example, here's how you could store posts by the month they were created:
Storing posts using a truncated timestamp will help you distribute your data across partitions, which will help your DB scale. If a month is too long, you could use truncated timestamps for a week/day/hour/etc. Whatever makes sense.
To fetch the N most recent posts, you'd simply query your secondary index for POSTS in the current month (e.g. POSTS#2021-01-00). If you don't get enough results, run the same query against the prior month (e.g. POSTS#2020-12-00). Keep doing this until your application has enough posts to show the client.
For the fourth access pattern, you'd like to fetch the most liked posts. One way to implement this access pattern is to define another GSI with "LIKES" as the partition key and the number of likes as the sort key.
If you intend on introducing a data range on the number of likes (e.g. most popular posts this week/month/year/etc) you could utilize the truncated timestamp approach I outlined for the previous access pattern.
When you find yourself "fetch most recent" access patterns, you may want to check out KSUIDs. KSUIDs, or K-sortable Universal Identifier, are unique identifiers that are sortable by their creation date/time/. Think of them as UUID's and timestamps combined into one attribute. This could be useful in supporting your first access pattern where you are fetching most recent posts for a user. If you were to use a KSUID for the Post ID, your table would look like this:
I've replaced the POST ID's in this example with KSUIDs. Because the KSUIDs are unique and sortable by the time they were created, you are able to support your first access pattern without any additional indexing.
There are KSUID libraries for most popular programming languages, so implementing this feature is pretty simple.
You could add two Global Secondary Indexes.
For 3):
Create a static attribute type with the value post, which serves as the Partition Key for the GSI and use the attribute UploadTime as the Sort Key. You can then query for type="post" and get the most recent items based on the sort key.
The solution for 4) is very similar:
Create another Global secondary index with the aforementioned item type as the partition key and Likes as the sort key. You can then query in a similar way as above. Note, that GSIs are eventually consistent, so it may take time until your like counters are updated.
Explanation and additional infos
Using this approach you group all posts in a single item collection, which allows for efficient queries. To save on storage space and RCUs, you can also choose to only project a subset of attributes into the index.
If you have more than 10GB of post-data, this design isn't ideal, but for a smaller application it will work fine.
If you're going for a Single Table Design, I'd recommend to use generic names for the Index attributes: PK, SK, GSI1PK, GSI1SK, GSI2PK, GSI2SK. You can then duplicate the attribute values into these items. This will make it less confusing if you store different entities in the table. Adding a type column that holds the entity type is also common.

NetSuite - UNION ALL equivalent in saved search?

I'm in the process of writing a SuiteTalk integration, and I've hit an interesting data transformation issue. In the target system, we have a sort of notes table which has a category column and then the notes column. Data going into that table from NetSuite could be several different fields on a single entity in NetSuite terms, but several records of different categories in our terms.
If you take the example of a Sales Order, you might have two text fields that we need to bring across as notes. For each of those fields I need to create a row, with both the notes field in the same column but separate rows. This would allow me to add a dynamic column that give the category for each of those fields.
So instead of
SO number notes 1 notes 2
SO1234567 some text1 some text2
You’d get
SO Number Category Text
SO1234567 category 1 some text1
SO1234567 category 2 some text2
The two problems I’m really trying to solve here are:
Where can I store the category name? It can’t be the field name in NetSuite. It needs to be configurable per customer as the number of notes fields in each record type might vary across implementations. This is currently my main blocker.
Performance – I could create a saved search for each type of note, and bring one row across each time, but that’s not really an acceptable performance hit if I can do it all in one call.
I use Saved Searches in NetSuite to provide a configurable way of filtering the data to import into the target system.
If I were writing a SQL query, i would use the UNION clause, with the first column being a dynamic column denoting the category and the second column being the actual data field from NetSuite. My ideal would be if I could somehow do a similar thing either as a single saved search, or as one saved search per entity, without having to create any additional fields within NetSuite itself, so that from the SuiteTalk side I can just query the search and pull in the data.
As a temporary kludge, I now have multiple saved searches in NetSuite, one per category, and within the ID of the saved search I expect the category name and an indicator of the record type. I then have a parent search which gives me the searches for that record type - it's very clunky, and ultimately results in far too many round trips for me to be satisfied.
Any idea if something like this is at all possible?? Or if not, is there a way of solving this without hard-coding the category values in the front end? Even if I can bring back multiple recordsets in one call, that would be a performance enhancement.
I've asked the same question on the NetSuite forums but to no avail.
Thanks
At first read it sounds like you are trying to query a set of fields from entities. The fields may be custom fields or built in fields. Can you not just query the entities where your saved search has all the potential category columns and then transform the received data into categories?
Otherwise please provide more specifics in Netsuite terms about what you are trying to do.

Database design: ordered set

task_set is a database with two colums(id, task):
id task
1 shout
2 bark
3 walk
4 run
assume there is another table with two colums(employee,task_order)
task_order is an ordered set of tasks, for example (2,4,3,1)
generally, the task_order is unchanged, but sometimes it may be inserted or deleted, e.g, (2,4,9,3,1) ,(2,4,1)
how to design such a database? I mean how to realize the ordered set?
If, and ONLY if you don't need to search inside the task_set column, or update one of it's values (i.e change 4,2,3 to 4,2,1), keeping that column as a delimited string might be an easy solution.
However, if you ever plan on searches or updates for specific values inside the task_set, then you better normalize that structure into a table that will hold employee id, task id, and task order.

Google App Engine - What's the recommended way to keep entity amount within limit?

I have some entities of a kind, and I need to keep them within limited amount, by discarding old ones. Just like log entries maintainance. Any good approach on GAE to do this?
Options in my mind:
Option 1. Add a Date property for each of these entities. Create cron job to check datastore statistics daily. If it exceeds the limit, query some entities of that kind and sort by date with oldest first. Delete them until the size is less than, for example, 0.9 * max_limit.
Option 2. Option 1 requires an additional property with index. I observed that the entity key ids may be likely increasing. So I'd like to query only keys and sort by ascending order. Delete the ones with smaller ids. It does not require additional property (date) and index. But I'm seriously worrying about whether the key id is assured to go increasingly?
I think this is a common data maintainance task. Is there any mature way to do it?
By the way, a tiny ad for my app, free and purely for coder's fun! http://robotypo.appspot.com
You cannot assume that the IDs are always increasing. The docs about ID gen only guarantee that:
IDs allocated in this manner will not be used by the Datastore's
automatic ID sequence generator and can be used in entity keys without
conflict.
The default sort order is also not guaranteed to be sorted by ID number:
If no sort orders are specified, the results are returned in the order
they are retrieved from the Datastore.
which is vague and doesn't say that the default order is by ID.
One solution may be to use a rotating counter that keeps track of the first element. When you want to add new entities: fetch the counter, increment it, mod it by the limit, and add a new element with an ID as the value of the counter. This must all be done in a transaction to guarantee that the counter isn't being incremented by another request. The new element with the same key will overwrite one that was there, if any.
When you want to fetch them all, you can manually generate the keys (since they are all known), do a bulk fetch, sort by ID, then split them into two parts at the value of the counter and swap those.
If you want the IDs to be unique, you could maintain a separate counter (using transactions to modify and read it so you're safe) and create entities with IDs as its value, then delete IDs when the limit is reached.
You can create a second entity (let's call it A) that keeps a list of the keys of the entities you want to limit, like this (pseudo-code):
class A:
List<Key> limitedEntities;
When you add a new entity, you add its key in the list of A. If the length of the list exceeds the limit, you take the first element of the list and the remove the corresponding entity.
Notice that when you add or delete an entity, you should modify the list of entity A in a transaction. Since, these entities belong to different entity groups, you should consider using Cross-Group Transactions.
Hope this helps!

How can simplify my database?

I am working on a project in which I have generated a unique id of a customer with the customer's Last name's first letter. And stored it in a database in different tables as if customer's name starting with a then the whole information of the customer will stored in Registration_A table. As such I have created tables of Registration up to Z. But retrieving if data with such structure is quiet difficult. can you suggest me another method to save data so that retrieving become more flexible?
Put all of your registration data into one table. There's absolutely no need for you to break it into alphabetical pieces like that unless you have some serious performance issues.
When querying for registration data, use SQL's WHERE clause to narrow down your results.
You have to merge this to one table ´Registration´, then let the database care about unique ids. This depends on your database, but searching for PRIMARY KEY or AUTO INCREMENT should give you lots of results.
If you have done the the splitting because of performance reasons, you can add a Index on the users last name.

Resources