Implementing a license key management system on GAE: Datastore or Cloud SQL? - google-app-engine

I am implementing a license key system on Google AppEngine. Keys are generated ahead of time and emailed to users. Then they log into the system and enter the key to activate a product.
I could have potentially several hundred people submitting their keys for validation at the same time. I need the transactions to be strongly consistent so that the same license key cannot be used more than once.
Option 1: Use the datastore
To use the datastore, I need it to be strongly consistent, so I will use an EntityGroup for the license keys. However, there is a limit of 1 write / second to an entity group. Appengine requests must complete within 60 seconds, so this would mean either notifying users offline when their key was activated, or having them poll in a loop until their key was accepted.
Option 2: Use Google Cloud SQL
Even the smallest tier of Google Cloud SQL can handle 250 concurrent connections. I don't expect these queries to take very long. This seems like it would be a lot faster and would handle hundreds or thousands of simultaneous license key requests without any issues.
The downside to Google Cloud SQL is that it is limited in size to 500GB per instance. If I run out of space, I'll have to create a new database instance and then query both for the submitted license key. I think it will be a long time before I use up that 500GB and it looks like you can even increase the size by contacting Google.
Seems like Option2 is the way to go - but I'm wondering what others think. Do you find Entity Group performance for transactions acceptable?

Option 2 seems more feasible, neat and clean in your case but you have to take care of db connections by yourself and its a hassle with increasing load if connection pooling is not properly used.
Datastore can also be used in license key system by defining multiple EntityGroups with dummy ancestors based on few leading or trailing digits of key to deal with 1 write / second to an entity group. In this way you can also easily determine EntityGroup of a generated or provided license key.
For example 4321 G42T 531P 8922 is license key so 4321 can be used as EntityGroup and all keys starting with 4321 will be part of this EntityGroup. This is sort of sharding like mechanism to avoid the potential of simultaneous writes to single entity group.
If you need to perform queries on some columns other than license key then a separate mapping table can be maintained without an EntityGroup.

You can mixed them , Google Cloud SQL is only have Keys and Email , with 500G i belived you can store key for all of people in the planet .
In other hand you can request google to increase data size limit .

I will go with Option 1 datastore, it's much faster and scalable.
And I don't know why you need to create EntityGroup, you could make the "license key" itself as the Key, so each Entity is in it's own EntityGroup... only this will make things scalable.

Related

Problems and solutions when using a secondary datastore alongside the main database?

I am in the middle of an interview simulation and I got stock with one question. Can someone provide the answer for me please?
The question:
We use a secondary datastore (we use elasticsearch alongside our main database) for real time analytics and reporting. What problems might you anticipate with this sort of approach? Explain how would go about solving or mitigating them?
Thank you
There are several problems:
No transactional cover : If your main database is transactional (which it usually is), so you either commit or you don't. After the record is inserted into your main database, there is no guarentee that it will be committed to ES. In fact if you commit several records to your primary DB, you may have a situation where some of them are committed to ES, and few others are not. This is a MAJOR issue.
Refresh Interval : Elasticsearch by default refreshes every second. That means "Real-time" is generally 1 second later, or at least when the data is queried for. If you commit a record into your primary db, and immediately query for it via ES, it may not get found. THe only way around this is to GET the record using its ID.
Data-Duplication : Elasticsearch cannot do joins. You need to denormalize all data that is coming from a RDBMS. If one user has many posts, you cannot "join" to search. You have to add the user id an any other user specific details to every post object.
Hardware : Elasticsearch needs RAM (bare minimum of 1 gb) to work properly. This is assuming you don't use anything else from the ELK stack. THis is an important cost wise consideration.
One problem might be synchronization issues, where the elastic search store gets out of sync and starts service stale data. To avoid issues, you will have to implement monitoring on your data pipeline, elastic search and the primary database, to detect any problem by checking for update times, delay, number of records (within some level of error) in each of them and overall system operation status (up / down).
Another is disconnection and recovery - what happens if your data pipeline or elastic search loses connection to the rest of the system? You will need an automatic way to re-connect, when network is restored and start synchronising data again.
You also have to take into account sudden influx of data - how to scale ElasticSearch ingestion or your data processor (data pipeline) if there is large amount of updates and inserts in peak hours or after re-connection when there was network issues.

Concurrent writes to a shared network resource

Here is the context for the problem I am trying to solve.
There are computers A and B, as well as a server S. Server S implements some backend which handles incoming requests in a RESTful manner.
The backend S has a shelf. The goal of users A and B is to make S create and place numbered boxes on that shelf. A unique constraint is that no two boxes can have the same number. Once a box is created, S should return that box (JSON, or xml...) back to A and B with its allocated number.
The problem boils down to concurrency, as A and B's POST ("create-numbered-box") transactions may arrive at the exact same time at the database - hence get cancelled (?). I remind, there is a unique constraint - no two boxes are allowed to have a same number.
What are possible ways to solve this problem? I wouldn't like to lock the database, so I am looking for alternatives of that. You are allowed to imagine that between the database and the backend layer calling the database we may have an extra layer of abstraction, e.g. a microservice, messaging queue... whatever or nothing at all - a direct backend - db exec. query call. If you think a postgres database is not a good choice to say a graph one, or document one, key-value one - feel free to substitute it.
The goal is in the end given concurrent writes users A and B to get responses to their create (POST) requests and each of them have a box on that shared shelf with a unique number, with no "Oops, something went wrong. Please retry" type of server response.
I described a simple world with users A and B but that can in theory go up to 10 000 users writing, not just 2.
As a secondary question, I'd like to ask, is there a way to test conflicting concurrent transactions in postgres?
I will go first.
My idea is, let A and B send requests and fail. Once they fail, have retries with random timeouts in some interval. Let's say up to 3 retries. This way for A and B I will try to separate the requested writes to the db and this would allow for some degree of successful resolution of the scenario. However, I don't think this is a clean solution and I am looking for alternatives you can think of. Just, please keep in mind the constraints and freedoms I mentioned above.
Databases such as Posgres include capabilities to have a unique number generated by the database (see PostgreSQL - SERIAL - Generate IDs (Identity, Auto-increment)). So the logic for your backend service S could be:
lookup if user has a record in the database already
return the id if it does
otherwise, create a record and return the newly allocated id
To avoid creating multiple boxes for the same user you need to serialize the lookup/create logic based on user id. Approaches to that vary from merely handling one request at a time in your service S to, for example, having Kafka topics that partition requests to different instances of service S based on user ids -- all depends on the scale.

Is google Datastore recommended for storing logs?

I am investigating what might be the best infrastructure for storing log files from many clients.
Google App engine offers a nice solution that doesn't make the process a IT nightmare: Load balancing, sharding, server, user authentication - all in once place with almost zero configuration.
However, I wonder if the Datastore model is the right for storing logs. Each log entry should be saved as a single document, where each clients uploads its document on a daily basis and can consists of 100K of log entries each day.
Plus, there are some limitation and questions that can break the requirements:
60 seconds timeout on bulk transaction - How many log entries per second will I be able to insert? If 100K won't fit into the 60 seconds frame - this will affect the design and the work that needs to be put into the server.
5 inserts per entity per seconds - Is a transaction considered a single insert?
Post analysis - text search, searching for similar log entries cross clients. How flexible and efficient is Datastore with these queries?
Real time data fetch - getting all the recent log entries.
The other option is to deploy an elasticsearch cluster on goole compute and write the server on our own which fetches data from ES.
Thanks!
Bad idea to use datastore and even worse if you use entity groups with parent/child as a comment mentions when comparing performance.
Those numbers do not apply but datastore is not at all designed for what you want.
bigquery is what you want. its designed for this specially if you later want to analyze the logs in a sql-like fashion. Any more detail requires that you ask a specific question as it seems you havent read much about either service.
I do not agree, Data Store is a totally fully managed no sql document store database, you can store the logs you want in this type of storage and you can query directly in datastore, the benefits of using this instead of BigQuery is the schemaless part, in BigQuery you have to define the schema before inserting the logs, this is not necessary if you use DataStore, think of DataStore as a MongoDB log analysis use case in Google Cloud.

Centralized data access or variables

I'm trying to find a way to access a centralized database for both retrieval and update.
the following is what I'm looking for,
Server 1 has this variable for example
int counter;
Server 2 will be interacting with the user, and will increase the counter whenever the user uses the service, until a certain threshold is reached. when this threshold is reached then server 2 will start rejecting the user access.
Also, the user will be able to use multiple servers (like server 2) from multiple locations and each time the user accesses the access any server the counter will be increased.
I tried google but it's hard to search for something without a name.
One approach to designing this is to do sharding by user - i.e. split the users between your servers depending on the ID of the user. That is, if you have 10 servers, then users with ID's ending with 2 would have all of their data stored on server 2, and so on. This assumes that user ID's are distributed uniformly.
One other approach is to shard the users by location - if you have servers in Asia vs Europe, for example. You'd need a property in the User record that tells you where the user is located; based on that, you'll know which server to route them to.
Ultimately, all of these design options have a concept of "where does the master record for a user reside?" Each of these approaches attempts to definitively answer this question.
A different category of approaches has to do with multi-master replication, which is supported by some database vendors; this approach does not scale as well (i.e. it's hard to get it to scale to 20 servers), but you might want to look into it, too.

How to set up a new SQL Server database to allow for possible replication in the future?

I'm building a system which has the potential to require support for 500+ concurrent users, each making dozens of queries (selects, inserts AND updates) each minute. Based on these requirements and tables with many millions of rows I suspect that there will be the need to use database replication in the future to reduce some of the query load.
Having not used replication in the past, I am wondering if there is anything I need to consider in the schema design?
For instance, I was once told that it is necessary to use GUIDs for primary keys to enable replication. Is this true?
What special considerations or best practices for database design are there for a database that will be replicated?
Due to time constraints on the project I don't want to waste any time by implementing replication when it may not be needed. (I have enough definite problems to overcome at the moment without worrying about having to solve possible ones.) However, I don't want to have to make potentially avoidable schema changes when/if replication is required in the future.
Any other advice on this subject, including good places to learn about implementing replication, would also be appreciated.
While every row must have a rowguid column, you are not required to use a Guid for your primary key. In reality, you aren't even required to have a primary key (though you will be stoned to death for failing to create one). Even if you define your primary key as a guid, not making it the rowguid column will result in Replication Services creating an additional column for you. You definitely can do this, and it's not a bad idea, but it is by no means necessary nor particularly advantageous.
Here are some tips:
Keep table (or, rather, row) sizes small; unless you use column-level replication, you'll be downloading/uploading the entire contents of a row, even if only one column changes. Additionally, smaller tables make conflict resolution both easier and less frequent.
Don't use sequential or deterministic algorithm-driven primary keys. This includes identity columns. Yes, Replication Services will handle identity columns and allocating key allotments by itself, but it's a headache that you don't want to deal with. This alone is a great argument for using a Guid for your primary key.
Don't let your applications perform needless updates. This is obviously a bad idea to begin with, but this issue is made exponentially worse in replication scenarios, both from a bandwidth usage and a conflict resolution perspective.
You may want to use GUIDs for primary keys - in a replicated system rows must be unique throughout your entire topology, and GUID PKs is one way of achieving this.
Here's a short article about use of GUIDs in SQL Server
I'd say your real question is not how to handle replication, but how to handle scale out, or at least scale out for queryability. And while there are various answers to this conundrum, one answer will stand out: not using replication.
The problem with replication, specially with merge replication, is that writes gets multiplied in replication. Say you have a system which handles a load of 100 queries (90 reads and 10 writes) per second. You want to scale out and you choose replication. Now you have 2 systems, each handling 50 queries, 45 reads and 5 writes each. Now those writes have to be replicated so the actual number of writes is not 5+5, but 5+5 (original writes ) and then another 5+5 (the replica writes), so you have 90 reads and 20 writes. So while the load on each system was reduced, the ratio of writes and reads has increased. This not only changes the IO patterns, but most importantly it changes the concurency pattern of the load. Add a third system and you'll have 90 reads and 30 writes and so on and so forth. Soon you'll have more writes than reads and the replication update latency combined with the concurency issues and merge conflicts will derail your project. The gist of it is that the 'soon' is much sooner than you expect. Is soon enough to justify looking into scale up instead, since you're talking a scale out of 6-8 peers at best anyway, and 6-8 times capacity increase using scale up will be faster, much more simpler and possible even cheaper to start with.
And keep in mind that all these are just purely theorethical numbers. In practice what happens is that the replication infrastructure is not free, it adds its own load on the system. Writes needs to be tracked, changes have to be read, a distributor has to exists to store changes until distributed to subscribers, then changes have to be writes and mediated for possible conflicts. That's why I've seen very few deployments that could claim success with a replication based scale out strategy.
One alternative is to scale out only reads and here replication does work, usualy using transactional replication, but so does log-shipping or mirroring with a database snapshot.
The real alternative is partitioning (ie. sharding). Requests are routed in the application to the proper partition and land on the server containig the appropiate data. Changes on one partiton that need to be reflected on another partition are shipped via asynchronous (usually messaging based) means. Data can only be joined within a partition. For a more detailed discussion of what I'm talking about, read how MySpace does it. Needless to say, such a strategy has a major impact on the application design and cannot be simply glued in after v1.

Resources