Prevent overriding keys in Key-Value datastores

Prevent overriding keys in Key-Value datastores - database

This picture below shows a sequence diagram for two clients storing values into a Key-Value datastore:
The problem I'm trying to solve is how to prevent overriding keys. The way the applications (Client_A, and Client_B) prevent this is by checking if key exists first before storing. The issue now is if both clients manage to get the same "does not exist" result, any of the two clients would be able to overwrite the values.
What strategy can be done to be able to prevent such from happening in a database client design?

A "key-value store", as it's usually defined, doesn't store duplicate keys at all. If two clients write to the same key, then only one value is stored -- the one from which ever client wrote "latest".
In order to reliably update values in a consistent way (where the new value depends on the old value associated with a key, or even whether or not there was an old value), your key-value store needs to support some kinds of atomic operations other than simple get and set.
Memcache, for example, supports atomic compare-and-set operations that will only set a value if it hasn't been set by someone else since you read it. Amazon's DynamoDB supports atomic transactions, atomic counters, etc.

START TRANSACTION;
SELECT ... FOR UPDATE;
take action depending on result
UPDATE ...;
COMMIT;
The "transaction" makes the pair. SELECT and UPDATE, "atomic".
Write this sort of code for any situation where another connection can sneak in and mess up what you are doing.
Note: The code written here uses MySQL's InnoDB syntax and semantics; adjust accordingly for other brands.

Related

SCD-2 in data modelling: how do I detect changes?

I know the concept of SCD-2 and I'm trying to improve my skills about it doing some practices.
I have the next scenario/experiment:
I'm calling daily to a rest API to extract information about companies.
In my initial load to the DB everything is new, so everything is very easy.
Next day I call to the same rest API, which might returns the same companies, but some of them might have (or not) some changes (i.e., they changed the size, the profits, the location, ...)
I know SCD-2 might be really simple if the rest API returns just records with changes, but in this case it might returns as well records without changes.
In this scenario, how people detect if the data of a company has changes or not in order to apply SCD-2?, do they compare all the fields?.
Is there any example out there that I can see?

There is no standard SCD-2 nor even a unique concept of it. It is a general term for large number of possible approaches. The only chance is to practice and see what is suitable for your use case.
In any case you must identify the natural key of the dimension and the set of the attributes you want to keep the history.
You may of course make it more complex by the decision to use your own surrogate key.
You mentioned that there are two main types of the interface for the process:
• You get periodically a full set of the dimension data
• You get the “changes only” (aka delta interface)
Paradoxically the former is much simple to handle than the latter.
First of all, in the full dimensional snapshot the natural key holds, contrary to the delta interface (where you may get more changes for one entity).
Additionally you have to handle the case of late change delivery or even the wrong order of changes delivery.
Next important decision is if you expect deletes to occur. This is again trivial in the full interface, you must define some convention, how this information would be passed in the delta interface.
Connected is the question whether a previously deleted entity can be reused (i.e. reappear in the data).
If you support delete/reuse you'll have to thing about how to show them in your dimension table.
In any case you will need some additional columns in the dimension to cover the historical information.
Some implementation use a change_timestamp, some other use validity interval valid_from and valid_to.
Even other implementation claim that additional sequence number is required – so you avoid the trap of more changes with the identical timestamp.
So you see that before you look for some particular implementation you need carefully decide the options above. For example the full and delta interface leads to a completely different implementations.

How to get a collection of all latest attributes values from DynamoDB?

I have a one table where I store all of the sensors data.
Id is a Partition key, TimeEpoch is a sort key.
Example table looks like this:
Id
TimeEpoch
AirQuality
Temperature
WaterTemperature
LightLevel
b8a76d85-f1b1-4bec-abcf-c2bed2859285
1608208992
95
3a6930c2-752a-4103-b6c7-d15e9e66a522
1608208993
23.4
cb44087d-77da-47ec-8264-faccc2a50b17
1608287992
5.6
latest
1608287992
95
5.6
23.4
1000
I need to get all the latest attributes values from the table.
For now I used additional Item with Id = latest where I'm storing all the latest values, but I know that this is a hacky way that requires sensor to put data in with new GUID as the Id and to the Id = latest at the same time.
The attributes are all known and it's possible that one sensor under one Id can store AirQuality and Temperature at the same time.

NoSQL databases like DynamoDB are a tricky thing, because they don't offer the same query "patterns" like traditional relational databases.
Therefore, you often need non-traditional solutions to valid challenges like the one you present.
My proposal for one such solution would be to use a DynamoDB feature called DynamoDB Streams.
In short, DynamoDB Streams will be triggered every time an item in your table is created, modified or removed. Streams will then send the new (and old) version of that item to a "receiver" you specify. Typically, that would be a Lambda function.
The solution I would propose is to use streams to send new items to a Lambda. This Lambda could then read the attributes of the item that are not empty and write them to whatever datastore you like. Could be another DynamoDB table, could be S3 or whatever else you like. Obviously, the Lambda would need to make sure to overwrite previous values etc, but the detailed business logic is then up to you.
The upside of this approach is, that you could have some form of up-to-date version of all of those values that you can always read without any complicated logic to find the latest value of each attribute. So reading would be simplified.
The downside is, that writing becomes a bit more complex. Not at least because you introduce more parts to your solution (DynamoDB Streams, Lambda, etc.). This also will increase your cost a bit, depending on how often your data changes. Since you seem to store sensor data that might be quite often. So keep in mind to check the cost. This solution will also introduce more delay. So if delay is an issue, it might not be for you.
At last I want to mention that it is recommend to only have at most two "receivers" of a tables stream. That means that for production I would recommend to only have a single receiver Lambda and then let that Lambda create an AWS EventBridge event (e.g. "item created", "item modified", "item removed"). This will allow you to have a lot more Lambdas etc. "listening" to such events and process them, mitigating the streams limitation. This is an event-driven solution then. As before, this will add delay.

Working with accumulated bucket values in Entity Framework

I'm attempting to find design patterns/strategies for working with accumulated bucket values in a database where concurrency can be a problem. I don't know the proper search terms to use to find information on the topic.
Here's my use case (I'm using code-first Entity Framework, so EF-specific advice is welcome):
I have a database table that contains a quantity value. This quantity value can be incremented or decremented by multiple clients at the same time (due to this, I call this value a "bucket" value as it is a bucket for a bunch of accumulated activity; this is in opposition of the other strategy where you keep all activity and calculate the value based on the activity). I am looking for strategies on ensuring accuracy of this "bucket" value (within the context of EF) that takes into consideration that multiple clients may attempt to change it simultaneously (concurrency).
The answer "you must track activity and derive your value from that activity" is acceptable, but I want to consider all bucket-centric solutions as well.
I am looking for advice on search terms to use to find good information on this topic as well as specific links.
Edit: You may assume that all activity is relative to the "bucket" value (no clients will be making an absolute change to the value; they will only increment or decrement).

Without directly coding the SQL Queries that update the buckets, you would have to use client-side Optimistic Concurrency. See Entity Framework Optimistic Concurrency Patterns. Clients whose update would overwrite a change will get an exception, after which you can reload with the current value and retry. This pattern requires a ROWVERSION column on the target table.
If you code the updates in TSQL you can code an atomic update, something like
update foo with (updlock)
set bucket_a = bucket_a + 1
output inserted.*
where id = #id
(The 'updlock' isn't strictly necessary in this query, but is good form any time you want to ensure this kind of isolation)

Unit of Work - What is the best approach to temporary object storage on a web farm?

I need to design and implement something similar to what Martin Fowler calls the "Unit of Work" pattern. I have heard others refer to it as a "Shopping Cart" pattern, but I'm not convinced the needs are the same.
The specific problem is that users (and our UI team) want to be able to create and assign child objects (with referential integrity constraints in the database) before the parent object is created. I met with another of our designers today and we came up with two alternative approaches.
a) First, create a dummy parent object in the database, and then create dummy children and dummy assignments. We could use negative keys (our normal keys are all positive) to distinguish between the sheep and the goats in the database. Then when the user submits the entire transaction we have to update data and get the real keys added and aligned.
I see several drawbacks to this one.
It causes perturbations to the indexes.
We still need to come up with something to satisfy unique constraints on columns that have them.
We have to modify a lot of existing SQL and code that generates SQL to add yet another predicate to a lot of WHERE clauses.
Altering the primary keys in Oracle can be done, but its a challenge.
b) Create Transient tables for objects and assignments that need to be able to participate in these reverse transactions. When the user hits Submit, we generate the real entries and purge the old.
I think this is cleaner than the first alternative, but still involves increased levels of database activity.
Both methods require that I have some way to expire transient data if the session is lost before the user executes submit or cancel requests.
Has anyone solved this problem in a different way?
Thanks in advance for your help.

I don't understand why these objects need to be created in the database before the transaction is committed, so you might want to clarify with your UI team before proceeding with a solution. You may find that all they want to do is read information previously saved by the user on another page.
So, assuming that the objects don't need to be stored in the database before the commit, I give you plan C:
Store initialized business objects in the session. You can then create all the children you want, and only touch the database (and set up references) when the transaction needs to be committed. If the session data is going to be large (either individually or collectively), store the session information in the database (you may already be doing this).

Is batch fetching db.get(keys) preserving the keys order?

Google App Engine supports a fetch operation based on a list of keys google.appengine.ext.db.get(keys).
I'd be interested to figure out if there is any guarantee that the result list preserves the order of the keys (i.e. keys = [k_1, k_2, k_3] then for the result [r_1, r_2, r_3] is always true that r_i.key() == k_i).
As far as I know, the API is performing the IN selects by internally issuing N sub-selects for each value in IN. I would expect this to happen for db.keys and so the call would preserve the keys order.
Anyways, I am not sure and I cannot find any reference that db.keys is equivalent to an IN select though and if there aren't any optimizations for its execution in place. Otherwise, the workaround would be quite simple (I would iterate and query myself for each key and so I'll have the guarantee that the I don't depend on db.keys implementation).
I have run some basic tests and the results are showing that:
db.get() performs best
db.get() preserves the keys order
the alternative Model.get_by_id (for which the order of results will always be guaranteed) is performing slower
While the results seem to confirm my assumptions, I am wondering if others have investigated this and have reached similar or different conclusions.
tia,
./alex
Doing some more research I have found the following (documentation for both db.get() and Model.get():
If ids is a list, the method returns a list of model instances, with a None value when no entity exists for a corresponding Key.
Even if it doesn't underline it, I think it is clear that the order is guaranteed.

You're correct: db.get returns entities in the same order as the keys you provided. The performance difference you observe is because it only has to make one round-trip to the database instead of many, and because it can simultaneously fetch all the entities, rather than acting serially. It's not equivalent to 'SELECT ... IN ...', however, because it's based on Bigtable, and you're selecting on the primary key, so it can do lookups directly on the table.
One thing to bear in mind when doing performance comparisons: Always do these on the production server, never on dev_appserver. The two have totally different performance characteristics.

The quote from the documentation clarifies my question.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight