Locking Mechanism for data stored in redis - c

Dears, I have a problem where multiple redis-clients are accessing a common structure stored in redis-server.
Requirements are as follows:-
If a particular redis-client is accessing the structure stored in redis-server (shall do read and write operation on the structure), no other redis-client should be able to access and wait for being released.
Every time other redis-client is accessing the structure, they should access the updated structure.
How can I put locking mechanism to fulfill this requirement in C Code.
Thanks in Advance.

Redis provides the following:
1) Use Redis transactions and optimistic locking. See Redis Transactions
2) Or Lua scripting, which will be executed in Redis atomically. See EVAL

Use the watch command https://redis.io/commands/watch to detect modification by other clients. This command is specified only to the specified keys in redis transactions

Dears, thanks all for your response. It was helpful to know various features and generalised way available with the redis.
However I approached as follows as my requirement met this way.
I used secound timestamp (say t_sec) as key and counter as hash value. If in that particular second, further request comes, the counter value corresponding to t_sec key was incremented (HINCRBY command) in atomic way. Rest of the parameters are locally stored in a structure. If counter reaches a particular set limit, requests were dropped.
If this is next sec, new t_sec key value is used and counter is incremented from zero.
t_sec key corresponding to previous second got deleted (HDEL command).

Related

Is it possible for a DynamoDB read to return state that is older than the state returned by a previous read?

Let's say there is a DynamoDB key with a value of 0, and there is a process that repeatably reads from this key using eventually consistent reads. While these reads are occurring, a second process sets the value of that key to 1.
Is it ever possible for the read process to ever read a 0 after it first reads a 1? Is it possible in DynamoDB's eventual consistency model for a client to successfully read a key's fully up-to-date value, but then read a stale value on a subsequent request?
Eventually, the write will be fully propagated and the read process will only read 1 values, but I'm unsure if it's possible for the reads to go 'backward in time' while the propagation is occuring.
The property you are looking for is known as monotonic reads, see for example the definition in https://jepsen.io/consistency/models/monotonic-reads.
Obviously, DynamoDB's strongly consistent read (ConsistentRead=true) is also monotonic, but you rightly asked about DynamoDB's eventually consistent read mode.
#Charles in his response gave a link, https://www.youtube.com/watch?v=yvBR71D0nAQ&t=706s, to a nice official official talk by Amazon on how eventually-consistent reads work. The talk explains that DynamoDB replicates written data to three copies, but a write completes when two out of three (including one designated as the "leader") of the copies were updated. It is possible that the third copy will take some time (usually a very short time to get updated).
The video goes on to explain that an eventually consistent read goes to one of the three replicas at random.
So in that short amount of time where the third replica has old data, a request might randomly go to one of the updated nodes and return new data, and then another request slightly later might go by random to the non-updated replica and return old data. This means that the "monotonic read" guarantee is not provided.
To summarize, I believe that DynamoDB does not provide the monotonic read guarantee if you use eventually consistent reads. You can use strongly-consistent reads to get it, of course.
Unfortunately I can't find an official document which claims this. It would also be nice to test this in practice, similar to how he paper http://www.aifb.kit.edu/images/1/17/How_soon_is_eventual.pdf tested whether Amazon S3 (not DynamoDB) guaranteed monotonic reads, and discovered that it did not by actually seeing monotonic-read violations.
One of the implementation details which may make it hard to see these monotonic-read violations in practice is how Amazon handles requests from the same process (which you said is your case). When the same process sends several requests in sequence, it may (but also may not...) may use the same HTTP connections to do so, and Amazon's internal load balancers may (but also may not) decide to send those requests to the same backend replica - despite the statement in the video that each request is sent to a random replica. If this happens, it may be hard to see monotonic read violations in practice - but it may still happen if the load balancer changes its mind, or the client library opens another connection, and so on, so you still can't trust the monotonic read property to hold.
Yes it is possible. Requests are stateless so a second read from the same client is just as likely as any other request to see slightly stale data. If that’s an issue, choose strong consistency.
You will (probably) not ever get the old data after getting the new data..
First off, there's no warning in the docs about repeated reads returning stale data, just that a read after a write may return stale data.
Eventually Consistent Reads
When you read data from a DynamoDB table, the response might not
reflect the results of a recently completed write operation. The
response might include some stale data. If you repeat your read
request after a short time, the response should return the latest
data.
But more importantly, every item in DDB is stored in three storage nodes. A write to DDB doesn't return a 200 - Success until that data is written to 2 of 3 storage nodes. Thus, it's only if your read is serviced by the third node, that you'd see stale data. Once that third node is updated, every node has the latest.
See Amazon DynamoDB Under the Hood
EDIT
#Nadav's answer points that it's at least theoretically possible; AWS certainly doesn't seem to guarantee monotonic reads. But I believe the reality depends on your application architecture.
Most languages, nodejs being an exception, will use persistent HTTP/HTTPS connections by default to the DDB request router. Especially given how expensive it is to open a TLS connection. I suspect though can't find any documents confirming it that there's at least some level of stickiness from the request router to a storage node. #Nadav discusses this possibility. But only AWS knows for sure and they haven't said.
Assuming that belief is correct
curl in a shell script loop - more likely to see the old data again
loop in C# using a single connection - less likely
The other thing to consider is that in the normal course of things, the third storage node in "only milliseconds behind".
Ironically, if the request router truly picks a storage node at random, a non-persistent connection is then less likely to see old data again given the extra time it takes to establish the connection.
If you absolutely need monotonic reads, then you'd need to use strongly consistent reads.
Another option might be to stick DynamoDB Accelerator (DAX) in front of your DDB. Especially if you're retrieving the key with GetItem(). As I read how it works it does seem to imply monotonic reads, especially if you've written-through DAX. Though it does not come right out an say so. Even if you've written around DAX, reading from it should still be monotonic, it's just there will be more latency until you start seeing the new data.

How Does Flink Clean Up Keyed State?

When thinking about the act of keying by something I traditionally think of the analogy of throwing all the events that match the key into the same bucket. As you can imagine, when the Flink application starts handling lots of data what you opt to key by starts to become important because you want to make sure you clean up state well. This leads me to my question, how exactly does Flink clean up these "buckets"? If the bucket is empty (all the MapStates and ValueStates are empty) does Flink close that area of the key space and delete the bucket?
Example:
Incoming Data Format: {userId, computerId, amountOfTimeLoggedOn}
Key: UserId/ComputerId
Current Key Space:
Alice, Computer 10: Has 2 events in it. Both events are stored in state.
Bob, Computer 11: Has no events in it. Nothing is stored in state.
Will Flink come and remove Bob, Computer 11 from the Key Space eventually or does it just live on forever because at one point it had an event in it?
Flink does not store any data for state keys which do not have any user value associated with them, at least in the existing state backends: Heap (in memory) or RocksDB.
The Key Space is virtual in Flink, Flink does not make any assumptions about which concrete keys can potentially exist. There are no any pre-allocated buckets per key or subset of keys. Only once user application writes some value for some key, it occupies storage.
The general idea is that all records with the same key are processed on the same machine (somewhat like being in the same bucket as you say). The local state for a certain key is also always kept on the same machine (if stored at all). This is not related to checkpoints though.
For your example, if some value was written for [Bob, Computer 11] at some point of time and then subsequently removed, Flink will remove it completely with the key.
Short Answer
It cleans up with the help of Time To Live (TTL) feature of Flink State and Java Garbage Collector (GC). TTL feature will remove any reference to the state entry and GC will take back the allocated memory.
Long Answer
Your question can be divided into 3 sub-questions:
I will try to be as brief as possible.
How does Flink partition the data based on Key?
For an operator over a keyed stream, Flink partitions the data on a key with the help of Consistent Hashing Algorithm. It creates max_parallelism number of buckets. Each operator instance is assigned one or more of these buckets. Whenever a datum is to be sent downstream, the key is assigned to one of those buckets and consequently sent to the concerned operator instance. No key is stored here because ranges are calculated mathematically. Hence no area is cleared or bucket is deleted anytime. You can create any type of key you want. It won't affect the memory in terms of keyspace or ranges.
How does Flink store state with a Key?
All operator instances have an instance-level state store. This store defines the state context of that operator instance and it can store multiple named-state-storages e.g. "count", "sum", "some-name" etc. These named-state-storages are Key-Value stores that can store values based on the key of the data.
These KV stores are created when we initialize the state with a state descriptor in open() function of an operator. i.e. getRuntimeContext().getValueState().
These KV stores will store data only when something is needed to be stored in the state. (like HashMap.put(k,v)). Thus no key or value is stored unless state update methods (like update, add, put) are called.
So,
If Flink hasn't seen a key, nothing is stored for that key.
If Flink has seen the key but didn't call the state update methods, nothing is stored for that key.
If a state update method is called for a key, the key-value pair will be stored in the KV store.
How does Flink clean up the state for a Key?
Flink does not delete the state unless it is required by the user or done by the user manually. As mentioned earlier, Flink has the TTL feature for the state. This TTL will mark the state expiry and remove it when a cleanup strategy is invoked. These cleanup strategies vary wrt backend type and the time of cleanup. For Heap State Backend, It will remove the entry from a state table i.e. removing any reference to the entry. The memory occupied by this non-referenced entry will be cleaned up by Java GC. For RocksDB State Backend, it simply calls the native delete method of RocksDB.

Flink when to split stream to jobs, using uid, rebalance

I am pretty new to flink and about to load our first production version. We have a stream of data. The stateful filter is checking if the data is new.
would it be better to split the stream to different jobs to gain more control on the parallelism as shown in option 1 or option 2 is better ?
following the documentation recommendation. should I put uid per operator e.g :
dataStream
.uid("firstid")
.keyBy(0)
.flatMap(flatMapFunction)
.uid("mappedId)
should I add rebalance after each uid if at all?
what is the difference if I setMaxParallelism as described here or setting parallelism from flink UI/cli ?
You only need to define .uid("someName") for your stateful operators. Not much need for operators which do not hold state as there is nothing in the savepoints that needs to be mapped back to them (more on this here). Won't hurt if you do though.
rebalance will only help you in the presence of data skew and that only if you aren't using keyed streams. If you process data based on a key, and your load isn't uniformly distributed across your keys (ie you have loads of "hot" keys) then rebalancing won't help you much.
In your example above I would start Option 2 and potentially move to Option 1 if the job proves to be too heavy. In general stateless processes are very fast in Flink so unless you want to add other consumers to the output of your stateful filter then don't bother to split it up at this stage.
There isn't right and wrong though, depends on your problem. Start simple and take it from there.
[Update] Re 4, setMaxParallelism if I am not mistaken defines the number of key groups and thus the maximum number of parallel instances your stream can be rescaled to. This is used by Flink internally but it doesn't set the parallelism of your job. You usually have to set that to some multiple of the actually parallelism you set for you job (via -p <n> in the CLI/UI when you deploy it).

multi-thread applications in Berkeley DB

I have a simple multi-threaded application. All the threads will only do put operations to the same database. But before a thread takes a put operations, it will first of all acquire a mutex lock to increase the key number and then release the lock and then do the put operation, i.e., the threads will insert items with different key number maybe at the same time. That's what I did in my application.
What I am still confused about is whether this simple app needs to specify DB_INIT_LOCK flag or DB_INIT_CDB flag? I have read the document about these flags. DB_INIT_CDB means multiple reads/single writer, however, in my simple app, the threads can operate concurrently, not single writer, so I do not need it. For DB_INIT_LOCK, since the threads never insert the item with the same key, I do not need it, am I right?
Please correct me if I am wrong. Many thanks.
You correctly state that DB_INIT_CDB gives you a multi-reader, single-writer environment. This puts Berkeley DB in a completely different mode of operation. But, since you've got more than one writer, you can't use it.
You'll need at least these two flags:
DB_INIT_LOCK: You're doing your own locking around your database key generation. But when you insert records into the database, Berkeley DB is going to touch some of the same pieces of memory. For example, the very first two records you insert will be right next to each other in the database. Unless they are large, they'll be on the same database "page" of memory. You need this flag to tell BDB to do its own locking.
It's the same as if you implemented your own in-memory binary tree that multiple threads were changing. You'd have to use some kind of locking to prevent the threads from completely destroying the tree with incompatible updates.
DB_THREAD: This flag lets BDB know that multiple threads will be using the same database environment.
You may find that you need to use transactions, or at the very least to allow BDB to use them interally. That's DB_INIT_TXN. And, I've always needed DB_INIT_MPOOL and DB_PRIVATE to allow BDB to use malloc() to manage some of its own memory.
(Just as an aside, if you have a simple increment for your key, consider using an atomic increment operation instead of a mutex. If you're using C with gcc, the builtin __sync_fetch_and_add (or __sync_add_and_fetch) can do this for you. From C++, you can use std::atomic's post-increment.)

How can we synchronize reads and writes to mem cache on app engine?

How can we make sure only one instance of a JVM is modifying a mem cache key instance at any one time, and also that a read isn't happening while in the middle of a write? I can use either the low level store, or the JSR wrapper.
List<Foo> foos = cache.get("foos");
foos.add(new Foo());
cache.put("foos", foos);
is there any way for us to protect against concurrent writes etc?
Thanks
Fh. is almost on the right way, but not completely. He is right saying MemCache works atomically, but this does not save you to take care of 'synchronization'. As we have many instances of JVM's running, we cannot speak of real synchronization in terms of what we usually understand when thinking of multi-threaded environments. Here's the solution to 'synchronize' MemCache access across one or many Instances :
With the methodMemcacheService.getIdentifiable(Object key) you get an Identifiable object instance. You can later put it back in the MemCache using MemcacheService.putIfUntouched(...)
Check the API at MemCache API : getIdentifiable().
An Identifiable Object is a wrapper containing the Object you fetched via it's Key.
Here's how it works:
Instance A fetches Identifiable X.
Instance B fetches Identifiable X at the 'same' time
Instance A & B will do an update of X's wrapped object (your object, actually).
The first instance doing a putIfUntouched(...) to store back the object will succeed, putIfUntouched will return true.
The second instance trying this will fail to do so, with putIfUntouched returning false.
Are you worried about corrupting memcache by reading a key, while a parallel operation is writing to the same key? Don't worry about this, on the memcache server both operations will run atomically.
Memcache will never output a 'corrupt' result, nor have corruption issues while writing.

Resources