Apache Flink Set Operator Uid vs UidHash

Apache Flink Set Operator Uid vs UidHash - apache-flink

I'm using Apache Flink 1.2.0. According to Production Readiness Checklist (https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/production_ready.html) it is recommended to set Uids for operators to ensure compatibility for savepoints.
I couldn't find the setUid() method for a flatMap but I found uid() and setUidHash() which according to doc. says
uid
"Sets an ID for this operator.
The specified ID is used to assign the same operator ID across job submissions (for example when starting a job from a savepoint)."
uidHash
"Sets an user provided hash for this operator. This will be used AS IS the create the JobVertexID.
The user provided hash is an alternative to the generated hashes, that is considered when identifying an operator through the default hash mechanics fails (e.g. because of changes between Flink versions)."
Which one actually should be set on a flatMap for example uid() or setUidHash()? Or both?

uid() method is recommended to be used in this case.
setUidHash() should be used only as workaround to fixup jobs created with default uids instead of user defined ones. It's stated in javadoc:
this should be used as a workaround or for trouble shooting. The provided hash needs to be unique per transformation and job. Otherwise, job submission will fail. Furthermore, you cannot assign user-specified hash to intermediate nodes in an operator chain and trying so will let your job fail.

Related

Are there ways to perform membership tests in pact? (performing membership test for pact tables)

Hello to the kadena pact developer community
I was looking at some basic code examples and as I wanted to play around with the functionality to develop a better grasp for it, I got curios about the following:
We see that some capabilities as defined in example code test for values within a row inside a table.
Is there a way one could simply test for a key and fail the predicate if the key is not present in the table?
Thank you for your insight.

While this may not be the most efficient way, I have found a solution to the question.
The pact syntax for testing membership is with the function call to 'contains'
Now we want to know whether a key exists within our table. In order to do this we can use the built-in function 'keys'
This will return a list of strings (i.e. our keys) and will let us query via the use of 'contains' whether the key in question exists as a key in our table, or: is X a member in our table.
Since this requires us to get a complete list of keys just to see whether the particular key is within our table, this is where my concern regarding performance comes in.
I wanted to share this with everyone, regardless, but in certain circumstances it may be better to just let the transaction fail instead of enforcing membership explicitly like this.
Edit: I used some code previously to show how to achieve this, but it was faulty code.
If you need a membership test, you can do it within the context of an if statement, but not with (enforce ) as enforce will only allow "pure" expressions (i.e. expressions that can be evaluated on the spot and do not involve database lookups like the 'keys' function).
Enforcing a test outcome that requires database transaction will return an error like
Error from (api.testnet.chainweb.com): : Failure: Database exception:
: Failure: Illegal database access attempt (keys)

Flink job production readiness - validate UUIDs assigned to all operators

The flink production readiness (https://ci.apache.org/projects/flink/flink-docs-stable/ops/production_ready.html) suggests assigning UUIDs to all operators. I'm looking for a way to validate that all operators in a given job graph have been assigned UUIDs -- ideally to be used as a pre-deployment check in our CI flow.
We already have a process in place that uses the PackagedProgram class to get a JSON-formatted 'preview plan'. Unfortunately, that does not include any information about the assigned UUIDs (or lack thereof).
Digging into the code behind generating the JSON preview plan (PlanJSONDumpGenerator), I can trace how it visits each of the nodes as a DumpableNode<?>, but from there, I can't find anything that leads me to the definition of the operator with it's UUID.
When defining the job (using the DataStream API), the UUID is assigned on a StreamTransformation<T>. Is there anyway to connect the data in the PackagedProgram back to the original StreamTransformation<T>s to get the UUID?
Or is there a better approach to doing this type of validation?

What is CAS in NoSQL and how to use it?

Write operations on Couchbase accept a parameter cas (create and set). Also the return result object of any non-data fetching query has cas property in it. I Googled a bit and couldn't find a good conceptual article about it.
Could anyone tell me when to use CAS and how to do it? What should be the common work-flow of using CAS?
My guess is we need to fetch CAS for the first write operation and then pass it along with next write. Also we need to update it using result's CAS. Correct me if I am wrong.

CAS actually stands for check-and-set, and is a method of optimistic locking. The CAS value is associated with each document which is updated whenever the document changes - a bit like a revision ID. The intent is that instead of pessimistically locking a document (and the associated lock overhead) you just read it's CAS value, and then only perform the write if the CAS matches.
The general use-case is:
Read an existing document, and obtain it's current CAS (get_with_cas)
Prepare a new value for that document, assuming no-one else has modified the document (and hence caused the CAS to change).
Write the document using the check_and_set operation, providing the CAS value from (1).
Step 3 will only succeed (perform the write) if the document is unchanged between (1) and (3) - i.e. no other user has modified it in the meantime. Typically if (3) does fail you would retry the whole sequence (get_with_cas, modify, check_and_set).
There's a much more detailed description of check-and-set in the Couchbase Developer Guide under Concurrent Document Mutations.

What's the difference between findAndModify and update in MongoDB?

I'm a little bit confused by the findAndModify method in MongoDB. What's the advantage of it over the update method? For me, it seems that it just returns the item first and then updates it. But why do I need to return the item first? I read the MongoDB: the definitive guide and it says that it is handy for manipulating queues and performing other operations that need get-and-set style atomicity. But I didn't understand how it achieves this. Can somebody explain this to me?

If you fetch an item and then update it, there may be an update by another thread between those two steps. If you update an item first and then fetch it, there may be another update in-between and you will get back a different item than what you updated.
Doing it "atomically" means you are guaranteed that you are getting back the exact same item you are updating - i.e. no other operation can happen in between.

findAndModify returns the document, update does not.
If I understood Dwight Merriman (one of the original authors of mongoDB) correctly, using update to modify a single document i.e.("multi":false} is also atomic. Currently, it should also be faster than doing the equivalent update using findAndModify.

From the MongoDB docs (emphasis added):
By default, both operations modify a single document. However, the update() method with its multi option can modify more than one document.
If multiple documents match the update criteria, for findAndModify(), you can specify a sort to provide some measure of control on which document to update.
With the default behavior of the update() method, you cannot specify which single document to update when multiple documents match.
By default, findAndModify() method returns the pre-modified version of the document. To obtain the updated document, use the new option.
The update() method returns a WriteResult object that contains the status of the operation. To return the updated document, use the find() method. However, other updates may have modified the document between your update and the document retrieval. Also, if the update modified only a single document but multiple documents matched, you will need to use additional logic to identify the updated document.
Before MongoDB 3.2 you cannot specify a write concern to findAndModify() to override the default write concern whereas you can specify a write concern to the update() method since MongoDB 2.6.
When modifying a single document, both findAndModify() and the update() method atomically update the document.

One useful class of use cases is counters and similar cases. For example, take a look at this code (one of the MongoDB tests):
find_and_modify4.js.
Thus, with findAndModify you increment the counter and get its incremented
value in one step. Compare: if you (A) perform this operation in two steps and
somebody else (B) does the same operation between your steps then A and B may
get the same last counter value instead of two different (just one example of possible issues).

This is an old question but an important one and the other answers just led me to more questions until I realized: The two methods are quite similar and in many cases you could use either.
Both findAndModify and update perform atomic changes within a single request, such as incrementing a counter; in fact the <query> and <update> parameters are largely identical
With both, the atomic change takes place directly on a document matching the query when the server finds it, ie an internal write lock on that document for the fraction of a millisecond that the server confirms the query is valid and applies the update
There is no system-level write lock or semaphore which a user can acquire. Full stop. MongoDB deliberately doesn't make it easy to check out a document then change it then write it back while somehow preventing others from changing that document in the meantime. (While a developer might think they want that, it's often an anti-pattern in terms of scalability and concurrency ... as a simple example imagine a client acquires the write lock then is killed while holding it. If you really want a write lock, you can make one in the documents and use atomic changes to compare-and-set it, and then determine your own recovery process to deal with abandoned locks, etc. But go with caution if you go that way.)
From what I can tell there are two main ways the methods differ:
If you want a copy of the document when your update was made: only findAndModify allows this, returning either the original (default) or new record after the update, as mentioned; with update you only get a WriteResult, not the document, and of course reading the document immediately before or after doesn't guard you against another process also changing the record in between your read and update
If there are potentially multiple matching documents: findAndModify only changes one, and allows you customize the sort to indicate which one should be changed; update can change all with multi although it defaults to just one, but does not let you say which one
Thus it makes sense what HungryCoder says, that update is more efficient where you can live with its restrictions (eg you don't need to read the document; or of course if you are changing multiple records). But for many atomic updates you do want the document, and findAndModify is necessary there.

We used findAndModify() for Counter operations (inc or dec) and other single fields mutate cases. Migrating our application from Couchbase to MongoDB, I found this API to replace the code which does GetAndlock(), modify the content locally, replace() to save and Get() again to fetch the updated document back. With mongoDB, I just used this single API which returns the updated document.

Database Supporting Update by Function rather than by Value

Desired behavior:
In Clojure's implementation of Agents, to update an Agent, one does NOT send a new value. One sends a function, which is called on the old value, and the return value is set as the new value of the Agent.
This makes certain things easy: for example, if I have a queue, and I have two concurrent threads that both want to append to the queue (and I don't care which order they append), each thread can just fire off a (fn [x] (cons x new_value)) ... and it just works. Whereas, if it was updating by value, I'd have to do a compare and swap of some sort.
Question:
Is there any database that supports this type of updating? For example, I was recently looking at MongoDB. However, MongoDB supports only $inc/$dec, and not arbitrary functions for updating the documents.
Thanks!
PS -- I don't need transactions / ACID / BASE / ... all I really want is a simple document store that supports updating via functions rather than values.

MongoDB actually supports multiple different modifier operations (full list here) and while I can see $inc/$dec alone might be limiting, a simple variant of specific example you give of adding to a queue could be handled using $push or $addToSet (depending on whether you wanted duplicates to appear in your queue or not as $addToSet will only add a value if it doesn't already exist).
For more complex cases (supporting compare and swap type of semantics) you can look at findAndModify command in MongoDB if you don't find something simple that fits your needs exactly.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Apache Flink Set Operator Uid vs UidHash - apache-flink

Related

Are there ways to perform membership tests in pact? (performing membership test for pact tables)

Flink job production readiness - validate UUIDs assigned to all operators

What is CAS in NoSQL and how to use it?

What's the difference between findAndModify and update in MongoDB?

Database Supporting Update by Function rather than by Value

Categories

Resources