I'm not sure how to draw precedence graph of this schedule.
Not sure whether to take into consideration the begnning and commit of a transaction or not.
But what i concluded that this schedule can not be serilizabled (there doesn't exist a schedule whos serial equivlent to the schedule above.)
Correct me.
Precedence Graph:
Yes this schedule is conflict-serializable as by drawing the precedence graph, there is no cycle.
An equivalent serial schedule would be a path where you can visit every node once, which would be T2 -> T3 -> T1 -> T4 in this case.
Related
Suppose I have a flink pipeline as such:
kafka_source -> maps/filters/keyBy/timewindow(1 minute) -> sinkCassandra
By the time the grouped messages hit the sinkCassandra operation, am I guaranteed that no other slots won't also concurrently run the maps/filters/keyBy/timewindow(1 minute) part of the pipeline?
Or is it possible to have some other slot run the middle pipeline while another set is running the sinkCassandra operation?
EDIT ( Added more requirements based on comment conversation ):
What I'm trying to do is effectively do a lookup based on flink data key from the datastore, and do an update and flush the updated data back.
The reason why I'm dodging using kafka_source -> maps/filters -> keyBy/TimeWindow/statefulReduce -> sinkCassandra is because the state can potentially get huge ( 1 day to 7 days where I can place 7 days as the max time bounding ) and I don't necessarily know the time window for each key. This would mean a HUGE state even with rocksdb.
Another potential option that I'm looking at is kafka_source -> maps/filters -> keyBy/sinkCass where within the custom sink operation, I would first check in some sort of in-memory buffer if I have the key that I want to update. If not, I go ahead and fetch from Cassandra. Every 5 seconds ( or every N seconds ), I would grab whatever's in the buffer and flush into Cassandra. To limit memory, I can do an in-memory least recently used hashmap ( I don't necessarily want to flush b/c multiple keys will show up again! )
Unless you have explicitly configured something unusual, each slot will contain one parallel slice of the complete pipeline -- each slot will have a kafka source instance connected to a disjoint subset of the kafka partitions, as well as the maps/filters/keyBy/window, and the cassandra sink.
All of those parallel sub-pipelines (slots) will be running concurrently. Furthermore, within each slot, each of the operators will also be running concurrently. The sink and the middle part of your pipeline are already running concurrently, but they are competing for the resources of the slot that contains them both. You can configure your task managers to have more cores per slot if you are concerned about starvation.
EDiT (responding to add'l info about requirements):
You can safely assume that for any given flink data key, after a keyBy, only one instance of each operator will process events for that key. That principle is fundamental to Flink's design. If I understand correctly what you are contemplating, that's the only guarantee you need.
I basically understand what Serial and Seralizable Schedules are but the question I want to answer is
Briefly explain the terms, Serializable Schedule, Serial Schedule and
Equivalent Schedule
What is the difference between a Serealizable Schedule and an Equivalent Schedule?
Anything to do with the order of Transactions? or the values writing in to them?
Serial Schedule
One process at a time
Does not interleave
Equivalent Schedule
When the effect of executing different schedules are the same
Serializable Schedule
A schedule that is equivalent to some Serial Values
I am reading in google about the conflict serializability and serializable.
But I am not getting the correct definition and the difference between serializable and conflict serializability.
I am getting only one thing.That is Conflict serializability implies serializable.
In many things they told most over the serializable and conflict serializable are same.
Can anyone please explain what is conflict serializable and difference between serializability and serializable with examples.
Thanks for advance !
I was found a answer for my question.
Serializable means the transaction is done in serial manner. That means if the scheduling is done, but the transactions are not use the same variable for read and write.
Example:-
T1 T2
Read(X)
Read(y)
Write(X)
Write(Y)
In this example, the two transactions are does not use the shared variable.
So, in here there is no conflict.
Conflict serializability means the transactions in done in concurrently. The two transactions are used the same variable, the output of the transaction is conflict.
Example:-
T1 T2
Read(X)
Read(X)
Write(X)
Write(X)
Read(Y)
Write(Y)
Read(Y)
Write(Y)
In this example the two transaction T1, and T2 uses the same variable.
So, the transaction T2 writes the X before T1 write. After the T1 writes the X. In here there is no use of transaction T2 writes. This is conflict serializability.
To answer your question, I'll first explain few terminologies, quoting a line from the Operating Systems book by Galvin.
Serial Schedule : A schedule in which each transaction is executed atomically is called a serial schedule just like it's shown in Premraj's answer.
Likewise, when we allow the transactions to overlap their execution,
i.e they aren't atomic any longer, they're called Non Serial
Schedule.
Conflicting Operations : This helps to see whether a Non Serial schedule is(not) equivalent to the one represented by Serial Schedule. Two consecutive operations are said to be in conflict if they access the same data item and at least one of them is write operation.
We try to find out and swap all the Non Conflicting Operations, and if the given Non Serial Schedule can be transformed to a Serial Schedule, we say the given schedule was Conflict Serializable.
Schedule : is a bundle of transactions.
Serializability is a property of a transaction schedule. It relates to the isolation property of a database transaction. Serializability of a schedule means equivalence to a serial schedule.
Conflict Serializable can occur on Non-Serializable Schedule on following 3 conditions:
They must belong to different transactions.
They must operate on same value
at least one of the should have write operation.
I am newbie in Apache Giraph. My question is related to Giraph graph partitioning. As far as I know, Giraph partition the large graph randomly.... possibly #partitions>#workers in order to load balance. But, my question is, is #partitions/worker always an integer? Saying in the ther way, Can it happen, that a partition (say p1) resides partially in worker w1 and worker w2? Or, should p1 be either in w1 or w2 at entirety?
Partition in Giraph refers to vertex partition not graph-partitions. For example, if a graph has 10 vertices numbered from 1 to 10 then a possible partition would be {1,2. 3}, {4,5,6}, {7,8,9,10}. Each partition knows where its outgoing edges are pointing. Each worker creates threads for each partition which is assigned to it. The thread iterates over each vertex in the partition and executes compute function.
So with this information I would say a partition has to reside on a single worker entirely.
Hello #zahorak,
If Giraph implemented Pregel as it is, then as per the Pregel paper it is not necessary to have #partitions == #workers. It says,
The master determines how many partitions the graph will have, and assigns one or more partitions to each worker machine. The number may be controlled by the user. Having more than one partition per worker allows parallelism among the partitions and better load balancing, and will usually improve performance.
UPDATE: I found the similar question on Giraph user mailing list. The answers given in replies might be helpful. Here is the link to the thread - https://www.mail-archive.com/user#giraph.apache.org/msg01869.html
AFAIK no, actually I would have said, #partitions == #workers
The reason for partitioning is to handle parts of the graph on one server. After the superstep is executed messages sent to other partitions are exchanged between the servers within a cluster.
Maybe you understand something else under the term partitioning as me, but for me partitioning means:
Giraph is on a cluster with multiple servers, in order to laverage all servers, it needs to partition the graph. And than in simply assings randomly a node to one of the n servers. Out of this you get n partitions and nodes within each partition are executed by the one server they were assigned to, no other.
In database theory, what is the difference between "conflict serializable" and "conflict equivalent"?
My textbook has a section on conflict serializable but glosses over conflict equivalence. These are probably both concepts I am familiar with, but I am not familiar with the terminology, so I am looking for an explanation.
Conflict in DBMS can be defined as two or more different transactions accessing the same variable and atleast one of them is a write operation.
For example:
T1: Read(X)
T2: Read (X)
In this case there's no conflict because both transactions are performing just read operations.
But in the following case:
T1: Read(X)
T2: Write(X)
there's a conflict.
Lets say we have a schedule S, and we can reorder the instructions in them. and create 2 more schedules S1 and S2.
Conflict equivalent: Refers to the schedules S1 and S2 where they maintain the ordering of the conflicting instructions in both of the schedules. For example, if T1 has to read X before T2 writes X in S1, then it should be the same in S2 also. (Ordering should be maintained only for the conflicting operations).
Conflict Serializability: S is said to be conflict serializable if it is conflict equivalent to a serial schedule (i.e., where the transactions are executed one after the other).
From Wikipedia.
Conflict-equivalence
The schedules S1 and S2 are said to be conflict-equivalent if the following conditions are satisfied:
Both schedules S1 and S2 involve the same set of transactions (including ordering of actions within each transaction).
The order of each pair of conflicting actions in S1 and S2 are the same.
Conflict-serializable
A schedule is said to be conflict-serializable when the schedule is conflict-equivalent to one or more serial schedules.
Another definition for conflict-serializability is that a schedule is conflict-serializable if and only if its precedence graph/serializability graph, when only committed transactions are considered, is acyclic (if the graph is defined to include also uncommitted transactions, then cycles involving uncommitted transactions may occur without conflict serializability violation).
Just two terms to describe one thing in different ways.
Conflict equivalent: you need to say Schedule A is conflict equivalent to Schedule B. it must involve two schedules
Conflict serializable: Still use Schedule A and B. we can say Schedule A is conflict serializable. Schedule B is conflict serializable.
We didn't say Schedule A/B is conflict equivalent
We didn't say Schedule A is conflict serializable to Schedule B
If a schedule S can be transformed into a schedule S´ by a series of swaps of non-conflicting instructions, we say that S and S´ are conflict equivalent.
We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule.
Conflict serializable means conflict equuivalent to any serial schedule.
Conflict Equivalent Schedules: if a Schedule S can be transformed into a schedule S' by a series of swaps of non conflicting instructions, we say that schedule S & S' are conflict equivalent.
Conflict Serializable Schedule: Schedule S is conflict serializable if it is conflict equivalent to a serial schedule.
Definitions have already been explained perfectly, but I feel this will be very useful to some.
I've developed a small console program (on github) which can test any schedule for conflict serializability and will also draw a precedence graph.
If there is at least one conflict equivalent schedule for considered transaction schedule, it is conflict serializable.