Are IF statements and loops such as while or do while atomic instructions in concurrent programming?
If not, is there a way to implement them atomically?
edit: Fixed some of my dodgy English.
In Java the only things which are atomic without any extra work are assignments. Anything else requires synchronisation, either via declaring method synchronized or using synchronized block. You can also use classes from java.concurrent - some of them use some more clever mechanisms to ensure synchronisation, rather than just declaring method synchronized what tends to be slow.
Regarding if-statement and the question you asked in the comment about comparison n == m:
Comparison is not atomic. First value of n has to be loaded (here value of m can still change), then value of m has to be loaded and then the actual comparison is evaluated (and at this point the actual values of both n and m can be already different than in the comparison).
If you want it synchronised you would have to do something like this:
public class Test {
private static final Object lock = new Object();
public static void main(String[] args) {
if (equals(1, 2)) {
// do something (not synchronised)
}
}
public static boolean equals(int n, int m) {
synchronized (lock) {
return n == m;
}
}
}
However this raises a question why do you want to do this and what should be the lock (and what threads is the lock shared with)? I would like to see some more context about your problem, because currently I cannot see any reason of doing something like this.
You should also remember that:
you cannot lock on primitives (you would have to declare both values as Integer)
locking on null will result in NullPointerException
lock is acquired on the value, not on the reference. Because integers in Java are immutable, assigning a new value to a field will result in creating a new lock, see the code below. Thread t1 acquires a lock on new Integer(1) while t2 acquires a lock on new Integer(2). So even though both threads lock on n, they can still be doing things in parallel.
public class Test {
private static Integer n = 1;
public static void main(String[] args) throws Exception {
Thread t1 = new Thread(() -> {
synchronized (n) {
System.out.println("thread 1 started");
sleep(2000);
System.out.println("thread 1 finished");
}
});
Thread t2 = new Thread(() -> {
synchronized (n) {
System.out.println("thread 2 started");
sleep(2000);
System.out.println("thread 2 finished");
}
});
t1.start();
sleep(1000);
n = 2;
t2.start();
t1.join();
t2.join();
}
private static void sleep(int millis) {
try {
Thread.sleep(millis);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
Have you considered using mutable AtomicInteger?
These can have arbitrarily large & complex (that is, non-atomic) boolean expressions that need to be evaluated. One way to prevent race conditions involving them (if that is what you mean by "appease") is to use some kind of locking mechanism.
Related
We are migrating a spark job to flink. We have used pre-shuffle aggregation in spark. Is there a way to execute similar operation in spark. We are consuming data from apache kafka. We are using keyed tumbling window to aggregate the data. We want to aggregate the data in flink before performing shuffle.
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
yes, it is possible and I will describe three ways. First the already built-in for Flink Table API. The second way you have to build your own pre-aggregate operator. The third is a dynamic pre-aggregate operator which adjusts the number of events to pre-aggregate before the shuffle phase.
Flink Table API
As it is shown here you can do MiniBatch Aggregation or Local-Global Aggregation. The second option is better. You basically tell to Flink to create mini-batches of every 5000 events and pre-aggregate them before the shuffle phase.
// instantiate table environment
TableEnvironment tEnv = ...
// access flink configuration
Configuration configuration = tEnv.getConfig().getConfiguration();
// set low-level key-value options
configuration.setString("table.exec.mini-batch.enabled", "true");
configuration.setString("table.exec.mini-batch.allow-latency", "5 s");
configuration.setString("table.exec.mini-batch.size", "5000");
configuration.setString("table.optimizer.agg-phase-strategy", "TWO_PHASE");
Flink Stream API
This way is more cumbersome because you have to create your own operator using OneInputStreamOperator and call it using the doTransform(). Here is the example of the BundleOperator.
public abstract class AbstractMapStreamBundleOperator<K, V, IN, OUT>
extends AbstractUdfStreamOperator<OUT, MapBundleFunction<K, V, IN, OUT>>
implements OneInputStreamOperator<IN, OUT>, BundleTriggerCallback {
#Override
public void processElement(StreamRecord<IN> element) throws Exception {
// get the key and value for the map bundle
final IN input = element.getValue();
final K bundleKey = getKey(input);
final V bundleValue = this.bundle.get(bundleKey);
// get a new value after adding this element to bundle
final V newBundleValue = userFunction.addInput(bundleValue, input);
// update to map bundle
bundle.put(bundleKey, newBundleValue);
numOfElements++;
bundleTrigger.onElement(input);
}
#Override
public void finishBundle() throws Exception {
if (!bundle.isEmpty()) {
numOfElements = 0;
userFunction.finishBundle(bundle, collector);
bundle.clear();
}
bundleTrigger.reset();
}
}
The call-back interface defines when you are going to trigger the pre-aggregate. Every time that the stream reaches the bundle limit at if (count >= maxCount) your pre-aggregate operator will emit events to the shuffle phase.
public class CountBundleTrigger<T> implements BundleTrigger<T> {
private final long maxCount;
private transient BundleTriggerCallback callback;
private transient long count = 0;
public CountBundleTrigger(long maxCount) {
Preconditions.checkArgument(maxCount > 0, "maxCount must be greater than 0");
this.maxCount = maxCount;
}
#Override
public void registerCallback(BundleTriggerCallback callback) {
this.callback = Preconditions.checkNotNull(callback, "callback is null");
}
#Override
public void onElement(T element) throws Exception {
count++;
if (count >= maxCount) {
callback.finishBundle();
reset();
}
}
#Override
public void reset() {
count = 0;
}
}
Then you call your operator using the doTransform:
myStream.map(....)
.doTransform(metricCombiner, info, new RichMapStreamBundleOperator<>(myMapBundleFunction, bundleTrigger, keyBundleSelector))
.map(...)
.keyBy(...)
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
A dynamic pre-aggregation
In case you wish to have a dynamic pre-aggregate operator check the AdCom - Adaptive Combiner for stream aggregation. It basically adjusts the pre-aggregation based on backpressure signals. It results in using the maximum possible of the shuffle phase.
I have written a small test case code in Flink to sort a datastream. The code is as follows:
public enum StreamSortTest {
;
public static class MyProcessWindowFunction extends ProcessWindowFunction<Long,Long,Integer, TimeWindow> {
#Override
public void process(Integer key, Context ctx, Iterable<Long> input, Collector<Long> out) {
List<Long> sortedList = new ArrayList<>();
for(Long i: input){
sortedList.add(i);
}
Collections.sort(sortedList);
sortedList.forEach(l -> out.collect(l));
}
}
public static void main(final String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
env.getConfig().setExecutionMode(ExecutionMode.PIPELINED);
DataStream<Long> probeSource = env.fromSequence(1, 500).setParallelism(2);
// range partition the stream into two parts based on data value
DataStream<Long> sortOutput =
probeSource
.keyBy(x->{
if(x<250){
return 1;
} else {
return 2;
}
})
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.process(new MyProcessWindowFunction())
;
sortOutput.print();
System.out.println(env.getExecutionPlan());
env.executeAsync();
}
}
However, the code just outputs the execution plan and a few other lines. But it doesn't output the actual sorted numbers. What am I doing wrong?
The main problem I can see is that You are using ProcessingTime based window with very short input data, which surely will be processed in time shorter than 20 seconds. While Flink is able to detect end of input(in case of stream from file or sequence as in Your case) and generate Long.Max watermark, which will close all open event time based windows and fire all event time based timers. It doesn't do the same thing for ProcessingTime based computations, so in Your case You need to assert Yourself that Flink will actually work long enough so that Your window is closed or refer to custom trigger/different time characteristic.
One other thing I am not sure about since I never used it that much is if You should use executeAsync for local execution, since that's basically meant for situations when You don't want to wait for the result of the job according to the docs here.
I need to use the same HashMap from different threads (EDT, a Timer, and maybe the Network Thread). Since Codename One doesn't have a ConcurrentHashMap implementation in its API, I tried to circumvent this issue so:
/**
* ConcurrentHashMap emulation for Codename One.
*/
public class ConcurrentHashMap<K, V> {
private static final EasyThread thread = EasyThread.start("concurrentHashMap_Thread");
private final Map<K, V> map = new HashMap<>();
public V put(K key, V value) {
return thread.run(() -> map.put(key, value));
}
public V get(K key) {
return thread.run(() -> map.get(key));
}
public V remove(K key) {
return thread.run(() -> map.remove(key));
}
}
I have two questions:
Is a code like the above actually necessary in Codename One, when accessing a Map from two or more threads? Is it better to use an HashMap directly?
What do you think about that code? I tried to keep it as simple as possible, but I'm not sure of its correctness.
That code is very correct. It's also very inefficient as you're creating multiple locks to add a new object to a different thread. So you're effectively bringing a tank to a problem that's far simpler.
The TL;DR is this:
map = Collections.synchronizedMap(map);
Your code could have been written like this and would have been more efficient:
public class ConcurrentHashMap<K, V> {
private static final EasyThread thread = EasyThread.start("concurrentHashMap_Thread");
private final Map<K, V> map = new HashMap<>();
public synchronized V put(K key, V value) {
return map.put(key, value);
}
public synchronized V get(K key) {
return map.get(key);
}
public synchronized V remove(K key) {
return map.remove(key);
}
}
But you obviously don't need it since we have the synchronizedMap method...
So why don't we have ConcurrentHashMap?
Because we don't run on huge servers. See this article: https://crunchify.com/hashmap-vs-concurrenthashmap-vs-synchronizedmap-how-a-hashmap-can-be-synchronized-in-java/
The TL;DR of that article is that ConcurrentHashMap makes sense when you have a lot of threads and a lot of server cores. It tries to go the extra mile at limiting synchronization for maximum scale. In mobile that's a huge overkill.
I'm new to Flink and I work with DataSet API. After a whole bunch of processing as the last stage I need to normalize one of the values by dividing it by its maximum value. So, I have used the .max() operator to take the max and later I'm passing the result as constructor's argument to the MapFunction.
This works, however all the processing is performed twice. One job is executed to find max values, and later another job is executed to create final result (starting execution from the beginning)... Is there any workaround to execute whole dataflow only once?
final List<Tuple6<...>> maxValues = result.max(2).collect();
assert maxValues.size() == 1;
result.map(new NormalizeAttributes(maxValues.get(0))).writeAsCsv(...)
#FunctionAnnotation.ForwardedFields("f0; f1; f3; f4; f5")
#FunctionAnnotation.ReadFields("f2")
private static class NormalizeAttributes implements MapFunction<Tuple6<...>, Tuple6<...>> {
private final Tuple6<...> maxValues;
public NormalizeAttributes(Tuple6<...> maxValues) {
this.maxValues = maxValues;
}
#Override
public Tuple6<...> map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}
collect() immediately triggers an execution of the program up to the dataset requested by collect(). If you later call env.execute() or collect() again, the program is executed second time.
Besides the side effect of execution, using collect() to distribute values to subsequent transformation has also the drawback that data is transferred to the client and later back into the cluster. Flink offers so-called Broadcast variables to ship a DataSet as a side input into another transformation.
Using Broadcast variables in your program would look as follows:
DataSet maxValues = result.max(2);
result
.map(new NormAttrs()).withBroadcastSet(maxValues, "maxValues")
.writeAsCsv(...);
The NormAttrs function would look like this:
private static class NormAttr extends RichMapFunction<Tuple6<...>, Tuple6<...>> {
private Tuple6<...> maxValues;
#Override
public void open(Configuration config) {
maxValues = (Tuple6<...>)getRuntimeContext().getBroadcastVariable("maxValues").get(1);
}
#Override
public PredictedLink map(Tuple6<...> value) throws Exception {
value.f2 /= maxValues.f2;
return value;
}
}
You can find more information about Broadcast variables in the documentation.
I have a class that returns an IEnumerable. I then execute these tasks in order. Let's say the class is TaskProvider.
public class TaskProvider {
public IEnumerable<Task> SomeThingsToDo() { return work; }
}
I am executing these with the following:
public void ExecuteTasks(IEnumerable<Task> tasks)
{
var enumerator = tasks.GetEnumerator();
ExecuteNextTask(enumerator);
}
static void ExecuteNextTask(IEnumerator<Task> enumerator)
{
bool moveNextSucceeded = enumerator.MoveNext();
if (!moveNextSucceeded) return;
enumerator
.Current
.ContinueWith(x => ExecuteNextTask(enumerator));
}
Now I have a situation where I might have multiple instances of TaskProvider, each generating a list of tasks. I want each list of tasks to be executed in order, meaning that all the tasks from one provider finish before the next one starts.
Then, most importantly, I need to know when all the tasks are completed.
What's the TPL way of accomplishing this?
(FWIW, I'm using the Async CTP for Silverlight.)
Here's the approach I took, and so far all my tests are passing.
First, I created a unioned enumerable of all the tasks from the various providers:
var tasks = from provider in providers
from task in provider.SomeThingsToDo()
select task;
I believe that part of my original problem was that I did a ToList (more or less) and thus began the execution of the tasks prematurely.
Next, I added a callback to ExecuteTasks and ExecuteNextTask. Admittedly, not as clean as I'd hoped. Here's the revised implementation:
public void ExecuteTasks(IEnumerable<Task> tasks, Action callback)
{
var enumerator = tasks.GetEnumerator();
ExecuteNextTask(enumerator, callback);
}
static void ExecuteNextTask(IEnumerator<Task> enumerator, Action callback)
{
bool moveNextSucceeded = enumerator.MoveNext();
if (!moveNextSucceeded)
{
if (callback != null) callback();
return;
}
enumerator
.Current
.ContinueWith(x => ExecuteNextTask(enumerator, callback));
}
I didn't need a thread-safe structure for storing the list of tasks, because the list is generated only once.
at worst you could have a static concurrentqueue of Ienumerables which you ExecuteNextTask method works it's way through...
something like:
public static class ExecuteController {
private static ConcurrentQueue<IEnumerable<Task>> TaskLists = new ConcurrentQueue<IEnumerable<Task>>();
public void ExecuteTaskList(IEnumerable<Task> tasks) {
TaskLists.Enqueue(tasks);
TryStartExec();
}
public void TryStartExec() {
check if there is a new task list and if so exec it with your code.
possibly need to lock around the dequeue but i think there is an atomic dequeue method on concurrent queue..
}
}