Apache Flink and Apache Pulsar - apache-flink

I am using Flink to read data from Apache Pulsar.
I have a partitioned topic in pulsar with 8 partitions.
I produced 1000 messages in this topic, distributed across the 8 partitions.
I have 8 cores in my laptop, so I have 8 sub-tasks (by default parallelism = # of cores).
I opened the Flink-UI after executing the code from Eclipse, I found that some sub-tasks are not receiving any records (idle).
I am expecting that all the 8 sub-tasks will be utilized (I am expecting that each sub-task will be mapped to one partition in my topic).
After restarting the job, I found that some times 3 sub-takes are utilized and some times 4 tasks are utilized while the remaining sub-tasks kept idle.
please your support to clarify this scenario.
Also how can I know that there is a shuffle between sub-takes or not?
My Code:
ConsumerConfigurationData<String> consumerConfigurationData = new ConsumerConfigurationData<>();
Set<String> topicsSet = new HashSet<>();
topicsSet.add("flink-08");
consumerConfigurationData.setTopicNames(topicsSet);
consumerConfigurationData.setSubscriptionName("my-sub0111");
consumerConfigurationData.setSubscriptionType(SubscriptionType.Key_Shared);
consumerConfigurationData.setConsumerName("consumer-01");
consumerConfigurationData.setSubscriptionInitialPosition(SubscriptionInitialPosition.Earliest);
PulsarSourceBuilder<String> builder = PulsarSourceBuilder.builder(new SimpleStringSchema()).pulsarAllConsumerConf(consumerConfigurationData).serviceUrl("pulsar://localhost:6650");
SourceFunction<String> src = builder.build();
DataStream<String> stream = env.addSource(src);
stream.print(" >>> ");

For the Pulsar question, I don't know enough to help. I recommend setting up a larger test and see how that turns out. Usually, you'd have more partitions than slots and have some slots consume several partitions in a somewhat random fashion.
Also how can I know that there is a shuffle between sub-takes or not?
The easiest way is to look at the topology of the Flink Web UI. There you should see the number of tasks and the channel types. You could post a screenshot if you want more details but in this case, there is nothing that will be shuffled, since you only have a source and a sink.

Related

How tasks are exactly distributed among threads/task-slots in Apache-Flink

i am new to Flink, as part of a research I am trying to figure out :
1-How exactly Flink(am using Dataset API and just one machine) is distributing the tasks among available threads/slots, which algorithms or techniques are being used ?
2- Does Flink decide that task-A will be assigned to thread-1 or thread-2, or what ever thread is available will execute that task ?
I already did some examples and used the Web-UI to get some Info's ,but I still don't know the answers for sure.
If someone could help or know any references that would help me get more insights I will appreciate it. Thanks a lot.
Update :
to offer more details and trying to explain my self in a better way ,
firstly the program is very simple as follows :
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(16);
DataSet<String> text = env.readTextFile(filePath);
DataSet<Tuple2<String, Integer>> wordTuples = text
.flatMap(new Tokenizer()).name("FlatMap Operation");
wordTuples.writeAsText("Path");
env.execute();
The first Image shows Info about the First Task of my Job ,each subtask get 4 records except subtask with ID-0 get nothing and Subtask with ID-13 gets 8 records, why is that happening ? who decide which Subtask or Slot should do which job ?
The second image is the second task, now its receiving data sent from first task , same subtasks are working and with the same number of records , why is that ?
so my question again
why in the first Task only one Slot were used to read the whole 5 records ? who decide which slot do which job ?
now next image is showing the output, why subtask 14 ís the one with doubled data not 13 as shown in first and second image ?
In case the structure of data is important then my Data i am testing on consists of 16 lines , each line as follows :
My Name Is[choose a name]
Sorry for the long explanation

Apache Fink & Iceberg: Not able to process hundred of RowData types

I have a Flink application that reads arbitrary AVRO data, maps it to RowData and uses several FlinkSink instances to write data into ICEBERG tables. By arbitrary data I mean that I have 100 types of AVRO messages, all of them with a common property "tableName" but containing different columns. I would like to write each of these types of messages into a separated Iceberg table.
For doing this I'm using side outputs: when I have my data mapped to RowData I use a ProcessFunction to write each message into a specific OutputTag.
Later on, with the datastream already processed, I loop into the different output tags, get records using getSideOutput and create an specific IcebergSink for each of them. Something like:
final List<OutputTag<RowData>> tags = ... // list of all possible output tags
final DataStream<RowData> rowdata = stream
.map(new ToRowDataMap()) // Map Custom Avro Pojo into RowData
.uid("map-row-data")
.name("Map to RowData")
.process(new ProcessRecordFunction(tags)) // process elements one by one sending them to a specific OutputTag
.uid("id-process-record")
.name("Process Input records");;
CatalogLoader catalogLoader = ...
String upsertField = ...
outputTags
.stream()
.forEach(tag -> {
SingleOutputStreamOperator<RowData> outputStream = stream
.getSideOutput(tag);
TableIdentifier identifier = TableIdentifier.of("myDBName", tag.getId());
FlinkSink.Builder builder = FlinkSink
.forRowData(outputStream)
.table(catalog.loadTable(identifier))
.tableLoader(TableLoader.fromCatalog(catalogLoader, identifier))
.set("upsert-enabled", "true")
.uidPrefix("commiter-sink-" + tableName)
.equalityFieldColumns(Collections.singletonList(upsertField));
builder.append();
});
It works very well when I'm dealing with a few tables. But when the number of tables scales up, Flink cannot adquire enough task resources since each Sink requires two different operators (because of the internals of https://iceberg.apache.org/javadoc/0.10.0/org/apache/iceberg/flink/sink/FlinkSink.html).
Is there any other more efficient way of doing this? or maybe any way of optimizing it?
Thanks in advance ! :)
Given your question, I assume that about half of your operators are IcebergStreamWriter which are fully utilised and another half is IcebergFilesCommitter which are rarely used.
You can optimise the resource usage of the servers by:
Increasing the number of slots on the TaskManagers (taskmanager.numberOfTaskSlots) [1] - so the CPU not utilised by the idle IcebergFilesCommitter Operators are then used by the other operators on the TaskManager
Increasing the resources provided to the TaskManagers (taskmanager.memory.process.size) [2] - this helps by distributing the JVM Memory overhead between the running Operators on this TaskManager (do not forget to increase the slots in parallel this change to start using the extra resources :) )
The possible downside in adding more slots for the TaskManagers could cause Operators competing for CPU, and the memory is still reserved for the "idle" tasks. [3]
Maybe this Flink architecture could useful too [4]
I hope this helps,
Peter

Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka

I have a simple Flink stream processing application (Flink version 1.13). The Flink app reads from Kakfa, does stateful processing of the record, then writes the result back to Kafka.
After reading from Kafka topic, I choose to use reinterpretAsKeyedStream() and not keyBy() to avoid a shuffle, since the records are already partitioned in Kakfa. The key used to partition in Kakfa is a String field of the record (using the default kafka partitioner). The Kafka topic has 24 partitions.
The mapping class is defined as follows. It keeps track of the state of the record.
public class EnvelopeMapper extends
KeyedProcessFunction<String, Envelope, Envelope> {
...
}
The processing of the record is as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource)
DataStreamUtils.reinterpretAsKeyedStream(messageStream, Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
With parallelism of 1, the code runs fine. With parallelism greater than 1 (e.g. 4), I am running into the follow error:
2022-06-12 21:06:30,720 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Map -> Flat Map -> KeyedProcess -> Map -> Sink: Unnamed (4/4) (7ca12ec043a45e1436f45d4b20976bd7) switched from RUNNING to FAILED on 100.101.231.222:44685-bd10d5 # 100.101.231.222 (dataPort=37839).
java.lang.IllegalArgumentException: KeyGroupRange{startKeyGroup=96, endKeyGroup=127} does not contain key group 85
Based on the stack trace, it seems the exception happens when EnvelopeMapper class validates the record is sent to the right replica of the mapper object.
When reinterpretAsKeyedStream() is used, how are the records distributed among the different replicas of the EventMapper?
Thank you in advance,
Ahmed.
Update
After feedback from #David Anderson, replaced reinterpretAsKeyedStream() with keyBy(). The processing of the record is now as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource) // Line x
.map(statelessMapper1)
.flatMap(statelessMapper2);
messageStream.keyBy(Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
Is there any difference in performance if keyBy() is done right after reading from Kakfa (marked with "Line x") vs right before the stateful Mapper (EnvelopeMapper).
With
reinterpretAsKeyedStream(
DataStream<T> stream,
KeySelector<T, K> keySelector,
TypeInformation<K> typeInfo)
you are asserting that the records are already distributed exactly as they would be if you had instead used keyBy(keySelector). This will not normally be the case with records coming straight out of Kafka. Even if they are partitioned by key in Kafka, the Kafka partitions won't be correctly associated with Flink's key groups.
reinterpretAsKeyedStream is only straightforwardly useful in cases such as handling the output of a window or process function where you know that the output records are key partitioned in a particular way. To use it successfully with Kafka is can be very difficult: you must either be very careful in how the data is written to Kafka in the first place, or do something tricky with the keySelector so that the keyGroups it computes line up with how the keys are mapped to Kafka partitions.
One case where this isn't difficult is if the data is written to Kafka by a Flink job running with the same configuration as the downstream job that is reading the data and using reinterpretAsKeyedStream.

Which Open source CEP shoud I choose for distributed and pipelined processing ; siddhi, Flink , Esper?

I am little bised towards siddhi cep as it has siddhi query language but it uses storm for distributed processing and WSO2 provides an web interface / dashboard to create and deploy applications . I think it will give me less independence to enhance / use some features .
Flink on the other hand seems to be good choice but it requires lot of code to implement even simple logic.
Is there a better option than these , I am
Confused
What do you mean by less independence? You can use Siddhi 4.x [1] without depending on storm by using its source and sink features to receive and send messages from one instance to another using tcp, Kafka, http, etc.
WSO2 Stream processor also uses the new version of Siddhi and with its editor you and simulate events and also debug.
Update: From 4.1 [WSO2 Stream Processor][2] can run on top of Kafka in fully distributed mode. See https://docs.wso2.com/display/SP4xx/Fully+Distributed+Deployment.
[1] https://wso2.github.io/siddhi/
[2] https://wso2.com/analytics
I would do a test...create 10 queries in each system....something like....
select * from SomeEvent where value = 1
select * from SomeEvent where value = 2
...
select * from SomeEvent where value = 9
select * from SomeEvent where value = 10
The idea is to see how easy it is to create the queries, how the API or deploy steps work and how performance changes with the number of queries.

Solr 3.5 indexing taking very long

We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores,
1) Core1 – 44555972 documents
2) Core2 – 29419244 documents
We commit every 5000 documents, but lately the commit is taking very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is,
“WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version.”
Memory details:
export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
Solr Config:
<useCompoundFile>false</useCompoundFile>
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>32</ramBufferSizeMB>
<!-- <maxBufferedDocs>1000</maxBufferedDocs> -->
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
Also noticed, that top command show almost 350GB of Virtual memory usage.
What could be causing this, as everything was running fine a few days back?
Do you have a large search warming query? Our commits take upto 2 mins because of search warming in place. Wondering if that is the case.
The large virtual memory usage would explain this.

Resources