I have a simple Flink stream processing application (Flink version 1.13). The Flink app reads from Kakfa, does stateful processing of the record, then writes the result back to Kafka.
After reading from Kafka topic, I choose to use reinterpretAsKeyedStream() and not keyBy() to avoid a shuffle, since the records are already partitioned in Kakfa. The key used to partition in Kakfa is a String field of the record (using the default kafka partitioner). The Kafka topic has 24 partitions.
The mapping class is defined as follows. It keeps track of the state of the record.
public class EnvelopeMapper extends
KeyedProcessFunction<String, Envelope, Envelope> {
...
}
The processing of the record is as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource)
DataStreamUtils.reinterpretAsKeyedStream(messageStream, Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
With parallelism of 1, the code runs fine. With parallelism greater than 1 (e.g. 4), I am running into the follow error:
2022-06-12 21:06:30,720 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Map -> Flat Map -> KeyedProcess -> Map -> Sink: Unnamed (4/4) (7ca12ec043a45e1436f45d4b20976bd7) switched from RUNNING to FAILED on 100.101.231.222:44685-bd10d5 # 100.101.231.222 (dataPort=37839).
java.lang.IllegalArgumentException: KeyGroupRange{startKeyGroup=96, endKeyGroup=127} does not contain key group 85
Based on the stack trace, it seems the exception happens when EnvelopeMapper class validates the record is sent to the right replica of the mapper object.
When reinterpretAsKeyedStream() is used, how are the records distributed among the different replicas of the EventMapper?
Thank you in advance,
Ahmed.
Update
After feedback from #David Anderson, replaced reinterpretAsKeyedStream() with keyBy(). The processing of the record is now as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource) // Line x
.map(statelessMapper1)
.flatMap(statelessMapper2);
messageStream.keyBy(Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
Is there any difference in performance if keyBy() is done right after reading from Kakfa (marked with "Line x") vs right before the stateful Mapper (EnvelopeMapper).
With
reinterpretAsKeyedStream(
DataStream<T> stream,
KeySelector<T, K> keySelector,
TypeInformation<K> typeInfo)
you are asserting that the records are already distributed exactly as they would be if you had instead used keyBy(keySelector). This will not normally be the case with records coming straight out of Kafka. Even if they are partitioned by key in Kafka, the Kafka partitions won't be correctly associated with Flink's key groups.
reinterpretAsKeyedStream is only straightforwardly useful in cases such as handling the output of a window or process function where you know that the output records are key partitioned in a particular way. To use it successfully with Kafka is can be very difficult: you must either be very careful in how the data is written to Kafka in the first place, or do something tricky with the keySelector so that the keyGroups it computes line up with how the keys are mapped to Kafka partitions.
One case where this isn't difficult is if the data is written to Kafka by a Flink job running with the same configuration as the downstream job that is reading the data and using reinterpretAsKeyedStream.
I use Flink SQL and CEP to recognize some really simple patterns. However, I found a weird thing (likely a bug). I have two example tables password_change and transfer as below.
transfer
transid,accountnumber,sortcode,value,channel,eventtime,eventtype
1,123,1,100,ONL,2020-01-01T01:00:01Z,transfer
3,123,1,100,ONL,2020-01-01T01:00:02Z,transfer
4,123,1,200,ONL,2020-01-01T01:00:03Z,transfer
5,456,1,200,ONL,2020-01-01T01:00:04Z,transfer
password_change
accountnumber,channel,eventtime,eventtype
123,ONL,2020-01-01T01:00:05Z,password_change
456,ONL,2020-01-01T01:00:06Z,password_change
123,ONL,2020-01-01T01:00:08Z,password_change
123,ONL,2020-01-01T01:00:09Z,password_change
Here are my SQL queries.
First create a temporary view event as
(SELECT accountnumber,rowtime,eventtype FROM password_change WHERE channel='ONL')
UNION ALL
(SELECT accountnumber,rowtime, eventtype FROM transfer WHERE channel = 'ONL' )
rowtime column is the event time extracted directly from original eventtime col with watermark periodic bound 1 second.
Then output the query result of
SELECT * FROM `event`
MATCH_RECOGNIZE (
PARTITION BY accountnumber
ORDER BY rowtime
MEASURES
transfer.eventtype AS event_type,
transfer.rowtime AS transfer_time
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (transfer password_change ) WITHIN INTERVAL '5' SECOND
DEFINE
password_change AS eventtype='password_change',
transfer AS eventtype='transfer'
)
It should output
123,transfer,2020-01-01T01:00:03Z
456,transfer,2020-01-01T01:00:04Z
But I got nothing when running Flink 1.11.1 (also no output for 1.10.1).
What's more, I change the pattern to only password_change, it still output nothing, but if I change the pattern to transfer then it outputs several rows but not all transfer rows. If I exchange the eventtime of two tables which means let password_changes happen first, then the pattern password_change will output several rows while transfer not.
On the other hand, if I extract those columns from two tables and merge them in one table manually, then emit them into Flink, the running result is correct.
I searched and tried a lot to get it right including changing the SQL statement, watermark, buffer timeout and so on, but nothing helped. Hope anyone here can help. Thanks.
10/10/2020 update:
I use Kafka as the table source. tEnv is the StreamTableEnvironment.
Kafka kafka=new Kafka()
.version("universal")
.property("bootstrap.servers", "localhost:9092");
tEnv.connect(
kafka.topic("transfer")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
.field("transid",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("value",DataTypes.DECIMAL(38,18))
).createTemporaryTable("transfer");
tEnv.connect(
kafka.topic("pchange")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
).createTemporaryTable("password_change");
Thank #Dawid Wysakowicz's answer. To confirm that, I added 4,123,1,200,ONL,2020-01-01T01:00:10Z,transfer to the end of transfer table, then the output becomes right, which means it is really some problem about watermarks.
So now the question is how to fix it. Since a user will not change his/her password frequently, the time gap between these two table is unavoidable. I just need the UNION ALL table has the same behavior as that I merged manually.
Update Nov. 4th 2020:
WatermarkStrategy with idle sources may help.
Most likely the problem is somewhere around watermark generation in conjunction with the UNION ALL operator. Could you share how you create the two tables including how you define the time attributes and what are the connectors? It could let me confirm my suspicions.
I think the problem is that one of the sources stops emitting Watermarks. If the transfer table (or the table with lower timestamps) does not finish and produces no records it emits no Watermarks. After emitting the fourth row it will emit Watermark = 3 (4-1 second). The Watermark of a union of inputs is the smallest of values of the two. Therefore the first table will pause/hold the Watermark with value Watermark = 3 and thus you see no progress for the original query and you see some records emitted for the table with smaller timestamps.
If you manually join the two tables, you have just a single input with a single source of Watermarks and thus it progresses further and you see some results.
I was trying to run the Benerator to populate database (shop demo to fill database schemas based on a setup file). While running the following,
I am getting the below error.
15:25:50,232 INFO (main) [DefaultDBSystem] Fetching table details and ordering tables by dependency
15:25:50,554 ERROR (main) [DescriptorRunner] Error in Benerator execution
org.databene.commons.ConfigurationError: Catalog 'myDB' not found in database 'db'
at org.databene.platform.db.DBSystem.findTableInConfiguredCatalogAndSchema(DBSystem.java:819)
at org.databene.platform.db.DBSystem.getTable(DBSystem.java:791)
at org.databene.platform.db.DBSystem.getWriteColumnInfos(DBSystem.java:744)
at org.databene.platform.db.DBSystem.persistOrUpdate(DBSystem.java:831)
at org.databene.platform.db.DBSystem.store(DBSystem.java:360)
at org.databene.benerator.storage.StorageSystemInserter.startProductConsumption(StorageSystemInserter.java:53)
at org.databene.benerator.consumer.AbstractConsumer.startConsuming(AbstractConsumer.java:47)
at org.databene.benerator.consumer.ConsumerProxy.startConsuming(ConsumerProxy.java:62)
at org.databene.benerator.engine.statement.ConsumptionStatement.execute(ConsumptionStatement.java:53)
at org.databene.benerator.engine.statement.GenerateAndConsumeTask.execute(GenerateAndConsumeTask.java:159)
at org.databene.task.TaskProxy.execute(TaskProxy.java:59)
at org.databene.task.StateTrackingTaskProxy.execute(StateTrackingTaskProxy.java:52)
at org.databene.task.TaskExecutor.runWithoutPage(TaskExecutor.java:136)
at org.databene.task.TaskExecutor.runPage(TaskExecutor.java:126)
at org.databene.task.TaskExecutor.run(TaskExecutor.java:101)
at org.databene.task.TaskExecutor.run(TaskExecutor.java:77)
at org.databene.task.TaskExecutor.execute(TaskExecutor.java:71)
at org.databene.benerator.engine.statement.GenerateOrIterateStatement.executeTask(GenerateOrIterateStatement.java:156
at org.databene.benerator.engine.statement.GenerateOrIterateStatement.execute(GenerateOrIterateStatement.java:99)
at org.databene.benerator.engine.statement.LazyStatement.execute(LazyStatement.java:58)
at org.databene.benerator.engine.statement.StatementProxy.execute(StatementProxy.java:46)
at org.databene.benerator.engine.statement.TimedGeneratorStatement.execute(TimedGeneratorStatement.java:70)
at org.databene.benerator.engine.statement.SequentialStatement.executeSubStatements(SequentialStatement.java:52)
at org.databene.benerator.engine.statement.SequentialStatement.execute(SequentialStatement.java:47)
at org.databene.benerator.engine.BeneratorRootStatement.execute(BeneratorRootStatement.java:63)
at org.databene.benerator.engine.DescriptorRunner.execute(DescriptorRunner.java:127)
at org.databene.benerator.engine.DescriptorRunner.runWithoutShutdownHook(DescriptorRunner.java:109)
at org.databene.benerator.engine.DescriptorRunner.run(DescriptorRunner.java:102)
at org.databene.benerator.main.Benerator.runFile(Benerator.java:94)
at org.databene.benerator.main.Benerator.runFromCommandLine(Benerator.java:75)
at org.databene.benerator.main.Benerator.main(Benerator.java:68)
15:25:50,611 INFO (main) [CachingDBImporter] Exporting Database meta data of ___temp to cache file
15:25:50,635 INFO (main) [CONFIG] Max. committed heap size: 15 MB
Inside my 'db' folder, I have the file user.ben.xml and it starts with,
<database id="db" url="jdbc:oracle:thin:#localhost:1521:mirev" driver="oracle.jdbc.driver.OracleDriver" user="myDB" tableFilter="DB_.*" />
i am new to Benerator. Could anyone please tell me why this error is throwing.
By default Oracle DB does not support 'Catalog'. Make sure your DB has catalog enabled and defined. If not then remove the catalog from your configuration.
I tried the same today...
It seems the oracle user/schema (=catalog in jdbc terms) needs to be alphabetically first to make the example work. I created a user 'A1000' to make the example work.
I am using Hive 0.14 and Hbase 0.98.8
I would like to use HiveQL for accessing a HBase "table".
I created a table with a complex composite rowkey:
CREATE EXTERNAL TABLE db.hive_hbase (rowkey struct<p1:string, p2:string, p3:string>, column1 string, column2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ';'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf:c1,cf:c2")
TBLPROPERTIES("hbase.table.name"="hbase_table");
The table is getting successfully created, but the HiveQL is taking forever:
SELECT * from db.hive_hbase WHERE rowkey.p1 = 'xyz';
Queries without using the rowkey are fine and also using the hbase shell with filters are working.
I don't find anything in the logs, but I assume that there could be an issue with complex composite keys and performance.
Did anybody face the same issue? Hints to solve it? Other ideas, what I could try?
Thank you
Update 16.07.15:
I changed the log4j properties to 'DEBUG' and found some interesting information:
It says:
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(823)) - Pushdown Predicates of FIL For Alias : hive_hbase
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(826)) - (rowkey.p1 = 'xyz')
But some lines later:
2015-07-15 15:56:41,430 DEBUG ppd.OpProcFactory (OpProcFactory.java:pushFilterToStorageHandler(1051)) - No pushdown possible for predicate: (rowkey.p1 = 'xyz')
So my guess is: HiveQL over HBase does not do any predicate pushdown in Hbase but rather starts a MapReduce job.
Could there be a bug with the predicate pushdown?
I tried similar situation using Hive 0.13 and it works fine. I got the result. What version of hive are you working on?
In SQL Server 2012 Data Quality Services, I need to clean the data in Term Based Relation as follows:
String Replaceto**
Wal walmart**
Wlr walmart**
Wlt walmart**
Walmart
That is the words "wal","wlr", and "wlt" have to be replaced with "walmart" and finally "walmart" is replaced with a empty space.
it shows the error as
SQL Server Data Quality Services
--------------------------------------------------------------------------------
2/1/2013 2:48:37 PM
Message Id: DataValueServiceTermBasedRelationCorrectedValueAlreadyCorrectingValue
Term Based Relation (walmart, ) cannot be added for domain 'keywordphrase' because 'walmart' value already exists as a correcting value.
--------------------------------------------------------------------------------
Microsoft.Ssdqs.DataValueService.Service.DataValueServiceException: Term Based Relation (walmart, ) cannot be added for domain 'keywordphrase' because 'walmart' value already exists as a correcting value.
at Microsoft.Ssdqs.DataValueService.Managers.DomainTermBasedRelationManager.PreapareAndValidateRelation(DomainTermBasedRelation relation, IMasterContext context)
at Microsoft.Ssdqs.DataValueService.Managers.DomainTermBasedRelationManager.Add(IMasterContext context, ServiceDefinitionBase data)
at Microsoft.Ssdqs.DataValueService.Service.DataValueServiceConcrete.Add(IMasterContext context, ReadOnlyCollection`1 data)
any suggestions for the solution
Thanks,
It is my understanding that DQS does not support multi-level replacements (i.e, a->b then b->c). Why not go straight to blanks for the firts three terms?