I have Flink DataStream job that flows like this:
before DataStream definition
define list of Kafka Topics
create Hashmap of string KafkaTopic -> SideOutput
define Kafka Source
during DataStream definition
add Kafka Source to env using fromSource
define Process function that pulls the kafka topic from the message metadata, looks up the SideOutput from the Hashmap of (string KafkaTopic -> SideOutput) and outputs the message to the output tag returned from the Hashmap
after DataStream definition
iterate over Hashmap of string KafkaTopic -> SideOutput
create JDBC sink with unique insert statement for each sideoutput
add a sink to each side OutputTag
I'd like to avoid building the list of KafkaTopics before defining the stream. I want to use the regex function of the KafkaSource to consume from all topics that match the pattern.
Is it possible to create new side Output tags and Sinks at runtime during the process function? If I encounter a new kafka topic create a sideoutput, add it to the stream, then add a new sink to the sideoutput?
The more I think about it, I assume this is not possible.
My alternative plan is to use a kafka connector and build a list of kafka topics in the 'before DataStream definition' step from above. In that case I would have to restart the job to consume from new topics.
I could have thousands of topics, which is why I want to dynamically define them.
Related
In PyFlink, is it possible to use the DataStream API to create a DataStream by means of StreamExecutionEnvironment's addSource(...), then perform transformations on this data stream using the DataStream API and then convert that stream into a form where SQL statements can be executed on it using the TableApi?
I have a single stream of many different types of events, so I'd like to create many different data streams from a single source, each with a different type of data in it. I was thinking perhaps I could use a side output based on the data in the initial stream and then perform different SQL operations against each of those streams, safe in the knowledge of what the data in each of those separate streams actually is. I don't want to have a different Flink job for each data type in the stream.
Yes, you can convert a DataStream to a Table API that will allow you to execute SQL statements: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/data_stream_api/
What you want to do to split the stream by type seems reasonable to me
I have Employee Master data coming as the data stream NIFI connector, also have a KAFA Data Main Stream which contains employee details. Within the Process element, I have to use this Master data stream for doing some calculations. Is there any way we can do it.
My Current Design contains a Data Stream and a broadcast stream(broadCastStream) all the data processing is doing under process element in (mainDataProcessor) which is derived from KeyedBroadcastProcessFunction.
I am connecting my broadcast stream to main stream as mentioned below.
ie. mainStream.connect(broadCastStream).process(new mainDataProcessor())
Now we have an additional need to introduce one more Data Stream which contains a Master data coming from Casandra table with Help OF NIFI COnnector. I need this master table data steam inside the process element to do some calculations with main stream data and broadcast data, is there any way for that.
What you usually want to do is to join the streams on a particular column. For example, using a temporal join.
If you are on DataStream, you can also use a join but you need to be careful with the state size (when can you discard data?).
If you don't have an employee id over which to join, you could also try to use broadcasts, but that's less recommended.
If you need more specific pointers, please update your question and also mention which API you are using.
I have a DB2 table with about 150 K records. I have another SQL Server table with the same columns. One of the table columns - let's call it code - is a unique value and is indexed. I am using Spring Batch. Periodically, I get a file with a list of codes. For example, a file with 5 K codes. For each code in the file, I need to read records from the DB2 table, whose code column matches the code in the file and insert a few columns from those records to the SQL Server table. I want to use SQL and not JPA and believe there is a limit (let's say 1000) on how many values can be in the SQL IN clause. Should this be my chunk size?
How should the Spring Batch application to do this be designed? I have considered below strategies but need help deciding which one (or any other) is better.
1) Single step job with reader reading codes from file, processor using a JdbcTemplate to get rows for a chunk of codes and writer writing the rows using JdbcBatchItemWriter - seems like the JdbcTemplate would have an open DB connection through out job execution.
2) JdbcPagingItemReader - Spring Batch documentation cautions that databases like DB2 have pessimistic locking strategies and suggests using driving query instead
3) Driving Query - Is there an example? - How does the processor convert the key to a full object here? How long does the connection stay open?
4) Chaining readers - is this possible? - first reader will read from file, second from DB2 and then processor and writer.
I would go with your option #1. Your file containing unique codes effectively becomes your driving query.
Your ItemReader will read the file and emit each code to your ItemProcessor.
The ItemProcessor can either directly use a JdbcTemplate, or you can delegate to a separate data access object (DAO) in your project, but either way, with each invocation of the process method a new record will be pulled in from your DB2 database table. You can do whatever other processing is necessary here prior to emitting the appropriate object for your ItemWriter which then inserts or updates the necessary record(s) in your SQL Server database table.
Here's an example from a project where I used an ItemReader<Integer> as my driving query to collect the IDs of devices on which I needed to process configuration data. I then passed those ID data on to my ItemProcessor which dealt with one configuration file at a time:
public class DeviceConfigDataProcessor implements ItemProcessor<Integer,DeviceConfig> {
#Autowired
MyJdbcDao myJdbcDao;
#Override
public DeviceConfig process(Integer deviceId) throws Exception {
DeviceConfig deviceConfig = myJdbcDao.getDeviceConfig( deviceId );
// process deviceConfig as needed
return deviceConfig;
}
}
You would swap out deviceId for code, and DeviceConfig for whatever domain object is appropriate to your project.
If you're using Spring Boot, you should have a ConnectionPool automatically, and your DAO will pull a single record at a time for processing, so you don't need to worry about persistent connections to the database, pessimistic locks, etc.
I have seen several mentions of an "upsert mode" for dynamic tables based on a unique key in the Flink documentation and on the official Flink blog. However, I do not see any examples / documentation regarding how to enable this mode on a dynamic table.
Examples:
Blog post:
When defining a dynamic table on a stream via update mode, we can specify a unique key attribute on the table. In that case, update and delete operations are performed with respect to the key attribute. The update mode is visualized in the following figure.
Documentation:
A dynamic table that is converted into an upsert stream requires a (possibly composite) unique key.
So my questions are:
How do I specify a unique key attribute on a dynamic table in Flink?
How do I place a dynamic table in update/upsert/"replace" mode, as opposed to append mode?
The linked resources describe two different scenarios.
The blog post discusses an upsert DataStream -> Table conversion.
The documentation describes the inverse upsert Table -> DataStream conversion.
The following discussion is based on Flink 1.4.0 (Jan. 2018).
Upsert DataStream -> Table Conversion
Converting a DataStream into a Table by upsert on keys is not natively supported but on the roadmap. Meanwhile, you can emulate this behavior using an append Table and a query with a user-defined aggregation function.
If you have an append Table Logins with the schema (user, loginTime, ip) that tracks logins of users, you can convert that into an upsert Table keyed on user with the following query:
SELECT user, LAST_VAL(loginTime), LAST_VAL(ip) FROM Logins GROUP BY user
The LAST_VAL aggregation function is a user-defined aggregation function that always returns the latest added value.
Native support for upsert DataStream -> Table conversion would work basically the same way, although providing a more concise API.
Upsert Table -> DataStream Conversion
Converting a Table into an upsert DataStream is not supported. This is also properly reflected in the documentation:
Please note that only append and retract streams are supported when converting a dynamic table into a DataStream.
We deliberately chose not to support upsert Table -> DataStream conversions, because an upsert DataStream can only be processed if the key attributes are known. These depend on the query and are not always straight-forward to identify. It would be the responsibility of the developer to make sure that the key attributes are correctly interpreted. Failing to do so would result in faulty programs. To avoid problems, we decided to not offer the upsert Table -> DataStream conversion.
Instead users can convert a Table into a retraction DataStream. Moreover, we support UpsertTableSink that writes an upsert DataStream to an external system, such as a database or key-value store.
Update: since Flink 1.9, LAST_VALUE is part of the build-in aggregate functions, if we use the Blink planner (which is the default since Flink 1.11).
Assuming the existence of the Logins table mentioned in the response of Fabian Hueske above, we can now convert it to an upsert table as simply as:
SELECT
user,
LAST_VALUE(loginTime),
LAST_VALUE(ip)
FROM Logins
GROUP BY user
Flink 1.8 still lacks of such support. Expecting those features to be added in future : 1) LAST_VAL 2) Upsert Stream <-> Dynamic Table.
ps. LAST_VAL() seems not possible to be implemented in UDTF. Aggregation functions doesn't give attached event/proc time context. Alibaba's Blink provides an alternative implementation of LAST_VAL, but it requires another field to provide order information, not directly on event/proc time. which makes the sql code ugly. (https://help.aliyun.com/knowledge_detail/62791.html)
My work-around solution of LAST_VAL (eg.get latest ip) is something like:
concat(ts, ip) as ordered_ip
MAX(ordered_ip) as ordered_ip
extract(ordered_ip) as ip
A certain job I'm running needs to collect some metadata from a DB (MySQL, although that's not as relevant) before processing some large HDFS files. This metadata will be added to data in the files and passed on to the later map/combine/reduce stages.
I was wondering where the "correct" place to put this query might be. I need the metadata to be available when the mapper begins, but placing it there seems redundant, as every Mapper will execute the same query. How can I (if at all) perform this query once and share its results across all the mappers? Is there a common way to share data between all the nodes performing a task (other than writing it to HDFS)? thanks.
You can have your MYSql query in your main function and the result of the query can be stored in a string. Then you can set the variable to the Hadoop Job Configuration object. The variables set in Configuration object can be accessed by all mappers.
Your main class looks like this....
JobConf conf = new JobConf(Driver.class);
String metainfo = <You metadata Info goes here>;
conf.set("metadata",metainfo);
So in you Map Class you can access the metadata value as follows
publi class Map(...){
String sMetaInfo="";
public void configure(JobConf job) {
sMetaInfo= job.get("metadata"); // Getting the metadata value from Job Configureation Object
}
public void map(....){
// Map Function
}
}
I would use swoop if you have the cloudera distribution for ease. I usually program with cascading in java and for db sources use dbmigrate as a source "tap" making dbs a first class citizen. When using pks with dbmigrate, the performance has been adequate.