I want to write a generic streaming job where based on the deployed config file (kafka topic and sql query), it will read from the configured kafka source topic as a table and then execute the configured query. In this way, the code needs to be generic to support any of our kafka topics.
For example, deploy the flink job with
config.yaml
source.topic: user
sql.query: select name,address from user where state = ' TX'
vs another config
config.yaml
source.topic: booking
sql.query: select booking_amount from booking where product_type = 'book'
We use Confluent schema registry so the schema is available at runtime.
From the docs I did not see how to generate the table dynamically on startup besides writing some utility method to generate the create DDL from the confluent schema. Is this utility available already somewhere?
Another option I was looking at was first creating a DataStream from the kafka topic then converting it to a Flink table. The datastream does work with Avro GenericRecord e.g.
KafkaSource<GenericRecord> source = KafkaSource.<GenericRecord>builder()
.setTopics(egspConfigAdapter.getSourceTopic())
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(ConfluentRegistryAvroDeserializationSchema.forGeneric(egspConfigAdapter.getSourceSchema(), egspConfigAdapter.getSourceSchemaRegistryUrl()))
.setProperties(egspConfigAdapter.consumerProperties())
.build();
But then when converting the datastream to a table, the schema needs to be specified for the query to work properly. Is there a utility helper to convert Avro Schema to a Flink Schema?
Related
So I have a Kafka topic which contains avro record with different schemas. I want to consume from that Kafka topic in flink and create a datastream of avro generic record.(this part is done)
Now I want to write that data in hudi using schema extracted from datastream. But since hudi pipeline/writer takes a config with predefined avro schema at the beginning itself, I can't do that.
Probable solution is to create a key stream based on a key they identify one type of schema and then extract schema from that and then create a dynamic hudi pipeline based on that.
I'm not sure on the last part, if that is possible.
A-->B-->C
Where A is generic avro record with different schemas. B is partitioned stream based on different schemas. And C uses that schema inside data stream B to create a config and pass it to hudi pipeline writer function.
I think I just answered this same question (but for Parquet) yesterday, but I can't find it now :) Anyway, if you know the different schemas in advance (assumed so, otherwise how do you deserialize the incoming Kafka records?) then you can use a ProcessFunction with multiple side outputs, where you split the stream by schema, and then connect each schema-specific stream to its own Hudi sink.
Is there any way to auto infer Kafka topic DDL in Flink without the need of manually CREATE TABLE query, just like in the case of spark.
You could use Flink Catalogs to connect to metadata repositories, see https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/catalogs/
Please let me know whether Snowflake Spark connector has the ability to create Snowflake external table?
Spark connector does support DDL statement and CREAT EXTERNAL TABLE is a DDL statement
https://docs.snowflake.com/en/user-guide/spark-connector-use.html#executing-ddl-dml-sql-statements
Not sure about how you can create external tables with spark connector, but, what I usually do is to create a stage in snowflake using a Blob Storage or S3 Bucket and you can work it like a local file. Which can be use in your queries using spark connector, also, any new file on the stage (BlobStorage/S3 Bucket) will be available through a query such as:
SELECT * FROM #"STAGE_NAME"/example.json;
I ain't sure if this is of any help, I don't know how you are trying to apply it. But if it does, I'll be glad to put a example here.
I'm currently looking into Kafka Connect to stream some of our databases to a data lake. To test out Kafka Connect I've setup a database with one of our project databases in. So far so good.
Next step I configured Kafka Connect with mode following properties:
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"timestamp.column.name": "updated_at,created_at",
"incrementing.column.name": "id",
"dialect.name": "SqlServerDatabaseDialect",
"validate.non.null": "false",
"tasks.max": "1",
"mode": "timestamp+incrementing",
"topic.prefix": "mssql-jdbc-",
"poll.interval.ms": "10000",
}
While this works for the majority of my tables where I got an ID and a created_at / updated_at field, it won't work for my tables where I solved my many-to-many relationships with a table in between and a composite key. Note that I'm using the generic JDBC configuration with a JDBC driver from Microsoft.
Is there a way to configure Kafka Connect for these special cases?
Instead of one connector to pull all of your tables, you may need to create multiple ones. This would be the case if you want to use different methods for fetching the data, or different ID/timestamp columns.
As #cricket_007 says, you can use the query option to pull back the results of a query—which could be a SELECT expressing your multi-table join. Even when pulling data from a single table object, the JDBC connector itself is just issuing a SELECT * from the given table, with a WHERE predicate to restrict the rows selected based on the incrementing ID/timestamp.
The alternative is to use log-based change data capture (CDC), and stream all changes directly from the database into Kafka.
Whether you use JDBC or log-based CDC, you can use stream processing to resolve joins in Kafka itself. An example of this is Kafka Streams or KSQL. I've written about the latter a lot here.
You might also find this article useful describing in detail your options for integrating databases with Kafka.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.
To automatically warehouse documents from Cloudant to dashDB, there is a schema discovery process (SDP) that automates the data migration for you. When using the SDP to warehouse documents from Cloudant to dashDB, there is an option 'Rescan'.
I have used 'Rescan' a number of times, but am unclear on the steps it actually performs. What steps are performed by a 'Rescan'? E.g.
Drop tables in the dashDB target schema? Which tables?
Scan Cloudant source database?
Recreate the target schema?
...
...
The steps are pretty much as you suggested. Rescan will
Inspect the previously discovered JSON schema and remove all tables from the dashDB instance created for that load (leaving any user defined tables untouched)
Re-discover the JSON schema again using the current settings (including sample size, type of discovery algorithm etc.)
Create the new tables into the same dashDB target
Ingest the newly created tables with data from Cloudant
Subscribe to the _changes feed from Cloudant to continuously synchronize document changes with dashDB
All steps (except for the first) are identical for the initial load as well as the rescan function.
The main motivation for a rescan is to support schema evolution. Whenever the document structure in a Cloudant source database changes, a user can make a conscious decision to drop and re-create the dashDB tables using this rescan function. SDP won't automate that process to avoid potential conflicts with applications depending on the existing dashDB tables.