Is there any way to auto infer Kafka topic DDL in Flink without the need of manually CREATE TABLE query, just like in the case of spark.
You could use Flink Catalogs to connect to metadata repositories, see https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/catalogs/
Related
So I have a Kafka topic which contains avro record with different schemas. I want to consume from that Kafka topic in flink and create a datastream of avro generic record.(this part is done)
Now I want to write that data in hudi using schema extracted from datastream. But since hudi pipeline/writer takes a config with predefined avro schema at the beginning itself, I can't do that.
Probable solution is to create a key stream based on a key they identify one type of schema and then extract schema from that and then create a dynamic hudi pipeline based on that.
I'm not sure on the last part, if that is possible.
A-->B-->C
Where A is generic avro record with different schemas. B is partitioned stream based on different schemas. And C uses that schema inside data stream B to create a config and pass it to hudi pipeline writer function.
I think I just answered this same question (but for Parquet) yesterday, but I can't find it now :) Anyway, if you know the different schemas in advance (assumed so, otherwise how do you deserialize the incoming Kafka records?) then you can use a ProcessFunction with multiple side outputs, where you split the stream by schema, and then connect each schema-specific stream to its own Hudi sink.
I want to write a generic streaming job where based on the deployed config file (kafka topic and sql query), it will read from the configured kafka source topic as a table and then execute the configured query. In this way, the code needs to be generic to support any of our kafka topics.
For example, deploy the flink job with
config.yaml
source.topic: user
sql.query: select name,address from user where state = ' TX'
vs another config
config.yaml
source.topic: booking
sql.query: select booking_amount from booking where product_type = 'book'
We use Confluent schema registry so the schema is available at runtime.
From the docs I did not see how to generate the table dynamically on startup besides writing some utility method to generate the create DDL from the confluent schema. Is this utility available already somewhere?
Another option I was looking at was first creating a DataStream from the kafka topic then converting it to a Flink table. The datastream does work with Avro GenericRecord e.g.
KafkaSource<GenericRecord> source = KafkaSource.<GenericRecord>builder()
.setTopics(egspConfigAdapter.getSourceTopic())
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(ConfluentRegistryAvroDeserializationSchema.forGeneric(egspConfigAdapter.getSourceSchema(), egspConfigAdapter.getSourceSchemaRegistryUrl()))
.setProperties(egspConfigAdapter.consumerProperties())
.build();
But then when converting the datastream to a table, the schema needs to be specified for the query to work properly. Is there a utility helper to convert Avro Schema to a Flink Schema?
Please let me know whether Snowflake Spark connector has the ability to create Snowflake external table?
Spark connector does support DDL statement and CREAT EXTERNAL TABLE is a DDL statement
https://docs.snowflake.com/en/user-guide/spark-connector-use.html#executing-ddl-dml-sql-statements
Not sure about how you can create external tables with spark connector, but, what I usually do is to create a stage in snowflake using a Blob Storage or S3 Bucket and you can work it like a local file. Which can be use in your queries using spark connector, also, any new file on the stage (BlobStorage/S3 Bucket) will be available through a query such as:
SELECT * FROM #"STAGE_NAME"/example.json;
I ain't sure if this is of any help, I don't know how you are trying to apply it. But if it does, I'll be glad to put a example here.
I need to update a table in snowflake by taking data from oracle database.
Is there a way to connect to oracle database from snowflake?
If answer is NO how can i update the table in snowflake using data from oracle.
Not sure exactly what you are looking for here. The best way to get data into Snowflake is via the COPY INTO command, which would then allow you to update the Snowflake table with that data. If you are looking at ways to keep the 2 systems in-sync, then you might want to look into the various data replication tools that are in the marketplace. If this is a transactional update, then you can use a connector (ODBC, JDBC, Python, etc.) to update the data from one system to another. I wouldn't recommend that for bulk updates, though.
There are several ways you can integrate your data from oracle to snowflake. If you are familiar with ETL tool you can use any one of them or you can use any program language to extract and load.
I'm currently looking into Kafka Connect to stream some of our databases to a data lake. To test out Kafka Connect I've setup a database with one of our project databases in. So far so good.
Next step I configured Kafka Connect with mode following properties:
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"timestamp.column.name": "updated_at,created_at",
"incrementing.column.name": "id",
"dialect.name": "SqlServerDatabaseDialect",
"validate.non.null": "false",
"tasks.max": "1",
"mode": "timestamp+incrementing",
"topic.prefix": "mssql-jdbc-",
"poll.interval.ms": "10000",
}
While this works for the majority of my tables where I got an ID and a created_at / updated_at field, it won't work for my tables where I solved my many-to-many relationships with a table in between and a composite key. Note that I'm using the generic JDBC configuration with a JDBC driver from Microsoft.
Is there a way to configure Kafka Connect for these special cases?
Instead of one connector to pull all of your tables, you may need to create multiple ones. This would be the case if you want to use different methods for fetching the data, or different ID/timestamp columns.
As #cricket_007 says, you can use the query option to pull back the results of a query—which could be a SELECT expressing your multi-table join. Even when pulling data from a single table object, the JDBC connector itself is just issuing a SELECT * from the given table, with a WHERE predicate to restrict the rows selected based on the incrementing ID/timestamp.
The alternative is to use log-based change data capture (CDC), and stream all changes directly from the database into Kafka.
Whether you use JDBC or log-based CDC, you can use stream processing to resolve joins in Kafka itself. An example of this is Kafka Streams or KSQL. I've written about the latter a lot here.
You might also find this article useful describing in detail your options for integrating databases with Kafka.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.