Flink write to hudi with different schemas extracted from kafka datastream - apache-flink

So I have a Kafka topic which contains avro record with different schemas. I want to consume from that Kafka topic in flink and create a datastream of avro generic record.(this part is done)
Now I want to write that data in hudi using schema extracted from datastream. But since hudi pipeline/writer takes a config with predefined avro schema at the beginning itself, I can't do that.
Probable solution is to create a key stream based on a key they identify one type of schema and then extract schema from that and then create a dynamic hudi pipeline based on that.
I'm not sure on the last part, if that is possible.
A-->B-->C
Where A is generic avro record with different schemas. B is partitioned stream based on different schemas. And C uses that schema inside data stream B to create a config and pass it to hudi pipeline writer function.

I think I just answered this same question (but for Parquet) yesterday, but I can't find it now :) Anyway, if you know the different schemas in advance (assumed so, otherwise how do you deserialize the incoming Kafka records?) then you can use a ProcessFunction with multiple side outputs, where you split the stream by schema, and then connect each schema-specific stream to its own Hudi sink.

Related

Flink Table from Kafka source dynamically

I want to write a generic streaming job where based on the deployed config file (kafka topic and sql query), it will read from the configured kafka source topic as a table and then execute the configured query. In this way, the code needs to be generic to support any of our kafka topics.
For example, deploy the flink job with
config.yaml
source.topic: user
sql.query: select name,address from user where state = ' TX'
vs another config
config.yaml
source.topic: booking
sql.query: select booking_amount from booking where product_type = 'book'
We use Confluent schema registry so the schema is available at runtime.
From the docs I did not see how to generate the table dynamically on startup besides writing some utility method to generate the create DDL from the confluent schema. Is this utility available already somewhere?
Another option I was looking at was first creating a DataStream from the kafka topic then converting it to a Flink table. The datastream does work with Avro GenericRecord e.g.
KafkaSource<GenericRecord> source = KafkaSource.<GenericRecord>builder()
.setTopics(egspConfigAdapter.getSourceTopic())
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(ConfluentRegistryAvroDeserializationSchema.forGeneric(egspConfigAdapter.getSourceSchema(), egspConfigAdapter.getSourceSchemaRegistryUrl()))
.setProperties(egspConfigAdapter.consumerProperties())
.build();
But then when converting the datastream to a table, the schema needs to be specified for the query to work properly. Is there a utility helper to convert Avro Schema to a Flink Schema?

What the Process to transfer the staging table data to Fact tables in Snowflake by Custom Validations

good Day.
I need help. I want to transfer the data in Snowflake from Staging tables to Fact tables automatically, when data is available in Stage table. While moving data from Staging table to Fact tables, I have couple of Custom validations on each column and row.
Any idea how to do this in Snowflake.
If any one knows could you please suggest me...!
Thanks in Advance...!
There are many ways to do this and how you go about it depends on what tools you have available. The simplest way to do this without using tools outside of the Snowflake ecosystem would be:
On each of the staging tables you have, set up a stream on these tables (here is the Snowflake documentation on streams)
Create a task that runs on a schedule (here is the Snowflake doc on tasks) to pull from the streams and write into the fact table.
This is really a general data warehousing question rather than a Snowflake one. Here is some more documentation on building SCD type 2 dimensions also written by someone at Snowflake
Assuming "staging tables" refers to a Snowflake table and not a file in a Snowflake stage, I would recommend using a Stream and Task for this. A stream will identify the delta of data that needs to be loaded, and a Task can execute on a schedule and will only actually run something if there is data in the stream. Create a stored procedure that is executed in the Task to run your validations and Merge the outcome of those into your Fact.

Kafka connect many to many tables in MSSQL

I'm currently looking into Kafka Connect to stream some of our databases to a data lake. To test out Kafka Connect I've setup a database with one of our project databases in. So far so good.
Next step I configured Kafka Connect with mode following properties:
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"timestamp.column.name": "updated_at,created_at",
"incrementing.column.name": "id",
"dialect.name": "SqlServerDatabaseDialect",
"validate.non.null": "false",
"tasks.max": "1",
"mode": "timestamp+incrementing",
"topic.prefix": "mssql-jdbc-",
"poll.interval.ms": "10000",
}
While this works for the majority of my tables where I got an ID and a created_at / updated_at field, it won't work for my tables where I solved my many-to-many relationships with a table in between and a composite key. Note that I'm using the generic JDBC configuration with a JDBC driver from Microsoft.
Is there a way to configure Kafka Connect for these special cases?
Instead of one connector to pull all of your tables, you may need to create multiple ones. This would be the case if you want to use different methods for fetching the data, or different ID/timestamp columns.
As #cricket_007 says, you can use the query option to pull back the results of a query—which could be a SELECT expressing your multi-table join. Even when pulling data from a single table object, the JDBC connector itself is just issuing a SELECT * from the given table, with a WHERE predicate to restrict the rows selected based on the incrementing ID/timestamp.
The alternative is to use log-based change data capture (CDC), and stream all changes directly from the database into Kafka.
Whether you use JDBC or log-based CDC, you can use stream processing to resolve joins in Kafka itself. An example of this is Kafka Streams or KSQL. I've written about the latter a lot here.
You might also find this article useful describing in detail your options for integrating databases with Kafka.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.

What steps are performed by a 'Rescan'?

To automatically warehouse documents from Cloudant to dashDB, there is a schema discovery process (SDP) that automates the data migration for you. When using the SDP to warehouse documents from Cloudant to dashDB, there is an option 'Rescan'.
I have used 'Rescan' a number of times, but am unclear on the steps it actually performs. What steps are performed by a 'Rescan'? E.g.
Drop tables in the dashDB target schema? Which tables?
Scan Cloudant source database?
Recreate the target schema?
...
...
The steps are pretty much as you suggested. Rescan will
Inspect the previously discovered JSON schema and remove all tables from the dashDB instance created for that load (leaving any user defined tables untouched)
Re-discover the JSON schema again using the current settings (including sample size, type of discovery algorithm etc.)
Create the new tables into the same dashDB target
Ingest the newly created tables with data from Cloudant
Subscribe to the _changes feed from Cloudant to continuously synchronize document changes with dashDB
All steps (except for the first) are identical for the initial load as well as the rescan function.
The main motivation for a rescan is to support schema evolution. Whenever the document structure in a Cloudant source database changes, a user can make a conscious decision to drop and re-create the dashDB tables using this rescan function. SDP won't automate that process to avoid potential conflicts with applications depending on the existing dashDB tables.

Using Google Cloud Dataflow for merging flat files and importing into Cloud SQL

We have to read data from CSV files and map two files with respect to one column and push data to Cloud SQL using Google Cloud Dataflow.
We are able to read data from CSV files but stuck with the next steps. Please provide me information or links regarding the following:
Merging/joining to flat files based on one column or condition with multiple columns
Copying merged pcollection into Сloud SQL database
Here's some pointers that may be helpful:
https://cloud.google.com/dataflow/model/joins describes the ways to join PCollection's in Dataflow
There is currently no built-in sink for writing to CloudSQL, however you can either simply process the results of your join using a ParDo which writes each individual record or in batches (flushing periodically or in finishBundle()) - or if your needs are more complex than that, consider writing a CloudSQL sink - see https://cloud.google.com/dataflow/model/sources-and-sinks

Resources