In Spark we can use infer schema to dynamically read schema from file e.g.:
df = sqlContext.read.format('com.databricks.spark.csv').options(delimiter='|',header='true', inferschema='true').load('cars.csv')
Is there a way to do same in Flink?
Flink has no built-in support for automatic schema inference from CSV files.
You could implement such functionality on top by analyzing the first rows of a CSV file and generating a corresponding CsvTableSource.
Related
So I have a Kafka topic which contains avro record with different schemas. I want to consume from that Kafka topic in flink and create a datastream of avro generic record.(this part is done)
Now I want to write that data in hudi using schema extracted from datastream. But since hudi pipeline/writer takes a config with predefined avro schema at the beginning itself, I can't do that.
Probable solution is to create a key stream based on a key they identify one type of schema and then extract schema from that and then create a dynamic hudi pipeline based on that.
I'm not sure on the last part, if that is possible.
A-->B-->C
Where A is generic avro record with different schemas. B is partitioned stream based on different schemas. And C uses that schema inside data stream B to create a config and pass it to hudi pipeline writer function.
I think I just answered this same question (but for Parquet) yesterday, but I can't find it now :) Anyway, if you know the different schemas in advance (assumed so, otherwise how do you deserialize the incoming Kafka records?) then you can use a ProcessFunction with multiple side outputs, where you split the stream by schema, and then connect each schema-specific stream to its own Hudi sink.
I want to write a generic streaming job where based on the deployed config file (kafka topic and sql query), it will read from the configured kafka source topic as a table and then execute the configured query. In this way, the code needs to be generic to support any of our kafka topics.
For example, deploy the flink job with
config.yaml
source.topic: user
sql.query: select name,address from user where state = ' TX'
vs another config
config.yaml
source.topic: booking
sql.query: select booking_amount from booking where product_type = 'book'
We use Confluent schema registry so the schema is available at runtime.
From the docs I did not see how to generate the table dynamically on startup besides writing some utility method to generate the create DDL from the confluent schema. Is this utility available already somewhere?
Another option I was looking at was first creating a DataStream from the kafka topic then converting it to a Flink table. The datastream does work with Avro GenericRecord e.g.
KafkaSource<GenericRecord> source = KafkaSource.<GenericRecord>builder()
.setTopics(egspConfigAdapter.getSourceTopic())
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(ConfluentRegistryAvroDeserializationSchema.forGeneric(egspConfigAdapter.getSourceSchema(), egspConfigAdapter.getSourceSchemaRegistryUrl()))
.setProperties(egspConfigAdapter.consumerProperties())
.build();
But then when converting the datastream to a table, the schema needs to be specified for the query to work properly. Is there a utility helper to convert Avro Schema to a Flink Schema?
Please let me know whether Snowflake Spark connector has the ability to create Snowflake external table?
Spark connector does support DDL statement and CREAT EXTERNAL TABLE is a DDL statement
https://docs.snowflake.com/en/user-guide/spark-connector-use.html#executing-ddl-dml-sql-statements
Not sure about how you can create external tables with spark connector, but, what I usually do is to create a stage in snowflake using a Blob Storage or S3 Bucket and you can work it like a local file. Which can be use in your queries using spark connector, also, any new file on the stage (BlobStorage/S3 Bucket) will be available through a query such as:
SELECT * FROM #"STAGE_NAME"/example.json;
I ain't sure if this is of any help, I don't know how you are trying to apply it. But if it does, I'll be glad to put a example here.
i'm new on Spark. From an input stream i got a dataframe, but i don't understand if a dataframe is like a relational table. How can i save the input stream into my distributed file system?
Is a dataframe enough to do this?
Thanks
Spark is a volatile storage i.e. it keeps all the in-memory. Until the data is in memory you can query the data using Spark APIs or SQL. All the data needs to reloaded back with the Spark job.
For persistence you can also save you Spark Dataframes as parquet files on persistence disk and query them by Spark or hive.
No. You cant use spark as database. Spark is a distrusted processing engine. You can use HDFS for storing dataframe. You can also use Hive, Hbase, etc for storing dataframe.
We have to read data from CSV files and map two files with respect to one column and push data to Cloud SQL using Google Cloud Dataflow.
We are able to read data from CSV files but stuck with the next steps. Please provide me information or links regarding the following:
Merging/joining to flat files based on one column or condition with multiple columns
Copying merged pcollection into Сloud SQL database
Here's some pointers that may be helpful:
https://cloud.google.com/dataflow/model/joins describes the ways to join PCollection's in Dataflow
There is currently no built-in sink for writing to CloudSQL, however you can either simply process the results of your join using a ParDo which writes each individual record or in batches (flushing periodically or in finishBundle()) - or if your needs are more complex than that, consider writing a CloudSQL sink - see https://cloud.google.com/dataflow/model/sources-and-sinks