How to convert DataStream[Row] into Table? - apache-flink

For this case, java environment support method returns() to specify type information.
But int scala environment ,org.apache.flink.streaming.api.scala.DataStream does't have returns().
How to convert a scala DataStream[Row] into a scala Table?

According to the table API documentation, you can use
// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section
// DataStream of Row with two fields "name" and "age" specified in `RowTypeInfo`
val stream: DataStream[Row] = ...
// convert DataStream into Table with default field names "name", "age"
val table: Table = tableEnv.fromDataStream(stream)

Related

Error converting Flink DataStream to Table after a ProcessFunction call

I'm implementing a data analysis pipeline in Flink and I have a problem converting a DataStream to a Table. I have this table defined from a join between two Kafka sources:
Table legalFileEventsTable = legalFilesTable.join(eventsTable)
.where($("id").isEqual($("id_fascicolo")))
.select(
$("id").as("id_fascicolo"),
$("id_evento"),
$("giudice"),
$("nrg"),
$("codice_oggetto"),
$("ufficio"),
$("sezione"),
$("data_evento"),
$("evento"),
$("data_registrazione_evento")
);
Then I convert the joined table to a DataStream to apply some computation on the data. Here's the code I'm using:
DataStream<Row> phasesDurationsDataStream = tEnv.toChangelogStream(legalFileEventsTable)
.keyBy(r -> r.<Long>getFieldAs("id_fascicolo"))
.process(new PhaseDurationCounterProcessFunction());
phasesDurationsDataStream.print();
The PhaseDurationCounterProcessFunction emits a Row like this:
Row outputRow = Row.withNames(RowKind.INSERT);
outputRow.setField("id_fascicolo", currentState.getId_fascicolo());
outputRow.setField("nrg", currentState.getNrg());
outputRow.setField("giudice", currentState.getGiudice());
outputRow.setField("codice_oggetto", currentState.getCodice_oggetto());
outputRow.setField("ufficio", currentState.getUfficio());
outputRow.setField("sezione", currentState.getSezione());
outputRow.setField("fase", currentState.getPhase());
outputRow.setField("fase_completata", false);
outputRow.setField("durata", currentState.getDurationCounter());
out.collect(outputRow);
After collecting the results from the process function I reconvert the DataStream to a Table and execute the pipeline:
Table phasesDurationsTable = tEnv.fromChangelogStream(
phasesDurationsDataStream,
Schema.newBuilder()
.column("id_fascicolo", DataTypes.BIGINT())
.column("nrg", DataTypes.STRING())
.column("giudice", DataTypes.STRING())
.column("codice_oggetto", DataTypes.STRING())
.column("ufficio", DataTypes.STRING())
.column("sezione", DataTypes.STRING())
.column("fase", DataTypes.STRING())
.column("fase_completata", DataTypes.BOOLEAN())
.column("durata", DataTypes.BIGINT())
.primaryKey("id_fascicolo", "fase")
.build(),
ChangelogMode.upsert()
);
env.execute();
But during the startup I receive this exception:
Unable to find a field named 'id_fascicolo' in the physical data type derived from the given type information for schema declaration. Make sure that the type information is not a generic raw type. Currently available fields are: [f0]
It seems that the row information (name and type) aren't available yet and so the exception is generated. I tried to invoke the env.execute() before the DataStream->Table conversion and in this case the job starts but I have no output if I print the phasesDurationsTable. Any suggestions on how to make this work?
You need to specify the correct type information for Flink because it cannot figure out the schema from the generic Row type.
Here is an example, given our data stream is producing records like this:
Row exampleRow = Row.ofKind(RowKind.INSERT, "sensor-1", 32);
We need to define the correct type information the following way:
TypeInformation<?>[] types = {
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.LONG_TYPE_INFO
};
Using this we can define the row type information:
RowTypeInfo rowTypeInfo = new RowTypeInfo(
types,
new String[]{"sensor_name", "temperature"}
);
The last step is to specify the return type of this DataStream:
DataStream<Row> stream = env.fromSource(...).returns(rowTypeInfo)
Note that when using fromChangelogStream you only need to provide a schema if you want to use a different one from the type info returned by the DataStream, so tEnv.fromChangelogStream(stream) works just fine and will use sensor_name and temperature as the schema by default.

Is there a way to upload the byte type created with zlib to Google bigquery?

I want to input string data into bigquery by implied by pyhton's zlib library.
Here is an example code that uses zlib to generate data:
import zlib
import pandas as pd
string = 'abs'
df = pd.DataFrame()
data = zlib.compress(bytearray(string, encoding='utf-8'), -1)
df.append({'id' : 1, 'data' : data}, ignore_index=True)
I've also tried both methods provided by the bigquery API, but both of them give me an error.
The schema is:
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("id", bigquery.enums.SqlTypeNames.NUMERIC),
bigquery.SchemaField("data", bigquery.enums.SqlTypeNames.BYTES),
],
write_disposition = "WRITE_APPEND"
)
Examples of methods I have tried are:
1. bigquery API
job = bigquery_client.load_table_from_dataframe(
df, table, job_config=job_config
)
job.result()
2. pandas_gbq
df.to_gbq(detination_table, project_id, if_exists='append')
However, both give similar errors.
1. error
pyarrow.lib.ArrowInvalid: Got bytestring of length 8 (expected 16)
2. error
pandas_gbq.gbq.InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
Is there any way to solve this ?
I want to input python bytestring as bigquery byte data.
Thank you
The problem isn't coming from the insertion of your zlib compressed data. The error occurs on the insertion of the value of your key id which is dataframe value 1 to the NUMERIC data type in BigQuery.
The easiest solution for this is to change the datatype of your schema in BigQuery from NUMERIC to INTEGER.
However, if you really need your schema to be in NUMERIC datatype, you may convert the dataframe datatype of 1 on your python code using decimal library as derived from this SO post before loading it to BigQuery.
You may refer to below sample code.
from google.cloud import bigquery
import pandas
import zlib
import decimal
# Construct a BigQuery client object.
client = bigquery.Client()
# Set table_id to the ID of the table to create.
table_id = "my-project.my-dataset.my-table"
string = 'abs'
df = pandas.DataFrame()
data = zlib.compress(bytearray(string, encoding='utf-8'), -1)
record = df.append({'id' : 1, 'data' : data}, ignore_index=True)
df_2 = pandas.DataFrame(record)
df_2['id'] = df_2['id'].astype(str).map(decimal.Decimal)
dataframe = pandas.DataFrame(
df_2,
# In the loaded table, the column order reflects the order of the
# columns in the DataFrame.
columns=[
"id",
"data",
],
)
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("id", bigquery.enums.SqlTypeNames.NUMERIC),
bigquery.SchemaField("data", bigquery.enums.SqlTypeNames.BYTES),
],
write_disposition = "WRITE_APPEND"
)
job = client.load_table_from_dataframe(
dataframe, table_id, job_config=job_config
) # Make an API request.
job.result() # Wait for the job to complete.
table = client.get_table(table_id) # Make an API request.
print(
"Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
)
)
OUTPUT:

Load JSON Data into Snow flake table

My Data is follows:
[ {
"InvestorID": "10014-49",
"InvestorName": "Blackstone",
"LastUpdated": "11/23/2021"
},
{
"InvestorID": "15713-74",
"InvestorName": "Bay Grove Capital",
"LastUpdated": "11/19/2021"
}]
So Far Tried:
CREATE OR REPLACE TABLE STG_PB_INVESTOR (
Investor_ID string, Investor_Name string,Last_Updated DATETIME
); Created table
create or replace file format investorformat
type = 'JSON'
strip_outer_array = true;
created file format
create or replace stage investor_stage
file_format = investorformat;
created stage
copy into STG_PB_INVESTOR from #investor_stage
I am getting an error:
SQL compilation error: JSON file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.
You should be loading your JSON data into a table with a single column that is a VARIANT. Once in Snowflake you can either flatten that data out with a view or a subsequent table load. You could also flatten it on the way in using a SELECT on your COPY statement, but that tends to be a little slower.
Try something like this:
CREATE OR REPLACE TABLE STG_PB_INVESTOR_JSON (
var variant
);
create or replace file format investorformat
type = 'JSON'
strip_outer_array = true;
create or replace stage investor_stage
file_format = investorformat;
copy into STG_PB_INVESTOR_JSON from #investor_stage;
create or replace table STG_PB_INVESTOR as
SELECT
var:InvestorID::string as Investor_id,
var:InvestorName::string as Investor_Name,
TO_DATE(var:LastUpdated::string,'MM/DD/YYYY') as last_updated
FROM STG_PB_INVESTOR_JSON;

Flink SQL : UDTF passes Row type parameters

CREATE TABLE user_log (
data ROW(id String,user_id String,class_id String)
) WITH (
'connector.type' = 'kafka',
...
);
INSERT INTO sink
SELECT * FROM user_log as tab,
LATERAL TABLE(splitUdtf(tab.data)) AS T(a,b,c);
UDTF Code:
public void eval(Row data) {...}
Can the eval method only pass Row type parameters? I want to get the key of Row in SQL,such as id,user_id,class_id,But the key of Row in java is index (such as 0,1,2).How do i do it? Thank you!
Is your sql able to directly convert kafka data to table Row? Maybe not .
Row is the type at the DataStream level, not the type in TableAPI&SQL.
If the data you received from kafka is in json format, you can use the DDL statement in fllink sql or use the Connector API to directly extract the fields in json, as long as your json is in key-value format.

Apache Spark: Type conversion problem on write using JDBC driver to SQL Server / Azure DWH for column of BINARY type

My initial goal is to save UUId values to SQL Server/Azure DWH to column of BINARY(16) type.
For example, I have demo table:
CREATE TABLE [Events] ([EventId] [binary](16) NOT NULL)
I want to write data to it using Spark like this:
import java.util.UUID
val uuid = UUID.randomUUID()
val uuidBytes = Array.ofDim[Byte](16)
ByteBuffer.wrap(uuidBytes)
.order(ByteOrder.BIG_ENDIAN)
.putLong(uuid.getMostSignificantBits())
.putLong(uuid.getLeastSignificantBits()
val schema = StructType(
List(
StructField("EventId", BinaryType, false)
)
)
val data = Seq((uuidBytes)).toDF("EventId").rdd;
val df = spark.createDataFrame(data, schema);
df.write
.format("jdbc")
.option("url", "<DATABASE_CONNECTION_URL>")
.option("dbTable", "Events")
.mode(org.apache.spark.sql.SaveMode.Append)
.save()
This code returns an error:
java.sql.BatchUpdateException: Conversion from variable or parameter type VARBINARY to target column type BINARY is not supported.
My question is how to cope with this situation and insert UUId value to BINARY(16) column?
My investigation:
Spark uses conception of JdbcDialects and has a mapping for each Catalyst type to database type and vice versa. For example here is MsSqlServerDialect which is used when we work against SQL Server or Azure DWH. In the method getJDBCType you can see the mapping:
case BinaryType => Some(JdbcType("VARBINARY(MAX)", java.sql.Types.VARBINARY))
and this is the root of my problem as I think.
So, I decide to implement my own JdbcDialect to override this behavior:
class SqlServerDialect extends JdbcDialect {
override def canHandle(url: String) : Boolean = url.startsWith("jdbc:sqlserver")
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case BinaryType => Option(JdbcType("BINARY(16)", java.sql.Types.BINARY))
case _ => None
}
}
val dialect = new SqlServerDialect
JdbcDialects.registerDialect(dialect)
With this modification I still catch exactly the same error. It looks like that Spark do not use mapping from my custom dialect. But I checked that the dialect is registered. So it is strange situation.

Resources