I have a Kafka Topic, which receives an array with multiple object in it like shown below.
[{"Id":2318805,"Booster Station":"Comanche County #1","TimeStamp":"2021-09-30T23:53:43.019","Total Throughput":2167.52856445125},{"Id":2318805,"Booster Station":"Comanche County #2","TimeStamp":"2020-09-30T23:53:43.019","Total Throughput":217.52856445125},]
when i load this in snowflake, it becomes one huge row, with all objects, i would like to store each object as individual row in snowflake, how can i achieve this, i am open to tweak at kafka level as or Connector.
My Kafka is AWS MSK, and i am using snowflake connector plugin for loading data in snowflake
Type of field
You can use Snowflake's flatten table function to flatten the arrays to individual rows:
create or replace temp table T1 as
select(parse_json($$[{"Id":2318805,"Booster Station":"Comanche County #1","TimeStamp":"2021-09-30T23:53:43.019","Total Throughput":2167.52856445125},
{"Id":2318805,"Booster Station":"Comanche County #2","TimeStamp":"2020-09-30T23:53:43.019","Total Throughput":217.52856445125}]
$$)) as JSON
;
select VALUE from T1, table(flatten(JSON));
This is assuming that the Kafka messages are stored as variant type. If they are string, you can use the parse_json function to convert them to variant.
From there, you can convert the individual objects to columns if you want:
select VALUE:"Booster Station"::string as BOOSTER_STATION
,VALUE:Id::int as ID
,VALUE:TimeStamp::timestamp as TIME_STAMP
,VALUE:"Total Throughput"::float as TOTAL_THROUGHPUT
from T1, table(flatten(JSON));
BOOSTER_STATION
ID
TIME_STAMP
TOTAL_THROUGHPUT
Comanche County #1
2318805
2021-09-30 23:53:43.019000000
2167.528564451
Comanche County #2
2318805
2020-09-30 23:53:43.019000000
217.528564451
Related
I have setup the snowflake - kafka connector. I setup a sample table (kafka_connector_test) in snowflake with 2 fields both are VARCHAR type.
Fields are CUSTOMER_ID and PURCHASE_ID.
Here is my configuration that I created for the connector
curl -X POST \
-H "Content-Type: application/json" \
--data '{
"name":"kafka_connector_test",
"config":{
"connector.class":"com.snowflake.kafka.connector.SnowflakeSinkConnector",
"tasks.max":"2",
"topics":"kafka-connector-test",
"snowflake.topic2table.map": "kafka-connector-test:kafka_connector_test",
"buffer.count.records":"10000",
"buffer.flush.time":"60",
"buffer.size.bytes":"5000000",
"snowflake.url.name":"XXXXXXXX.snowflakecomputing.com:443",
"snowflake.user.name":"XXXXXXXX",
"snowflake.private.key":"XXXXXXXX",
"snowflake.database.name":"XXXXXXXX",
"snowflake.schema.name":"XXXXXXXX",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"com.snowflake.kafka.connector.records.SnowflakeJsonConverter"}}'\
I send data to the topic that I have configured in the connector configuration.
{"CUSTOMER_ID" : "test_id", "PURCHASE_ID" : "purchase_id_test"}
then when I check the kafka-connect server I get the below error:
[SF KAFKA CONNECTOR] Detail: Table doesn't have a compatible schema
Is there something I need to setup in either kafka connect or snowflake that says which parts of the json go into which columns of the table? Not sure how to specify how it parses the json.
I setup a different topic as well and didn't create a table in snowlake. In that I was able to populate this table but the connector makes a table with 2 columns RECORD_METADATA and RECORD_CONTENT. But I don't want to write a scheduled job to parse this I want to directly insert into a queryable table.
Snowflake Kafka connector writes data as json by design. The default columns RECORD_METADATA and RECORD_CONTENT are variant. If you like to query them you can create a view on top the table to achieve your goal and you don't need a scheduled job
So, your table created by connector would be something like
RECORD_METADATA, RECORD_CONTENT
{metadata fields in json}, {"CUSTOMER_ID" : "test_id", "PURCHASE_ID" : "purchase_id_test"}
You can create a view to display your data
create view v1 as
select RECORD_CONTENT:CUSTOMER_ID::text CUSTOMER_ID,
RECORD_CONTENT:PURCHASE_ID::text PURCHASE_ID
Your query will be
select CUSTOMER_ID , PURCHASE_ID from v1
PS. If you like to create your own tables you need to use variant as your data type instead of varchar
Also looks like it's not supported at this time in reference to this github issue
CREATE TABLE user_log (
data ROW(id String,user_id String,class_id String)
) WITH (
'connector.type' = 'kafka',
...
);
INSERT INTO sink
SELECT * FROM user_log as tab,
LATERAL TABLE(splitUdtf(tab.data)) AS T(a,b,c);
UDTF Code:
public void eval(Row data) {...}
Can the eval method only pass Row type parameters? I want to get the key of Row in SQL,such as id,user_id,class_id,But the key of Row in java is index (such as 0,1,2).How do i do it? Thank you!
Is your sql able to directly convert kafka data to table Row? Maybe not .
Row is the type at the DataStream level, not the type in TableAPI&SQL.
If the data you received from kafka is in json format, you can use the DDL statement in fllink sql or use the Connector API to directly extract the fields in json, as long as your json is in key-value format.
I am trying to build a data structure in BigQuery using SQL which exactly reflects the data structure which I obtain when uploading JSON. This will enable me to query the view using SQL with dot notation instead of having to UNNEST, which I do understand but many of my clients find extremely confusing and unintuitive.
If I build a really simple dummy dataset with a couple of rows and then nest using the ARRAY_AGG(STRUCT([field list])) pattern:
WITH
flat_table AS (
SELECT "BigQuery" AS name, 23 AS user_count, "Data Warehouse" AS data_thing, 5 AS ease_of_use, "Awesome" AS description UNION ALL
SELECT "MySQL" AS name, 12 AS user_count, "Database" AS data_thing, 3 AS ease_of_use, "Solid" AS description
)
SELECT
name, user_count,
ARRAY_AGG(STRUCT(data_thing, ease_of_use, description)) AS attributes
FROM flat_table
GROUP BY name, user_count
Then saving and viewing the schema shows that the attributes field is Type = RECORD and Mode = REPEATED. Schema field names are:
name
user_count
attributes
attributes.data_thing
attributes.ease_of_use
attributes.description
If I look at the COLUMN information in the INFORMATION_SCHEMA.COLUMNS query I can see that the attributes field is_nullable = NO and data_type = ARRAY<STRUCT<data_thing STRING, ease_of_use INT64, description STRING>>
If I want to query this structure I need to use the UNNEST pattern as below:
SELECT
name,
user_count
FROM
nested_table,
UNNEST(attributes)
WHERE
ease_of_use > 3
However when I upload the following JSON representation of the same data to BigQuery with automatic schema detection:
{"attributes":{"description":"Awesome","ease_of_use":5,"data_thing":"Data Warehouse"},"user_count":23,"name":"BigQuery"}
{"attributes":{"description":"Solid","ease_of_use":3,"data_thing":"Database"},"user_count":12,"name":"MySQL"}
The schema looks nearly identical once loaded, except for the attributes field is Mode = NULLABLE (it is still Type = RECORD). The INFORMATION_SCHEMA.COLUMNS shows me that the attributes field is now is_nullable = YES and data_type = STRUCT<data_thing STRING, ease_of_use INT64, description STRING>, i.e. now nullable and not in an array.
However the most interesting thing for me is that I can now query this table using dot notation instead of the UNNEST pattern, so the query above becomes:
SELECT
name,
user_count
FROM
nested_table_json
WHERE
attributes.ease_of_use > 3
Which is arguably easier to read, even in this trivial case. However once we get to more complex data structures with multiple nested fields and multi-level nesting, the UNNEST pattern becomes extremely difficult to write, QA and debug. The dot notation pattern appears to be much more intuitive and scalable.
So my question is: is it possible to build a data structure equivalent to the loaded JSON by writing queries in SQL, enabling us to build Standard SQL queries using dot notation and not requiring complex UNNEST patterns?
If you know that your array_agg will produce one row, you can drop the ARRAY notation like this:
SELECT
name, user_count,
ARRAY_AGG(STRUCT(data_thing, ease_of_use, description))[offset(0)] AS attributes
notice the use of OFFSET(0) this way the returned output will be:
[
{
"name": "BigQuery",
"user_count": "23",
"attributes": {
"data_thing": "Data Warehouse",
"ease_of_use": "5",
"description": "Awesome"
}
}
]
which can be queried using dot notation.
In case you want just to group result in STRUCT, you don't need array_agg.
WITH
flat_table AS (
SELECT "BigQuery" AS name, 23 AS user_count, struct("Data Warehouse" AS data_thing, 5 AS ease_of_use, "Awesome" AS description) as attributes UNION ALL
SELECT "MySQL" AS name, 12 AS user_count, struct("Database" AS data_thing, 3 AS ease_of_use, "Solid" AS description)
)
SELECT
*
FROM flat_table
My Use case-
Collect events for a particular duration and then group them based on the key
Objective
After processing, user can save data of particular duration based on the key
How i am planning to do
1)Receive events from Kafka
2)Create data stream of events
3)associate a table with it and collect data for a particular duration by running a SQL query
4)associate a new table with step-2 output and group collected data according to the key
5)save the data in DB
Solution i tried-
I am able to-
1)receive events from Kafka,
2)setup a data stream(lets say sensorDataStream)-
DataStream<SensorEvent> sensorDataStream
= source.flatMap(new FlatMapFunction<String, SensorEvent>() {
#Override
public void flatMap(String catalog, Collector<SensorEvent> out) {
// create SensorEvent(id, sensor notification value, notification time) creation
});
3)associate a table(lets say table1) with data stream and after running SQL query like-
SELECT id, sensorNotif, notifTime FROM SENSORTABLE WHERE notifTime > t1_Timestamp AND notifTime < t2_Timestamp.
Here t1_Timestamp and t2_Timestamp is predefined epoch time and will change based on some predefined conditions
4)I am able to print this sql query result by using following query on the console-
tableEnv.toAppendStream(table1, Row.class).print();
5)Created a new table(lets say table2) by using table1 and following type of sql query-
Table table2 = tableEnv.sqlQuery("SELECT id AS SensorID, COUNT(sensorNotif) AS SensorNotificationCount FROM table1 GROUP BY id);
6)Collecting and print data by using -
tableEnv.toRetractStream(table2 , Row.class).print();
Problem
1)I am not able to see output of step 6 on the console.
I did some experiment and found that If i skip table1 setup step(that means no sensor data clubbing for a duration) and directly associate my senserDataStream with table2 then i can see the output of step-6 but as this is RetractStream so i can see data in the form of and if new event is coming then this retract stream will invalidate data and print newly calculated data.
Suggestion i would like to have
1)How can i merge step 5 and step 6(means table1 and table2). I already merged these tables but as data is not visible on console so i have doubt? Am i doing something wrong? Or data is merged but not visible?
2)My plan is to --
2.a)filter data in 2 pass, in first pass filter data for a particular interval and in second pass group this data
2.b)Save 2.a output in DB
Will this approach work(i have doubt because i am using data stream and table1 out put is append stream but table2 output is retract stream)?
HANA Version: SP12
All,
I've successfully created Calc Views with INPUT_PARAMETERS as described by Lars in many blogs and forums. While these views work without issue when querying directly for single and multi inputs, I'm encountering an issue with performing joins on the Calc View itself within a stored proc or table function.
Example:
"BASE_SCHEMA"."BASE_TABLE_EXAMPLE" - record count(*) ~ 2million records
Keys: Material (20k distinct), Plant (200 distinct)
"_SYS_BIC"."CA_EXAMPLE_PRODUCTIVITY"
Input Parameters: IP_MATNR (nvarchar (5000)), IP_PLANT (nvarchar(5000))
Issue #1: The maximum value for nvarchar is 5000. Unable to utilize multiple inputs within the parameter if the count of distinct characters are 5000+.
Issue #2: How to use PLACEHOLDER logic in the same method of performing an INNER_JOIN in SQL.
base_data =
select
PLANT
,MATERIAL
from "BASE_SCHEMA"."BASE_TABLE_EXAMPLE"
group by PLANT,MATERIAL;
I would think to perform the below but the output would cause issues when concatenating multiple strings for use within input parameter of nvarchar(5000).
select
string_agg(PLANT,''',''') as PLANT
,string_agg(MATERIAL,''',''') as MATERIAL
into var_PLANT, var_MATERIAL
from
(
select
PLANT
,MATERIAL
from :base_data
);
While I'm successful up to this point, once adding the variables into the PLACEHOLDER of the Calc View, it fails stating that I'm passing too many characters to the IP. Any suggestions??? Thanks in advance.
base_calc =
select
PLANT
,MATERIAL
,MATERIAL_BU
,etc....
from "_SYS_BIC"."CA_EXAMPLE_PRODUCTIVITY"
(PLACEHOLDER."IP_MATNR"=> :var_MATERIAL, --<---Fails here. :(
PLACEHOLDER."IP_PLANT"=> :var_PLANT);
Question raised on SAP SCN. Located here!
Did you tried to use WHERE clause instead of PLACEHOLDER?
base_calc =
select
PLANT,
MATERIAL,
MATERIAL_BU,
etc....
from "_SYS_BIC"."CA_EXAMPLE_PRODUCTIVITY"
WHERE MATERIAL = var_MATERIAL AND PLANT = var_PLANT;