CREATE TABLE in AWS Athena with an Anonymous JSON Array

CREATE TABLE in AWS Athena with an Anonymous JSON Array - arrays

I am working in AWS Athena. I am trying to CREATE EXTERNAL TABLE that has an anonymous JSON array in it. For example:
[{"key1":"value1a","key2":"value2a",...},{"key1":"value1b","key2":"value2b",...}]
I cannot get the table definition to correctly parse the array of objects into a set of rows.
I have tried this:
CREATE EXTERNAL TABLE IF NOT EXISTS testdata.trynum85 (
data struct<`key1`:string,
`key2`:string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'strip.outer.array'='true'
) LOCATION 's3://mys3/new-json-structure/'
TBLPROPERTIES ('has_encrypted_data'='false');
and
CREATE EXTERNAL TABLE IF NOT EXISTS testdata.trynum85 (
data array<struct<`key1`:string,
`key2`:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'strip.outer.array'='false'
) LOCATION 's3://mys3/new-json-structure/'
TBLPROPERTIES ('has_encrypted_data'='false');
and with both cases, when I query, the logic only finds the first object in the array.
SELECT stuff.key1
FROM "testdata"."trynum85 ", UNNEST(data) AS t(stuff)
returns
key1
----
value1a
value1b never appears.
If I modify the data and make the array named, i.e.
{"iamme":[{"key1":"value1a","key2":"value2a",...},{"key1":"value1b","key2":"value2b",...}]}
Then queries can yield the list of results on a query for key1.
How do I handle the anonymous array?

Related

Load JSON Data into Snow flake table

My Data is follows:
[ {
"InvestorID": "10014-49",
"InvestorName": "Blackstone",
"LastUpdated": "11/23/2021"
},
{
"InvestorID": "15713-74",
"InvestorName": "Bay Grove Capital",
"LastUpdated": "11/19/2021"
}]
So Far Tried:
CREATE OR REPLACE TABLE STG_PB_INVESTOR (
Investor_ID string, Investor_Name string,Last_Updated DATETIME
); Created table
create or replace file format investorformat
type = 'JSON'
strip_outer_array = true;
created file format
create or replace stage investor_stage
file_format = investorformat;
created stage
copy into STG_PB_INVESTOR from #investor_stage
I am getting an error:
SQL compilation error: JSON file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.

You should be loading your JSON data into a table with a single column that is a VARIANT. Once in Snowflake you can either flatten that data out with a view or a subsequent table load. You could also flatten it on the way in using a SELECT on your COPY statement, but that tends to be a little slower.
Try something like this:
CREATE OR REPLACE TABLE STG_PB_INVESTOR_JSON (
var variant
);
create or replace file format investorformat
type = 'JSON'
strip_outer_array = true;
create or replace stage investor_stage
file_format = investorformat;
copy into STG_PB_INVESTOR_JSON from #investor_stage;
create or replace table STG_PB_INVESTOR as
SELECT
var:InvestorID::string as Investor_id,
var:InvestorName::string as Investor_Name,
TO_DATE(var:LastUpdated::string,'MM/DD/YYYY') as last_updated
FROM STG_PB_INVESTOR_JSON;

Flink SQL : UDTF passes Row type parameters

CREATE TABLE user_log (
data ROW(id String,user_id String,class_id String)
) WITH (
'connector.type' = 'kafka',
...
);
INSERT INTO sink
SELECT * FROM user_log as tab,
LATERAL TABLE(splitUdtf(tab.data)) AS T(a,b,c);
UDTF Code:
public void eval(Row data) {...}
Can the eval method only pass Row type parameters? I want to get the key of Row in SQL,such as id,user_id,class_id,But the key of Row in java is index (such as 0,1,2).How do i do it? Thank you!

Is your sql able to directly convert kafka data to table Row? Maybe not .
Row is the type at the DataStream level, not the type in TableAPI&SQL.
If the data you received from kafka is in json format, you can use the DDL statement in fllink sql or use the Connector API to directly extract the fields in json, as long as your json is in key-value format.

Presto: How to read from s3 an entire bucket that is partitioned in sub-folders?

I need to read using presto from s3 an entire dataset that sits in "bucket-a". But, inside the bucket, the data was saved in sub-folders by year. So I have a bucket that looks like that:
Bucket-a>2017>data
Bucket-a>2018>more data
Bucket-a>2019>more data
All the above data is the same table but saved this way in s3. Notice that in the bucket-a itself there is no data, just inside each folder.
What I have to do is read all the data from the bucket as a single table adding a year as column or partition.
I tried doing this way, but didn't work:
CREATE TABLE hive.default.mytable (
col1 int,
col2 varchar,
year int
)
WITH (
format = 'json',
partitioned_by = ARRAY['year'],
external_location = 's3://bucket-a/'--also tryed 's3://bucket-a/year/'
)
and also
CREATE TABLE hive.default.mytable (
col1 int,
col2 varchar,
year int
)
WITH (
format = 'json',
bucketed_by = ARRAY['year'],
bucket_count = 3,
external_location = 's3://bucket-a/'--also tryed's3://bucket-a/year/'
)
All of the above didn't work.
I have seen people writing with partitions to s3 using presto, but what I'm trying to do is the opposite: read from s3 data that is already splitted in folders as single table.
Thanks.

If your folders were following Hive partition folder naming convention (year=2019/), you could declare the table as partitioned and just use system. sync_partition_metadata procedure in Presto.
Now, your folders do not follow the convention, so you need to register each one individually as a partition using system.register_partition procedure (will be available in Presto 330, about to be released). (The alternative to register_partition is to run appropriate ADD PARTITION in Hive CLI.)

BigQuery or SQL Server SPLIT query

I have searched around and can not find much on this topic. I have a table, that gets logging information. As a result the column I am interested in contains multiple values that I need to search against. The column is formatted in a php URL style. i.e.
/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32
This makes all searches end up with really long regexes to get data. Then join statements to combine data.
Is there a way in BigQuery, or SQL Server that I can pull the information from that column and put it into new columns?
Example:
The information I would like extracted begins after the ?, and ends at &, The string can sometimes be longer, and contains additional headers.
Thanks,

Below is for BigQuery Standard SQL and addresses below aspect of your question
Is there a way in BigQuery, ... that I can pull the information from that column and put it into new columns?
#standardSQL
CREATE TEMP FUNCTION parseColumn(kv STRING, column_name STRING) AS (
IF(SPLIT(kv, '=')[OFFSET(0)]= column_name, SPLIT(kv, '=')[OFFSET(1)], NULL)
);
WITH `project.dataset.table` AS (
SELECT '/test/test.aspx?extra=abc&DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url UNION ALL
SELECT '/test/test.aspx?DS_Vendor=55192&DS_ProdVer=4.30.100.0&more=123&DS_ProdLang=DE&DS_Product=MTE&DS_OfficeBits=64'
)
SELECT
MIN(parseColumn(kv, 'DS_Vendor')) AS DS_Vendor,
MIN(parseColumn(kv, 'DS_ProdVer')) AS DS_ProdVer,
MIN(parseColumn(kv, 'DS_ProdLang')) AS DS_ProdLang,
MIN(parseColumn(kv, 'DS_Product')) AS DS_Product,
MIN(parseColumn(kv, 'DS_OfficeBits')) AS DS_OfficeBits
FROM `project.dataset.table`,
UNNEST(REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')) AS kv
GROUP BY url
with the result as below
Row DS_Vendor DS_ProdVer DS_ProdLang DS_Product DS_OfficeBits
1 55039 7.90.100.0 EN MTT 32
2 55192 4.30.100.0 DE MTE 64
Below is also addressed
The string can sometimes be longer, and contains additional headers.

One example using BigQuery (with standard SQL):
SELECT REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')
FROM (
SELECT '/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url
)
This returns the parts of the URL as an ARRAY<STRING>. To go one step further, you can get back an ARRAY<STRUCT<key STRING, value STRING>> with a query of this form:
SELECT
ARRAY(
SELECT AS STRUCT
SPLIT(part, '=')[OFFSET(0)] AS key,
SPLIT(part, '=')[OFFSET(1)] AS value
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')) AS part
) AS keys_and_values
FROM (
SELECT '/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url
)
...or with the keys and values as top-level columns:
SELECT
SPLIT(part, '=')[OFFSET(0)] AS key,
SPLIT(part, '=')[OFFSET(1)] AS value
FROM (
SELECT '/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url
)
CROSS JOIN UNNEST(REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')) AS part

Delete an object from nested array in openjson SQL Server 2016

I want to delete the "AttributeName" : "Manufacturer" from the below json in SQL Server 2016:
declare #json nvarchar(max) = '[{"Type":"G","GroupBy":[],
"Attributes":[{"AttributeName":"Class Designation / Compressive Strength"},{"AttributeName":"Size"},{"AttributeName":"Manufacturer"}]}]'
This is the query I tried which is not working
select JSON_MODIFY((
select JSON_Query(#json, '$[0].Attributes') as res),'$.AttributeName.Manufacturer', null)

Here is the working solution using the for json and open json. The point is to:
Identify the item you wish to delete and replace it with NULL. This is done by JSON_MODIFY(#json,'$[0].Attributes[2]', null). We're simply saying, take the 2nd element in Attributes and replace it by null
Convert this array to a row set. We need to somehow get rid of this null element and that's something we can filter easily in SQL by where [value] is not null
Assemble it all back to original JSON. That's done by FOR JSON AUTO
Please bear in mind one important aspect of such JSON data transformations:
JSON is designed for information exchange or eventually to store the information. But you should avoid more complicated data manipulation on SQL level.
Anyway, solution here:
declare #json nvarchar(max) = '[{"Type": "G","GroupBy": [],"Attributes": [{"AttributeName": "Class Designation / Compressive Strength"}, {"AttributeName": "Size"}, {"AttributeName": "Manufacturer"}]}]';
with src as
(
SELECT * FROM OPENJSON(
JSON_Query(
JSON_MODIFY(#json,'$[0].Attributes[2]', null) , '$[0].Attributes'))
)
select JSON_MODIFY(#json,'$[0].Attributes', (
select JSON_VALUE([value], '$.AttributeName') as [AttributeName] from src
where [value] is not null
FOR JSON AUTO
))