export a relational Snowflake table as valid JSON or XML - snowflake-cloud-data-platform

I have a table in Snowflake and want to export it as JSON or XML but am struggling to get all of the records from my table into one valid output. I'm not very familiar with JSON formats but have made some progress.
Here is a sample of my table:
USER_ID
PRODUCT_EXTERNAL_ID
PRODUCT_PRICE
"1770607c104641caaf5fde0e433b76a6"
["817720022911","817720022782"]
[39.99,39.99]
"27d4c6a6559e48629dc558c94080c5ca"
["882709630449" ]
[43.65 ]
Here is the query I am using to get the results below:
create or replace temp table json_output as
with cte as
(
select
to_json(
object_construct('order_details',
array_agg(
object_construct('user_id', user_id,
'PRODUCT_EXTERNAL_ID',PRODUCT_EXTERNAL_ID,
'PRODUCT_PRICE',PRODUCT_PRICE
)
)
)
)as output
from one_row_per_order
group by user_id
)
select *
from cte;
When querying the above table, this is the format of the JSON (both records combined):
{
"order_details": [{
"PRODUCT_EXTERNAL_ID": ["817720022911", "817720022782"],
"PRODUCT_PRICE": [ 39.99, 39.99 ],
"USER_ID": "1770607c104641caaf5fde0e433b76a6",
}]
} {
"PRODUCT_EXTERNAL_ID": [ "882709630449" ],
"PRODUCT_PRICE": [ 43.65 ],
"USER_ID": "27d4c6a6559e48629dc558c94080c5ca"
}]
}
However, from my understanding the above isn't valid JSON - see the where the first object ends and the next begins (the first close-square-bracket is in the wrong place). Instead, this is my desired output format (valid JSON syntax) but I can't achieve it:
{
"order_details": [{
"PRODUCT_EXTERNAL_ID": ["12345", "26810"],
"PRODUCT_PRICE": [17.99, 10.99],
"USER_ID": "1770607c104641caaf5fde0e433b76a6"
},
{
"PRODUCT_EXTERNAL_ID": ["4578", "9876", "8888"],
"PRODUCT_PRICE": [4.99, 12.50],
"USER_ID": "27d4c6a6559e48629dc558c94080c5ca"
}]
}
I have tried array_agg but reach the maximum size limit imposed by LOB objects in Snowflake (I actually have 70k records my table with many more columns than shown in the example).I also tried TO_XML but ran into the same issue where each row is treated as an object where I want all rows combined into one object.
Have you any suggestions as to how to format the JSON please? As mentioned, I would accept valid XML too.

Original Table -
select * from test_json;
+---------------------+---------------------+---------------+
| USER_ID | PRODUCT_EXTERNAL_ID | PRODUCT_PRICE |
|---------------------+---------------------+---------------|
| 1770607c90283414343 | [ | [ |
| | "817823487432", | 39.99, |
| | "817982433987" | 39.99 |
| | ] | ] |
| 27d4c90283414343 | [ | [ |
| | "882487432" | 43.65 |
| | ] | ] |
+---------------------+---------------------+---------------+
Query to return valid JSON, as converted from values in table; however, we still have limits for size as imposed in snowflake for various functions such as ARRAY_AGG -
select object_construct('order_details',array_agg(json_1)) as json_val
from
(
select '1' as id,
OBJECT_CONSTRUCT('PRODUCT_EXTERNAL_ID',
PRODUCT_EXTERNAL_ID,
'PRODUCT_PRICE',
PRODUCT_PRICE,
'USER_ID',user_id) as json_1
from test_json)
group by id;
+----------------------------------------+
| JSON_VAL |
|----------------------------------------|
| { |
| "order_details": [ |
| { |
| "PRODUCT_EXTERNAL_ID": [ |
| "817823487432", |
| "817982433987" |
| ], |
| "PRODUCT_PRICE": [ |
| 39.99, |
| 39.99 |
| ], |
| "USER_ID": "1770607c90283414343" |
| }, |
| { |
| "PRODUCT_EXTERNAL_ID": [ |
| "882487432" |
| ], |
| "PRODUCT_PRICE": [ |
| 43.65 |
| ], |
| "USER_ID": "27d4c90283414343" |
| } |
| ] |
| } |
+----------------------------------------+
Another approach (which though does not meet 'combine all' requirement), is load all records as JSON in JSON format stage file and then we can use that as needed.
As per below test case - for 70K record the input bytes will be around 14MB.
create stage test_int_stage file_format=(type=JSON);
+-------------------------------------------------+
| status |
|-------------------------------------------------|
| Stage area TEST_INT_STAGE successfully created. |
+-------------------------------------------------+
copy into #TEST_INT_STAGE from (select
(OBJECT_CONSTRUCT('PRODUCT_EXTERNAL_ID',
PRODUCT_EXTERNAL_ID,
'PRODUCT_PRICE',
PRODUCT_PRICE,
'USER_ID',user_id)) as json_col
from test_json);
+---------------+-------------+--------------+
| rows_unloaded | input_bytes | output_bytes |
|---------------+-------------+--------------|
| 2 | 209 | 143 |
+---------------+-------------+--------------+
list #test_int_stage;
+-----------------------------------+------+
| name | size |
|-----------------------------------+------+
| test_int_stage/data_0_0_0.json.gz | 144 |
+-----------------------------------+------+

Related

Insert filename and line number into stage for Json Format

I have copy command
Table structure :
CREATE OR REPLACE transient TABLE PB_INVEOR_JSON_2 (
var variant ,
file_name text,
line_number number
);
My Copy command:
copy into PB_INVEOR_JSON_2(VAR,FILE_NAME, LINE_NUMBER)
from (select $1,metadata$filename, metadata$file_row_number #investor_stage_s3/EMBRO_20220111/ )
pattern='.*Investor_.*.json'
FILE_FORMAT=(TYPE= 'JSON' strip_outer_array=true)
on_error=continue FORCE = TRUE
I am unable to get this insert. Can some one guide me to get filename.
Ideally I would like PB_INVEOR_JSON_2 table in this format
+-------------------+--------------------------+----------------+
| METADATA$FILENAME | METADATA$FILE_ROW_NUMBER | PARSE_JSON($1) |
|-------------------+--------------------------+----------------|
| s3://em/a/a.json | 1 | { |
| | | "a": { |
| | | "b": "x1", |
| | | "c": "y1" |
| | | } |
| | | } |
| s3://em/a/a.json. | 2 | { |
| | | "a": { |
| | | "b": "x2", |
| | | "c": "y2" |
| | | } |
| | | } |
+-------------------+--------------------------+----------------+
Error:
SQL compilation error: syntax error line 2 at position 62 unexpected '#investor_stage_s3/EMBRO_20220111/'.
Missing the FROM in the SELECT portion of the COPY statement:
copy into PB_INVEOR_JSON_2(VAR,FILE_NAME, LINE_NUMBER)
from (
select $1,metadata$filename, metadata$file_row_number
from #investor_stage_s3/EMBRO_20220111/
)
pattern='.*Investor_.*.json'
FILE_FORMAT=(TYPE= 'JSON' strip_outer_array=true)
on_error=continue FORCE = TRUE

Database design for One to Many or Many to many for use case?

I have a requirement where having a four consecutive one to many table layers as follows,
School
SchoolID | SName
-----------------
1 | HSS1
2 | HSS2
3 | HSS3
4 | HSS4
Class
ClassID | CName | SchoolID
-------------------------------
1 | Class1 | 1
2 | Class2 | 1
3 | Class3 | 3
4 | Class4 | 3
Student
StudentID | StName | ClassID
-------------------------------
1 | StudC1 | 1
2 | StudC2 | 1
3 | StudC3 | 2
4 | StudC4 | 2
Bench
BenchID | BName | StudentID
-------------------------------
1 | Bench1 | 1
2 | Bench2 | 1
3 | Bench3 | 1
4 | Bench4 | 2
1 | Bench1 | 3
2 | Bench2 | 3
3 | Bench3 | 4
4 | Bench4 | 4
Idea is to get the response as this format as json from Database,
{
"resources": [
{
"SName": "HSS1",
"ClassList": [
{
"CName": [
{
"CName": "Class1",
"StudentList": [
{
"StName": "StudC1",
"BenchList": [
{
"BName": "Bench1"
},
{
"BName": "Bench2"
},
{
"BName": "Bench3"
}
]
},
{
"StName": "StudC2",
"BenchList": [
{
"BName": "Bench4"
}
]
}
]
},
{
"CName": "Class2",
"StudentList": [
{
"StName": "StudC3"
},
{
"StName": "StudC4"
}
]
}
]
}
]
}
]
}
My question is,
Is it a good practice of having or creating a nested one to many relationship tables in DB the expected results are JSON using java as above?
Is there any other simplification method of the approach in table creation yet get the same response as nested?
The above structure of tables as source will get merged into same structure of target tables using MERGE/UPSERT. Will that be easy?
one to many to many to many relationship change does that help?
is it okay to get such json response from database design as is if i design tables such as the same way?
Note: Assume table names are sample names and consider all 4 tables are one to many connected
Need community suggestions for this approach and better solution if any to this design.

Database structure for metaquery

I have a database that has a the following structure:
+------+------+--------+-------+--------+-------+
| item | type | color | speed | length | width |
+------+------+--------+-------+--------+-------+
| 1 | 1 | 3 | 1 | 2 | 2 |
| 2 | 1 | 5 | 3 | 1 | 1 |
| 3 | 1 | 6 | 3 | 1 | 1 |
| 4 | 2 | 2 | 1 | 3 | 1 |
| 5 | 2 | 2 | 2 | 2 | 1 |
| 6 | 2 | 4 | 2 | 3 | 1 |
| 7 | 2 | 5 | 1 | 1 | 2 |
| 8 | 3 | 1 | 2 | 2 | 2 |
| 9 | 4 | 4 | 3 | 1 | 2 |
| 10 | 4 | 6 | 3 | 3 | 2 |
+------+------+--------+-------+--------+-------+
I would like to efficiently query what combination of fields are valid. So, for example, I'd like to query the database for the following:
What values of color are valid if type is 1?
ans: [3, 5, 6]
What values of speed are valid if type is 2 and color is 2?
ans: [1, 2]
What values of type are valid if length is 2 and width is 2?
ans: [1, 2]
The SQL equivalents are:
SELECT DISTINCT `color` FROM `cars` WHERE `type` =2
SELECT DISTINCT `speed` FROM `cars` WHERE `type` =2 AND `width` =2
SELECT DISTINCT `type` FROM `cars` WHERE `length` =2 AND `width` =2
I'm planning on using a cloud based database (Cloudant DBAAS - based on CouchDB). How would this best be implemented, keeping in mind that there may be thousands of items with tens of fields?
I haven't put too much thought into this question, so there may be errors in the approach, but one option is to represent each row with a document:
{
"_id": "1db91338150bfcfe5fcadbd98fe77d56",
"_rev": "1-83daafc1596c2dabd4698742c2d8b0cf",
"item": 1,
"type": 1,
"color": 3,
"speed": 1,
"length": 2,
"width": 2
}
Note the _id and _rev fields have been automatically generated by Cloudant for this example.
You could then create a secondary index on the type field:
function(doc) {
if(doc.type)
emit(doc.type);
}
To search using the type field:
https://accountname.cloudant.com/dashboard.html#database/so/_design/ddoc/_view/col_for_type?key=1&include_docs=true
A secondary index on the type and width fields:
function(doc) {
if( doc.type && doc.width)
emit([doc.type, doc.width]);
}
To search using the type and width fields:
https://accountname.cloudant.com/dashboard.html#database/so/_design/ddoc/_view/speed_for_type_and_width?key=[1,2]&include_docs=true
A secondary index on the length and width fields:
function(doc) {
if (doc.length && doc.width)
emit([doc.length, doc.width]);
}
To search using the length and width fields:
https://accountname.cloudant.com/dashboard.html#/database/so/_design/ddoc/_view/type_for_length_and_width?key=[2,2]&include_docs=true
The complete design document is here:
{
"_id": "_design\/ddoc",
"_rev": "3-c87d7c3cd44dcef35a030e23c1c91711",
"views": {
"col_for_type": {
"map": "function(doc) {\n if(doc.type)\n emit(doc.type);\n}"
},
"speed_for_type_and_width": {
"map": "function(doc) {\n if( doc.type && doc.width)\n emit([doc.type, doc.width]);\n}"
},
"type_for_length_and_width": {
"map": "function(doc) {\n if (doc.length && doc.width)\n emit([doc.length, doc.width]);\n}"
}
},
"language": "javascript"
}

Solr Grouping For Categories / Sub-Categories

I'm still very much a newbie in the realms of Solr.
I'm attempting create a query which groups by categories, returning a unique list of sub_categories. My schema looks something similar to the following:
+==============================================+
| id | category | sub_category | Type |
+-----+----------+--------------+--------------+
| 1 | Apparel | Pants | Suede |
| 2 | Apparel | Pants | Leather |
| 3 | Apparel | Pants | Wind Pants |
| 4 | Apparel | Shirts | Short-Sleeve |
| 5 | Apparel | Shirts | Long-Sleeve |
| 6 | Sports | Balls | Soccer Ball |
| 7 | Sports | Balls | Football |
+-----+----------+--------------+--------------+
I'm interested in getting a return similar to the following, but am unsure on how to accomplish it. I can almost get here, the but problem is I am unable to get the sub_category column to return unique values. The example below does account for distinct sub_categories:
{
"responseHeader": {
"status": 0,
"QTime:" 12,
},
"grouped": {
"category": {
"matches": 1,
"groups": [
{
"groupValue": "Apparel",
'docList": {
"numFound": 2,
'start': 0,
'docs': [
{"sub_category": "Pants"},
{"sub_category": "Shirts"}
]
},
{
"groupValue": "Sports",
'docList": {
"numFound": 2,
'start': 0,
'docs': [
{"sub_category": "Balls"}
]
},
]
}
}
}
Assuming, you are using Solr 4.0+, I believe facet pivoting is a better way to do this.
Try:
http://localhost:8983/solr/select?q=*:*&facet.pivot=category,sub_category&facet=true&facet.field=category&rows=0
Update: Hmm, but that's not going to give you the unique counts though :-? That will give you something like this, if that's OK with you:
+ Apparel [5]
|--- Pants [3]
|--- Shirts [2]
|
+ Sports [2]
|--- Balls [2]

MySQL query does not return any data

I need to retrieve data from a specific time period.
The query works fine until I specify the time period. Is there something wrong with the way I specify time period? I know there are many entries within that time-frame.
This query returns empty:
SELECT stop_times.stop_id, STR_TO_DATE(stop_times.arrival_time, '%H:%i:%s') as stopTime, routes.route_short_name, routes.route_long_name, trips.trip_headsign FROM trips
JOIN stop_times ON trips.trip_id = stop_times.trip_id
JOIN routes ON routes.route_id = trips.route_id
WHERE stop_times.stop_id = 5508
HAVING stopTime BETWEEN DATE_SUB(stopTime,INTERVAL 1 MINUTE) AND DATE_ADD(stopTime,INTERVAL 20 MINUTE);
Here is it's EXPLAIN:
+----+-------------+------------+--------+------------------+---------+---------+-------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+------------------+---------+---------+-------------------------------+------+-------------+
| 1 | SIMPLE | stop_times | ref | trip_id,stop_id | stop_id | 5 | const | 605 | Using where |
| 1 | SIMPLE | trips | eq_ref | PRIMARY,route_id | PRIMARY | 4 | wmata_gtfs.stop_times.trip_id | 1 | |
| 1 | SIMPLE | routes | eq_ref | PRIMARY | PRIMARY | 4 | wmata_gtfs.trips.route_id | 1 | |
+----+-------------+------------+--------+------------------+---------+---------+-------------------------------+------+-------------+
3 rows in set (0.00 sec)
The query works if I remove the HAVING clause (don't specify time range). Returns:
+---------+----------+------------------+-----------------+---------------+
| stop_id | stopTime | route_short_name | route_long_name | trip_headsign |
+---------+----------+------------------+-----------------+---------------+
| 5508 | 06:31:00 | "80" | "" | "FORT TOTTEN" |
| 5508 | 06:57:00 | "80" | "" | "FORT TOTTEN" |
| 5508 | 07:23:00 | "80" | "" | "FORT TOTTEN" |
| 5508 | 07:49:00 | "80" | "" | "FORT TOTTEN" |
| 5508 | 08:15:00 | "80" | "" | "FORT TOTTEN" |
| 5508 | 08:41:00 | "80" | "" | "FORT TOTTEN" |
| 5508 | 09:08:00 | "80" | "" | "FORT TOTTEN" |
I am using Google Transit format Data loaded into MySQL.
The query is supposed to provide stop times and bus routes for a given bus stop.
For a bus stop, I am trying to get:
Route Name
Bus Name
Bus Direction (headsign)
Stop time
The results should be limited only to buses times from 1 min ago to 20 min from now.
Please let me know if you could help.
UPDATE
The problem was that I was comparing DATE to DATETIME as one answer said.
I could not use DATE because my values had times but not dates.
So my solution was to use Unix time:
SELECT stop_times.stop_id, stop_times.trip_id, UNIX_TIMESTAMP(CONCAT(DATE_FORMAT(NOW(),'%Y-%m-%d '), stop_times.arrival_time)) as stopTime, routes.route_short_name, routes.route_long_name, trips.trip_headsign FROM trips
JOIN stop_times ON trips.trip_id = stop_times.trip_id
JOIN routes ON routes.route_id = trips.route_id
WHERE stop_times.stop_id = 5508
HAVING stopTime > (UNIX_TIMESTAMP(NOW()) - 60) AND stopTime < (UNIX_TIMESTAMP(NOW()) + (60*20));
Stoptime is a time value, and DATE_ADD/SUB work with datetime fields. Ensure they are both the same type.
Try this instead:
SELECT * FROM
(SELECT stop_times.stop_id, STR_TO_DATE(stop_times.arrival_time, '%H:%i:%s') as stopTime, routes.route_short_name, routes.route_long_name, trips.trip_headsign FROM trips
JOIN stop_times ON trips.trip_id = stop_times.trip_id
JOIN routes ON routes.route_id = trips.route_id
WHERE stop_times.stop_id = 5508) AS qu_1
WHERE qu_1.stopTime BETWEEN DATE_SUB(qu_1.stopTime,INTERVAL 1 MINUTE) AND DATE_ADD(qu_1.stopTime,INTERVAL 20 MINUTE);
Have to warn you I haven't tested this but it does remove the need for the HAVING clause.
Don't work with the synthetic column stopTime other than as the output.
I think your query should be something like:
SELECT stop_times.stop_id, STR_TO_DATE(stop_times.arrival_time, '%H:%i:%s') as stopTime, routes.route_short_name, routes.route_long_name, trips.trip_headsign FROM trips
JOIN stop_times ON trips.trip_id = stop_times.trip_id
JOIN routes ON routes.route_id = trips.route_id
WHERE stop_times.stop_id = 5508
AND arrival_time BETWEEN <something> AND <something else>
The HAVING clause you wrote should always return true, so I'm guessing that's not what you really had in mind.

Resources