Parse json arrays using HIVE

Parse json arrays using HIVE - arrays

I have many json arrays stored in a table (jt) that looks like this:
[{"ts":1403781896,"id":14,"log":"show"},{"ts":1403781896,"id":14,"log":"start"}]
[{"ts":1403781911,"id":14,"log":"press"},{"ts":1403781911,"id":14,"log":"press"}]
Each array is a record.
I would like to parse this table in order to get a new table (logs) with 3 fields: ts, id, log.
I tried to use the get_json_object method, but it seems that method is not compatible with json arrays because I only get null values.
This is the code I have tested:
CREATE TABLE logs AS
SELECT get_json_object(jt.value, '$.ts') AS ts,
get_json_object(jt.value, '$.id') AS id,
get_json_object(jt.value, '$.log') AS log
FROM jt;
I tried to use other functions but they seem really complicated.
Thank you! :)
Update!
I solved my issue by performing a regexp:
CREATE TABLE jt_reg AS
select regexp_replace(regexp_replace(value,'\\}\\,\\{','\\}\\\n\\{'),'\\[|\\]','') as valuereg from jt;
CREATE TABLE logs AS
SELECT get_json_object(jt_reg.valuereg, '$.ts') AS ts,
get_json_object(jt_reg.valuereg, '$.id') AS id,
get_json_object(jt_reg.valuereg, '$.log') AS log
FROM ams_json_reg;

I just ran into this problem, with the JSON array stored as a string in the hive table.
The solution is a bit hacky and ugly, but it works and doesn't require serdes or external UDFs
SELECT
get_json_object(single_json_table.single_json, '$.ts') AS ts,
get_json_object(single_json_table.single_json, '$.id') AS id,
get_json_object(single_json_table.single_json, '$.log') AS log
FROM ( SELECT explode (
split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
'"}","', '"}",,,,"'), ',,,,')
) FROM src_table) single_json_table;
I broke the lines up so that it would be a little easier to read.
I'm using substr() to strip the first and last characters, removing [ and ] . I'm then using regex_replace to match the separator between records in the json array and adding or changing the separator to be something unique that can then be used easily with split() to turn the string into a hive array of json objects which can then be used with explode() as described in the previous solution.
Note, the separator regex used here ( "}"," ) wouldn't work with the original data set...the regex would have to be ( "},\{" ) and the replacement would then need to be "},,,,{" eg..
split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
'"},\\{"', '"},,,,{"'), ',,,,')

Use explode() function
hive (default)> CREATE TABLE logs AS
> SELECT get_json_object(single_json_table.single_json, '$.ts') AS ts,
> get_json_object(single_json_table.single_json, '$.id') AS id,
> get_json_object(single_json_table.single_json, '$.log') AS log
> FROM
> (SELECT explode(json_array_col) as single_json FROM jt) single_json_table ;
Automatically selecting local only mode for query
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
hive (default)> select * from logs;
OK
ts id log
1403781896 14 show
1403781896 14 start
1403781911 14 press
1403781911 14 press
Time taken: 0.118 seconds, Fetched: 4 row(s)
hive (default)>
where json_array_col is column in jt which holds your array of jsons.
hive (default)> select json_array_col from jt;
json_array_col
["{"ts":1403781896,"id":14,"log":"show"}","{"ts":1403781896,"id":14,"log":"start"}"]
["{"ts":1403781911,"id":14,"log":"press"}","{"ts":1403781911,"id":14,"log":"press"}"]

because get_json_object doesn't support json array string, so you can concat to a json object, like this:
SELECT
get_json_object(concat(concat('{"root":', jt.value), '}'), '$.root')
FROM jt;

Related

Snowflake:Copy from S3 into Table with nested JSON

Requirement: To load the Nested JSON file into the Snowflake from S3
Error: SQL compilation error: COPY statement only supports simple SELECT from stage statements for import.
I know I can create a temporary table from the SQL, Is there a better way to load directly from the S3 into Snowflake
COPY INTO schema.table_A FROM (
WITH s3 AS (
SELECT $1 AS json_array
FROM '#public.stage'
(file_format => 'public.json',
pattern => 'abc/xyz/.*')
)
SELECT DISTINCT
CURRENT_TIMESTAMP() AS exec_t,
json_array AS data,
json_array:id AS id,
json_array:code::text AS code
FROM s3,TABLE(Flatten(s3.json_array)) f
);

Basically Transformations during loading come along with certain limitations, see here: https://docs.snowflake.com/en/user-guide/data-load-transform.html#transforming-data-during-a-load
If you still want to keep your code and not apply the transformations later, you may create a view on top of the stage and then basically you INSERT into another table based on SELECT * from that view.
Maybe avoiding the CTE is already helping.

XPath 'contains()' requires a singleton (or empty sequence)

Given the XML:
<Dial>
<DialID>
24521
</DialID>
<DialName>
Base Price
</DialName>
</Dial>
<Dial>
<DialID>
24528
</DialID>
<DialName>
Rush Options
</DialName>
<DialValue>
1.5
</DialValue>
</Dial>
<Dial>
<DialID>
24530
</DialID>
<DialName>
Bill Rush Charges
</DialName>
<DialValue>
School
</DialValue>
</Dial>
I can use the contains() function in my xpath:
//Dial[DialName[contains(text(), 'Bill')]]/DialValue
To retrieve the values I'm after:
School
The above XML is stored in a field in my SQL database so I'm using the .value method to select from that field.
SELECT Dials.DialDetail.value('(//Dial[DialName[contains(text(), "Bill")]]/DialValue)[1]','VARCHAR(64)') AS BillTo
FROM CampaignDials Dials
I can't seem to get the syntax right though... the xpath works as expected (tested in Oxygen and elsewhere) but when I use it in the XQuery argument of the .value() method, I get an error:
Started executing query at Line 1
Msg 2389, Level 16, State 1, Line 36
XQuery [Dials.DialDetail.value()]: 'contains()' requires a singleton (or empty sequence), found operand of type 'xdt:untypedAtomic *'
Total execution time: 00:00:00.004
I've tried different variations of single and double quotes with no effect. The error refers to an XPath data type for attributes, but I'm not retrieving an attribute; I'm getting the text value. I receive the same error if I type the response with //Dial[DialName[contains(text(), 'Bill')]]/DialValue/text() instead.
What is the correct way to use contains() in an XQuery when it's used in the XML.value() method? Or is this the wrong approach to begin with?

You nearly have it right, you just need [1] on the text() function to guarantee a single value.
You should also use text() on the actual node you are pulling out, for performance reasons.
Also, // can be inefficient, so only use it if you really need recursive descent. You can instead use /*/ to get the first node of any name.
SELECT
Dials.DialDetail.value(
'(//Dial[DialName[contains(text()[1], "Bill")]]/DialValue/text())[1]',
'VARCHAR(64)') AS BillTo
FROM CampaignDials Dials
As Yitzhak Kabinsky notes, this only gets you one value per row of the table, you need .nodes if you want to shred the XML itself into rows.

The difference between your actual database case that fails and your reduced sample case that works is likely one of different data.
The error,
contains() requires a singleton (or empty sequence)
indicates that one of your DialName elements has multiple text node children rather than a single text node child as you're expecting.
You can abstract away such variations by testing the string-value of DialName rather than its text node children:
//Dial[contains(DialName, 'Bill')]/DialValue
See also
Testing text() nodes vs string values in XPath

Here is how to do XML shredding in MS SQL Server correctly.
You need to apply filter in the XQuery .nodes() method.
The .value() method is just for the actual value retrieval.
It is possible to pass SQL Server variable as a parameter instead of the hard-coding "Bill" value.
SQL
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT IDENTITY PRIMARY KEY, DialDetail XML);
INSERT INTO #tbl (DialDetail) VALUES
(N'<Dial>
<DialID>24521</DialID>
<DialName>Base Price</DialName>
</Dial>
<Dial>
<DialID>24528</DialID>
<DialName>Rush Options</DialName>
<DialValue>1.5</DialValue>
</Dial>
<Dial>
<DialID>24530</DialID>
<DialName>Bill Rush Charges</DialName>
<DialValue>School</DialValue>
</Dial>');
-- DDL and sample data population, end
SELECT ID
, c.value('(DialID/text())[1]', 'INT') AS DialID
, c.value('(DialName/text())[1]', 'VARCHAR(30)') AS DialName
, c.value('(DialValue/text())[1]', 'VARCHAR(30)') AS DialValue
FROM #tbl CROSS APPLY DialDetail.nodes('/Dial[contains((DialName/text())[1], "Bill")]') AS t(c);
Output
+----+--------+-------------------+-----------+
| ID | DialID | DialName | DialValue |
+----+--------+-------------------+-----------+
| 1 | 24530 | Bill Rush Charges | School |
+----+--------+-------------------+-----------+

Create a Nested/Repeating field using SQL in BigQuery which can be queried with dot notation (without UNNEST)

I am trying to build a data structure in BigQuery using SQL which exactly reflects the data structure which I obtain when uploading JSON. This will enable me to query the view using SQL with dot notation instead of having to UNNEST, which I do understand but many of my clients find extremely confusing and unintuitive.
If I build a really simple dummy dataset with a couple of rows and then nest using the ARRAY_AGG(STRUCT([field list])) pattern:
WITH
flat_table AS (
SELECT "BigQuery" AS name, 23 AS user_count, "Data Warehouse" AS data_thing, 5 AS ease_of_use, "Awesome" AS description UNION ALL
SELECT "MySQL" AS name, 12 AS user_count, "Database" AS data_thing, 3 AS ease_of_use, "Solid" AS description
)
SELECT
name, user_count,
ARRAY_AGG(STRUCT(data_thing, ease_of_use, description)) AS attributes
FROM flat_table
GROUP BY name, user_count
Then saving and viewing the schema shows that the attributes field is Type = RECORD and Mode = REPEATED. Schema field names are:
name
user_count
attributes
attributes.data_thing
attributes.ease_of_use
attributes.description
If I look at the COLUMN information in the INFORMATION_SCHEMA.COLUMNS query I can see that the attributes field is_nullable = NO and data_type = ARRAY<STRUCT<data_thing STRING, ease_of_use INT64, description STRING>>
If I want to query this structure I need to use the UNNEST pattern as below:
SELECT
name,
user_count
FROM
nested_table,
UNNEST(attributes)
WHERE
ease_of_use > 3
However when I upload the following JSON representation of the same data to BigQuery with automatic schema detection:
{"attributes":{"description":"Awesome","ease_of_use":5,"data_thing":"Data Warehouse"},"user_count":23,"name":"BigQuery"}
{"attributes":{"description":"Solid","ease_of_use":3,"data_thing":"Database"},"user_count":12,"name":"MySQL"}
The schema looks nearly identical once loaded, except for the attributes field is Mode = NULLABLE (it is still Type = RECORD). The INFORMATION_SCHEMA.COLUMNS shows me that the attributes field is now is_nullable = YES and data_type = STRUCT<data_thing STRING, ease_of_use INT64, description STRING>, i.e. now nullable and not in an array.
However the most interesting thing for me is that I can now query this table using dot notation instead of the UNNEST pattern, so the query above becomes:
SELECT
name,
user_count
FROM
nested_table_json
WHERE
attributes.ease_of_use > 3
Which is arguably easier to read, even in this trivial case. However once we get to more complex data structures with multiple nested fields and multi-level nesting, the UNNEST pattern becomes extremely difficult to write, QA and debug. The dot notation pattern appears to be much more intuitive and scalable.
So my question is: is it possible to build a data structure equivalent to the loaded JSON by writing queries in SQL, enabling us to build Standard SQL queries using dot notation and not requiring complex UNNEST patterns?

If you know that your array_agg will produce one row, you can drop the ARRAY notation like this:
SELECT
name, user_count,
ARRAY_AGG(STRUCT(data_thing, ease_of_use, description))[offset(0)] AS attributes
notice the use of OFFSET(0) this way the returned output will be:
[
{
"name": "BigQuery",
"user_count": "23",
"attributes": {
"data_thing": "Data Warehouse",
"ease_of_use": "5",
"description": "Awesome"
}
}
]
which can be queried using dot notation.
In case you want just to group result in STRUCT, you don't need array_agg.
WITH
flat_table AS (
SELECT "BigQuery" AS name, 23 AS user_count, struct("Data Warehouse" AS data_thing, 5 AS ease_of_use, "Awesome" AS description) as attributes UNION ALL
SELECT "MySQL" AS name, 12 AS user_count, struct("Database" AS data_thing, 3 AS ease_of_use, "Solid" AS description)
)
SELECT
*
FROM flat_table

BigQuery or SQL Server SPLIT query

I have searched around and can not find much on this topic. I have a table, that gets logging information. As a result the column I am interested in contains multiple values that I need to search against. The column is formatted in a php URL style. i.e.
/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32
This makes all searches end up with really long regexes to get data. Then join statements to combine data.
Is there a way in BigQuery, or SQL Server that I can pull the information from that column and put it into new columns?
Example:
The information I would like extracted begins after the ?, and ends at &, The string can sometimes be longer, and contains additional headers.
Thanks,

Below is for BigQuery Standard SQL and addresses below aspect of your question
Is there a way in BigQuery, ... that I can pull the information from that column and put it into new columns?
#standardSQL
CREATE TEMP FUNCTION parseColumn(kv STRING, column_name STRING) AS (
IF(SPLIT(kv, '=')[OFFSET(0)]= column_name, SPLIT(kv, '=')[OFFSET(1)], NULL)
);
WITH `project.dataset.table` AS (
SELECT '/test/test.aspx?extra=abc&DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url UNION ALL
SELECT '/test/test.aspx?DS_Vendor=55192&DS_ProdVer=4.30.100.0&more=123&DS_ProdLang=DE&DS_Product=MTE&DS_OfficeBits=64'
)
SELECT
MIN(parseColumn(kv, 'DS_Vendor')) AS DS_Vendor,
MIN(parseColumn(kv, 'DS_ProdVer')) AS DS_ProdVer,
MIN(parseColumn(kv, 'DS_ProdLang')) AS DS_ProdLang,
MIN(parseColumn(kv, 'DS_Product')) AS DS_Product,
MIN(parseColumn(kv, 'DS_OfficeBits')) AS DS_OfficeBits
FROM `project.dataset.table`,
UNNEST(REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')) AS kv
GROUP BY url
with the result as below
Row DS_Vendor DS_ProdVer DS_ProdLang DS_Product DS_OfficeBits
1 55039 7.90.100.0 EN MTT 32
2 55192 4.30.100.0 DE MTE 64
Below is also addressed
The string can sometimes be longer, and contains additional headers.

One example using BigQuery (with standard SQL):
SELECT REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')
FROM (
SELECT '/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url
)
This returns the parts of the URL as an ARRAY<STRING>. To go one step further, you can get back an ARRAY<STRUCT<key STRING, value STRING>> with a query of this form:
SELECT
ARRAY(
SELECT AS STRUCT
SPLIT(part, '=')[OFFSET(0)] AS key,
SPLIT(part, '=')[OFFSET(1)] AS value
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')) AS part
) AS keys_and_values
FROM (
SELECT '/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url
)
...or with the keys and values as top-level columns:
SELECT
SPLIT(part, '=')[OFFSET(0)] AS key,
SPLIT(part, '=')[OFFSET(1)] AS value
FROM (
SELECT '/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url
)
CROSS JOIN UNNEST(REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')) AS part

Array_agg in postgres selectively quotes

I have a complex database with keys and values stored in different tables. It is useful for me to aggregate them when pulling out the values for the application:
SELECT array_agg(key_name), array_agg(vals)
FROM (
SELECT
id,
key_name,
array_agg(value)::VARCHAR(255) AS vals
FROM factor_key_values
WHERE id=20
GROUP BY key_name, id
) f;
This particular query, in my case gives the following invalid JSON:
-[ RECORD 1 ]-----------------------------------------------------------------------
array_agg | {"comparison method","field score","field value"}
array_agg | {"{\"text category\"}","{100,70,50,0,30}","{A,B,C,F,\"No Experience\"}"}
Notice that the array of varchars is only quoted if the string has a space. I have narrowed this down to the behaviour of ARRAY_AGG. For completeness here is an example:
BEGIN;
CREATE TABLE test (txt VARCHAR(255));
INSERT INTO test(txt) VALUES ('one'),('two'),('three'), ('four five');
SELECT array_agg(txt) FROM test;
The result will be:
{one,two,three,"four five"}
This is why my JSON is breaking. I can handle unquoted or quoted strings in the application code, but have a mix in nuts.
Is there any solution to this?

Can't you use json_agg?
select json_agg(txt) from test;
json_agg
--------------------------------------
["one", "two", "three", "four five"]

Unfortunately, this is the inconsistent standard that PostgreSQL uses for formatting arrays. See "Array Input and Output Syntax" for more information.
Clodoaldo's answer is probably what you want, but as an alternative, you could also build your own result:
SELECT '{'||array_to_string(array_agg(txt::text), ',')||'}' FROM test;

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Parse json arrays using HIVE - arrays

because get_json_object doesn't support json array string, so you can concat to a json object, like this: SELECT get_json_object(concat(concat('{"root":', jt.value), '}'), '$.root') FROM jt;

Related

Snowflake:Copy from S3 into Table with nested JSON

XPath 'contains()' requires a singleton (or empty sequence)

Create a Nested/Repeating field using SQL in BigQuery which can be queried with dot notation (without UNNEST)

BigQuery or SQL Server SPLIT query

Array_agg in postgres selectively quotes

Categories

Resources