Hive explode with array of struct - arrays

I am trying to work out how to explode a complex type in Hive. I have the following Avro file that I want to use for my test and have build a Hive external table over it.
Here is my test data.
{"order_id":123456,"customer_id":987654,"total":305,"order_details":[{"quantity":5,"total":55,"product_detail":{"product_id":1000,"product_name":"Hugo Boss XY","product_description": {"string": "Hugo Xy Men 100 ml"}, "product_status": "AVAILABLE", "product_category":["fragrance","perfume"],"price":10.35,"product_hash":"XY123"}},{"quantity":5,"total":250,"product_detail":{"product_id":2000,"product_name":"Cherokee Polo T Shirt","product_description": {"string": "Cherokee Medium Blue Polo T Shirt"}, "product_status": "AVAILABLE", "product_category":["T-shirts","V-Neck","Cotton", "Medium"],"price":50.00,"product_hash":"XY789"}}]}
{"order_id":789012,"customer_id":4567324,"total":220,"order_details":[{"quantity":10,"total":120,"product_detail":{"product_id":1001,"product_name":"Hugo Men Red","product_description": {"string": "Hugo Men Red 150 ml"}, "product_status": "ONLY_FEW_LEFT", "product_category":["fragrance","perfume"],"price":12.99,"product_hash":"XY456"}},{"quantity":10,"total":100,"product_detail":{"product_id":2001,"product_name":"Ruggers Smart","product_description": {"string": "Ruggers Smart White Small Polo T Shirt"}, "product_status": "ONLY_FEW_LEFT", "product_category":["T-shirts","Round-Neck","Woolen", "Small"],"price":9.99,"product_hash":"XY987"}}]}
Avro schema
{
"namespace":"com.treselle.db.model",
"type":"record",
"doc":"This Schema describes about Order",
"name":"Order",
"fields":[
{"name":"order_id","type": "long"},
{"name":"customer_id","type": "long"},
{"name":"total","type": "float"},
{"name":"order_details","type":{
"type":"array",
"items": {
"namespace":"com.treselle.db.model",
"name":"OrderDetail",
"type":"record",
"fields": [
{"name":"quantity","type": "int"},
{"name":"total","type": "float"},
{"name":"product_detail","type":{
"namespace":"com.treselle.db.model",
"type":"record",
"name":"Product",
"fields":[
{"name":"product_id","type": "long"},
{"name":"product_name","type": "string","doc":"This is the name of the product"},
{"name":"product_description","type": ["string", "null"], "default": ""},
{"name":"product_status","type": {"name":"product_status", "type": "enum", "symbols": ["AVAILABLE", "OUT_OF_STOCK", "ONLY_FEW_LEFT"]}, "default":"AVAILABLE"},
{"name":"product_category","type":{"type": "array", "items": "string"}, "doc": "This contains array of categories"},
{"name":"price","type": "float"},
{"name": "product_hash", "type": {"type": "fixed", "name": "product_hash", "size": 5}}
]
}
}
]
}
}
}
]
}
My Hive DDL
CREATE EXTERNAL TABLE orders (
order_id bigint,
customer_id bigint,
total float,
order_items array<
struct<
quantity:int,
total:float,
product_detail:struct<
product_id:bigint,
product_name:string,
product_description:string,
product_status:string,
product_caretogy:array<string>,
price:float,
product_hash:binary
>
>
>
)
STORED AS AVRO
LOCATION '/user/hive/test/orders';
Queries
SELECT order_id, customer_id FROM orders;
This works fine and returns the results from the 2 rows as expected.
But when I try to use explode with lateral view I hit problems.
SELECT
order_id,
customer_id,
ord_dets.quantity as line_qty,
ord_dets.total as line_total
FROM
orders
LATERAL VIEW explode(order_items) exploded_table as ord_dets;
This query runs okay, but does not produce any results.
Any pointers as to what it wrong here?

The reason is that in your schema you defined order_items but in the data and the avro schema the field is called order_details. Hive looks for order_items and thinks it's a non-existent field and defaults to null.

Thanks for the pointer.
When I corrected that error I got errors at query time...
OK
Failed with exception java.io.IOException:org.apache.avro.AvroTypeException: Found com.treselle.db.model.order_details, expecting union
After further analysis I found both the enum type and the fixed type in the avro file caused the "expecting union" error.
After removing those columns I was able to query the Hive table successfully.

Related

Loading JSON into BigQuery: Field is sometimes an array and sometimes a string

I'm trying to load JSON data to BigQuery. The excerpt of my data causing problems looks like this:
[{"Value":"123","Code":"A"},{"Value":"000","Code":"B"}]
{"Value":"456","Code":"A"}
[{"Value":"123","Code":"A"},{"Value":"789","Code":"C"},{"Value":"000","Code":"B"}]
{"Value":"Z","Code":"A"}
I have defined the schema for this field to be:
{
"fields": [
{
"mode": "NULLABLE",
"name": "Code",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Value",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "Properties",
"type": "RECORD"
}
But I'm having trouble successfully extracting the string and array values into one repeated field. This SQL will successfully extract the string values:
JSON_EXTRACT_SCALAR(json_string,'$.Properties.Code') as Code,
JSON_EXTRACT_SCALAR(json_string,'$.Properties.Value') as Value
And this SQL will successfully extract the array values:
ARRAY(
SELECT
STRUCT(
JSON_EXTRACT_SCALAR(Properties_Array,'$.Code') AS Code,
JSON_EXTRACT_SCALAR(Properties_Array,'$.Value') AS Value
)
FROM UNNEST(JSON_EXTRACT_ARRAY(json_string,'$.Properties')) Properties_Array)
AS Properties
I am trying to find a way to have BigQuery to read this string as a one element array instead of preprocessing the data. Is this possible in #StandardSQL?
Below example is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` as (
SELECT '{"Properties":[{"Value":"123","Code":"A"},{"Value":"000","Code":"B"}]}' json_string UNION ALL
SELECT '{"Properties":{"Value":"456","Code":"A"}}' UNION ALL
SELECT '{"Properties":[{"Value":"123","Code":"A"},{"Value":"789","Code":"C"},{"Value":"000","Code":"B"}]}' UNION ALL
SELECT '{"Properties": {"Value":"Z","Code":"A"}}'
)
SELECT json_string,
ARRAY(
SELECT STRUCT(
JSON_EXTRACT_SCALAR(Properties,'$.Code') AS Code,
JSON_EXTRACT_SCALAR(Properties,'$.Value') AS Value
)
FROM UNNEST(IFNULL(
JSON_EXTRACT_ARRAY(json_string,'$.Properties'),
[JSON_EXTRACT(json_string,'$.Properties')])) Properties
) AS Properties
FROM `project.dataset.table`
with output

OPENJSON - How to extract value from JSON object saved as NVARCHAR in SQL Server

There is a column RawData of type NVARCHAR which contains JSON object as strings
RawData
-------------------------------------------------------
{"ID":1,--other key/value(s)--,"object":{--object1--}}
{"ID":2,--other key/value(s)--,"object":{--object2--}}
{"ID":3,--other key/value(s)--,"object":{--object3--}}
{"ID":4,--other key/value(s)--,"object":{--object4--}}
{"ID":5,--other key/value(s)--,"object":{--object5--}}
This JSON string is big (1kb) and currently the most used part of this json is object(200 bytes).
i want to extract object part of these json strings by using OPENJSON.and i was not able to achieve a solution but i think there is a solution.
The result that i want is:
RawData
----------------
{--object1--}
{--object2--}
{--object3--}
{--object4--}
{--object5--}
My attempts so far
SELECT *
FROM OPENJSON((SELECT RawData From DATA_TB FOR JSON PATH))
Looks like this should work for you.
Sample data
create table data_tb
(
RawData nvarchar(max)
);
insert into data_tb (RawData) values
('{"ID":1, "key": "value1", "object":{ "name": "alfred" }}'),
('{"ID":2, "key": "value2", "object":{ "name": "bert" }}'),
('{"ID":3, "key": "value3", "object":{ "name": "cecil" }}'),
('{"ID":4, "key": "value4", "object":{ "name": "dominique" }}'),
('{"ID":5, "key": "value5", "object":{ "name": "elise" }}');
Solution
select d.RawData, json_query(d.RawData, '$.object') as Object
from data_tb d;
See it in action: fiddle.
Something like this
SELECT object
FROM DATA_TB as dt
CROSS APPLY
OPENJSON(dt.RawData) with (object nvarchar(max) as json);

Query JSON Key:Value Pairs in AWS Athena

I have received a data set from a client that is loaded in AWS S3. The data contains unnamed JSON key:value pairs. This isn't my area of expertise, so I was looking for a little help.
The structure of JSON data that I've typically worked with in the past looks similar to this:
{ "name":"John", "age":30, "car":null }
The data that I have received from my client is formatted as such:
{
"answer_id": "cc006",
"answer": {
"101086": 1,
"101087": 2,
"101089": 2,
"101090": 7,
"101091": 5,
"101092": 3,
"101125": 2
}
}
This is survey data, where the key on the left is a numeric customer identifier, and the value on the right is their response to a survey question, i.e. customer "101125" answered the survey with a value of "2". I need to be able to query the JSON data using Athena such that my result set looks similar to:
Cross joining the unnested children against the parent node isn't an issue. What I can't figure out is how to select all of the keys from the array "answer" without specifying that actual key name. Similarly, I want to be able to select all of the values as well.
Is it possible to create a virtual table in Athena that would allow for these results, or do I need to convert the JSON to a format this looks more similar to the following:
{
"answer_id": "cc006",
"answer": [
{ "key": "101086", "value": 1 },
{ "key": "101087", "value": 2 },
{ "key": "101089", "value": 2 },
{ "key": "101090", "value": 7 },
{ "key": "101091", "value": 5 },
{ "key": "101092", "value": 3 },
{ "key": "101125", "value": 2 }
]
}
EDIT 6/4/2020
I was able to use the code that Theon provided below along with the following table structure:
CREATE EXTERNAL TABLE answer_example (
answer_id string,
answer string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/'
That allowed me to use the following query to generate the results that I needed.
WITH Data AS(
SELECT
answer_id,
CAST(json_extract(answer, '$') AS MAP(VARCHAR, VARCHAR)) as answer
FROM
answer_example
)
SELECT
answer_id,
key,
element_at(answer, key) AS value
FROM
Data
CROSS JOIN UNNEST (map_keys(answer)) AS answer (key)
EDIT 6/5/2020
Taking additional advice from Theon's response below, the following DDL and Query simplify this quite a bit.
DDL:
CREATE EXTERNAL TABLE answer_example (
answer_id string,
answer map<string,string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/'
Query:
SELECT
answer_id,
key,
element_at(answer, key) AS value
FROM
answer_example
CROSS JOIN UNNEST (map_keys(answer)) AS answer (key)
Cross joining with the keys of the answer property and then picking the corresponding value. Something like this:
WITH data AS (
SELECT
'cc006' AS answer_id,
MAP(
ARRAY['101086', '101087', '101089', '101090', '101091', '101092', '101125'],
ARRAY[1, 2, 2, 7, 5, 3, 2]
) AS answers
)
SELECT
answer_id,
key,
element_at(answers, key) AS value
FROM data
CROSS JOIN UNNEST (map_keys(answers)) AS answer (key)
You could probably do something with transform_keys to create rows of the key value pairs, but the SQL above does the trick.

Remove element from jsonb array

I have the following jsonb. From the array pages I would like to remove the element called 'pageb'. The solutions offered in similar questions are not working for me.
'{
"data": {
"id": "a1aldjfg3f",
"pages": [
{
"type": "pagea"
},
{
"type": "pageb"
}
],
"activity": "test"
}
}'
My script right now looks like this. It doesnt return any error but the elements won't be removed.
UPDATE database
SET reports = jsonb_set(reports, '{data,pages}', (reports->'data'->'pages') - ('{"type":"pageb"}'), true)
WHERE reports->'data'->'pages' #> '[{"type":"pageb"}]';
The - operator cannot be applied here because the right-hand operand is a string defining a key, per the documentation:
Delete key/value pair or string element from left operand. Key/value pairs are matched based on their key value.
Removing a json object from a json array can be done by unpacking the array and finding the index of the object. A query using this method may be too complicated, so defining a custom function is very handy in this case.
create or replace function jsonb_remove_array_element(arr jsonb, element jsonb)
returns jsonb language sql immutable as $$
select arr- (
select ordinality- 1
from jsonb_array_elements(arr) with ordinality
where value = element)::int
$$;
And the update:
update my_table
set reports =
jsonb_set(
reports,
'{data,pages}',
jsonb_remove_array_element(reports->'data'->'pages', '{"type":"pageb"}')
)
where reports->'data'->'pages' #> '[{"type":"pageb"}]';
Working example in rextester.
The following is a combination of the answer provided for deleting an element inside an array reliably and the PostgreSQL's ability to use data-modifying WITH statements, but it needs an identity column (id in my test table) to work because of necessary correlation:
WITH new_reports AS (
SELECT
id,
reports #- array['data','pages',(position - 1)::text] AS new_value
FROM
test,
jsonb_array_elements(reports->'data'->'pages') WITH ORDINALITY arr(item, position)
WHERE
test.reports->'data'->'pages' #> '[{"type":"pageb"}]'
AND
item->>'type' = 'pageb'
)
UPDATE test SET reports = new_reports.new_value FROM new_reports WHERE test.id = new_reports.id;
The test data I used:
SELECT reports FROM test;
reports
-----------------------------------------------------------------------------------------------------
{"data": {"id": "a1aldjfg3f", "pages": [{"type": "pagea"}, {"type": "pagec"}], "activity": "test"}}
{"data": {"id": "a1aldjfg3f", "pages": [{"type": "pagea"}, {"type": "pageb"}], "activity": "test"}}
{"data": {"id": "a1aldjfg3f", "pages": [{"type": "pageb"}, {"type": "pagec"}], "activity": "test"}}
(3 rows)
...and after executing the query:
SELECT reports FROM test;
reports
-----------------------------------------------------------------------------------------------------
{"data": {"id": "a1aldjfg3f", "pages": [{"type": "pagea"}, {"type": "pagec"}], "activity": "test"}}
{"data": {"id": "a1aldjfg3f", "pages": [{"type": "pagea"}], "activity": "test"}}
{"data": {"id": "a1aldjfg3f", "pages": [{"type": "pagec"}], "activity": "test"}}
(3 rows)
I hope that works for you.
There you go
do $$
declare newvar jsonb;
begin
newvar := jsonb '{ "customer": "John Doe", "buy": [{"product": "Beer","qty": 6},{"product": "coca","qty": 5}]}';
newvar := jsonb_set(newvar,'{buy}', jsonb_remove((newvar->>'buy')::jsonb,'{"product": "Beer"}'));
newvar := jsonb_set(newvar,'{buy}', jsonb_add((newvar->>'buy')::jsonb,'{"product": "cofe","qty": 6}'));
RAISE NOTICE '%', newvar;
end $$
create or replace function jsonb_remove(arr jsonb, element jsonb)
returns jsonb language sql immutable as $$
select ('['||coalesce(string_agg(r::text,','),'')||']')::jsonb from jsonb_array_elements(arr) r where r #> element=false
$$;
create or replace function jsonb_add(arr jsonb, element jsonb)
returns jsonb language sql immutable as $$
select arr||element
$$;

postgresql json array query

I tried to query my json array using the example here: How do I query using fields inside the new PostgreSQL JSON datatype?
They use the example:
SELECT *
FROM json_array_elements(
'[{"name": "Toby", "occupation": "Software Engineer"},
{"name": "Zaphod", "occupation": "Galactic President"} ]'
) AS elem
WHERE elem->>'name' = 'Toby';
But my Json array looks more like this (if using the example):
{
"people": [{
"name": "Toby",
"occupation": "Software Engineer"
},
{
"name": "Zaphod",
"occupation": "Galactic President"
}
]
}
But I get an error: ERROR: cannot call json_array_elements on a non-array
Is my Json "array" not really an array? I have to use this Json string because it's contained in a database, so I would have to tell them to fix it if it's not an array.
Or, is there another way to query it?
I read documentation but nothing worked, kept getting errors.
The json array has a key people so use my_json->'people' in the function:
with my_table(my_json) as (
values(
'{
"people": [
{
"name": "Toby",
"occupation": "Software Engineer"
},
{
"name": "Zaphod",
"occupation": "Galactic President"
}
]
}'::json)
)
select t.*
from my_table t
cross join json_array_elements(my_json->'people') elem
where elem->>'name' = 'Toby';
The function json_array_elements() unnests the json array and generates all its elements as rows:
select elem->>'name' as name, elem->>'occupation' as occupation
from my_table
cross join json_array_elements(my_json->'people') elem
name | occupation
--------+--------------------
Toby | Software Engineer
Zaphod | Galactic President
(2 rows)
If you are interested in Toby's occupation:
select elem->>'occupation' as occupation
from my_table
cross join json_array_elements(my_json->'people') elem
where elem->>'name' = 'Toby'
occupation
-------------------
Software Engineer
(1 row)

Resources