I'm currently using postgres (version 9.4.4) to save entire json documents of projects in one table like this (simplified):
CREATE TABLE projects (
id numeric(19,0) NOT NULL,
project bson NOT NULL )
The project json is something like this (OVERLY-simplified):
{
"projectData": {
"guid":"project_guid",
"name":"project_name"
},
"types": [
{
"class":"window",
"provider":"glassland"
"elements":[
{
"name":"example_name",
"location":"2nd floor",
"guid":"1a",
},
{
"name":"example_name",
"location":"3rd floor",
"guid":"2a",
}
]
},
{
"class":"door",
"provider":"woodland"
"elements":[
{
"name":"example_name",
"location":"1st floor",
"guid":"3a",
},
{
"name":"example_name",
"location":"2nd floor",
"guid":"4a",
}
]
}
]
}
I've been reading documentation on operators ->, ->>, #>, #> and so on. I did some tests and successful selects. But I can't manage to index properly, specially nested arrays (types and elements).
Those are some example selects I would like to learn how to optimize (there are plenty like this):
select distinct types->'class' as class, types->'provider' as type
from projects, jsonb_array_elements(project#>'{types}') types;
select types->>'class' as class,
types->>'provider' as provider,
elems->>'name' as name,
elems->>'location' as location,
elems->>'guid' as guid,
from projects, jsonb_array_elements(project#>'{types}') types, jsonb_array_elements(types#>'{elements}') elems
where types->>'class' like 'some_text%' and elems->'guid' <> '""';
Also I have this index:
CREATE INDEX idx_gin ON projects USING GIN (project jsonb_ops);
Both of those selects work, but they don't use te #> operator or any operator that can use the GIN index. I can't create a index btree ( create index idx_btree on tcq.json_test using btree ((obra->>'types')); ) because the size of the value exceeds the limit (for the real json). Also I can't ( or I don't know how to ) create an index for, let's say, guids of elements ( create index idx_btree2 on tcq.json_test using btree((obra->>'types'->>'elements'->>'guid')); ). This produces a syntax error.
I been trying to translate queries to something using #> but things like this:
select count(*)
from projects, jsonb_array_elements(project#>'{types}') types
where types->>'class' = 'window';
select count(*)
from projects
where obra #> '{"types":[{"class":"window"}]}';
produce a different output.
Is there a way to properly index the nested arrays of that json? or to properly select taking advantage of the GIN index?
Related
Consider the below JSON object
[
{
"startdt": "10/13/2021",
"enddt": "10/13/2022",
"customerName1": "John",
"customerName2": "CA"
},
{
"startdt": "10/14/2021",
"enddt": "10/14/2022",
"customerName1": "Jacob",
"customerName2": "NJ"
}
]
This is the value present in a table "CustInfo" in the column "custjson" in Postgress DB. I want to search the data for the field customerName1. I have created the below query but it is searching in the whole object in such a way that if I give customerName1 as "Jacob" it gives the whole array. I want to search only for a particular array and return the same.
SELECT DISTINCT ON(e.id) e.*,
(jsonb_array_elements(e.custjson)->>'customerName1') AS name1
FROM CustInfo e
CROSS JOIN jsonb_array_elements(e.custjson) ej
WHERE value ->> 'customerName1' LIKE '%Jacob%'
Is there a way in which we can only search the "Jacob" customerName1's array instead of whole json?
For eg: if i search for Jacob i should get the following istead of searching the whole JSON
{
"startdt": "10/14/2021",
"enddt": "10/14/2022",
"customerName1": "Jacob",
"customerName2": "NJ"
}
Any help would be greatly helpful
You can use a JSON path expression to find the array element with a matching customer name:
select e.id,
jsonb_path_query_array(e.custjson, '$[*] ? (#.customerName1 like_regex "Jacob")')
from custinfo e
Based on your sample data, this returns:
id | jsonb_path_query_array
---+----------------------------------------------------------------------------------------------------
1 | [{"enddt": "10/14/2022", "startdt": "10/14/2021", "customerName1": "Jacob", "customerName2": "NJ"}]
If you are using an older Postgres version, that doesn't support JSON path queries, you need to unnest and aggregate manually:
select e.id,
(select jsonb_agg(element)
from jsonb_array_elements(e.custjson) as x(element)
where x.element ->> 'customerName1' like '%Jacob%')
from custinfo e
This assumes that custjson is defined with the data type jsonb (which it should be). If not, you need to cast it: custjson::jsonb
I am trying to build a data structure in BigQuery using SQL which exactly reflects the data structure which I obtain when uploading JSON. This will enable me to query the view using SQL with dot notation instead of having to UNNEST, which I do understand but many of my clients find extremely confusing and unintuitive.
If I build a really simple dummy dataset with a couple of rows and then nest using the ARRAY_AGG(STRUCT([field list])) pattern:
WITH
flat_table AS (
SELECT "BigQuery" AS name, 23 AS user_count, "Data Warehouse" AS data_thing, 5 AS ease_of_use, "Awesome" AS description UNION ALL
SELECT "MySQL" AS name, 12 AS user_count, "Database" AS data_thing, 3 AS ease_of_use, "Solid" AS description
)
SELECT
name, user_count,
ARRAY_AGG(STRUCT(data_thing, ease_of_use, description)) AS attributes
FROM flat_table
GROUP BY name, user_count
Then saving and viewing the schema shows that the attributes field is Type = RECORD and Mode = REPEATED. Schema field names are:
name
user_count
attributes
attributes.data_thing
attributes.ease_of_use
attributes.description
If I look at the COLUMN information in the INFORMATION_SCHEMA.COLUMNS query I can see that the attributes field is_nullable = NO and data_type = ARRAY<STRUCT<data_thing STRING, ease_of_use INT64, description STRING>>
If I want to query this structure I need to use the UNNEST pattern as below:
SELECT
name,
user_count
FROM
nested_table,
UNNEST(attributes)
WHERE
ease_of_use > 3
However when I upload the following JSON representation of the same data to BigQuery with automatic schema detection:
{"attributes":{"description":"Awesome","ease_of_use":5,"data_thing":"Data Warehouse"},"user_count":23,"name":"BigQuery"}
{"attributes":{"description":"Solid","ease_of_use":3,"data_thing":"Database"},"user_count":12,"name":"MySQL"}
The schema looks nearly identical once loaded, except for the attributes field is Mode = NULLABLE (it is still Type = RECORD). The INFORMATION_SCHEMA.COLUMNS shows me that the attributes field is now is_nullable = YES and data_type = STRUCT<data_thing STRING, ease_of_use INT64, description STRING>, i.e. now nullable and not in an array.
However the most interesting thing for me is that I can now query this table using dot notation instead of the UNNEST pattern, so the query above becomes:
SELECT
name,
user_count
FROM
nested_table_json
WHERE
attributes.ease_of_use > 3
Which is arguably easier to read, even in this trivial case. However once we get to more complex data structures with multiple nested fields and multi-level nesting, the UNNEST pattern becomes extremely difficult to write, QA and debug. The dot notation pattern appears to be much more intuitive and scalable.
So my question is: is it possible to build a data structure equivalent to the loaded JSON by writing queries in SQL, enabling us to build Standard SQL queries using dot notation and not requiring complex UNNEST patterns?
If you know that your array_agg will produce one row, you can drop the ARRAY notation like this:
SELECT
name, user_count,
ARRAY_AGG(STRUCT(data_thing, ease_of_use, description))[offset(0)] AS attributes
notice the use of OFFSET(0) this way the returned output will be:
[
{
"name": "BigQuery",
"user_count": "23",
"attributes": {
"data_thing": "Data Warehouse",
"ease_of_use": "5",
"description": "Awesome"
}
}
]
which can be queried using dot notation.
In case you want just to group result in STRUCT, you don't need array_agg.
WITH
flat_table AS (
SELECT "BigQuery" AS name, 23 AS user_count, struct("Data Warehouse" AS data_thing, 5 AS ease_of_use, "Awesome" AS description) as attributes UNION ALL
SELECT "MySQL" AS name, 12 AS user_count, struct("Database" AS data_thing, 3 AS ease_of_use, "Solid" AS description)
)
SELECT
*
FROM flat_table
I've got two JSON data row from the column in postgresql database and that looks like this.
{
"details":[{"to":"0:00:00","from":"00:00:12"}]
}
{
"details":[
{"to":"13:01:11","from":"13:00:12"},
{"to":"00:00:12","from":"13:02:11"}
]
}
I want to iterate over details and get only the "from" key values using a query in postgresql.
I want it like
from
00:00:12
13:00:12
13:02:11
Use jsonb_array_elements
select j->>'from' as "from" from t
cross join jsonb_array_elements(s->'details') as j;
Demo
I am trying to insert data into a table in Hive I created. I’ve been struggling, so I’m trying to simplify it as much as possible to get to the root of the issue.
Here is my simplified code for creating a basic table. I basically have an array of structure with a single element.
DROP TABLE IF EXISTS foo.S_FILE_PA_JOB_DATA_T;
CREATE TABLE foo.S_FILE_PA_JOB_DATA_T
PARTITIONED BY (customer_id string)
STORED AS AVRO
TBLPROPERTIES (
'avro.schema.literal'=
'{
"namespace": "com.foo.oozie.foo",
"name": "S_FILE_PA_JOB_DATA_T",
"type": "record",
"fields":
[
{"name":"pa_hwm" ,"type":{
"type":"array",
"items":{
"type":"record",
"name":"pa_hwm_record",
"fields":
[
{"name":"pa_axis" ,"type":["int","null"]}
]
}
}}
]
}');
My problem is I can’t figure out the syntax to insert into the table.
insert into table foo.s_FILE_PA_JOB_DATA_T partition (customer_id) values (0,'a390c1cf-4ee5-4ab9-b7a3-73f5f268b669')
The 0 needs to somehow be an array<struct<int>> but I can't get the syntax right. Can anyone help? Thanks!
Unfortunately, you can't directly do that. See also Hive inserting values to an array complex type column.
In theory, you should be able to do it using something like
insert into table s_file_pa_job_data_t partition(customer_id)
values (array(named_struct('pa_axis',0)) );
that is, using the array() and named_struct() udfs which will from some scalar values construct respectively an array, and a struct according to your specs. (see UDF documentation here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ComplexTypeConstructors
but unfortunately if you do that you'll get
FAILED: SemanticException [Error 10293]: Unable to create temp file
for insert values Expression of type TOK_FUNCTION not supported in insert/values
because unfortunately hive does not support the use of UDF functions in the VALUES clause yet. As the other posts suggest, you could do it using a dummy table, which is kind of ugly, but works.
I have a simple data cube with organization structure hierarchy defined. In my calculations inside the cube I would like to have different calculations depending on which level of organization items is currently used in WHERE clause in MDX query.
So let's say that I have 5 levels of organization structure, and for the last level (store level) I would like to change the way that calculation is being made using expression for instance:
IIF([Organization Structure].[Parent Id].LEVEL IS
[Organization Structure].[Parent Id].[Level 05], 'THIS IS STORE', 'THIS IS NOT')
This in Visual Studio browser result in something that we actually want:
and same for using MDX Query like:
SELECT { [Measures].[TEST] } ON COLUMNS
FROM [DataCubeName]
WHERE
{
[Organization Structure].[Parent Id].&[123]
}
Problem starts, when we want to use more than one organization structure item in WHERE clause. It is allowed to have items in this clause from the same level only, and I still would like to know which level is it, but of course when we add second item to WHERE like so:
SELECT { [Measures].[TEST] } ON COLUMNS
FROM [DataCubeName]
WHERE
{
[Organization Structure].[Parent Id].&[123],
[Organization Structure].[Parent Id].&[124]
}
I get error that "currentmember failed because the coordinate for the attribute contains a set".
That's why in my expression I have tried to use ITEM(0) function in many different configurations, but I just couldn't find a way to use it on a set of items that are currently used in WHERE clause... So the big question is:
How to get a set of items, that are listed in WHERE clause that is currently being executed so I can use Item(0) on that set, or is there any other way of retrieving Level of currently selected items knowing that they must be the same level?
Using Currentmember combined with set in the where clause is potentially problematic.
See this post from chris Webb: http://blog.crossjoin.co.uk/2009/08/08/sets-in-the-where-clause-and-autoexists/
Here is a possible workaround for your situation: you can try adapting to your curcumstance.
WITH
MEMBER [Measures].[x] AS
IIF
(
(existing [Geography].[Geography].[State-Province].members).item(0).Level
IS
[Geography].[Geography].[State-Province]
,'THIS IS state'
,'THIS IS NOT'
)
SELECT
{[Measures].[x]} ON COLUMNS
FROM [Adventure Works]
WHERE
(
{[Geography].[Geography].[State-Province].&[77]&[FR],
[Geography].[Geography].[State-Province].&[59]&[FR]}
);
Expanding the above to prove it works:
WITH
MEMBER [Measures].[x] AS
IIF
(
(EXISTING
[Geography].[Geography].[State-Province].MEMBERS).Item(0).Level
IS
[Geography].[Geography].[State-Province]
,'THIS IS state'
,'THIS IS NOT'
)
MEMBER [Measures].[proof] AS
(EXISTING
[Geography].[Geography].[State-Province].MEMBERS).Item(0).Member_Caption
MEMBER [Measures].[proof2] AS
(EXISTING
[Geography].[Geography].[State-Province].MEMBERS).Count
SELECT
{
[Measures].[x]
,[Measures].[proof]
,[Measures].[proof2]
} ON COLUMNS
FROM [Adventure Works]
WHERE
{
[Geography].[Geography].[State-Province].&[77]&[FR]
,[Geography].[Geography].[State-Province].&[59]&[FR]
};
Results in the following:
So your expression could become something like the following:
IIF
(
(EXISTING
[Organization Structure].[Parent Id].MEMBERS).Item(0).Level
IS
[Organization Structure].[Parent Id].[Level 05]
,'THIS IS STORE'
,'THIS IS NOT'
)