Most efficient way to query data nested deep in JSON arrays? - arrays

Currently I'm writing queries against a JSONB table with 8 million+ rows. How can I query from the parent and the friends objects in the most efficient manner possible?
Query (Postgres 9.6):
select distinct id, data->>'_id' jsonID, data->>'email' email, friends->>'name' friend_name, parent->>'name' parent
from temp t
CROSS JOIN jsonb_array_elements(t.data->'friends') friends
CROSS JOIN jsonb_array_elements(friends->'parent') parent
where friends ->> 'name' = 'Chan Franco'
and parent->>'name' = 'Hannah Golden'
Example DDL (with data): https://pastebin.com/byN7uyKx

Your regularly structured data would be cleaner, smaller and faster as normalized relational design.
That said, to make the setup you have much faster (if not as fast as a normalized design with matching indexes), add a GIN index on the expression data->'friends':
CREATE INDEX tbl_data_friends_gin_idx ON tbl USING gin ((data->'friends'));
Then add a matching WHERE clause to our query with the contains operator #>:
SELECT DISTINCT -- why DISTINCT ?
id, data->>'_id' AS json_id, data->>'email' AS email
, friends->>'name' AS friend_name, parent->>'name' AS parent
FROM tbl t
CROSS JOIN jsonb_array_elements(t.data->'friends') friends
CROSS JOIN jsonb_array_elements(friends->'parent') parent
WHERE t.data->'friends' #> '[{"name": "Chan Franco", "parent": [{"name": "Hannah Golden"}]}]'
AND friends->>'name' = 'Chan Franco'
AND parent ->>'name' = 'Hannah Golden';
db<>fiddle here
The huge difference: With the help of the index, Postgres can now identify matching rows before unnesting each an every nested "friends" array in the whole table. Only after having identified matching rows in the underlying table, jsonb_array_elements() is called and resulting rows with qualifying array elements are kept.
Note that the search expression has to be valid JSON, matching the structure of the JSON array data->'friends' - including the outer brackets []. But omit all key/value pairs that are not supposed to serve as filter.
Related:
Index for finding an element in a JSON array
I avoided the table name temp as this is an SQL key word, that might lead to confusing errors. Using the name tbl instead.

Related

SQL query based on list from another query

I am trying to build a query that will generate a list of records based on the results of a very similar query.
Here are the details and examples
Query 1: Generate a list if part #'s in a specific location of the warehouse.
Query 2: Use the list of part #'s generated in #1 to show all locations for the list of part #'s, assuming they will be in both the location specified in #1 and other locations.
Query 1 looks like this:
Select
ItemMaster.ItemNo, BinInfo.BIN, ItemDetail.Qty, ItemDetail.Whouse_ID
From
((ItemDetail
Left Join
ItemMaster on ItemMaster.ID=ItemDetail.Item_ID)
Left Join
BinInfo on BinInfo.ID = ItemDetail.Bin_ID)
Where
ItemDetail.Whouse_ID = '1'
And BinInfo.Bin = 'VLM';
Query 2 needs to be almost identical except the ItemMaster.ItemNo list will come from query #1.
Any help here would be great. I don't know if I need to learn Unions, Nested Queries, or what.
make sure that your first query returns the list of ids that you need.
then write the second query with the WHERE id IN (...) syntax:
SELECT * FROM table1 WHERE id IN
(SELECT id FROM table2 WHERE...) -- first query

SOQL Query for Left Join for custom objects

I have a requirement to fetching data from Sales force. I need to get the data from two custom objects. I
have written query in sql can anyone help me to convert it into SOQL
SELECT ID, Name, Crop_Year__c, Targeted_Enrollment_Segments__c, Description__c, Start_Date__c,
End_Date__c from Enrollment_Program__c EP
Left Join Account_Enrollment__c AE on EP.Crop_Year__c = AE.Crop_Year__c and EP.ID =
AE.Enrollment_Program__c
where AE.Account__c = 'xyz'
As you probably know, Salesforce SOQL doesn't have explicit JOIN clauses. It does that for you implicitly based on related object fields. That means you'll have to query Account_Enrollment__c and traverse the fields to get the related Enrollment_Program__c Lookup relationship.
Another problem is Salesforce only performs joins based on primary and foreign keys, so the EP.Crop_Year__c = AE.Crop_Year__c in your query won't work.
So, with that said, you can try this:
SELECT Enrollment_Program__c, Enrollment_Program__e.Name,
Enrollment_Program__r.Crop_Year__c, Enrollment_Program__r.Targeted_Enrollment_Segments__c,
Enrollment_Program__r.Description__c, Enrollment_Program__r.Start_Date__c,
Enrollment_Program__r.End_Date__c
FROM Account_Entrollment_Program__c WHERE Account__c = 'zyz'
If you know beforehand what the Crop_Year__c value is, you can just add this to your query:
AND Crop_Year__c=:year AND Enrollment_Program__c.Crop_Year__c=:year
Some details on the queries:
The __r suffix is how you get the lookup object addressed in the query. If you are interested only in the id, you can use __c.
The :year is how you pass the parameter year to the query. If you want to append it as text you can just use ... Crop_Year='+ year + '.

Redshift query array<varchar(128)> returning encoded values

We have an external table created in Redshift like this:
CREATE EXTERNAL TABLE spectrum.my_table(
insert_id varchar(128),
attribution_ids array<varchar(100)>
PARTITIONED BY (
event_date varchar(128))
STORED AS PARQUET
LOCATION
's3://my_bucket/my_path'
We do everything perfectly, but when we query the array<varchar> field as the documentation describes:
SELECT c.insert_id, a FROM
spectrum.my_table c, c.attribution_ids a LIMIT 10
Redshift return the insert_id correctly but the array it returns encoded please see below:
"insert_id", "o"
"0baed794-df11-4032-b13c-aac5d0deced7" "0b8ad4fd9af12804ffaea83f4886672b"
The source data should be like:
"0baed794-df11-4032-b13c-aac5d0deced7", [0baed794-df11-4032-b13c-aac5d0deced7, 0baed794-df11-4032-b13c-aac5d0deced7]
When we run the same query in Athena running as a SELECT * FROM my_table it returns the array with the correct data.
What should I do here?
Redshift does not support nested data types.
Redshift spectrum has simple support for nested data types - collection types like array or map has to be unnested (exploded) before selecting.
Unnesting basically does kind of a CROSS JOIN of all collection items with the row that collection belongs to.
Notice the syntax: ... FROM TABLE a, a.collecion_column B ... - in classical queries that's the synonym for CROSS JOIN.
So what you are seeing in in "o" column is one of the items from attribution_ids array.

Distinct with long columns

I have here some database schema with tables having long fields (in MS-SQL-Server of type "text", in Sybase of type "text" too) and I need to retrieve distinct rows.
The tables looks like
create table node (id int primary key, … a few more fields … data text);
create table ref (id int primary key, node_id int, … a few more fields);
For one row in "node", there may be zero or more rows in "ref".
Now I have a query like
SELECT node.* FROM node, ref WHERE node.id = ref.node_id AND ... some more restrictions.
This query returns duples and triples when there is more than a single row in "ref" for some "node_id".
But I need unique rows!
Using SELECT DISTINCT node.* does not work because of the columns of type "text" :-(
In Sybase there is trick, just add "GROUP BY node.id" to the query, voila! You get unique rows returned.
Is there some similar simple Trick for MS-SQL-Server?
I have already a solution with temporary tables, but this seems to be a lot slower maybe the reason is just because of the larger number of statements transferred to the database?
It looks like you are approaching this problem from the wrong direction. Joins are typically used to expand on keys where relevant data is stored in different tables. So it's no surprise you are getting more than one row per node_id.
In your query, you join the two tables together, but then you ignore everything from ref. It looks like you're just trying to filter out ids from node that are not referenced in ref. If that is the case, then you don't want to use a join. The following will work much better
select *
from node
where id in (
select node_id
from ref
where [any restrictions placed on the ref table go here]
)
and [any restrictions placed on the node table go here]
Furthermore, at the risk of teaching you bad join practices, the same thing can be accomplished they way you were trying to do it originally, but it's more painful to write and it's not good practice
select node.col1, node.col2, ... , node.last_col
FROM node
inner join ref on node.id = ref.node_id
where [some restrictions.]
group by node.col1, node.col2, ... , node.last_col

Hierarchical SQL select-query

I'm using MS SqlServer 2008. And I have a table 'Users'. This table has the key field ID of bigint. And also a field Parents of varchar which encodes all chain of user's parent IDs.
For example:
User table:
ID | Parents
1 | null
2 | ..
3 | ..
4 | 3,2,1
Here user 1 has no parents and user 4 has a chain of parents 3->2->1. I created a function which parses the user's Parents field and returns result table with user IDs of bigint.
Now I need a query which will select and join IDs of some requested users and theirs parents (order of users and theirs parents is not important). I'm not an SQL expert so all I could come up with is the following:
WITH CTE AS(
SELECT
ID,
Parents
FROM
[Users]
WHERE
(
[Users].Name = 'John'
)
UNION ALL
SELECT
[Users].Id,
[Users].Parents
FROM [Users], CTE
WHERE
(
[Users].ID in (SELECT * FROM GetUserParents(CTE.ID, CTE.Parents) )
))
SELECT * FROM CTE
And basically it works. But performance of this query is very poor. I believe WHERE .. IN .. expression here is a bottle neck. As I understand - instead of just joining the first subquery of CTE (ID's of found users) with results of GetUserParents (ID's of user parents) it has to enumerate all users in the Users table and check whether the each of them is a part of the function's result (and judging on execution plan - Sql Server does distinct order of the result to improve performance of WHERE .. IN .. statement - which is logical by itself but in general is not required for my goal. But this distinct order takes 70% of execution time of the query). So I wonder how this query could be improved or perhaps somebody could suggest some another approach to solve this problem at all?
Thanks for any help!
The recursive query in the question looks redundant since you already form the list of IDs needed in GetUserParents. Maybe change this into SELECT from Users and GetUserParents() with WHERE/JOIN.
select Users.*
from Users join
(select ParentId
from (SELECT * FROM Users where Users.Name='John') as U
cross apply [GetDocumentParents](U.ID, U.Family, U.Parents))
as gup
on Users.ID = gup.ParentId
Since GetDocumentParents expects scalars and select... where produces a table, we need to apply the function to each row of the table (even if we "know" there's only one). That's what apply does.
I used indents to emphasize the conceptual parts of the query. (select...) as gup is the entity Users is join'd with; (select...) as U cross apply fn() is the argument to FROM.
The key knowledge to understanding this query is to know how the cross apply works:
it's a part of a FROM clause (quite unexpectedly; so the syntax is at FROM (Transact-SQL))
it transforms the table expression left of it, and the result becomes the argument for the FROM (i emphasized this with indent)
The transformation is: for each row, it
runs a table expression right of it (in this case, a call of a table-valued function), using this row
adds to the result set the columns from the row, followed by the columns from the call. (In our case, the table returned from the function has a single column named ParentId)
So, if the call returns multiple rows, the added records will be the same row from the table appended with each row from the function.
This is a cross apply so rows will only be added if the function returns anything. If this was the other flavor, outer apply, a single row would be added anyway, followed by a NULL in the function's column if it returned nothing.
This "parsing" thing violates even the 1NF. Make Parents field contain only the immediate parent (preferably, a foreign key), then an entire subtree can be retrieved with a recursive query.

Resources