PostgreSQL: Efficiently aggregate array columns as part of a group by

PostgreSQL: Efficiently aggregate array columns as part of a group by - arrays

We wish to perform a GROUP BY operation on a table. The original table contains an ARRAY column. Within a group, the content of these arrays should be transformed into a single array with unique elements. No ordering of these elements is required. contain
Newest PostgreSQL versions are available.
Example original table:
id | fruit | flavors
---: | :----- | :---------------------
| apple | {sweet,sour,delicious}
| apple | {sweet,tasty}
| banana | {sweet,delicious}
Exampled desired result:
count_total | aggregated_flavors
----------: | :---------------------------
1 | {delicious,sweet}
2 | {sour,tasty,delicious,sweet}
SQL toy code to create the original table:
CREATE TABLE example(id int, fruit text, flavors text ARRAY);
INSERT INTO example (fruit, flavors)
VALUES ('apple', ARRAY [ 'sweet','sour', 'delicious']),
('apple', ARRAY [ 'sweet','tasty' ]),
('banana', ARRAY [ 'sweet', 'delicious']);
We have come up with a solution requiring transforming the array to s
SELECT COUNT(*) AS count_total,
array
(SELECT DISTINCT unnest(string_to_array(replace(replace(string_agg(flavors::text, ','), '{', ''), '}', ''), ','))) AS aggregated_flavors
FROM example
GROUP BY fruit
However we think this is not optimal, and may be problematic as we assume that the string does neither contain "{", "}", nor ",". It feels like there must be functions to combine arrays in the way we need, but we weren't able to find them.
Thanks a lot everyone!

demo:db<>fiddle
Assuming each record contains a unique id value:
SELECT
fruit,
array_agg(DISTINCT flavor), -- 2
COUNT(DISTINCT id) -- 3
FROM
example,
unnest(flavors) AS flavor -- 1
GROUP BY fruit
unnest() array elements
Group by fruit value: array_agg() for distinct flavors
Group by fruit value: COUNT() for distinct ids with each fruit group.
if the id column is really empty, you could generate the id values for example with the row_number() window function:
demo:db<>fiddle
SELECT
*
FROM (
SELECT
*, row_number() OVER () as id
FROM example
) s,
unnest(flavors) AS flavor

Related

T-SQL Verify each character of each value in a column against values of another table

I have a table in a database with a column which has values like XX-xx-cccc-ff-gg. Let's assume this is table ABC and column is called ABC_FORMAT_STR. In another table, ABC_FORMAT_ELEMENTS I have a column called CHARS with values like, A, B, C, D... X, Y, Z, a, d, f, g, x, y, z etc. (please don't assume I have all ASCII values there, it's mainly some letters and numbers plus some special characters like *, ;, -, & etc.).
I need to add a constraint in [ABC].[ABC_FORMAT_STR] column, in such a way so, each and every character of every value of that column, should exist in [ABC_FORMAT_ELEMENTS].[CHARS]
Is the possible? Can someone help me with this?
Thank you very much in advance.
This is an example with simple names, keeping the names of the object above for clarity:
Example
SELECT [ABC_FORMAT_STR] FROM [ABC]
Nick
George
Adam
SELECT [CHARS] FROM [ABC_FORMAT_ELEMENTS]
A
G
N
a
c
e
g
i
k
o
r
After the coonstraint:
SELECT [ABC_FORMAT_STR] FROM [ABC]
Nick
George
Note on the result:
"Adam" cannot be included because "d" and "m" character are not in [ABC_FORMAT_ELEMENTS] table.

Here is a simple and most natural solution based on the TRANSLATE() function.
It will work starting from SQL Server 2017 onwards.
SQL
-- DDL and sample data population, start
DECLARE #ABC TABLE (ABC_FORMAT_STR VARCHAR(50));
INSERT INTO #ABC VALUES
('Nick'),
('George'),
('Adam');
DECLARE #ABC_FORMAT_ELEMENTS TABLE (CHARS CHAR(1));
INSERT INTO #ABC_FORMAT_ELEMENTS VALUES
('A'), ('G'), ('N'),('a'), ('c'), ('e'), ('g'),
('i'), ('k'), ('o'), ('r');
-- DDL and sample data population, end
SELECT a.*
, t1.legitChars
, t2.badChars
FROM #ABC AS a
CROSS APPLY (SELECT STRING_AGG(CHARS, '') FROM #ABC_FORMAT_ELEMENTS) AS t1(legitChars)
CROSS APPLY (SELECT TRANSLATE(a.ABC_FORMAT_STR, t1.legitChars, SPACE(LEN(t1.legitChars)))) AS t2(badChars)
WHERE TRIM(t2.badChars) = '';
Output
+----------------+-------------+----------+
| ABC_FORMAT_STR | legitChars | badChars |
+----------------+-------------+----------+
| Nick | AGNacegikor | |
| George | AGNacegikor | |
+----------------+-------------+----------+
Output with WHERE clause commented out
Just to see why the row with the 'Adam' value was filtered out.
+----------------+-------------+----------+
| ABC_FORMAT_STR | legitChars | badChars |
+----------------+-------------+----------+
| Nick | AGNacegikor | |
| George | AGNacegikor | |
| Adam | AGNacegikor | d m |
+----------------+-------------+----------+

Based on your sample data, here's one method to identify valid/invalid rows in ABC. You could easily adapt this to be part of a trigger that can check inserted or updated rows in inserted and rollback if any rows violate the criteria.
This uses a tally/numbers table (very often used for splitting strings), This defines one using a CTE but a permanent solution would have a permanent numbers table to reuse.
The logic is to split the strings into rows and then count the rows that exist in the lookup table and reject any with a count of rows that is less than the length of the string.
with
numbers (n) as (select top 100 Row_Number() over (order by (select null)) from sys.messages ),
strings as (
select a.ABC_FORMAT_STR, Count(*) over(partition by a.ABC_FORMAT_STR) n
from abc a cross join numbers n
where n.n<=Len(a.ABC_FORMAT_STR)
and exists (select * from ABC_FORMAT_ELEMENTS e where e.chars=Substring(a.ABC_FORMAT_STR,n,1))
)
select ABC_FORMAT_STR
from strings
where Len(ABC_FORMAT_STR)=n
group by ABC_FORMAT_STR
/* change to where Len(ABC_FORMAT_STR) <> n to find rows that aren't allowed */
See this DB Fiddle

How to respect the order of an array in a PostgreSQL select sentence

This is my (extremely simplified) product table and some test data.
drop table if exists product cascade;
create table product (
product_id integer not null,
reference varchar,
price decimal(13,4),
primary key (product_id)
);
insert into product (product_id, reference, price) values
(1001, 'MX-232', 100.00),
(1011, 'AX-232', 20.00),
(1003, 'KKK 11', 11.00),
(1004, 'OXS SUPER', 0.35),
(1005, 'ROR-MOT', 200.00),
(1006, '234PPP', 30.50),
(1007, 'T555-NS', 110.25),
(1008, 'LM234-XS', 101.20),
(1009, 'MOTOR-22', 12.50),
(1010, 'MOTOR-11', 30.00),
(1002, 'XUL-XUL1', 40.00);
I real life, listing product columns is a taught task, full of joins, case-when-end clauses, etc. On the other hand, there is a large number of queries to be fulfilled, as products by brand, featured products, products by title, by tags, by range or price, etc.
I don't want to repeat and maintain the complex product column listings every time I perform a query so, my current approach is breaking query processes in two tasks:
encapsulate the query in functions of type select_products_by_xxx(), that return product_id arrays, properly selected and ordered.
encapsulate all the product column complexity in a unique function list_products() that takes a product_id array as a parameter.
execute select * from list_products(select_products_by_xxx()) to obtain the desired result for every xxx function.
For example, to select product_id in reverse order (in case this was any meaningful select for the application), a function like this would do the case.
create or replace function select_products_by_inverse ()
returns int[]
as $$
select
array_agg(product_id order by product_id desc)
from
product;
$$ language sql;
It can be tested to work as
select * from select_products_by_inverse();
select_products_by_inverse |
--------------------------------------------------------|
{1011,1010,1009,1008,1007,1006,1005,1004,1003,1002,1001}|
To encapsulate the "listing" part of the query I use this function (again, extremely simplified and without any join or case for the benefit of the example).
create or replace function list_products (
tid int[]
)
returns table (
id integer,
reference varchar,
price decimal(13,4)
)
as $$
select
product_id,
reference,
price
from
product
where
product_id = any (tid);
$$ language sql;
It works, but does not respect the order of products in the passed array.
select * from list_products(select_products_by_inverse());
id |reference|price |
----|---------|--------|
1001|MX-232 |100.0000|
1011|AX-232 | 20.0000|
1003|KKK 11 | 11.0000|
1004|OXS SUPER| 0.3500|
1005|ROR-MOT |200.0000|
1006|234PPP | 30.5000|
1007|T555-NS |110.2500|
1008|LM234-XS |101.2000|
1009|MOTOR-22 | 12.5000|
1010|MOTOR-11 | 30.0000|
1002|XUL-XUL1 | 40.0000|
So, the problem is I am passing a custom ordered array of product_id but the list_products() function does not respect the order inside the array.
Obviously, I could include an order by clause in list_products(), but remember that the ordering must be determined by the select_products_by_xxx() functions to keep the list_products() unique.
Any idea?
EDIT
#adamkg solution is simple and works: adding a universal order by clause like this:
order by array_position(tid, product_id);
However, this means to ordering products twice: first inside select_products_by_xxx() and then inside list_products().
An explain exploration renders the following result:
QUERY PLAN |
----------------------------------------------------------------------|
Sort (cost=290.64..290.67 rows=10 width=56) |
Sort Key: (array_position(select_products_by_inverse(), product_id))|
-> Seq Scan on product (cost=0.00..290.48 rows=10 width=56) |
Filter: (product_id = ANY (select_products_by_inverse())) |
Now I am wondering if there is any other better approach to reduce cost, keeping separability between functions.
I see two promising strategies:
As for the explain clause and the issue itself, it seems that an complete scan of table product is being done inside list_products(). As there may be thousands of products, a better approach would be to scan the passed array instead.
The xxx functions can be refactored to return setof int instead of int[]. However, a set cannot be passed as a function parameter.

For long arrays you typically get (much!) more efficient query plans with unnesting the array and joining to the main table. In simple cases, this even preserves the original order of the array without adding ORDER BY. Rows are processed in order. But there are no guarantees and the order may be broken with more joins or with parallel execution etc. To be sure, add WITH ORDINALITY:
CREATE OR REPLACE FUNCTION list_products (tid int[]) -- VARIADIC?
RETURNS TABLE (
id integer,
reference varchar,
price decimal(13,4)
)
LANGUAGE sql STABLE AS
$func$
SELECT product_id, p.reference, p.price
FROM unnest(tid) WITH ORDINALITY AS t(product_id, ord)
JOIN product p USING (product_id) -- LEFT JOIN ?
ORDER BY t.ord
$func$;
Fast, simple, safe. See:
PostgreSQL unnest() with element number
Join against the output of an array unnest without creating a temp table
You might want to throw in the modifier VARIADIC, so you can call the function with an array or a list of IDs (max 100 items by default). See:
Return rows matching elements of input array in plpgsql function
Call a function with composite type as argument from native query in jpa
Function to select column values by comparing against comma separated list
I would declare STABLE function volatility.
You might use LEFT JOIN instead of JOIN to make sure that all given IDs are returned - with NULL values if a row with given ID has gone missing.
db<>fiddle here
Note a subtle logic difference with duplicates in the array. While product_id is UNIQUE ...
unnest + left join returns exactly one row for every given ID - preserving duplicates in the given IDs if any.
product_id = any (tid) folds duplicates. (One of the reasons it typically results in more expensive query plans.)
If there are no dupes in the given array, there is no difference. If there can be duplicates and you want to fold them, your task is ambiguous, as it's undefined which position to keep.

You're very close, all you need to add is ORDER BY array_position(tid, product_id).
testdb=# create or replace function list_products (
tid int[]
)
returns table (
id integer,
reference varchar,
price decimal(13,4)
)
as $$
select
product_id,
reference,
price
from
product
where
product_id = any (tid)
-- add this:
order by array_position(tid, product_id);
$$ language sql;
CREATE FUNCTION
testdb=# select * from list_products(select_products_by_inverse());
id | reference | price
------+-----------+----------
1011 | AX-232 | 20.0000
1010 | MOTOR-11 | 30.0000
1009 | MOTOR-22 | 12.5000
1008 | LM234-XS | 101.2000
1007 | T555-NS | 110.2500
1006 | 234PPP | 30.5000
1005 | ROR-MOT | 200.0000
1004 | OXS SUPER | 0.3500
1003 | KKK 11 | 11.0000
1002 | XUL-XUL1 | 40.0000
1001 | MX-232 | 100.0000
(11 rows)

Flatten and aggregate two columns of arrays via distinct in Snowflake

Table structure is
+------------+---------+
| Animals | Herbs |
+------------+---------+
| [Cat, Dog] | [Basil] |
| [Dog, Lion]| [] |
+------------+---------+
Desired output (don't care about sorting of this list):
unique_things
+------------+
[Cat, Dog, Lion, Basil]
First attempt was something like
SELECT ARRAY_CAT(ARRAY_AGG(DISTINCT(animals)), ARRAY_AGG(herbs))
But this produces
[[Cat, Dog], [Dog, Lion], [Basil], []]
Since the distinct is operating on each array, not looking at distinct components within all arrays

If I understand your requirements right and assuming the source tables of
insert into tabarray select array_construct('cat', 'dog'), array_construct('basil');
insert into tabarray select array_construct('lion', 'dog'), null;
I would say the result would look like this:
select array_agg(distinct value) from
(
select
value from tabarray
, lateral flatten( input => col1 )
union all
select
value from tabarray
, lateral flatten( input => col2 ))
;

UPDATE
It is possible without using FLATTEN, by using ARRAY_UNION_AGG:
Returns an ARRAY that contains the union of the distinct values from the input ARRAYs in a column.
For sample data:
CREATE OR REPLACE TABLE t AS
SELECT ['Cat', 'Dog'] AS Animals, ['Basil'] AS Herbs
UNION SELECT ['Dog', 'Lion'], [];
Query:
SELECT ARRAY_UNION_AGG(ARRAY_CAT(Animals, Herbs)) AS Result
FROM t
or:
SELECT ARRAY_UNION_AGG(Animals) AS Result
FROM (SELECT Animals FROM t
UNION ALL
SELECT Herbs FROM t);
Output:
You could flatten the combined array and then aggregate back:
SELECT ARRAY_AGG(DISTINCT F."VALUE") AS unique_things
FROM tab, TABLE(FLATTEN(ARRAY_CAT(tab.Animals, tab.Herbs))) f

Here is another variation to handle NULLs in case they appear in data set.
SELECT ARRAY_AGG(DISTINCT a.VALUE) unique_things from tab, TABLE (FLATTEN(array_compact(array_append(tab.Animals, tab.Herbs)))) a

SQL GROUP BY with columns which contain mirrored values

Sorry for the bad title. I couldn't think of a better way to describe my issue.
I have the following table:
Category | A | B
A | 1 | 2
A | 2 | 1
B | 3 | 4
B | 4 | 3
I would like to group the data by Category, return only 1 line per category, but provide both values of columns A and B.
So the result should look like this:
category | resultA | resultB
A | 1 | 2
B | 4 | 3
How can this be achieved?
I tried this statement:
SELECT category, a, b
FROM table
GROUP BY category
but obviously, I get the following errors:
Column 'a' is invalid in the select list because it is not contained
in either an aggregate function or the GROUP BY clause.
Column 'b' is invalid in the select list because it is not contained in either an
aggregate function or the GROUP BY clause.
How can I achieve the desired result?

Try this:
SELECT category, MIN(a) AS resultA, MAX(a) AS resultB
FROM table
GROUP BY category
If the values are mirrored then you can get both values using MIN, MAX applied on a single column like a.

Seams you don't really want to aggregate per category, but rather remove duplicate rows from your result (or rather rows that you consider duplicates).
You consider a pair (x,y) equal to the pair (y,x). To find duplicates, you can put the lower value in the first place and the greater in the second and then apply DISTINCT on the rows:
select distinct
category,
case when a < b then a else b end as attr1,
case when a < b then b else a end as attr2
from mytable;

Considering you want a random record from duplicates for each category.
Here is one trick using table valued constructor and Row_Number window function
;with cte as
(
SELECT *,
(SELECT Min(min_val) FROM (VALUES (a),(b))tc(min_val)) min_val,
(SELECT Max(max_val) FROM (VALUES (a),(b))tc(max_val)) max_val
FROM (VALUES ('A',1,2),
('A',2,1),
('B',3,4),
('B',4,3)) tc(Category, A, B)
)
select Category,A,B from
(
Select Row_Number()Over(Partition by category,max_val,max_val order by (select NULL)) as Rn,*
From cte
) A
Where Rn = 1

How to use group by in SQL Server query?

I have problem with group by in SQL Server
I have this simple SQL statement:
select *
from Factors
group by moshtari_ID
and I get this error :
Msg 8120, Level 16, State 1, Line 1
Column 'Factors.ID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
This is my result without group by :
and this is error with group by command :
Where is my problem ?

In general, once you start GROUPing, every column listed in your SELECT must be either a column in your GROUP or some aggregate thereof. Let's say you have a table like this:
| ID | Name | City |
| 1 | Foo bar | San Jose |
| 2 | Bar foo | San Jose |
| 3 | Baz Foo | Santa Clara |
If you wanted to get a list of all the cities in your database, and tried:
SELECT * FROM table GROUP BY City
...that would fail, because you're asking for columns (ID and Name) that aren't in the GROUP BY clause. You could instead:
SELECT City, count(City) as Cnt FROM table GROUP BY City
...and that would get you:
| City | Cnt |
| San Jose | 2 |
| Santa Clara | 1 |
...but would NOT get you ID or Name. You can do more complicated things with e.g. subselects or self-joins, but basically what you're trying to do isn't possible as-stated. Break down your problem further (what do you want the data to look like?), and go from there.
Good luck!

When you group then you can select only the columns you group by. Other columns need to be aggrgated. This can be done with functions like min(), avg(), count(), ...
Why is this? Because with group by you make multiple records unique. But what about the column not being unique? The DB needs a rule for those on how to display then - aggregation.

You need to apply aggregate function such as max(), avg() , count() in group by.
For example this query will sum totalAmount for all moshtari_ID
select moshtari_ID,sum (totalAmount) from Factors group by moshtari_ID;
output will be
moshtari_ID SUM
2 120000
1 200000

Try it,
select *
from Factorys
Group by ID, date, time, factorNo, trackingNo, totalAmount, createAt, updateAt, bark_ID, moshtari_ID

If you are applying group clause then you can only use group columns and aggregate function in select
syntax:
SELECT expression1, expression2, ... expression_n,
aggregate_function (aggregate_expression)
FROM tables
[WHERE conditions]
GROUP BY expression1, expression2, ... expression_n
[ORDER BY expression [ ASC | DESC ]];

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

PostgreSQL: Efficiently aggregate array columns as part of a group by - arrays

Related

T-SQL Verify each character of each value in a column against values of another table

How to respect the order of an array in a PostgreSQL select sentence

Flatten and aggregate two columns of arrays via distinct in Snowflake

SQL GROUP BY with columns which contain mirrored values

How to use group by in SQL Server query?

Categories

Resources