PostgreSQL performance on joins using integer[] columns vs. link table - arrays

We have a table user_info with high read-write-update activity containing millions of rows and another table combinations with low write, high read containing thousands of rows. We need to link these two tables on a unique ID in combinations composed of 3 integers using a one-many relationship. Each row in user_info will be linked to ~5 rows in combinations.
Example combinations:
id, some_id1, some_id2, some_id3
1, 10, 10, 100
2, 10, 21, 201
3, 20, 21, 201
Example user_info (with array column approach):
id, some_column_1, some_column_2, combination_ids
1, 'Bla bla', 'Smth', {1,3}
2, 'Bla bla', 'Smth Smth', {2,3}
We know that this could be done both using an integer[] column on user_info or using a link table. We are interested in optimizing for performance, specifically for joins between the two tables involving many rows. All the tables would be indexed on the relevant columns. Would it be faster to join using array columns or using a link table?

Related

Conditionally Sum a googlesheet column based on entries in related tables

Say I have two related sheets/tabs within a google sheet. One sheet/tab is titled "Categories", the other is "Measures".
Categories:
userid
catcode
1
a
1
b
2
a
3
c
Measures:
userid
catcode
points
1
a
5
1
b
5
1
c
3
2
a
4
3
c
3
For each user I'd like to be able to sum the points from the Measures table where the catcode is present for the user in the categories table. Ideally using an auto-extending/filling formula (like an arrayformula or query).
I have some idea how I'd approach this with SQL statements (joining the related tables, or doing a select where exists), but I'm new to googlesheets and would appreciate some direction here. I've experimented with this a bit and assuming a third table named "Users" with userids in column A, I can add this formula:
=sum(filter(measure!C2:C4, measure!A2:A4=users!A2, not(iserror(vlookup(measure!B2:B4, unique(filter(categories!B2:B5, categories!A2:A5=users!A2)), 1, false)))))
However this approach doesn't seem to be compatible with arrayformula and won't allow me to autofill down the Users tab for newly added userids. Sum itself is apparently incompatible with arrayformula. Additionally, if I enclose the above in arrayformula and replace sum with sumproduct or some other approach to the summation, I'm unable to get the users!A2 references to extend down as I'd expect via something like users!A2:A.
Any help/direction would be appreciated. Thanks!
Try:
=ARRAYFORMULA(QUERY({A1:A, VLOOKUP(A1:A&B1:B, {D1:D&E1:E, F1:F}, 2, 0)},
"select Col1,sum(Col2) where Col1 is not null group by Col1 label sum(Col2)''"))

Snowflake query pruning by Column

in the Snowflake Docs it says:
First, prune micro-partitions that are not needed for the query.
Then, prune by column within the remaining micro-partitions.
What is meant with the second step?
Let's take the example table t1 shown in the link. In this example table I use the following query:
SELECT * FROM t1
WHERE
Date = ‚11/3‘ AND
Name = ‚C‘
Because of the Date = ‚11/3‘ it would only scan micro partitions 2, 3 and 4. Because of the Name = 'C' it can prune even more and only scan micro-partions 2 and 4.
So in the end only micro-partitions 2 and 4 would be scanned.
But where does the second step come into play? What is meant with prune by column within the remaining micro partitions?
Does it mean, that only rows 4, 5 and 6 on micro-partition 2 and row 1 on micro-partition 4 are scanned, because date is my clustering key and is sorted so you can prune even further with the date?
So in the end only 4 rows would be scanned?
But where does the second step come into play? What is meant with prune by column within the remaining micro partitions?
Benefits of Micro-partitioning:
Columns are stored independently within micro-partitions, often referred to as columnar storage.
This enables efficient scanning of individual columns; only the columns referenced by a query are scanned.
It is recommended to avoid SELECT * and specify required columns explicitly.
It simply means to only select the columns that are required for the query. So in your example it would be:
SELECT col_1, col_2 FROM t1
WHERE
Date = ‚11/3‘ AND
Name = ‚C‘

SQL query takes 20 minutes to finish and only contains 1300 rows

I have a table that has a unique string column and a department description. The length of the unique string column represents the department hierarchy so 4 character length is the lowest level while 2 character length the highest.
My goal is to create new variables so I can show the hierarchy levels and corresponding department descriptions for each row and use these new columns as filters
My SQL code is working; however, it takes more than 20 minutes to generate results for a 1300 row table.
Is there a better way to optimize this query? Note that I’m only using one table and creating multiple copies to create the final version that I’d like to achieve.
m.UniqueDescription as "Department Code",
m.DepartmentDescription as "Department",
Left(m.UniqueDescription,2) as "Level 2 Hierarchy",
Left(m.UniqueDescription,3) as "Level 3 Hierarchy",
Left(m.UniqueDescription,4) as "Level 4 Hierarchy",
l2. DepartmentDescription as "L2 Department",
l3. DepartmentDescription as "L3 Department",
l4. DepartmentDescription as "L4 Department"
From department_table m
LEFT JOIN department_table l2
ON Left(m.UniqueDescription,2) = l2.UniqueDescription
LEFT JOIN department_table l3
ON Left(m.UniqueDescription,3) = l3.UniqueDescription
LEFT JOIN department_table l4
ON Left(m.UniqueDescription,4) = l4.UniqueDescription"
Below is the output that I would like to achieve:
Table Format
First thing, the structure and missing of numeric IDs is not a good practice
Check for index creation.
Do not use functions on the left side of your ON or WHERE clauses, it doesn't allow to the execution planner to index those columns.
Instead of FUNCTION(LeftTable.Column) = value use LeftTable.Column = INVERSE_FUNCTION(value)

How to sort comma delimited column in SQL

I have a database table with a Value column that contains comma delimited strings and I need to order on the Value column with a SQL query:
Column ID Value
1 ff, yy, bb, ii
2 kk, aa, ee
3 dd
4 cc, zz
If I use a simple query and order by Value ASC, it would result in an order of Column ID: 4, 3, 1, 2, but the desired result is Column ID: 2, 1, 4, 3, since Column ID '2' contains aa, '1' contains bb, and etc.
By the same token, order by Value DESC would result in an order of Column ID: 2, 1, 3, 4, but the desired result is Column ID: 4, 1, 2, 3.
My initial thought is to have two additional columns, 'Lowest Value' and 'Highest Value', and within the query I can order on either 'Lowest Value' and 'Highest value' depending on the sort order. But I am not quite sure how to sort the highest and lowest in each row and insert them into the appropriate columns. Or is there another solution without the use of those two additional columns within the sql statement? I'm not that proficient in sql query, so thanks for your assistance.
Best solution is not to store a single comma separated value at all. Instead have a detail table Values which can have multiple rows per ID, with one value each.
If you have the possibility to alter the data structure (and by your own suggestion of adding columns, it seems you have), I would choose that solution.
To sort by lowest value, you could then write a query similar to the one below.
select
t.ID
from
YourTable t
left join ValueTable v on v.ID = t.ID
group by
t.ID
order by
min(v.Value)
But this structure also allows you to write other, more advanced queries. For instance, this structure makes it easier and more efficient to check if a row matches a specific value, because you don't have to parse the list of values every time, and separate values can be indexed better.
string / array splitting (and creation, for that matter) is covered quite extensively. You might want to have a read of this, one of the best articles out there covering a comparison of the popular methods. Once you have the values the rest is easy.
http://sqlperformance.com/2012/07/t-sql-queries/split-strings
Funnily enough I did something like this just the other week in a cross-applied table function to do some data cleansing, improving performance 8 fold over the looped version in place.

Postgresql - performance of using array in big database

Let say we have a table with 6 million records. There are 16 integer columns and few text column. It is read-only table so every integer column have an index.
Every record is around 50-60 bytes.
The table name is "Item"
The server is: 12 GB RAM, 1,5 TB SATA, 4 CORES. All server for postgres.
There are many more tables in this database so RAM do not cover all database.
I want to add to table "Item" a column "a_elements" (array type of big integers)
Every record would have not more than 50-60 elements in this column.
After that i would create index GIN on this column and typical query should look like this:
select * from item where ...... and '{5}' <# a_elements;
I have also second, more classical, option.
Do not add column a_elements to table item but create table elements with two columns:
id_item
id_element
This table would have around 200 mln records.
I am able to do partitioning on this tables so number of records would reduce to 20 mln in table elements and 500 K in table item.
The second option query looks like this:
select item.*
from item
left join elements on (item.id_item=elements.id_item)
where ....
and 5 = elements.id_element
I wonder what option would be better at performance point of view.
Is postgres able to use many different indexes with index GIN (option 1) in a single query ?
I need to make a good decision because import of this data will take me a 20 days.
I think you should use an elements table:
Postgres would be able to use statistics to predict how many rows will match before executing query, so it would be able to use the best query plan (it is more important if your data is not evenly distributed);
you'll be able to localize query data using CLUSTER elements USING elements_id_element_idx;
when Postgres 9.2 would be released then you would be able to take advantage of index only scans;
But I've made some tests for 10M elements:
create table elements (id_item bigint, id_element bigint);
insert into elements
select (random()*524288)::int, (random()*32768)::int
from generate_series(1,10000000);
\timing
create index elements_id_item on elements(id_item);
Time: 15470,685 ms
create index elements_id_element on elements(id_element);
Time: 15121,090 ms
select relation, pg_size_pretty(pg_relation_size(relation))
from (
select unnest(array['elements','elements_id_item', 'elements_id_element'])
as relation
) as _;
relation | pg_size_pretty
---------------------+----------------
elements | 422 MB
elements_id_item | 214 MB
elements_id_element | 214 MB
create table arrays (id_item bigint, a_elements bigint[]);
insert into arrays select array_agg(id_element) from elements group by id_item;
create index arrays_a_elements_idx on arrays using gin (a_elements);
Time: 22102,700 ms
select relation, pg_size_pretty(pg_relation_size(relation))
from (
select unnest(array['arrays','arrays_a_elements_idx']) as relation
) as _;
relation | pg_size_pretty
-----------------------+----------------
arrays | 108 MB
arrays_a_elements_idx | 73 MB
So in the other hand arrays are smaller, and have smaller index. I'd do some 200M elements tests before making a decision.

Resources