Postgresql - performance of using array in big database - arrays

Let say we have a table with 6 million records. There are 16 integer columns and few text column. It is read-only table so every integer column have an index.
Every record is around 50-60 bytes.
The table name is "Item"
The server is: 12 GB RAM, 1,5 TB SATA, 4 CORES. All server for postgres.
There are many more tables in this database so RAM do not cover all database.
I want to add to table "Item" a column "a_elements" (array type of big integers)
Every record would have not more than 50-60 elements in this column.
After that i would create index GIN on this column and typical query should look like this:
select * from item where ...... and '{5}' <# a_elements;
I have also second, more classical, option.
Do not add column a_elements to table item but create table elements with two columns:
id_item
id_element
This table would have around 200 mln records.
I am able to do partitioning on this tables so number of records would reduce to 20 mln in table elements and 500 K in table item.
The second option query looks like this:
select item.*
from item
left join elements on (item.id_item=elements.id_item)
where ....
and 5 = elements.id_element
I wonder what option would be better at performance point of view.
Is postgres able to use many different indexes with index GIN (option 1) in a single query ?
I need to make a good decision because import of this data will take me a 20 days.

I think you should use an elements table:
Postgres would be able to use statistics to predict how many rows will match before executing query, so it would be able to use the best query plan (it is more important if your data is not evenly distributed);
you'll be able to localize query data using CLUSTER elements USING elements_id_element_idx;
when Postgres 9.2 would be released then you would be able to take advantage of index only scans;
But I've made some tests for 10M elements:
create table elements (id_item bigint, id_element bigint);
insert into elements
select (random()*524288)::int, (random()*32768)::int
from generate_series(1,10000000);
\timing
create index elements_id_item on elements(id_item);
Time: 15470,685 ms
create index elements_id_element on elements(id_element);
Time: 15121,090 ms
select relation, pg_size_pretty(pg_relation_size(relation))
from (
select unnest(array['elements','elements_id_item', 'elements_id_element'])
as relation
) as _;
relation | pg_size_pretty
---------------------+----------------
elements | 422 MB
elements_id_item | 214 MB
elements_id_element | 214 MB
create table arrays (id_item bigint, a_elements bigint[]);
insert into arrays select array_agg(id_element) from elements group by id_item;
create index arrays_a_elements_idx on arrays using gin (a_elements);
Time: 22102,700 ms
select relation, pg_size_pretty(pg_relation_size(relation))
from (
select unnest(array['arrays','arrays_a_elements_idx']) as relation
) as _;
relation | pg_size_pretty
-----------------------+----------------
arrays | 108 MB
arrays_a_elements_idx | 73 MB
So in the other hand arrays are smaller, and have smaller index. I'd do some 200M elements tests before making a decision.

Related

Rails ActiveRecord - Performance - How to destroy invalid belongs_to records

I have two models:
Asset has_many Assethistories
Assethistory belongs_to Asset
Unfortunately, when the migration was created, no foreign_key was added to Assethistories. Some Assethistories records exist where Assethistories.asset_id has a value that does not exist in Asset.id (probably someone used delete on the Asset table instead of destroy).
There are approximately 10 million Asset records and 25 million Assethistories records.
Using this query takes a VERY LONG time:
Assethistory.where("asset_id NOT IN (select id from assets)").delete_all
or in rails syntax:
Assethistory.where.not(asset_id: Asset.select(:id)).delete_all
NOTE: we can use delete_all since there are no callbacks or nested models.
In fact, just doing a COUNT of the invalid records takes VERY LONG.
Assethistory.where.not(asset_id: Asset.select(:id)).count
Is there any way to destroy the invalid records that would be more performant?
I've got some data which is about 20x smaller than your case, e.g 5M assethistory and 500K asset like your case, which could reproduce your issue by using where.not. By following Anti-Join pattern, where joins records that don't exist, would bring you the result you want with better performance.
Assethistory.left_outer_joins(:asset).where(assets: { id: nil })
# => 3500 rows (415.3ms)
or
Assethistory.where('NOT EXISTS (:assets)', assets: Asset.select('1').where('assets.id = assethistories.asset_id'))
# => 3500 rows (701.6ms)
By using LEFT OUTER JOIN instead of WHERE NOT, DB scans the table in a very different way. WHERE NOT loops over all rows in assethistories where assethistories.asset_id do not match all existing assets.id.
LEFT OUTER JOIN queries the records from the joined-table(assethistories and assets as a big one table) where the rows have null value in asset.id column, which is much more efficient.

Snowflake query pruning by Column

in the Snowflake Docs it says:
First, prune micro-partitions that are not needed for the query.
Then, prune by column within the remaining micro-partitions.
What is meant with the second step?
Let's take the example table t1 shown in the link. In this example table I use the following query:
SELECT * FROM t1
WHERE
Date = ‚11/3‘ AND
Name = ‚C‘
Because of the Date = ‚11/3‘ it would only scan micro partitions 2, 3 and 4. Because of the Name = 'C' it can prune even more and only scan micro-partions 2 and 4.
So in the end only micro-partitions 2 and 4 would be scanned.
But where does the second step come into play? What is meant with prune by column within the remaining micro partitions?
Does it mean, that only rows 4, 5 and 6 on micro-partition 2 and row 1 on micro-partition 4 are scanned, because date is my clustering key and is sorted so you can prune even further with the date?
So in the end only 4 rows would be scanned?
But where does the second step come into play? What is meant with prune by column within the remaining micro partitions?
Benefits of Micro-partitioning:
Columns are stored independently within micro-partitions, often referred to as columnar storage.
This enables efficient scanning of individual columns; only the columns referenced by a query are scanned.
It is recommended to avoid SELECT * and specify required columns explicitly.
It simply means to only select the columns that are required for the query. So in your example it would be:
SELECT col_1, col_2 FROM t1
WHERE
Date = ‚11/3‘ AND
Name = ‚C‘

What is the most efficient way to store 2-D timeseries in a database (sqlite3)

I am performing large scale wind simulations to produce hourly wind patterns over a city. The results is a time series of 2-dimensional contours. Currently I am storing the results in SQLite3 database tables with the following structure
Table: CFD
id, timestamp, velocity, cell_id
1 , 2010-01-01 08:00:00, 3.345, 1
2 , 2010-01-01 08:00:00, 2.355, 2
3 , 2010-01-01 08:00:00, 2.111, 3
4 , 2010-01-01 08:00:00, 6.432, 4
.., ..................., ....., .
1000 , 2010-01-01 09:00:00, 3.345, 1
1001 , 2010-01-01 10:00:00, 2.355, 2
1002 , 2010-01-01 11:00:00, 2.111, 3
1003 , 2010-01-01 12:00:00, 6.432, 4
.., ..................., ....., .
Actual create statement:
CREATE TABLE cfd(id INTEGER PRIMARY KEY, time DATETIME, u, cell_id integer)
CREATE INDEX idx_cell_id_cfd on cfd(cell_id)
CREATE INDEX idx_time_cfd on cfd(time)
(There are three of these tables, each for a different result variable)
where cell_id is a reference to the cell in the domain representing a location in the city. See this picture to have an idea of what it looks like at a specific timestep.
The typical query performs some kind of aggregation on the time dimension and group by on cell_id. For example, if I want to know the average local wind speed in each cell during a specific time interval, I would execute
select sum(time in ('2010-01-01 08:00:00','2010-01-01 13:00:00','2010-01-01 14:00:00', ...................., ,'2010-12-30 18:00:00','2010-12-30 19:00:00','2010-12-30 20:00:00','2010-12-30 21:00:00') and u > 5.0) from cfd group by cell_id
The number of timestamps can vary from 100 to 8,000.
This is fine for small databases, but it gets much slower for larger ones. For example, my last database was 60GB, 3 tables and each table had 222,000,000 rows.
Is there a better way to store the data? For example:
would it make sense to create a different table for each day?
would be better to use a separate table for the timesteps and then use a join?
is there a better way of indexing?
I have already adopted all the recommendations in this question to maximise the performance.
This particular query is hard to optimize because the sum() must be computed over all table rows. It is a better idea to filter rows with WHERE:
SELECT count(*)
FORM cfd
WHERE time IN (...)
AND u > 5
GROUP BY cell_id;
If possible, use a simpler expression to filter times, such as time BETWEEN a AND b.
It might be worthwhile to use a covering index, or in this case, when all queries filter on the time, a clustered index (without additional indexes):
CREATE TABLE cfd (
cell_id INTEGER,
time DATETIME,
u,
PRIMARY KEY (cell_id, time)
) WITHOUT ROWID;

Is "offset-fetch and order by" ordering the whole table or partial table?

In SQL Server, if I try the following query:
select id from table
order by id
offset 1000000 ROWS
fetch next 1000000 ROWS ONLY;
How will SQL Server work? What strategy does SQL server use?
1. Do a sorting on the whole table first and then select the 1 million rows we need
2. Do a sorting on partial table and then return the 1 million rows we need.
I assume it is 2nd option. If so, how does SQL server decide which range of the table to be sorted?
Edit 1:
I am asking this question to understand what could cause the query slow. I am testing with two queries:
--Query 1:
select id from table
order by id
offset 1 ROWS
fetch next 1 ROWS ONLY;
and
--Query 2:
select id from table
order by id
offset 1000000000 ROWS
fetch next 1 ROWS ONLY;
I found the second query can take me about 30 minutes to finish while the first takes almost 0 second.
So I am curious on what causes this difference? If the two have same time used for order by (or does it even really do a sorting on the whole table? The id is the clustered indexed column of the table. I cannot imagine that it takes 0 second to finish sorting on a terabyte table.)
Then if the sorting takes same time, only difference would be the clustered-index scan. For first query, it only needs to scan first 1 or 10 (a small number) of rows. While for the second query, it needs to scan a much bigger number of rows ( >1000000000 ). But I am not quite sure if this is correct.
Thank you for your help!
Let me take a simple example..
order by id
offset 50 rows fetch 25 rows only
For the above query,the steps would be
1.Table should be sorted by id (if not pay penalty of sort,there is no partial sort,always a full sort)
2.Then scan 50+25 rows(paying cost of 75 rows) and return 25 rows only..
Below is an example of orders table i have(orderid is Pk,so sorted),you can see even though, we are getting only 20 rows ,you are paying cost of 120 rows...
Coming to your question,there is no partial sort (Which implies first option regarding sort only),even you try to return one row like below..
select top 1* from table
order by orderid

Advice on sql server index performance

I have "UserLog" table with 15 millions rows.
On this table I have a cluster index on User_ID field which is of type bigint identity, and a non clustered index on User_uid field which is of type varchar(35) (a fake uniqueidentifier).
On my application we can have 2 categories of users connection. 1 of them concerns only 0.007% of rows (about 1150 raws over 15 millions) and the 2nd concerns the remaining rows (99%).
The objective is to improve performance of the 0.007% users connection.
That's why I create a 'split' field "Userlog_ID" with type bit with default value of 0. So for each user connection we insert a new row in Userlog (with 0 as a value for User_log).
This field (User_Log) will then be update and it will take either 0 (for more then 99% of rows) or 1 (for 0.007% of rows) depending on the user category.
I create then a non clustered index on this field (User_log).
the select statment I want to optimize is:
SELECT User_UID, User_LastAuthentificationDate,
Language_ID,User_SecurityString
FROM dbo.UserLog
WHERE User_Active = 1
AND User_UID = '00F5AA38-C288-48C6-8ED1922601819486'
So the idea is now to add a filter on User_Log field to optimise the performance (precisely the index seek operator), only when the user belongs to the category 1 (0.007%):
SELECT User_UID, User_LastAuthentificationDate, Language_ID,User_SecurityString
FROM dbo.UserLog
WHERE User_Active = 1
AND User_UID = '00F5AA38-C288-48C6-8ED1922601819486'
and User_Log = 1
In my mind, I have the idea that since we add this filter the index seek will perform better because we have now a smaller result set.
Unfortunately, when I compare the 2 queries with the estimated execution plan I obtain 50% for each query. For both queries the optimiser use an index seek on user_uid non clustered index and then a key lookup on the cluster index (User_id).
So In conclusion by adding the split field and a non clustered index (either normal or filtered) on it, I don't improve the performance.
Can anyone have an explanation why. Maybe my reasoning and my interpretation are totaly wrong.
Thank you

Resources