I'm trying to determine the best indexes for a table in PostgreSQL. I expect on the order of ~10b rows and ~10TB data.
The table has 5 main columns used for filtering and/or sorting
Filtering: 3 Columns of binary data stored as bytea
Filtering / sorting: 2 Columns of type integer
CREATE TABLE table (
filter_key_1 AS BYTEA, -- filtering
filter_key_2 AS BYTEA, -- filtering
filter_key_3 AS BYTEA, -- filtering
sort_key_1 AS INTEGER, -- filtering & sorting
sort_key_2 AS INTEGER -- filtering & sorting
)
Queries will be:
SELECT * FROM table WHERE filter_key_1 = $1 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
SELECT * FROM table WHERE filter_key_2 = $1 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
SELECT * FROM table WHERE filter_key_3 = $1 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
SELECT * FROM table WHERE filter_key_1 = $1 AND sort_key_1 <= $2 AND sort_key_2 <= $3 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
SELECT * FROM table WHERE filter_key_2 = $1 AND sort_key_1 <= $2 AND sort_key_2 <= $3 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
SELECT * FROM table WHERE filter_key_3 = $1 AND sort_key_1 <= $2 AND sort_key_2 <= $3 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
What are the ideal indexes for the table? How large will they get with ~10b rows? How much will they limit write throughput?
Edit
What if I want to add additional queries such as below. Would the indexes from above hold-up?
SELECT * FROM table WHERE filter_key_1 = $1 AND filter_key_2 = $2 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
SELECT * FROM table WHERE filter_key_1 = $1 AND filter_key_2 = $2 AND filter_key_3 = $3 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
-- ...
IO requirements
The workload is heavy read, low write.
Read speed is important. Write speed is less important (can live with up-to 3 seconds per insert)
Read:
expecting on average 150 read queries/sec
most queries pulling in 100 to 100,000 rows after WHERE and before LIMIT
Write:
expecting 1 write query/12sec, 0.08 queries/sec
writing 500-1000 rows/query, 42-84 rows/sec
Since you need to run these queries all the time, you will have to optimize them as much as possible. That would mean
CREATE INDEX ON tab (filter_key_1, sort_key_1, sort_key_2);
CREATE INDEX ON tab (filter_key_2, sort_key_1, sort_key_2);
CREATE INDEX ON tab (filter_key_3, sort_key_1, sort_key_2);
Together, these indexes should be substantially larger than your table.
Related
I have a table (t) which contains a column (c) of JSONB objects, each containing an array (a) with a varying length between 1-10 (inclusive). I need to build a query that selects 1000 rows from t where c includes 100 random rows of each possible length of a. What would be the most concise way to write this query? My current query structure looks like this:
WITH length_1 AS (
SELECT *
FROM t
WHERE JSONB_ARRAY_LENGTH(c -> 'a') = 1
ORDER BY RANDOM()
LIMIT 100
),
length_2 AS (
SELECT *
FROM t
WHERE JSONB_ARRAY_LENGTH(c -> 'a') = 2
ORDER BY RANDOM()
LIMIT 100
)
...
SELECT *
FROM length_1
UNION
SELECT *
FROM length_2
...
You can use a window function to label each row with a row number within its partition, then use an outer select to limit to 100 for each partition.
select * from (
select t.*, row_number() over (partition by jsonb_array_length(c->'a') order by random()) as rn from t
) foo where rn<=100;
The two levels are needed because you can't use a window function in a WHERE or a HAVING.
I need to select random rows from my sql table, when search this cases in google, they suggested to ORDER BY NEWID() but it reduces the performance. Since my table has more than 2'000'000 rows of data, this solution does not suit me.
I tried this code to get random data :
SELECT TOP 10 *
FROM Table1
WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) AS INT)) % 100) < 10
It also drops performance sometimes.
Could you please suggest good solution for getting random data from my table, I need minimum rows from that tables like 30 rows for each request. I tried TableSAMPLE to get the data, but it returns nothing once I added my where condition because it return the data by the basis of page not basis of row.
Try to calc the random ids before to filter your big table.
since your key is not identity, you need to number records and this will affect performances..
Pay attention, I have used distinct clause to be sure to get different numbers
EDIT: I have modified the query to use an arbitrary filter on your big table
declare #n int = 30
;with
t as (
-- EXTRACT DATA AND NUMBER ROWS
select *, ROW_NUMBER() over (order by YourPrimaryKey) n
from YourBigTable t
-- SOME FILTER
WHERE 1=1 /* <-- PUT HERE YOUR COMPLEX FILTER LOGIC */
),
r as (
-- RANDOM NUMBERS BETWEEN 1 AND COUNT(*) OF FILTERED TABLE
select distinct top (#n) abs(CHECKSUM(NEWID()) % n)+1 rnd
from sysobjects s
cross join (SELECT MAX(n) n FROM t) t
)
select t.*
from t
join r on r.rnd = t.n
If your uniqueidentifier key is a random GUID (not generated with NEWSEQUENTIALID() or UuidCreateSequential), you can use the method below. This will use the clustered primary key index without sorting all rows.
SELECT t1.*
FROM (VALUES(
NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())) AS ThirtyKeys(ID)
CROSS APPLY(SELECT TOP (1) * FROM dbo.Table1 WHERE ID >= ThirtyKeys.ID) AS t1;
If I have multi-columns index with column keys: col_1, col_2, and col_3
Is the query would use this index or not if it has in the WHERE clause these conditions:
col_1 = any_value AND col_3 = any_value
(the second columns in the index keys was not added to WHERE-clause)
and here is another example:
if the index has 10 columns and the column keys in this order:
col_1, col_2, ...., col_10
and then, I have run this query:
Select col_1,col_2, ..., col_10 from X
WHERE col_1 = any_value AND col_5 = any_value AND col_10 = any_value
and my question: Is the index would be used in this case or not??
new answer as your question is now more clear to me
No, the index will not be used. Only when querying on col_1 OR col_1/col_2 OR col_1/col_2/col_3 the index will/may be used. Check this with the execution plan of your query. The order of your multi-column index does matter: check this question for some discussion around this topic Multiple Indexes vs Multi-Column Indexes
If you consider that it will be more likely you will query on col_1 and col_3, why not creating a multicolumn index just on those 2 columns?
It might be used. It depends on many factors, mostly your data(and statistics about your data), and your queries.
TL/DR; you need to test this on your own data and your own queries. The index might be used.
You should try it out on the data that you have or expect to have. It is very easy to create some test-data on which you can test your queries and try different indexes. You might also need to reconsider the order of the columns in the index, is col_1 really the best column to be first in the index?
Below is a very specific scenario from which we can only conclude that the index can be used, sometimes, in similar scenarios as yours.
Consider this scenario below; the table contains 1M rows and have only a single nonclustered index on (a, b, c). Note that the values in column D is very large.
The first 4 queries below used the index, only the fifth query did not.
Why?
Sql Server will need to figure out how to complete the query while reading the least amount of data. Sometimes it is easier for SQL Server to read the index instead of the table even when the query-filter does not completely match the index.
In Query 1 and 2 the query will actually do a Seek on the index which is quite good. It knows that column A is a good candidate to perform the Seek on.
In query 3 and 4 it needs to scan the entirety of the index to find the matching rows. It still used the index.
In query 5 SQL Server realizes that it is easier to scan the actual table instead of the index.
IF OBJECT_ID('tempdb..#peter') IS NOT NULL DROP TABLE #peter;
CREATE TABLE #peter(a INT, b INT, c VARCHAR(100), d VARCHAR(MAX));
WITH baserows AS (
SELECT * FROM master..spt_values WHERE type = 'P'
),
numbered AS (
SELECT TOP 1000000
a.*, rn = ROW_NUMBER() OVER(ORDER BY (SELECT null))
FROM baserows a, baserows b, baserows c
)
INSERT #peter
( a, b, c, d )
SELECT
rn % 100, rn % 10, CHAR(65 + (rn % 60)), REPLICATE(CHAR(65 + (rn % 60)), rn)
FROM numbered
CREATE INDEX ix_peter ON #peter(a, b, c);
-- First query does Seek on the index + RID Lookup.
SELECT * FROM #peter WHERE a = 55 AND c = 'P'
-- Second Query does Seek on the index.
SELECT a, b, c FROM #peter WHERE a = 55 AND c = 'P'
-- Third query does Scan on the index because the index is smaller to scan than the full table.
SELECT a, b, c FROM #peter WHERE c = 'P'
-- Fourth query does a scan on the index
SELECT a, b, c FROM #peter WHERE b = 22
-- Fifth query scans the table and not the index
SELECT MAX(d) FROM #peter
Tested on SQL Server 2014.
The index will definitely be used but not effectively.
I did an experiment (SQL Server) and here is how it looks [IX_AB is an index on a, b] and I can correlate your problem with it.
These are the conclusions
If you create an index with col1, col2, and col3 and pass col1 and col3 only, the index will only filter col1 values and then data retrieved from there will be filtered programmatically O(N) where N is the records marked by the index.
Passing the mid-value as "not null" or "null" does not help.
The requirement is to load 50 records in paging with all 65 columns of table "empl" with minimum IO. There are 280000+ records in table. There is only one clustered index over the PK.
Pagination query is as following:
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY e.[uon] DESC ) AS [row_number], e.*
FROM
empl e with (NOLOCK)
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = e.ptid
WHERE
e.del = 0 AND e.pub = 1 AND e.sid = 2
AND e.md = 0
AND e.tid = 3
AND e.coid = 2
AND (e.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1))
SELECT
*
FROM
result_set
WHERE
[row_number] BETWEEN 0 AND 50
Following are the stats after running the above query from profiler:
CPU: 1500, Reads: 25576, Duration: 25704
Then I put the following index over the table empl:
CREATE NONCLUSTERED INDEX [ci_empl]
ON [dbo].[empl] ([del],[md],[pub],[tid],[coid],[sid],[ptid],[cid],[uon])
GO
After putting index CPU and Reads are still higher. I don't know what's wrong with the index or something wrong with the query?
Edit:
The following query is also taking high reads after putting index. And there are only 3 columns and 1 count.
SELECT TOP (2147483647)
ame.aid ID, ame.name name,
COUNT(empl.pid) [Count], ps.uff uff FROM ame with (NOLOCK)
JOIN pam AS pa WITH (NOLOCK) ON pa.aid = ame.aid
JOIN empl WITH (NOLOCK) ON empl.pid = pa.pid
LEFT JOIN psam AS ps
ON ps.psid = 1001
AND ps.aid = ame.aid
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = empl.ptid
WHERE
empl.del = 0 AND empl.pub = 1 AND empl.sid = 2
AND empl.md = 0
AND (empl.tid = 3)
AND (empl.coid = 2)
AND (empl.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1)
AND ame.pub = 1 AND ame.del = 0
GROUP BY ame.aid, ame.name, ps.uff
ORDER BY ame.name ASC
Second Edit:
Now I had put the following index on "uon" column:
CREATE NONCLUSTERED INDEX [ci_empl_uon]
ON [dbo].[empl] (uon)
GO
But still CPU and Reads are Higher.
Third Edit:
DTA is suggesting me index with all columns included for the first query so I altered the suggested index convert it to a filter index for the basic four filters to make it more effective.
I added the line below after Include while creating the index.
Where e.del = 0 AND e.pub = 1 AND e.sid = 2 AND e.md = 0 AND e.coid = 2
But still the reads are high on both development and production machine.
Fourth Edit:
Now I had come to a solution that has improved the performance, but still not up to the goal. The key is that it's not going for ALL THE DATA.
The query is a following:
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY e.[uon] DESC ) AS [row_number], e.pID pID
FROM
empl e with (NOLOCK)
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = e.ptid
WHERE
e.del = 0 AND e.pub = 1 AND e.sid = 2
AND e.md = 0
AND e.tid = 3
AND e.coid = 2
AND (e.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1))
SELECT
*
FROM
result_set join empl on result_set.pID = empl.pID
WHERE
[row_number] BETWEEN #start AND #end
And recreated the index with key column alterations, include and filter:
CREATE NONCLUSTERED INDEX [ci_empl]
ON [dbo].[empl] ([ptid],[cid],[tid],[uon])
INCLUDE ([pID])
Where
[coID] = 2 and
[sID] = 2 and
[pub] = 1 and
[del] = 0 and
[md] = 0
GO
It improves the performance, but not up to the goal.
You are selecting the top 50 rows ordered by e.uon desc. An index that starts with uon will speed up the query:
create index IX_Empl_Uon on dbo.empl (uon)
The index will allow SQL Server to scan the top N rows of a this index. N is the highest number in your pagination: for the 3rd page of 50 elements, N equals 150. SQL Server then does 50 key lookups to retrieve the full rows from the clustered index. As far as I know, this is a textbook example of where an index can make a big difference.
Not all query optimizers will be smart enough to notice that row_number() over ... as rn with where
rn between 1 and 50 means the top 50 rows. But SQL Server 2012 does. It uses the index both for first and consecutive pages, like row_number() between 50 and 99.
You are trying to find the X through X+Nth row from a dataset, based on an order specified by column uon.
I’m assuming here that uon is the mentioned primary key. If not, without an index where uon is the first (if not only) column, a table scan is inevitable.
Next wrinkle: You don’t want that direct span of columns, you want that span of columns as filtered by a fairly extensive assortment of filters. The clustered index might pull the first 50 columns, but the WHERE may filter none, some, or all of those out. More will almost certainly have to read in order to "fill your span".
More fun: you perform a left outer join on table empl_add (e.g. retaing the empl row even if there is no empl_add found), and then require filter out all rows where empladd.ptgid is not found in the subquery. Might as well make this an inner join, it may speed things up and certainly will not make them slower. It is also a "filtering factor" that cannot be addressed with an index on table empl.
So: as I see it (i.e. I’m not testing it all out locally), SQL has to first assemble the data, filter out the invalid rows (which involves table joins), order what remains, and return that span of rows you are interested in. I believe that, with or without the index on uon, SQL is identifying a need to read all the data and filter/sort before it can pick out the desired range.
(Your new index would appear to be insufficient. The sixth column is sid, but sid is not referenced in the query, so it might only be able to help “so far”. This raises lots of questions about data cardinality and whatnot, at which point I defer to #Aarons’ point that we have insufficient information on the overall problem set for a full analysis.)
UPDATE CattleProds
SET SheepTherapy=(ROUND((RAND()* 10000),0))
WHERE SheepTherapy IS NULL
If I then do a SELECT I see that my random number is identical in every row. Any ideas how to generate unique random numbers?
Instead of rand(), use newid(), which is recalculated for each row in the result. The usual way is to use the modulo of the checksum. Note that checksum(newid()) can produce -2,147,483,648 and cause integer overflow on abs(), so we need to use modulo on the checksum return value before converting it to absolute value.
UPDATE CattleProds
SET SheepTherapy = abs(checksum(NewId()) % 10000)
WHERE SheepTherapy IS NULL
This generates a random number between 0 and 9999.
If you are on SQL Server 2008 you can also use
CRYPT_GEN_RANDOM(2) % 10000
Which seems somewhat simpler (it is also evaluated once per row as newid is - shown below)
DECLARE #foo TABLE (col1 FLOAT)
INSERT INTO #foo SELECT 1 UNION SELECT 2
UPDATE #foo
SET col1 = CRYPT_GEN_RANDOM(2) % 10000
SELECT * FROM #foo
Returns (2 random probably different numbers)
col1
----------------------
9693
8573
Mulling the unexplained downvote the only legitimate reason I can think of is that because the random number generated is between 0-65535 which is not evenly divisible by 10,000 some numbers will be slightly over represented. A way around this would be to wrap it in a scalar UDF that throws away any number over 60,000 and calls itself recursively to get a replacement number.
CREATE FUNCTION dbo.RandomNumber()
RETURNS INT
AS
BEGIN
DECLARE #Result INT
SET #Result = CRYPT_GEN_RANDOM(2)
RETURN CASE
WHEN #Result < 60000
OR ##NESTLEVEL = 32 THEN #Result % 10000
ELSE dbo.RandomNumber()
END
END
While I do love using CHECKSUM, I feel that a better way to go is using NEWID(), just because you don't have to go through a complicated math to generate simple numbers .
ROUND( 1000 *RAND(convert(varbinary, newid())), 0)
You can replace the 1000 with whichever number you want to set as the limit, and you can always use a plus sign to create a range, let's say you want a random number between 100 and 200, you can do something like :
100 + ROUND( 100 *RAND(convert(varbinary, newid())), 0)
Putting it together in your query :
UPDATE CattleProds
SET SheepTherapy= ROUND( 1000 *RAND(convert(varbinary, newid())), 0)
WHERE SheepTherapy IS NULL
I tested 2 set based randomization methods against RAND() by generating 100,000,000 rows with each. To level the field the output is a float between 0-1 to mimic RAND(). Most of the code is testing infrastructure so I summarize the algorithms here:
-- Try #1 used
(CAST(CRYPT_GEN_RANDOM(8) AS BIGINT)%500000000000000000+500000000000000000.0)/1000000000000000000 AS Val
-- Try #2 used
RAND(Checksum(NewId()))
-- and to have a baseline to compare output with I used
RAND() -- this required executing 100000000 separate insert statements
Using CRYPT_GEN_RANDOM was clearly the most random since there is only a .000000001% chance of seeing even 1 duplicate when plucking 10^8 numbers FROM a set of 10^18 numbers. IOW we should not have seen any duplicates and this had none! This set took 44 seconds to generate on my laptop.
Cnt Pct
----- ----
1 100.000000 --No duplicates
SQL Server Execution Times:
CPU time = 134795 ms, elapsed time = 39274 ms.
IF OBJECT_ID('tempdb..#T0') IS NOT NULL DROP TABLE #T0;
GO
WITH L0 AS (SELECT c FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS D(c)) -- 2^4
,L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B) -- 2^8
,L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B) -- 2^16
,L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B) -- 2^32
SELECT TOP 100000000 (CAST(CRYPT_GEN_RANDOM(8) AS BIGINT)%500000000000000000+500000000000000000.0)/1000000000000000000 AS Val
INTO #T0
FROM L3;
WITH x AS (
SELECT Val,COUNT(*) Cnt
FROM #T0
GROUP BY Val
)
SELECT x.Cnt,COUNT(*)/(SELECT COUNT(*)/100 FROM #T0) Pct
FROM X
GROUP BY x.Cnt;
At almost 15 orders of magnitude less random this method was not quite twice as fast, taking only 23 seconds to generate 100M numbers.
Cnt Pct
---- ----
1 95.450254 -- only 95% unique is absolutely horrible
2 02.222167 -- If this line were the only problem I'd say DON'T USE THIS!
3 00.034582
4 00.000409 -- 409 numbers appeared 4 times
5 00.000006 -- 6 numbers actually appeared 5 times
SQL Server Execution Times:
CPU time = 77156 ms, elapsed time = 24613 ms.
IF OBJECT_ID('tempdb..#T1') IS NOT NULL DROP TABLE #T1;
GO
WITH L0 AS (SELECT c FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS D(c)) -- 2^4
,L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B) -- 2^8
,L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B) -- 2^16
,L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B) -- 2^32
SELECT TOP 100000000 RAND(Checksum(NewId())) AS Val
INTO #T1
FROM L3;
WITH x AS (
SELECT Val,COUNT(*) Cnt
FROM #T1
GROUP BY Val
)
SELECT x.Cnt,COUNT(*)*1.0/(SELECT COUNT(*)/100 FROM #T1) Pct
FROM X
GROUP BY x.Cnt;
RAND() alone is useless for set-based generation so generating the baseline for comparing randomness took over 6 hours and had to be restarted several times to finally get the right number of output rows. It also seems that the randomness leaves a lot to be desired although it's better than using checksum(newid()) to reseed each row.
Cnt Pct
---- ----
1 99.768020
2 00.115840
3 00.000100 -- at least there were comparitively few values returned 3 times
Because of the restarts, execution time could not be captured.
IF OBJECT_ID('tempdb..#T2') IS NOT NULL DROP TABLE #T2;
GO
CREATE TABLE #T2 (Val FLOAT);
GO
SET NOCOUNT ON;
GO
INSERT INTO #T2(Val) VALUES(RAND());
GO 100000000
WITH x AS (
SELECT Val,COUNT(*) Cnt
FROM #T2
GROUP BY Val
)
SELECT x.Cnt,COUNT(*)*1.0/(SELECT COUNT(*)/100 FROM #T2) Pct
FROM X
GROUP BY x.Cnt;
require_once('db/connect.php');
//rand(1000000 , 9999999);
$products_query = "SELECT id FROM products";
$products_result = mysqli_query($conn, $products_query);
$products_row = mysqli_fetch_array($products_result);
$ids_array = [];
do
{
array_push($ids_array, $products_row['id']);
}
while($products_row = mysqli_fetch_array($products_result));
/*
echo '<pre>';
print_r($ids_array);
echo '</pre>';
*/
$row_counter = count($ids_array);
for ($i=0; $i < $row_counter; $i++)
{
$current_row = $ids_array[$i];
$rand = rand(1000000 , 9999999);
mysqli_query($conn , "UPDATE products SET code='$rand' WHERE id='$current_row'");
}