How to speed up LIMIT & OFFSET query in a big database

How to speed up LIMIT & OFFSET query in a big database - database

I have a table which can contain up to billions rows
CREATE TABLE "Log4DataUsb" (
"Time" integer primary key not null ,
"Microseconds" integer ,
"Current" integer ,
"Voltage" integer )
Usually a user will want to query the data within a specific range, for example Time <= 123456789 and Time >= 0, because this may return billions rows, I want to segment the rows and only return a batch each time, like LIMIT 10,000, LIMITE 10,000 OFFSET X until it reaches the end of this time-range query.
I notice that when the number of rows goes up, this query can be quite slow, executing the queries below will take seconds even though I just want to move to the next batch.
SELECT * FROM TABLE WHERE Time <= 123456789 and Time >= 0 LIMIT 10,000
SELECT * FROM TABLE WHERE Time <= 123456789 and Time >= 0 LIMIT 10,000 OFFSET 10,0000
If the database is supposed to have 2 billion rows in total, is there any way it can largely increase the query performance?

Related

Sumproduct in Netezza

I want to calculate sumproduct in netezza table, where one column is fixed. In frist column (A) I have some numbers in DISCOUNT are discount factors. As a result I want to get sumproduct bewteen A and DISCOUNT, where DISCOUNT always start from first row.
Number in RESULTS:
14,54535 = 5/(1+2%)+3/(1+3%)+7/(1+4%),
9.737293 = 3/(1+2%)+7/(1+3%)
6.862745 = 7/(1+2%)
Always when copunting the next number in columns RESULT, we ignore the previous values from A, but always use the DISCOUNT from MATURITY=1 forward.
MATURITY
A
R
DISCOUNT
RESULT
1
5
2%
98.0392...%
14,54535...
2
3
3%
97.0874...%
9.737293...
3
7
4%
97.0874...%
6.862745...
Is the any way to do that in Neteza? Without using multiple joins for rates/discounts? Since the dimension of data can vary.

how to get number containing rows in snowflake

I have tried regexp, regexp_like and like but didn't work
select * from b
where regexp_like(col1, '\d')
where regexp_like(col1, '[0-9]')
....etc
we have this table
Col1
avr100000
adfdsgwr
20170910020359.761
Enterprise
adf56ds76gwr
0+093000
080000
adfdsgwr
output should be these 5 rows
col1
avr100000
20170910020359.761
adf56ds76gwr
0+093000
1080000
Thanks

You can use regexp_instr in the where clause to see if it finds a digit anywhere in the string:
create temp table b(col1 string);
insert into b (col1) values ('avr100000'), ('adfdsgwr'),
('20170910020359.761'),
('Enterprise'),
('adf56ds76gwr'),
('0+093000'),
('080000'),
('adfdsgwr')
;
select col1 from b where regexp_instr(col1, '\\d') > 0;
I'm updating my answer to note that regexp_instr is going to perform about 3.8 times faster than using regexp_count for this requirement.
The reason is that regexp_instr will stop and report the location of the first digit it encounters. In contrast, regexp_count will continue examining the string until it reaches its end. If we only want to know if a digit exists in a string, we can stop as soon as we encounter the first one.
If it is a small data set, this won't matter much. For large data sets, that 3.8 times faster makes a big difference. Here is a mini test harness that shows the performance difference:
create or replace transient table RANDOM_STRINGS as
select RANDSTR(50, random()) as RANDSTR from table(generator (rowcount => 10000000));
alter session set use_cached_result = false;
-- Run these statements multiple times on an X-Small warehouse to test performance
-- Run both to warm the cache, then note the times after the initial runs to warm the cache
-- Average over 10 times with warm cache: 3.315s
select count(*) as ROWS_WITH_NUMBERS
from RANDOM_STRINGS where regexp_count(randstr, '\\d') > 0;
-- Average over 10 times with warm cache: 0.8686s
select count(*) as ROWS_WITH_NUMBERS
from RANDOM_STRINGS where regexp_instr(randstr, '\\d') > 0;

One method is to count how many alpha tokens there are:
select column1 as input
,regexp_count(column1, '[A-Za-z]') as alpha_count
from values
('0-100000'),
('adfdsgwr'),
('20170910020359.761'),
('Enterprise'),
('adfdsgwr'),
('0+093000'),
('1-080000'),
('adfdsgwr')
INPUT
ALPHA_COUNT
0-100000
0
adfdsgwr
8
20170910020359.761
0
Enterprise
10
adfdsgwr
8
0+093000
0
1-080000
0
adfdsgwr
8
and thus exclude those where it is not zero:
select column1 as input
from values
('0-100000'),
('adfdsgwr'),
('20170910020359.761'),
('Enterprise'),
('adfdsgwr'),
('0+093000'),
('1-080000'),
('adfdsgwr')
where regexp_count(column1, '[A-Za-z]') = 0
gives:
INPUT
0-100000
20170910020359.761
0+093000
1-080000

All you need to do is find 1 or more instances of a numeric value:
select
column1 as input
from
values
('avr100000'),
('adfdsgwr'),
('20170910020359.761'),
('Enterprise'),
('adf56ds76gwr'),
('0+093000'),
('1-080000'),
('adfdsgwr')
where
regexp_count (column1, '\\d') > 0;
Results:
avr100000
20170910020359.800
adf56ds76gwr
0+093000
1-080000

Why is the cost for the Sort operator in this execution plan so high?

The Sort operator says its only got 100 rows to sort. How could that possibly be more expensive than reading 1.9 million rows? I must be reading this wrong or misunderstanding something.
Also, how is the Estimated Number of Rows Per Execution in the Sort operator only 100? If the Index Seek operator estimates the Number of Rows Per Execution to be 1.9 million, how does only 100 rows get piped over to the Sort operator?
Here is the query:
DECLARE #PageIndex INT = 1000;
DECLARE #PageCount INT = 1000;
SELECT ID
FROM dbo.Table1
WHERE DateCreated >= '2021-10-27'
AND
DateCreated < '2021-10-28'
ORDER BY ID
OFFSET #PageIndex * #PageCount ROWS FETCH NEXT #PageCount ROWS ONLY

The Sort operator (as opposed to "Top N Sort") will sort the entirety of its input in its open method before returning any rows.
SQL Server estimates that the seek will output 1.9 million rows that go into the sort.
The costing is therefore for sorting 1.9 million rows.
You are doing
OFFSET 1000000 ROWS FETCH NEXT 1000 ROWS ONLY
The actual output rows from the sort will be at least 1,001,000 (maybe more in a parallel plan) and the TOP operator discards the first million for the offset and then stops requesting rows after it has received the 1000 to be returned.
The estimate of 100 is just a guess as SQL Server has no idea what the value of the variables will be at runtime when the plan is compiled.

Space consumed by a particular column's data and impact on deleting that column

I am using Oracle 12c database in my project and I have a column "Name" of type "VARCHAR2(128 CHAR) NOT NULL ". I have approximately 25328687 rows in my table.
Now I don't need the "Name" column so I want to delete it. When I calculated the total size of the data in this column(using lengthb and vsize) for all the rows it was approximately 1.07 GB.
Since the max size of the data in this column is specified, isn't all the rows will be allocated 128 bytes for this column (ignoring unicode for simplicity) and the total space consumed by this column should be 128 * number of rows = 3242071936 bytes or 3.24 GB.

Oracle Varchar2 allocate memory dynamically (definition says variable length string data type)
Char datatype is fixed length string data type.
create table x (a char(5), b varchar2(5));
insert into x value ('RAM', 'RAM');
insert into x value ('RAMA', 'RAMA');
insert into x value ('RAMAN', 'RAMAN');
SELECT * FROM X WHERE length(a) = 3; -> this will return 0 record
SELECT * FROM X WHERE length(b) = 3; -> this will return 1 record (RAM)
SELECT length(a) len_a, length(b) len_b from x ;
o/p will be like below
len_a | len_b
-------------
5 | 3
5 | 4
5 | 5

Oracle do dynamic allocation for varchar2 .
So a string of 4 char will take 5 bytes one for the length and 4 bytes for 4 char , if one-byte character set .

As the other answers say, the storage that a VARCHAR2 column uses is VARying. To get an estimate of the actual amount, you can use
1) The data dictionary
SELECT column_name, avg_col_len, last_analyzed
FROM ALL_TAB_COL_STATISTICS
WHERE owner = 'MY_SCHEMA'
AND table_name = 'MY_TABLE'
AND column_name = 'MY_COLUMN';
The result avg_col_len is the average column length. Mulitply it by your number of rows 25328687 and you get an estimate of roughly how many bytes this column uses. (If last_analyzed is NULL or very old compared to the last big data change, you'll have to refresh the optimizer stats with DBMS_STATS.GATHER_TABLE_STATS('MY_SCHEMA','MY_TABLE') first.
2) Count yourself in sample
SELECT sum(s), count(*), avg(s), stddev(s)
FROM (
SELECT vsize(my_column) as s
FROM my_schema.my_table SAMPLE (0.1)
);
This calculates the storage size of a 0.1 percent sample of your table.
3) To know for sure, I'd do a test of with a subset of the data
CREATE TABLE my_test TABLESPACE my_scratch_tablespace NOLOGGING AS
SELECT * FROM my_schema.my_table SAMPLE (0.1);
-- get the size of the test table in megabytes
SELECT round(bytes/1024/1024) as mb
FROM dba_segments WHERE owner='MY_SCHEMA' AND segment_name='MY_TABLE';
-- now drop the column
ALTER TABLE my_test DROP (my_column);
-- and measure again
SELECT round(bytes/1024/1024) as mb
FROM dba_segments WHERE owner='MY_SCHEMA' AND segment_name='MY_TABLE';
-- check how much space will be freed up
ALTER TABLE my_test MOVE;
SELECT round(bytes/1024/1024) as mb
FROM dba_segments WHERE owner='MY_SCHEMA' AND segment_name='MY_TABLE';
You could improve the test by using the same PCTFREE and COMPRESSION levels on your test table.

Counting Columns with conditions, assigning values based on count

I have a table with call logs. I need to assign time slots for next call based on which time slot the phone number was reachable in.
The relevant columns of the table are:
Phone Number | CallTimeStamp
CallTimeStamp is a datetime object.
I need to calculate the following:
Time Slot: From the TimeStamp, I need to calculate the count for each time slot (eg. 0800-1000, 1001-1200, etc.) for each phone number. Now, if the count is greater than 'n' for a particular time slot, then I need to assign that time slot to that number. Otherwise, I select a default time slot.
Weekday Slot: Same as above, but with weekdays.
Priority: Basically a count of how many times a number was reached
Here's I have gone about solving these issues:
Priority
To calculate the number of times a phone number is called is straight forward. If a number exists in the call log, I know that it was called. In that case, the following query will give me the call count for each number.
SELECT DISTINCT(PhoneNumber), COUNT(PhoneNumber) FROM tblCallLog
GROUP BY PhoneNumber
However, my problem is that I need to change the values in the field Count(PhoneNumber) based on the value in that column itself. How do I go about achieving this? (eg. If Count(PhoneNumber) gives me a value > 20, I need to change it to 5).
Time Slot / Weekday
This is where I'm completely stumped and am looking for the "database" way of doing things.
Unfortunately, I can't get out of my iterative process of thinking. For example, if I was aggregating for a certain phone number (say '123456') and in a certain time slot (say between 0800-1000 hrs), I can write a query like this:
DECLARE #T1Start time = '08:00:00.0000'
DECLARE #T2End time = '10:00:00.0000'
SELECT COUNT(CallTimeStamp) FROM tblCallLog
WHERE PhoneNumber = '123456' AND FORMAT(CallTimeStamp, 'hh:mm:ss') >= #T1Start AND FORMAT(CallTimeStamp, 'hh:mm:ss') < #T2End
Now, I could go through each and every Distinct Phone Number in the table, count the values for each time slot and then assign a slot value for the phone number. However, there has to be a way that does not involve me iterating through a database.
So, I am looking for suggestions on how to solve this.
Thanks

You can use DATEPART Function to get week day slot.
To calculate time slot you can try dividing number of minutes from beginning of day and dividing it by size of the time slot. It would return you slot number. You can use either CASE statement to translate it to proper string or look table where you can store slot descriptions.
SELECT
PhoneNumber
, DATEPART(WEEKDAY, l.CallTimeStamp) AS DayOfWeekSlot
, DATEDIFF(MINUTE, CONVERT(DATE, l.CallTimeStamp), l.CallTimeStamp) / 120 AS TwoHourSlot /*You can change number of minutes to get different slot size*/
, COUNT(*) AS Count
FROM tblCallLog l
GROUP BY PhoneNumber
, DATEPART(WEEKDAY, l.CallTimeStamp)
, DATEDIFF(MINUTE, CONVERT(DATE, l.CallTimeStamp), l.CallTimeStamp) / 120

You could try this to return the phone number, the day of the week and a 2 hour slot. If the volume of calls is greater than 20 the value is set to 5 (not sure why to 5?). The code for the 2 hour section is adapted from this question How to Round a Time in T-SQL where the value 2 in (24/2) is the number of hours in your time period.
SELECT
PhoneNumber
, DATENAME(weekday,CallTimeStamp) as [day]
, CONVERT(smalldatetime,ROUND(CAST(CallTimeStamp as float) * (24/2),0)/(24/2)) AS RoundedTime
, CASE WHEN COUNT(*) > 20 THEN 5 ELSE COUNT(*) END
FROM
tblCallLog
GROUP BY
PhoneNumber
, DATENAME(weekday,dateadd(s,start_ts,'01/01/1970'))

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight