currently we have a column with only integer values declared as NUMBER. At the same time it is our (only) index. I wonder if it would make a difference in performance if you declare the index as INTEGER? Or is Oracle smart enough to see that it is an integer? Thank you very much.
No, it won't.
Taking Florin's test tables, you can set up a small test harness that runs each query hundreds of times and averages the elapsed time. In my case, I ran both queries 500 times each.
Sometimes, the NUMBER version will run slightly faster (1.232 hundredths of a second vs 1.284 hundredths of a second).
SQL> ed
Wrote file afiedt.buf
1 declare
2 l_start_time number;
3 l_end_time number;
4 l_cnt number;
5 l_iterations number := 500;
6 begin
7 l_start_time := dbms_utility.get_time();
8 for i in 1 .. l_iterations
9 loop
10 select count(*)
11 into l_cnt
12 from fg_test;
13 end loop;
14 l_end_time := dbms_utility.get_time();
15 dbms_output.put_line( 'Average elapsed (number) = ' ||
16 (l_end_time - l_start_time)/l_iterations ||
17 ' hundredths of a second.' );
18 l_start_time := dbms_utility.get_time();
19 for i in 1 .. l_iterations
20 loop
21 select count(*)
22 into l_cnt
23 from fg_test1;
24 end loop;
25 l_end_time := dbms_utility.get_time();
26 dbms_output.put_line( 'Average elapsed (integer) = ' ||
27 (l_end_time - l_start_time)/l_iterations ||
28 ' hundredths of a second.' );
29* end;
30 /
Average elapsed (number) = 1.232 hundredths of a second.
Average elapsed (integer) = 1.284 hundredths of a second.
PL/SQL procedure successfully completed.
Elapsed: 00:00:12.60
If you immediately run the same code block again, however, you're just as likely to see the reverse where the integer version runs slightly faster.
SQL> /
Average elapsed (number) = 1.256 hundredths of a second.
Average elapsed (integer) = 1.22 hundredths of a second.
PL/SQL procedure successfully completed.
Elapsed: 00:00:12.38
Realistically, where you're trying to measure differences in milliseconds or fractions of milliseconds, you're well into the realm where system noise is going to come into play. Even though my machine is "idle" other than the test I'm running, there are thousands of reasons why the system might add an extra millisecond or two to an elapsed time to deal with some interrupt or to run some background thread that does something for the operating system.
This result makes sense when you consider that INTEGER is just a synonym for NUMBER(38)
SQL> desc fg_test1;
Name Null? Type
----------------------------------------- -------- ----------------------------
A NUMBER(38)
SQL> desc fg_test;
Name Null? Type
----------------------------------------- -------- ----------------------------
A NUMBER
Update:
Even using a NUMBER(6) (note that the INSERT has to be changed to load only 999,999 rows rather than 1 million), there is no change
Create the table
SQL> create table fg_test2(a number(6));
Table created.
Elapsed: 00:00:00.01
SQL> ed
Wrote file afiedt.buf
1 insert into fg_test2
2* select level from dual connect by level <= 1000000-1
SQL> /
999999 rows created.
Elapsed: 00:00:07.61
SQL> create index fg_ix2 on fg_test2(a);
Index created.
Elapsed: 00:00:00.01
Run the script. Note that there are no significant differences across any of the four runs and (by chance) in none of the four cases is the NUMBER(6) table the most efficient.
SQL> ed
Wrote file afiedt.buf
1 declare
2 l_start_time number;
3 l_end_time number;
4 l_cnt number;
5 l_iterations number := 500;
6 begin
7 l_start_time := dbms_utility.get_time();
8 for i in 1 .. l_iterations
9 loop
10 select count(*)
11 into l_cnt
12 from fg_test;
13 end loop;
14 l_end_time := dbms_utility.get_time();
15 dbms_output.put_line( 'Average elapsed (number) = ' ||
16 (l_end_time - l_start_time)/l_iterations ||
17 ' hundredths of a second.' );
18 l_start_time := dbms_utility.get_time();
19 for i in 1 .. l_iterations
20 loop
21 select count(*)
22 into l_cnt
23 from fg_test1;
24 end loop;
25 l_end_time := dbms_utility.get_time();
26 dbms_output.put_line( 'Average elapsed (integer) = ' ||
27 (l_end_time - l_start_time)/l_iterations ||
28 ' hundredths of a second.' );
29 l_start_time := dbms_utility.get_time();
30 for i in 1 .. l_iterations
31 loop
32 select count(*)
33 into l_cnt
34 from fg_test2;
35 end loop;
36 l_end_time := dbms_utility.get_time();
37 dbms_output.put_line( 'Average elapsed (number(6)) = ' ||
38 (l_end_time - l_start_time)/l_iterations ||
39 ' hundredths of a second.' );
40* end;
SQL> /
Average elapsed (number) = 1.236 hundredths of a second.
Average elapsed (integer) = 1.234 hundredths of a second.
Average elapsed (number(6)) = 1.306 hundredths of a second.
PL/SQL procedure successfully completed.
Elapsed: 00:00:18.89
SQL> /
Average elapsed (number) = 1.208 hundredths of a second.
Average elapsed (integer) = 1.228 hundredths of a second.
Average elapsed (number(6)) = 1.312 hundredths of a second.
PL/SQL procedure successfully completed.
Elapsed: 00:00:18.74
SQL> /
Average elapsed (number) = 1.208 hundredths of a second.
Average elapsed (integer) = 1.232 hundredths of a second.
Average elapsed (number(6)) = 1.288 hundredths of a second.
PL/SQL procedure successfully completed.
Elapsed: 00:00:18.66
SQL> /
Average elapsed (number) = 1.21 hundredths of a second.
Average elapsed (integer) = 1.22 hundredths of a second.
Average elapsed (number(6)) = 1.292 hundredths of a second.
PL/SQL procedure successfully completed.
Elapsed: 00:00:18.62
UPDATE: My test had a small problem.
( I tried for first table to insert 10M rows, but connect by has raised an insufficient memory exception. However probly was inserted 2-3M rows and then rolbaked. So, my first table had same number of rows, but more blocks.)
So, the assertions below are not verified.
The answer is yes.
(but how much you obtain from this, you should test with you critical operations.)
INTEGER is a subtype of NUMBER. But, surprisingly, subtipes of NUMBER are allways faster(need link here).
My test case:
create table fg_test(a number);
insert into fg_test
select level from dual connect by level <= 1000000;
--1000000 rows inserted
create index fg_ix on fg_test(a);
select count(*) from fg_test;
-- >141 msecs
create table fg_test1(a INTEGER);
insert into fg_test1
select level from dual connect by level <= 1000000;
--1000000 rows inserted
create index fg_ix1 on fg_test1(a);
select count(*) from fg_test1;
-- > 116 msecs
Explanation: select count(*) will do a fast full scan on the index. I ran the select count(*) muuuultiple times to see how what is best speed. In general, with INTEGER is faster. Best speed of INTEGER is better than best speed of NUMBER;
Related
I've got a dataset that has id, start date and a claim value (in dollars) in each row - most ids have more than one row - some span over 50 rows. The earliest date for each ID/claim varies, and the claim values are mostly different.
I'd like to do a rolling sum of the value of IDs that have claims within 365 days of each other, to report each ID that has claims that have exceeded a limiting value across each period. So for an ID that had a claim date on 1 January, I'd sum all claims to 31 December (inclusive). Most IDs have several years of data so for the example above, I'd also need to check that if they had a claim on 1 May that they hadn't exceeded the limit by 30 April the following year and so on. I normally see this referred to as a 'rolling sum'. My site has many SAS products including base, stat, ets, and others.
I'm currently testing code on a small mock dataet and so far I've converted a thin file to a fat file with one column for each claim value and each date of the claim. The mock dataset is similar to the client dataset that I'll be using. Here's what I've done so far (noting that the mock data uses days rather than dates - I'm not at the stage where I want to test on real data yet).
data original_data;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table ppt_counts as
select ppt, count(*) as ppts
from work.original_data
group by ppt;
select cats('value_', max(ppts) ) into :cats
from work.ppt_counts;
select cats('dates_',max(ppts)) into :cnts
from work.ppt_counts;
quit;
%put &cats;
%put &cnts;
data flipped;
set original_data;
by ppt;
array vars(*) value_1 -&cats.;
array dates(*) dates_1 - &cnts.;
array m_vars value_1 - &cats.;
array m_dates dates_1 - &cnts.;
if first.ppt then do;
i=1;
do over m_vars;
m_vars="";
end;
do over m_dates;
m_dates="";
end;
end;
if first.ppt then do:
i=1;
vars(i) = claim;
dates(i)=day;
if last.ppt then output;
i+1;
retain value_1 - &cats dates_1 - &cnts. 0.;
run;
data output;
set work.flipped;
max_date =max(of dates_1 - &cnts.);
max_value =max(of value_1 - &cats.);
run;
This doesn't give me even close to what I need - not sure how to structure code to make this correct.
What I need to end up with is one row per time that an ID exceeds the yearly limit of claim value (say in the mock data if a claim exceeds 75 across a seven day period), and to include the sum of the claims. So it's likely that there may be multiple lines per ID and the claims from one row may also be included in the claims for the same ID on another row.
type of output:
ID sum of claims
a $85
a $90
b $80
On separate rows.
Any help appreciated.
Thanks
If you need to perform a rolling sum, you can do this with proc expand. The code below will perform a rolling sum of 5 days for each group. First, expand your data to fill in any missing gaps:
proc expand data = original_data
out = original_data_expanded
from = day;
by ppt;
id day;
convert claim / method=none;
run;
Any days with gaps will have missing value of claim. Now we can calculate a moving sum and ignore those missing days when performing the moving sum:
proc expand data = original_data
out = want(where=(NOT missing(claim)));
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Output:
ppt day rolling_sum claim
a 1 7 7
a 2 19 12
a 4 31 12
a 6 42 18
a 7 41 11
...
b 9 53 14
b 10 70 17
c 4 2 2
c 6 6 4
c 8 14 8
The reason we use two proc expand statements is because the rolling sum is calculated before the days are expanded. We need the rolling sum to occur after the expansion. You can test this by running the above code all in a single statement:
/* Performs moving sum, then expands */
proc expand data = original_data
out = test
from = day;
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Use a SQL self join with the dates being within 365 days of itself. This is time/resource intensive if you have a very large data set.
Assuming you have a date variable, the intnx is probably the better way to calculate the date interval than 365 depending on how you want to account for leap years.
If you have a claim id to group on, that would also be better than using the group by clause in this example.
data have;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table want as
select a.*, sum(b.claim) as total_claim
from have as a
left join have as b
on a.ppt=b.ppt and
b.day between a.day and a.day+365
group by 1, 2, 3;
/*b.day between a.day and intnx('year', a.day, 1, 's')*/;
quit;
Assuming that you have only one claim per day you could just use a circular array to keep track of the pervious N days of claims to generate the rolling sum. By circular array I mean one where the indexes wrap around back to the beginning when you increment past the end. You can use the MOD() function to convert any integer into an index into the array.
Then to get the running sum just add all of the elements in the array.
Add an extra DO loop to zero out the days skipped when there are days with no claims.
%let N=5;
data want;
set original_data;
by ppt ;
array claims[0:%eval(&n-1)] _temporary_;
lagday=lag(day);
if first.ppt then call missing(of lagday claims[*]);
do index=max(sum(lagday,1),day-&n+1) to day-1;
claims[mod(index,&n)]=0;
end;
claims[mod(day,&n)]=claim;
running_sum=sum(of claims[*]);
drop index lagday ;
run;
Results:
running_
OBS ppt day claim sum
1 a 1 7 7
2 a 2 12 19
3 a 4 12 31
4 a 6 18 42
5 a 7 11 41
6 a 8 10 51
7 a 9 14 53
8 a 10 17 70
9 b 1 27 27
10 b 2 12 39
11 b 3 14 53
12 b 4 12 65
13 b 6 18 56
14 b 7 11 55
15 b 8 10 51
16 b 9 14 53
17 b 10 17 70
18 c 4 2 2
19 c 6 4 6
20 c 8 8 14
Working in a known domain of date integers, you can use a single large array to store the claims at each date and slice out the 365 days to be summed. The bookkeeping needed for the modular approach is not needed.
Example:
data have;
call streaminit(20230202);
do id = 1 to 10;
do date = '01jan2012'd to '02feb2023'd;
date + rand('integer', 25);
claim = rand('integer', 5, 100);
output;
end;
end;
format date yymmdd10.;
run;
options fullstimer;
data want;
set have;
by id;
array claims(100000) _temporary_;
array slice (365) _temporary_;
if first.id then call missing(of claims(*));
claims(date) = claim;
call pokelong(
peekclong(
addrlong (claims(date-365))
, 8*365)
,
addrlong(slice(1))
);
rolling_sum_365 = sum(of slice(*));
if dif1(claim) < 365 then
claims_out_365 = lag(claim) - dif1(rolling_sum_365);
if first.id then claims_out_365 = .;
run;
Note: SAS Date 100,000 is 16OCT2233
I'm looking to improve my code efficiency by turning my code into arrays and loops. The data i'm working with starts off like this:
ID Mapping Asset Fixed Performing Payment 2017 Payment2018 Payment2019 Payment2020
1 Loan1 1 1 1 90 30 30 30
2 Loan1 1 1 0 80 20 40 20
3 Loan1 1 0 1 60 40 10 10
4 Loan1 1 0 0 120 60 30 30
5 Loan2 ... ... ... ... ... ... ...
So For each ID (essentially the data sorted by Mapping, Asset, Fixed and then Performing) I'm looking to build a profile for the Payment Scheme.
The Payment Vector for the first ID looks like this:
PaymentVector1 PaymentVector2 PaymentVector3 PaymentVector4
1 0.33 0.33 0.33
It is represented by the formula
PaymentVector(I)=Payment(I)/Payment(1)
The above is fine to create in an array, example code can be given if you wish.
Next, under the assumption that every payment made is replaced i.e. when 30 is paid in 2018, it must be replaced, and so on.
I'm looking to make a profile that shows the outflows (and for illustration, but not required in code, in brackets inflows) for the movement of the payments as such - For ID=1:
Payment2017 Payment2018 Payment2019 Payment2020
17 (+90) -30 -30 -30
18 N/A (+30) -10 -10
19 N/A N/A (+40) -13.3
20 N/A N/A N/A (+53.3)
so if you're looking forwards, the rows can be thought of what year it is and the columns representing what years are coming up.
Hence, in year 2019, looking at what is to be paid in 2017 and 2018 is N/A because those payments are in the past / cannot be paid now.
As for in year 2018, looking at what has to be paid in 2019, you have to pay one-third of the money you have now, so -10.
I've been working to turn this dataset row by row into the array but there surely has to be a quicker way using an array:
The Code I've used so far looks like:
Data Want;
Set Have;
Array Vintage(2017:2020) Vintage2017-Vintage2020;
Array PaymentSchedule(2017:2020) PaymentSchedule2017-PaymentSchedule2020;
Array PaymentVector(2017:2020) PaymentVector2017-PaymentVector2020;
Array PaymentVolume(2017:2020) PaymentVolume2017-PaymentVolume2020;
do i=1 to 4;
PaymentVector(i)=PaymentSchedule(i)/PaymentSchedule(1);
end;
I'll add code tomorrow... but the code doesn't work regardless.
data have;
input
ID Mapping $ Asset Fixed Performing Payment2017 Payment2018 Payment2019 Payment2020; datalines;
1 Loan1 1 1 1 90 30 30 30
2 Loan1 1 1 0 80 20 40 20
3 Loan1 1 0 1 60 40 10 10
4 Loan1 1 0 0 120 60 30 30
data want(keep=id payment: fraction:);
set have;
array p payment:;
array fraction(4); * track constant fraction determined at start of profile;
array out(4); * track outlay for ith iteration;
* compute constant (over iterations) fraction for row;
do i = dim(p) to 1 by -1;
fraction(i) = p(i) / p(1);
end;
* reset to missing to allow for sum statement, which is <variable> + <expression>;
call missing(of out(*));
out(1) = p(1);
do iter = 1 to 4;
p(iter) = out(iter);
do i = iter+1 to dim(p);
p(i) = -fraction(i) * p(iter);
out(i) + (-p(i)); * <--- compute next iteration outlay with ye olde sum statement ;
end;
output;
p(iter) = .;
end;
format fract: best4. payment: 7.2;
run;
You've indexed your arrays with 2017:2020 but then try and use them using the 1 to 4 index. That won't work, you need to be consistent.
Array PaymentSchedule(2017:2020) PaymentSchedule2017-PaymentSchedule2020;
Array PaymentVector(2017:2020) PaymentVector2017-PaymentVector2020;
do i=2017 to 2020;
PaymentVector(i)=PaymentSchedule(i)/PaymentSchedule(2017);
end;
I am new in Matlab but I am trying.
I have the following code:
for t = 1:size(data,2)
b = data(t)/avevalue;
if b >= 1
cat1 = [repmat((avevalue),floor(b),1)',mod(data(t),15)];
else
cat1 = data(t);
end
modified = [modified,cat1];
end
The answer for
data=[16 18 16 25 17 7 15];
avevalue=15;
is
15 1 15 3 15 1 15 10 15 2 7 15 0
But when my array is more than 10000 elements it working very, impossibly slow (for 100000 nearly 3 minutes, for example). How can I increase its speed?
There are two main reasons for the slowness:
The fact that you are using a loop.
The output array is growing on each iteration.
you can improve runtime by trying the following approach:
%auxilliary array
divSumArray = ceil((data+1)/avevalue);
%defines output array
newArr = ones(1,sum(divSumArray))*avevalue;
%calculates modulo
moduloDataIndices = cumsum(divSumArray);
%assigning modulo in proper location
newArr(moduloDataIndices) = mod(data,avevalue);
the final result
15 1 15 3 15 1 15 10 15 2 7 15 0
Time measurement
I measured runtime for the following input:
n = 30000;
data = randi([0 99],n,1);
avevalue=15;
original algo:
Elapsed time is 11.783951 seconds.
optimized algo:
Elapsed time is 0.007728 seconds.
I'm monitoring more than 300 servers, for that I'm using Ganglia.
Which use RRD as database to collect and store data related the resources of each server.
I would like to have a history about 2 years or more, so reading this article, I think that my RRA configuration should be :
RRAs "RRA:AVERAGE:0.5:1:17520"
17520 = (365 days [year] x 2) * 24 [hour]
This is Ganglia default configuration, which is running today:
#
# Round-Robin Archives
# You can specify custom Round-Robin archives here (defaults are listed below)
#
# RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244" "RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" \
# "RRA:AVERAGE:0.5:5760:374"
#
Is that right my way of thinking or I'm missing something here ?
After studying this subject for a while, I came up with an answer that may help someone in the future. I read these two articles many times, which I recommend.
Read this one first, Creating an initial RRD then read this one. How to create an RRDTool database:
I will try to explain it simply. Format RRA:CF:xff:steps:rows:
RRA: Round Robin Archive
CF: Consolidation Factor
XFF: Xfile Factor
steps
rows
The biggest issue for me was to discover the right value for steps and rows.
After reading, I came up with this explanation:
1 day - 5-minute resolution
1 week - 15-minute resolution
1 month - 1-hour resolution
1 year - 6-hour resolution
RRA:AVERAGE:0.5:1:288 \
RRA:AVERAGE:0.5:3:672 \
RRA:AVERAGE:0.5:12:744 \
RRA:AVERAGE:0.5:72:1480
Keep in mind that our step is 300 seconds, so the idea is very simple:
If I want to resolve one day which has 86400 seconds, as shown in the first example, how many rows do I need? The answer is 288 rows. Why?
`86400 seconds [1 day] / 300 seconds [5 minutes`] = 288 rows
Another example, if I want to resolve:
1 week [ = 604800 seconds ] in 15 minutes [ = 900 seconds ] = 604800/900 = 672 rows
And so it goes on for the other values. This way you are going to find out how many rows you need.
Finding out how many steps you need is very simple, you just have to take the multiplier of your steps.
Let me explain: Our steps are 300 seconds, right?
So if we want to resolve 5 minutes [ = 300 seconds ], we just need to multiply by 1, right?
So, 15 minutes means by 300 seconds x 3, 1 hour means 300 x 12, 6 hours mean 300 x 72 and so on.
In my specific case, I would like to my steps be 30 seconds, so I came up with these structure:
1 every time 30 seconds 1 * 30s = 30s
2 every second time 1 minute 2 * 30s = 1m
4 every third time 2 minutes 4 * 30s = 2m
10 every 10th time 5 minutes 10 * 30s = 5m
20 every 20th time 10 minutes 20 * 30s = 10m
60 every 60th time 30 minutes 60 * 30s = 30m
80 every 80th time 40 minutes 80 * 30s = 40m
100 every 100th time 50 minutes 100 * 30s = 50m
120 every 120th time 1 hour 120 * 30s = 1h
240 every 240th time 2 hours 240 * 30s = 2h
360 every 360th time 3 hours 360 * 30s = 3h
RRA:AVERAGE:0.5:1:120 \
RRA:AVERAGE:0.5:2:120 \
RRA:AVERAGE:0.5:4:120 \
RRA:AVERAGE:0.5:10:288 \
RRA:AVERAGE:0.5:20:1008 \
RRA:AVERAGE:0.5:60:1440 \
RRA:AVERAGE:0.5:80:3240 \
RRA:AVERAGE:0.5:100:5184 \
RRA:AVERAGE:0.5:120:8760 \
RRA:AVERAGE:0.5:240:8760 \
RRA:AVERAGE:0.5:360:8760 \
Which means:
1 hour - 30 seconds resolution
2 hours - 1 minute resolution
4 hours - 2 minutes resolution
1 day - 5 minutes resolution
1 week - 10 minutes resolution
1 month - 30 minutes resolution
3 months - 40 minutes resolution
6 months - 50 minutes resolution
1 year - 1 hour resolution
2 year - 2 hour resolution
3 year - 3 hour resolution
Well, I hope this helps someone, that's all.
I have two query to filter some userid depend on question and its answers.
Scenario
Query A is (the original version):
SELECT userid
FROM mem..ProfileResult
WHERE ( ( QuestionID = 4
AND QuestionLabelID = 0
AND AnswerGroupID = 4
AND ResultValue = 1
)
OR ( QuestionID = 14
AND QuestionLabelID = 0
AND AnswerGroupID = 19
AND ResultValue = 3
)
OR ( QuestionID = 23
AND QuestionLabelID = 0
AND AnswerGroupID = 28
AND ( ResultValue & 16384 > 0 )
)
OR ( QuestionID = 17
AND QuestionLabelID = 0
AND AnswerGroupID = 22
AND ( ResultValue = 6
OR ResultValue = 19
OR ResultValue = 21
)
)
OR ( QuestionID = 50
AND QuestionLabelID = 0
AND AnswerGroupID = 51
AND ( ResultValue = 10
OR ResultValue = 41
)
)
)
GROUP BY userid
HAVING COUNT(*) = 5
I use 'set statistics time on' and 'set statistic io on' to check the cpu time and io performance.
the result is:
CPU time = 47206 ms, elapsed time = 20655 ms.
I rewrote Query A via using Set Operation, let me name it Query B:
SELECT userid
FROM ( SELECT userid
FROM mem..ProfileResult
WHERE QuestionID = 4
AND QuestionLabelID = 0
AND AnswerGroupID = 4
AND ResultValue = 1
INTERSECT
SELECT userid
FROM mem..ProfileResult
WHERE QuestionID = 14
AND QuestionLabelID = 0
AND AnswerGroupID = 19
AND ResultValue = 3
INTERSECT
SELECT userid
FROM mem..ProfileResult
WHERE QuestionID = 23
AND QuestionLabelID = 0
AND AnswerGroupID = 28
AND ( ResultValue & 16384 > 0 )
INTERSECT
SELECT userid
FROM mem..ProfileResult
WHERE QuestionID = 17
AND QuestionLabelID = 0
AND AnswerGroupID = 22
AND ( ResultValue = 6
OR ResultValue = 19
OR ResultValue = 21
)
INTERSECT
SELECT userid
FROM mem..ProfileResult
WHERE QuestionID = 50
AND QuestionLabelID = 0
AND AnswerGroupID = 51
AND ( ResultValue = 10
OR ResultValue = 41
)
) vv;
the CPU Time and Elapsed Time is:
CPU time = 8480 ms, elapsed time = 18509 ms
My Simple Analysis
As you can see from up result, Query A have CPU Time more than 2 times of Elapsed time
I search for this case, mostly people say CPU time should less than Elapsed time, because CPU time is how long the CPU running this task. And the Elapsed time include I/O time and other sort of time cost. But one special case is when the Server has multiple Core CPU. However, I just checked the development db server and it has one single core CPU.
Question 1
How to explain that CPU time more than Elapsed time in Query A in a single core CPU environment?
Question 2
After, using set operation, Is the performance really improved?
I have this question because logical reads of Query B is 280627 which is higher than Query A's 241885
Brad McGehee said in his article that 'The fewer the logical reads performed by a query, the more efficient it is, and the faster it will perform, assuming all other things are held equal.'
Than, does it correctly say that even Query B have higher logical reads than Query A, but CPU time is significantly less than Query A, Query B should have a better performance.
If CPU is greater than elapsed, you do have a multi core or hyper-threaded CPU
The CPU time is where the SQL Server Engine is installed. It isn't for a local Management Studio install.
As for logical IO vs CPU, I'd go with lower CPU. If this runs often and overlapping, you'll run out of CPU resource first. I'd try a WHERE EXISTS (UNION ALL) construct and make sure I have good indexes.
Edit, after comments
there are parallelism operators in the plan = more than one logical processor visible to the OS and SQL Server. So it's either multiple core or hyper-threaded
Try EXEC xp_msver
In my case- SQL Server Execution Times:
CPU time = 671 ms, elapsed time = 255 ms.
CPU time was nearly three times bigger than the elapsed time for query. Because the query
was processed in parallel, the CPU burden was very high, and the CPU could become a bottleneck
for this scenario.
SQL Server 2012 brings a solution to the CPU burden problem. It introduces iterators
that process batches of rows at a time, not just row by row.
For query optimization you can Create columnstore index on your table-
CREATE COLUMNSTORE INDEX idx_cs_colname
ON dbo.Tablename(feild1,feild2);