I am trying to understand the structure of the Sql Server data pages. This is the screenshot from the Pro SQL Server Internals by Dmitri Korotkevitch
I've created 3 tables:
1 INT Column
2 INT Columns
4 INT Columns
All columns are NOT NULL
Then I run
dbcc traceon(3604);
dbcc page
(
'DbName'
,1 /*File ID*/
,368 /*Page ID*/
,3 /*Output mode: 3 - display page header and row details */
);
and getting following values for Fdata length:
With 1 Column - 0800 = 0008 = 8 = 4 + 1x4
With 2 Columns - 0c00 = 00c0 = 12 = 4 + 2x4
With 4 Columns - 1400 = 0014 = 20 = 4 + 4x4
Here I've listed "value in the output" = "swapped value" = "decimal value"
EDITED:
So far as I understand it is Const_4 + Nbr_of_Columns * Size_Of_Columns. What is this Const_4?
"Fdata length" in the picture is the offset where the fixed-length data part of the row ends.
In your example the data occupie 4, 8 and 16 bytes, and the row starts with another 4 bytes (1+1+2: status bits A, status bits B, Flength), so the "end" (offset) of storing fixed length data is 8, 12 and 20 respectively.
Related
I've got a dataset that has id, start date and a claim value (in dollars) in each row - most ids have more than one row - some span over 50 rows. The earliest date for each ID/claim varies, and the claim values are mostly different.
I'd like to do a rolling sum of the value of IDs that have claims within 365 days of each other, to report each ID that has claims that have exceeded a limiting value across each period. So for an ID that had a claim date on 1 January, I'd sum all claims to 31 December (inclusive). Most IDs have several years of data so for the example above, I'd also need to check that if they had a claim on 1 May that they hadn't exceeded the limit by 30 April the following year and so on. I normally see this referred to as a 'rolling sum'. My site has many SAS products including base, stat, ets, and others.
I'm currently testing code on a small mock dataet and so far I've converted a thin file to a fat file with one column for each claim value and each date of the claim. The mock dataset is similar to the client dataset that I'll be using. Here's what I've done so far (noting that the mock data uses days rather than dates - I'm not at the stage where I want to test on real data yet).
data original_data;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table ppt_counts as
select ppt, count(*) as ppts
from work.original_data
group by ppt;
select cats('value_', max(ppts) ) into :cats
from work.ppt_counts;
select cats('dates_',max(ppts)) into :cnts
from work.ppt_counts;
quit;
%put &cats;
%put &cnts;
data flipped;
set original_data;
by ppt;
array vars(*) value_1 -&cats.;
array dates(*) dates_1 - &cnts.;
array m_vars value_1 - &cats.;
array m_dates dates_1 - &cnts.;
if first.ppt then do;
i=1;
do over m_vars;
m_vars="";
end;
do over m_dates;
m_dates="";
end;
end;
if first.ppt then do:
i=1;
vars(i) = claim;
dates(i)=day;
if last.ppt then output;
i+1;
retain value_1 - &cats dates_1 - &cnts. 0.;
run;
data output;
set work.flipped;
max_date =max(of dates_1 - &cnts.);
max_value =max(of value_1 - &cats.);
run;
This doesn't give me even close to what I need - not sure how to structure code to make this correct.
What I need to end up with is one row per time that an ID exceeds the yearly limit of claim value (say in the mock data if a claim exceeds 75 across a seven day period), and to include the sum of the claims. So it's likely that there may be multiple lines per ID and the claims from one row may also be included in the claims for the same ID on another row.
type of output:
ID sum of claims
a $85
a $90
b $80
On separate rows.
Any help appreciated.
Thanks
If you need to perform a rolling sum, you can do this with proc expand. The code below will perform a rolling sum of 5 days for each group. First, expand your data to fill in any missing gaps:
proc expand data = original_data
out = original_data_expanded
from = day;
by ppt;
id day;
convert claim / method=none;
run;
Any days with gaps will have missing value of claim. Now we can calculate a moving sum and ignore those missing days when performing the moving sum:
proc expand data = original_data
out = want(where=(NOT missing(claim)));
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Output:
ppt day rolling_sum claim
a 1 7 7
a 2 19 12
a 4 31 12
a 6 42 18
a 7 41 11
...
b 9 53 14
b 10 70 17
c 4 2 2
c 6 6 4
c 8 14 8
The reason we use two proc expand statements is because the rolling sum is calculated before the days are expanded. We need the rolling sum to occur after the expansion. You can test this by running the above code all in a single statement:
/* Performs moving sum, then expands */
proc expand data = original_data
out = test
from = day;
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Use a SQL self join with the dates being within 365 days of itself. This is time/resource intensive if you have a very large data set.
Assuming you have a date variable, the intnx is probably the better way to calculate the date interval than 365 depending on how you want to account for leap years.
If you have a claim id to group on, that would also be better than using the group by clause in this example.
data have;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table want as
select a.*, sum(b.claim) as total_claim
from have as a
left join have as b
on a.ppt=b.ppt and
b.day between a.day and a.day+365
group by 1, 2, 3;
/*b.day between a.day and intnx('year', a.day, 1, 's')*/;
quit;
Assuming that you have only one claim per day you could just use a circular array to keep track of the pervious N days of claims to generate the rolling sum. By circular array I mean one where the indexes wrap around back to the beginning when you increment past the end. You can use the MOD() function to convert any integer into an index into the array.
Then to get the running sum just add all of the elements in the array.
Add an extra DO loop to zero out the days skipped when there are days with no claims.
%let N=5;
data want;
set original_data;
by ppt ;
array claims[0:%eval(&n-1)] _temporary_;
lagday=lag(day);
if first.ppt then call missing(of lagday claims[*]);
do index=max(sum(lagday,1),day-&n+1) to day-1;
claims[mod(index,&n)]=0;
end;
claims[mod(day,&n)]=claim;
running_sum=sum(of claims[*]);
drop index lagday ;
run;
Results:
running_
OBS ppt day claim sum
1 a 1 7 7
2 a 2 12 19
3 a 4 12 31
4 a 6 18 42
5 a 7 11 41
6 a 8 10 51
7 a 9 14 53
8 a 10 17 70
9 b 1 27 27
10 b 2 12 39
11 b 3 14 53
12 b 4 12 65
13 b 6 18 56
14 b 7 11 55
15 b 8 10 51
16 b 9 14 53
17 b 10 17 70
18 c 4 2 2
19 c 6 4 6
20 c 8 8 14
Working in a known domain of date integers, you can use a single large array to store the claims at each date and slice out the 365 days to be summed. The bookkeeping needed for the modular approach is not needed.
Example:
data have;
call streaminit(20230202);
do id = 1 to 10;
do date = '01jan2012'd to '02feb2023'd;
date + rand('integer', 25);
claim = rand('integer', 5, 100);
output;
end;
end;
format date yymmdd10.;
run;
options fullstimer;
data want;
set have;
by id;
array claims(100000) _temporary_;
array slice (365) _temporary_;
if first.id then call missing(of claims(*));
claims(date) = claim;
call pokelong(
peekclong(
addrlong (claims(date-365))
, 8*365)
,
addrlong(slice(1))
);
rolling_sum_365 = sum(of slice(*));
if dif1(claim) < 365 then
claims_out_365 = lag(claim) - dif1(rolling_sum_365);
if first.id then claims_out_365 = .;
run;
Note: SAS Date 100,000 is 16OCT2233
I am trying to transpose a sequence of ranges from an excel file into SAS. The excel file looks something like this:
31 Dec 01Jan 02Jan 03Jan 04Jan
Book id1 23 24 35 43 98
Book id2 3 4 5 4 1
(few blank rows in between)
05Jan 06Jan 07Jan 08Jan 09Jan
Book id1 14 100 30 23 58
Book id2 2 7 3 8 6
(and it repeats..)
My final output should have a first column for the date and then two additional columns for the book Ids:
Date Book id1 Book id2
31 Dec 23 3
01Jan 24 4
02Jan 35 5
03Jan 43 4
04Jan 98 1
05Jan 14 2
06Jan 100 7
07Jan 30 3
08Jan 23 8
09Jan 58 6
In this particular case I am asking for a simpler method to:
Either import and transpose each range of data and replacing the data range with macro variables to separately import and transpose each individual range
Or to import the whole datafile first and then to create a loop that
transposes each range of data
Code I used for a simple import and transpose of a specific data range:
proc import datafile="&input./have.xlsx"
out=want
dbms=xlsx replace;
range="Data$A3:F5" ;
run;
proc transpose data=want
out=want_transposed
name=date;
id A;
run;
Thanks!
A data row that is split over several segments or blocks of rows in an Excel file can be imported raw into SAS and then processed into a categorical form using a DATA Step.
In this example sample data is put into a text file and imported such that the column names are generic VAR-1 ... VAR-n. The generic import is then processed across each row, outputting one SAS data set row per import cell.
The column names in each segment are retained within a temporary array an updated whenever a blank book id is encountered.
* mock data;
filename demo "%sysfunc(pathname(WORK))\demo.txt";
data _null_;
input;
file demo;
put _infile_;
datalines;
., 31Dec, 01Jan, 02Jan, 03Jan, 04Jan
Book_id1, 23 , 24 , 35 , 43 , 98
Book_id2, 3 , 4 , 5 , 4 , 1
., 05Jan, 06Jan, 07Jan, 08Jan, 09Jan
Book_id1, 14 , 100 , 30 , 23 , 58
Book_id2, 2 , 7 , 3 , 8 , 6
run;
* mock import;
proc import replace out=work.haveraw file=demo dbms=csv;
getnames = no;
datarow = 1;
run;
ods listing;
proc print data=haveraw;
run;
When Excel import is be made to look like this:
Obs VAR1 VAR2 VAR3 VAR4 VAR5 VAR6
1 31Dec 01Jan 02Jan 03Jan 04Jan
2 Book_id1 23 24 35 43 98
3 Book_id2 3 4 5 4 1
4
5 05Jan 06Jan 07Jan 08Jan 09Jan
6 Book_id1 14 100 30 23 58
7 Book_id2 2 7 3 8 6
It can be processed in a transposing way, outputting only the name value pairs corresponding to a original cell.
data have (keep=bookid date value);
set haveraw;
array dates(1000) $12 _temporary_ ;
array vars var:;
if missing(var1) then do;
do index = 2 by 1 while (index <= dim(vars));
if not missing(vars(index)) then
dates(index) = put(index-1,z3.) || '_' || vars(index); * adjust as you see fit;
else
dates(index) = '';
end;
end;
else do;
bookid = var1;
do index = 2 by 1 while (index <= dim(vars));
date = dates(index);
value = input(vars(index),??best12.);
output;
end;
end;
run;
Did not know how to call the Title properly. However, I am trying to understand how the data pages are stored. I've created simple table:
CREATE TABLE testFix
(
id INT,
v CHAR(10)
);
INSERT INTO dbo.testFix
(
id,
v
)
VALUES
( 1, -- id - int
'asdasd' -- v - varchar(100)
)
GO 2
DBCC TRACEON(3604);
Then I got PageFID and PagePID by following command:
DBCC IND(tempdb, testFix, -1)
GO
Then the actual data pages:
DBCC PAGE (tempdb, 1, 368, 3)
So now I see:
Slot 0 Offset 0x60 Length 21
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP
Record Size = 21
Memory Dump #0x000000287DD7A060
0000000000000000: 10001200 01000000 61736461 73642020 20200200
........asdasd .. 0000000000000014: 00
.
Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4
id = 1
Slot 0 Column 2 Offset 0x8 Length 10 Length (physical) 10
v = asdasd
Slot 1 Offset 0x75 Length 21
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP
Record Size = 21
Memory Dump #0x000000287DD7A075
0000000000000000: 10001200 01000000 61736461 73642020 20200200
........asdasd .. 0000000000000014: 00
.
Slot 1 Column 1 Offset 0x4 Length 4 Length (physical) 4
id = 1
Slot 1 Column 2 Offset 0x8 Length 10 Length (physical) 10
v = asdasd
Slot 2 Offset 0x8a Length 21
Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP
Record Size = 21
Memory Dump #0x000000287DD7A08A
0000000000000000: 10001200 01000000 61736461 73642020 20200200
........asdasd .. 0000000000000014: 00
So the length of the record is 21 byte. However INT is 4 bytes and CHAR(10) is 10 bytes. 4+10=14. What for the other 7 bytes are used?
Here is the "anatomy" of data row
In red there are 7 bytes you are missing: Status Bits A (1), Status Bits B (1), Fdata length (2), Ncols (2), NullBits (1)
From this book: Pro SQL Server Internals by Korotkevitch D.
I think this question might be related to Ms Excel -> 2 columns into a 2 dimensional array but I can't quite make the connection.
I have a VBA script for filling missing missing data. I select two adjacent columns, and it finds any gaps in the second column and linearly interpolates based on (possibly irregular) spacing in the first column. For instance, I could use it on this data:
1 7
2 14
3 21
5 35
5.1
6 42
7
8
9 45
to get this output
1 7
2 14
3 21
5 35
5.1 35.7 <---1/10th the way between 35&42
6 42
7 43 <-- 1/3 the way between 42 & 45
8 44 <-- 2/3 the way between 42 & 45
9 45
This is very useful for me.
My trouble is that it only works on contiguous columns. I would like to be able to select two columns that are not adjacent to each other and have it work the same way. My code starts out like this:
Dim addr As String
addr = Selection.Address
Dim nR As Long
Dim nC As Long
'Reads Selected Cells' Row and Column Information
nR = Range(addr).Rows.Count
nC = Range(addr).Columns.Count
When I run this with contiguous columns selected, addr shows up in the Locals window with a value like "$A$2:$B$8" and nC = 2
When I run this with non-contiguous columns selected, addr shows up in the Locals window with a value like "$A$2:$A$8,$C$2:$C$8" and nC = 1.
Later on in the script, I collect the values in each column into an array. Here's how I deal with the second column, for example:
'Creates a Column 2 (col1) array, determines cells needed to interpolate for, changes font to bold and red, and reads its values
Dim col2() As Double
ReDim col2(0 To nR + 1)
i = 1
Do Until i > nR
If IsEmpty(Selection(i, 2)) Or Selection(i, 2) = 0 Or Selection(i, 2) = -901 Then
Selection(i, 2).Font.Bold = True
Selection(i, 2).Font.Color = RGB(255, 69, 0)
col2(i) = 9999999
Else
col2(i) = Selection(i, 2)
End If
i = i + 1
Loop
This is also busted, because even if my selection is "$A$2:$A$8,$C$2:$C$8" VBA will treat Selection(1,2) as a reference to $B$2, not the desired $C$2.
Anyone have a suggestion for how I can get VBA to treat non-contiguous selection the way it treats contiguous?
You're dealing with "disjoint ranges." Use the Areas collection, e.g., as described here. The first column should be in Selection.Areas(1) and the second column should be in Selection.Areas(2).
There are multiple occurrence of same combination of values in different rows of matlab, for example 1 1 in first and second row. I want to remove all those duplicates but adding the values in third column. In case of 1 1 it will be 7. Finally I want to create a similarity matrix as shown below in Answer. I don't mind 2*values in diagonals because I will not be considering diagonal elements in further work. The code below does this but it is not vectorized. Can this be vectorized somehow. Example is given below.
datain = [ 1 1 3;
1 1 4;
1 2 5;
1 2 4;
1 2 3;
1 3 8;
1 3 7;
1 3 12;
2 2 22;
2 2 77;
2 3 111;
2 3 113;
3 3 456;
3 3 568];
cmp1=unique(datain(:,1));
cmp1sz=size(cmp1,1);
cmp2=unique(datain(:,2));
cmp2sz=size(cmp2,1);
thetotal=zeros(cmp1sz,cmp2sz);
for i=1:size(datain,1)
for j=1:cmp1sz
for k=1:cmp2sz
if datain(i,1)==cmp1(j,1) && datain(i,2)== cmp2(k,1)
thetotal(j,k)=thetotal(j,k)+datain(i,3);
thetotal(k,j)=thetotal(k,j)+datain(i,3);
end
end
end
end
The answer is
14 12 27
12 198 224
27 224 2048
This is a poster case for using ACCUMARRAY.
thetotal = accumarray(datain(:,1:2),datain(:,3),[],#sum,0);
%# to make the array symmetric, you simply add its transpose
thetotal = thetotal + thetotal'
thetotal =
14 12 27
12 198 224
27 224 2048
EDIT
So what if datain does not contain only integer values? In this case, you can still construct a table, but e.g. thetotal(1,1) will not correspond to datain(1,1:2) == [1 1], but to the smallest entry in the first two columns of datain.
[uniqueVals,~,tmp] = unique(reshape(datain(:,1:2),[],1));
correspondingIndices = reshape(tmp,size(datain(:,1:2)));
thetotal = accumarray(correspondingIndices,datain(:,3),[],#sum,0);
The value at [1 1] now corresponds to the row [uniqueVals(1) uniqueVals(1)] in the first two cols of datain.