Calculating space savings for sparse columns in SQL Server

Calculating space savings for sparse columns in SQL Server - sql-server

According to the SQL Server documentation for sparse columns:
Consider using sparse columns when the space saved is at least 20 percent to 40 percent.
And there are tables that show the percentage of data that must be NULL to achieve a "net space savings of 40%" based on the size of the data type. Here's a summary:
Nonsparse bytes Sparse bytes NULL percentage
--------------- ------------ ---------------
1 5 86%
2 6 76%
3 7 69%
4 8 64%
5 9 60%
6 10 57%
8 (bigint) 12 52%
10 14 49%
Questions:
How were the above NULL percentages calculated?
I'm having trouble figuring out the math. For example, consider the row for bigint. If 52% of values in a sparse bigint column are NULL, then the 48% of non-NULL values should require 12 bytes × 48% = 5.76 bytes on average, which is 8 − 5.76 = 2.24 bytes less compared to a nonsparse column. But this is a savings of 2.24 bytes ÷ 8 bytes = 28%, not 40%.
What percentage of data must be NULL to achieve a space savings of 20%?
The post Why & When should I use SPARSE COLUMN? doesn't answer either question.

Related

Obtaining data from an array to a dataframe

so i have 2 datasets, the first one is a dataframe
df1 <- data.frame(user=c(1:10), h01=c(3,3,6,8,9,10,4,1,2,5), h12=c(5,5,3,4,1,2,8,8,9,10),a=numeric(10))
the first column represents the user id, and h01 represents the id of a cell phone antenna from which the user is connected for a period of time (00:00 - 1:00AM) and h12 represents the same but between 1:00AM and 2:00AM.
And then i have an array
array1 <- array(c(23,12,63,11,5,6,9,41,23,73,26,83,41,51,29,10,1,5,30,2), dim=c(10,2))
The rows represent the cell phone antenna id, the columns represent the periods of time and the values in array1 represent how many people is connected to the antenna at that period of time. So array1[1,1] will print how many people is connected between 00:00 and 1:00 to antenna 1, array1[2,2] will print how many people is connected between 1:00 and 2:00 to antenna 2 and so on.
What i want to do is for each user in df1 obtain from array1 how many people in total is connected to the same antennas in the same period of time and place the value in column a.
For example, the first user is connected to antenna 3 between 00:00 and 1:00AM, and antenna 5 between 1:00AM and 2:00AM, so the value in a should be array1[3,1] plus array1[5,2]
I used a for loop to do this
aux1 <- df1[,2]
aux2 <- df1[,3]
for(i in 1:length(df1$user)){
df1[i,4] <- sum(array1[aux1[i],1],array1[aux2[i],2])
}
which gives
user h01 h02 a
1 1 3 5 92
2 2 3 5 92
3 3 6 3 47
4 4 8 4 92
5 5 9 1 49
6 6 10 2 156
7 7 4 8 16
8 8 1 8 28
9 9 2 9 42
10 10 5 10 7
This loop works and gives the correct values, the problem is the 2 datasets (df1 and array1) are really big. df1 has over 20.000 users and 24 periods of time, and array1 has over 1300 antennas, not to mention that this data corresponds to users from one socioeconomic level, and i have 5 in total, so simplifying the code is mandatory.
I would love if someone could show me a different approach to this, specially if its withouth a for loop.

Try this approach:
df1$a <- array1[df1$h01,1] + array1[df1$h12,2]

saving hashtable using c so that random access is faster

I am writing a C code (call it database generation) processes an input file and generated a number in range [1,10^8] alongwith a sequence of float values whose length is fixed but unknown followed by 3 integers. All values are separated by space
Example:
19432 23.45 32.12 45.76 ...(156 such float values) 4 6 106
This will be one line of database where first number is hash index (one to 10^8) , and last 3 integers denote the x,y coordinated and document ID respectively.
Our database is saved in file xyz which has following content
2341 34.67 43.13 ... (234 such float values) 5 8 123
2352 46.92 41.89 ... (51 such float values) 1 9 145
2352 46.92 41.89 ... (98 such float values) 2 7 12
2359 12.71 72.90 ... (141 such float values) 8 12 13
The starting number (hash index value) will always be in non-decreasing order in database as we proceed from one line to next.
I have another C code (call it retrieval) which takes hash index value as input and should output all lines starting with that value.
I have 2 questions
How can I make sure that retrieval directly jumps to line containing asked hash index value skipping the starting lines of database so that its response is fast.
When I get another input file for database and its hash index value is 2352. How do i add another line starting with 2352 at its proper position in database?
I am considering following approach which is not ideal, as the database won't be organised in required non-decreasing order of hash index values. Also, database is split into 2 components. One contains byte offset entries for each hash index and another is the database file presented above.
It involves
(1)byte-offset.txt of the form
2341 byte-pos-1
2352 byte-pos-2
2359 byte-pos-3
2352 byte-pos-4
(2)database.txt of the form
2341 34.67 43.13 ... (234 such float values) 5 8 123
2352 46.92 41.89 ... (51 such float values) 1 9 145
2359 12.71 72.90 ... (141 such float values) 8 12 13
2352 46.92 41.89 ... (98 such float values) 2 7 12
the only good thing about it is that new entries can be appended to end in each file as database grows when we get more data.

How to make a table and access table parameters using array of structures in C?

Rows are Temperature and columns are Pressure.
Temp 750 755 760 765(pressure)
0 1.1 2 1 4
1 3 4 2 1 (factors)
2 4 5 5 9
I need a help in making this table in code with that i would like to access factor values for respective temp and pressure .
For example if temp 0 and pressure 750 the factor value is 1.1 ,if temp 1 and pressure 750 factor value is 3.
My Sample Output image

ORACLE: Calculate storage requirement of table in terms of blocks on paper

I have to calculate storage requirements in terms of blocks for following table:
Bill(billno number(7),billdate date,ccode varchar2(20),amount number(9,2))
The table storage attributes are :
PCTFREE=20 , INITRANS=4 , PCTUSED=60 , BLOCKSIZE=8K , NUMBER OF ROWS=100000
I searched a lot on internet, referred many books but didn't got anything.

First you need to figure out what is the typical value for varchar2 column. The total size will depend on that. I created 2 tables from your BILL table. BILLMAX where ccode takes always 20 Char ('12345678901234567890') and BILLMIN that has always NULL in ccode.
The results are:
TABLE_NAME NUM_ROWS AVG_ROW_LEN BLOCKS
BILLMAX 3938 37 28
BILLMIN 3938 16 13
select table_name, num_rows, avg_row_len, blocks from user_tables
where table_name in ( 'BILLMIN', 'BILLMAX')
As you can see, the number of blocks depends on that. Use exec dbms_stats.GATHER_TABLE_STATS('YourSchema','BILL') to refresh values inside user_tables.
The other thing that you need to take into consideration is how big will be your extents. For example :
STORAGE (
INITIAL 64K
NEXT 1M
MINEXTENTS 1
MAXEXTENTS UNLIMITED
PCTINCREASE 0
BUFFER_POOL DEFAULT
)
will generate first 16 extents with 8 blocks size. After that it will start to create extents with size of 1 MB (128 blocks).
So for BILLMAX it will generate 768 blocks and BILLMIN will take 384 blocks.
As you can see the difference is quite big.
For BILLMAX : 16 * 8 + 128 * 5 = 768
For BILLMIN : 16 * 8 + 128 * 2 = 384

Matlab Dataset Array Calculating delta t

I have a very large dataset array with over a million values that looks like this:
Month Day Year Hour Min Second Line1 Line2 Power Dt
7 8 2013 0 1 54 1.91 4.98 826.8 0
7 8 2013 0 0 9 1.93 3.71 676.8 0
7 8 2013 0 1 15 1.92 5.02 832.8 0
7 8 2013 0 1 21 1.91 5.01 830.4 0
and so on.
When the measurement of seconds got to 60 it would start over again at 0 hence why the first number is bigger. I need to fill the delta t column (Dt) by taking the current rows seconds column and subtracting the previous rows seconds column and correcting for negatyive values. This opperation cannot preform this operation in a loop as the it would take ages to complete and needs to be completed in a simple, one-shot, vector subtraction operation.

You can try diff command to generate such results. Its very fast and should work wihout any for loop.
HTH

Dt=diff(datenum(A(:,1:6)))*60*60*24;
This gives the delta in seconds, but I'm not sure what you want you correction for negative differences to be. Could you give an example of the expected output?
Note that Dt will be one entry shorter than A, so you may have to pad it.
You can remove the negative values (I think) with the command
Dt(Dt<0)=Dt(Dt<0)+60;
If you need to pad the Dt vector so that it is the same length as the data set, try
Dt=[Dt;0];

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Calculating space savings for sparse columns in SQL Server - sql-server

Related

Obtaining data from an array to a dataframe

saving hashtable using c so that random access is faster

How to make a table and access table parameters using array of structures in C?

ORACLE: Calculate storage requirement of table in terms of blocks on paper

Matlab Dataset Array Calculating delta t

Categories

Resources