How does the CACHE option of CREATE SEQUENCE work? - sql-server

CREATE SEQUENCE has CACHE option
MSDN defines it as
[ CACHE [<constant> ] | NO CACHE ]
Increases performance for applications that use sequence objects by
minimizing the number of disk IOs that are required to generate
sequence numbers. Defaults to CACHE. For example, if a cache size of
50 is chosen, SQL Server does not keep 50 individual values cached. It
only caches the current value and the number of values left in the
cache. This means that the amount of memory required to store the
cache is always two instances of the data type of the sequence object.
I understand it improves performance by avoiding reads from disk IO and maintaining some info in the memory that would help reliably generate the next number in the sequence, but I cannot imagine what a simple memory representation of the cache would look like for what the MSDN describes in the example.
Can someone explain how would the cache work with this sequence
CREATE SEQUENCE s
AS INT
START WITH 0
INCREMENT BY 25
CACHE 5
describing what the cache memory would hold when each of the following statements is executed independently:
SELECT NEXT VALUE FOR s -- returns 0
SELECT NEXT VALUE FOR s -- returns 25
SELECT NEXT VALUE FOR s -- returns 50
SELECT NEXT VALUE FOR s -- returns 75
SELECT NEXT VALUE FOR s -- returns 100
SELECT NEXT VALUE FOR s -- returns 125

This paragraph in the doc is very helpful:
For an example, a new sequence is created with a starting value of 1 and a cache size of 15. When the first value is needed, values 1
through 15 are made available from memory. The last cached value (15)
is written to the system tables on the disk. When all 15 numbers are
used, the next request (for number 16) will cause the cache to be
allocated again. The new last cached value (30) will be written to the
system tables.
So, in your scenario
CREATE SEQUENCE s
AS INT
START WITH 0
INCREMENT BY 25
CACHE 5
You will have 0, 25, 50, 75 and 100 in Memory and you will get only one I/O write in disk: 100.
The problem you could have, explained in the the doc, is if the server goes down and you haven't used all the 5 items, next time you ask for a value you'll get 125.

Related

Is there a space efficent way to store and retrieve the order of a dataset?

Here's my problem. I have a set of 20 objects stored in memory as an array. I want to store a second piece of data that defines an order for the objects to be displayed.
The simplest way to store the order is as an array of 20 unsigned integers, each of which is 5 bits (aka 0-31). The position of the object in the output list would be defined by the number stored in this array at the same index as the object in it's array.
But.. I know from statistics that there are only 20! (that's 20 factorial), ways to arrange these objects.
This could be stored in 62 bits, since 2^62 > 20!
I'm currently using 100 bits to store the same information.
So my question is this: Is there a space efficient way to store ORDER as a sequence of bits?
I have some addition constraints as well. This will run on an embedded device, so I can't use any huge arrays or high level math functions. I would need a simple iterative method.
Edit: Some clarification on why this is necessary. Say for example the objects are pictures, and they're stored in ROM (aka they can't be moved around). Now lets say I want to keep track of what order to display the images in, and i'm going to update that order every second. My device has 1k of storage with wear leveling, but each bit in the storage can only be written 1000 times before it becomes unreliable. If I need 1kb to store the order, than my device will only work for 1000 seconds. If I need 0.1kb, it will work for 10k seconds, and so on. Thus the devices longevity will be inversely proportional to the number of bits I need to update every cycle.
You can store the order in a single 64-bit value x:
For the first choice, 20 possibilities, compute the index as x % 20 and update x as x /= 20,
For the next choice, only 19 possibilities, compute x % 19 and update x as x /= 19.
Continue this process 17 more times and you are done.
I think I've found a partial solution to my own question. Assuming I start at the left side of the order array, for every move right there are fewer remaining possibilities for the position value. The number of possibilities is 20,19,18,etc. I can take advantage of this by populating the order array in a relative fashion. The first index will place a value in the order array. There are 20 possibilities so this takes 5 bits. Placing the next value, there are only 19 position available (still 5 bits). Proceeding though the whole array. The bits-required is now 5,5,5,5,4,4,4,4,4,4,4,4,3,3,3,3,2,2,1,0. So that gets me down to 69 bits, much better.
There's still some "wasted" precision in each of the values, since for example the first position can store 32 possible values, even though there are only 20. I'm not sure how to deal with this, but I think will have something to do with carrying a remainder from one calculation to the next..

Is there an easy way to get the percentage of successful reads of last x minutes?

I have a setup with a Beaglebone Black which communicates over I²C with his slaves every second and reads data from them. Sometimes the I²C readout fails though, and I want to get statistics about these fails.
I would like to implement an algorithm which displays the percentage of successful communications of the last 5 minutes (up to 24 hours) and updates that value constantly. If I would implement that 'normally' with an array where I store success/no success of every second, that would mean a lot of wasted RAM/CPU load for a minor feature (especially if I would like to see the statistics of the last 24 hours).
Does someone know a good way to do that, or can anyone point me in the right direction?
Why don't you just implement a low-pass filter? For every successfull transfer, you push in a 1, for every failed one a 0; the result is a number between 0 and 1. Assuming that your transfers happen periodically, this works well -- and you just have to adjust the cutoff frequency of that filter to your desired "averaging duration".
However, I can't follow your RAM argument: assuming you store one byte representing success or failure per transfer, which you say happens every second, you end up with 86400B per day -- 85KB/day is really negligible.
EDIT Cutoff frequency is something from signal theory and describes the highest or lowest frequency that passes a low or high pass filter.
Implementing a low-pass filter is trivial; something like (pseudocode):
new_val = 1 //init with no failed transfers
alpha = 0.001
while(true):
old_val=new_val
success=do_transfer_and_return_1_on_success_or_0_on_failure()
new_val = alpha * success + (1-alpha) * old_val
That's a single-tap IIR (infinite impulse response) filter; single tap because there's only one alpha and thus, only one number that is stored as state.
EDIT2: the value of alpha defines the behaviour of this filter.
EDIT3: you can use a filter design tool to give you the right alpha; just set your low pass filter's cutoff frequency to something like 0.5/integrationLengthInSamples, select an order of 0 for the IIR and use an elliptic design method (most tools default to butterworth, but 0 order butterworths don't do a thing).
I'd use scipy and convert the resulting (b,a) tuple (a will be 1, here) to the correct form for this feedback form.
UPDATE In light of the comment by the OP 'determine a trend of which devices are failing' I would recommend the geometric average that Marcus Müller ꕺꕺ put forward.
ACCURATE METHOD
The method below is aimed at obtaining 'well defined' statistics for performance over time that are also useful for 'after the fact' analysis.
Notice that geometric average has a 'look back' over recent messages rather than fixed time period.
Maintain a rolling array of 24*60/5 = 288 'prior success rates' (SR[i] with i=-1, -2,...,-288) each representing a 5 minute interval in the preceding 24 hours.
That will consume about 2.5K if the elements are 64-bit doubles.
To 'effect' constant updating use an Estimated 'Current' Success Rate as follows:
ECSR = (t*S/M+(300-t)*SR[-1])/300
Where S and M are the count of errors and messages in the current (partially complete period. SR[-1] is the previous (now complete) bucket.
t is the number of seconds expired of the current bucket.
NB: When you start up you need to use 300*S/M/t.
In essence the approximation assumes the error rate was steady over the preceding 5 - 10 minutes.
To 'effect' a 24 hour look back you can either 'shuffle' the data down (by copy or memcpy()) at the end of each 5 minute interval or implement a 'circular array by keeping track of the current bucket index'.
NB: For many management/diagnostic purposes intervals of 15 minutes are often entirely adequate. You might want to make the 'grain' configurable.

Algorithm to find newest of n values

I am working on a problem where three memory pages are available and data is supposed to be written in one of the pages.
To keep history the data is first written to 1st page, and when that is full the next page shall be used. Finally, the last page is also full so we have to erase the data in the first page and use the first page. And so on...
How can I know which of the pages is the 'oldest'? How do I determine which to erase?
I think that a counter is needed, and this counter increments every time a new page is used. The counter values is read in the beginning to find which page is the newest and then the next page is the oldest (since circular approach). However, eventually the counter will overflow, the counter restarts and it will not be possible to be sure which value is the highest (since the new value is 0).
Example:
0 0 0 (from beginning)
1 0 0 (page0 was used)
1 2 0 (page1 was used)
1 2 3 (page2 was used)
4 2 3 (page0 was used)
4 5 3 (page1 was used)
...
255 0 254 (I dont know... )
Is the problem clear? Otherwise I can try to re-explain.
This is a technique used in EEPROM wear leveling. The concept is that since EEPROM usually has a limited life of write/erase cycles, we balance out the wear in the memory so that effectively the life increases. Since the data in EEPROM stays in the controller even on power off, we may have to store log values of some variables periodically on the EEPROM for further use.
One simple approach that you can follow is that as suggested in the comments you can update the counter by keep calculating (counter modulo 3).
Other (more general) approach is to have three registers for the counter. Whenever you have to write to a page, first scan these three registers and check the combinations where (C[i] != C[i-1] + 1)
0 0 0
1 0 0 // 1 to 0
1 2 0 // 2 to 0
1 2 3 // 3 to 1
4 2 3 // 4 to 2
...
255 0 254 // 0 to 254.
This link has more information about this subject: Is there a general algorithm for microcontroller EEPROM wear leveling?
Your idea of using a circular buffer is a good one. All you need in addition to that are to indices, one to point at the oldest page and one to point at the newest. You need to update those indices whenever you add or replace a page.
The reason you need to is that in beginning -- until the buffer is full -- only one of them will be advancing while the other will remain stationary.
I do this kind of cycles like this:
// init
int page0=adress of page0; // oldest data
int page1=adress of page1; // old data
int page2=adress of page2; // actual data (page for write)
// after page 2 is full
int tmp;
tmp=page0;
page0=page1;
page1=page2;
page2=tmp;
this way you allways know which page is which
page 0 allways the oldest data
page 1 allways the old data
page 2 allways actual data
it is easily extendable to any number of pages
instead of adress you can store the page number ... use what is more suitable for your task

Worth a unique table for database values that repeat ~twice?

I have a static database of ~60,000 rows. There is a certain column for which there are ~30,000 unique entries. Given that ratio (60,000 rows/30,000 unique entries in a certain column), is it worth creating a new table with those entries in it, and linking to it from the main table? Or is that going to be more trouble than it's worth?
To put the question in a more concrete way: Will I gain a lot more efficiency by separating out this field into it's own table?
** UPDATE **
We're talking about a VARCHAR(100) field, but in reality, I doubt any of the entries use that much space -- I could most likely trim it down to VARCHAR(50). Example entries: "The Gas Patch and Little Canada" and "Kora Temple Masonic Bldg. George Coombs"
If the field is a VARCHAR(255) that normally contains about 30 characters, and the alternative is to store a 4-byte integer in the main table and use a second table with a 4-byte integer and the VARCHAR(255), then you're looking at some space saving.
Old scheme:
T1: 30 bytes * 60 K entries = 1800 KiB.
New scheme:
T1: 4 bytes * 60 K entries = 240 KiB
T2: (4 + 30) bytes * 30 K entries = 1020 KiB
So, that's crudely 1800 - 1260 = 540 KiB space saving. If, as would be necessary, you build an index on the integer column in T2, you lose some more space. If the average length of the data is larger than 30 bytes, the space saving increases. If the ratio of repeated rows ever increases, the saving increases.
Whether the space saving is significant depends on your context. If you need half a megabyte more memory, you just got it — and you could squeeze more if you're sure you won't need to go above 65535 distinct entries by using 2-byte integers instead of 4 byte integers (120 + 960 KiB = 1080 KiB; saving 720 KiB). On the other hand, if you really won't notice the half megabyte in the multi-gigabyte storage that's available, then it becomes a more pragmatic problem. Maintaining two tables is harder work, but guarantees that the name is the same each time it is used. Maintaining one table means that you have to make sure that the pairs of names are handled correctly — or, more likely, you ignore the possibility and you end up without pairs where you should have pairs, or you end up with triplets where you should have doubletons.
Clearly, if the type that's repeated is a 4 byte integer, using two tables will save nothing; it will cost you space.
A lot, therefore, depends on what you've not told us. The type is one key issue. The other is the semantics behind the repetition.

Looking for precision on how ROW_OVERFLOW_DATA happen

I'm currently in the initial phases of planning a rewrite for a large module in our CRM application.
One area I am currently looking into is database optimization, I haven't made any decision yet but I just want to make sure I understand the concept of ROW_OVERFLOW_DATA properly - http://msdn.microsoft.com/en-us/library/ms186981.aspx
We are using SQL server 2005, it's my understanding that the row size limit is 8,060 bytes and that after that overflow will occur.
I ran a query to get my max row size for a particular read intensive database
SELECT OBJECT_NAME (sc.[id]) tablename
, COUNT (1) nr_columns
, SUM (sc.length) maxrowlength
FROM syscolumns sc
join sysobjects so
on sc.[id] = so.[id]
WHERE so.xtype = 'U'
GROUP BY OBJECT_NAME (sc.[id])
ORDER BY SUM (sc.length) desc
This gave me a few tables with a maxrowlength that was sligtly above 8,000, but under 10,000. Another query shows that the average row size is actually quite small, around 1,000 bytes.
My question is: is ROW_OVERFLOW_DATA based on each row or is it per column? Once the 8,060 bytes limit is expanded is the entire column that caused it to overflow moved to another page or is it only the specific row?
So for example given the following simplified schema:
col1 (int) | col 2 (varchar (4000)) | col 3(varchar(5000))
1 | 4000 characters | 5000 characters ***This row is overflowing
2 | 4000 characters | 100 characters
3 | 150 characters | 150 characters
4 | 500 characters | 600 characters
Would every the col 3 of row 1 to 4 get replaced by a 24 bytes pointer or only rowID 1?
I am wondering cause if it's every row gets a pointer it becomes important to fix it, if it's only a few rows maybe we can take the performance hit.
Also, I've seen many blogs suggesting to move nullable columns toward the end of the database so that if the values are in fact NULL they don't take any row space. Is this true? We tend to keep our timestamp and tracking columns at the end cause it's easier to visualize. Now I am wondering if maybe we shouldn't move them further up as they are never NULL.
If you have one row in, say, a 100 million that overflows would you move the whole column? No.
For reference, a technet article from Paul Randal who is the God of this stuff (my bold)
The feature you are using, row-overflow, is great for allowing the occasional row to be longer than 8,060 bytes, but it is not well suited for the majority of rows being oversized and can lead to a drop in query performance, as you are experiencing.
The reason for this is that when a row is about to become oversized, one of the variable-length columns in the row is pushed "off-row." This means the column is taken from the row on the data or index page and moved to a text page. In place of the old column value, a pointer is substituted that points to the new location of the column value in the data file.
And MSDN (my bold)
ROW_OVERFLOW_DATA Allocation Unit
For every partition used by a table (heap or clustered table), index, or indexed view, there is one ROW_OVERFLOW_DATA allocation unit. This allocation unit contains zero (0) pages until a data row with variable length columns (varchar, nvarchar, varbinary, or sql_variant) in the IN_ROW_DATA allocation unit exceeds the 8 KB row size limit. When the size limitation is reached, SQL Server moves the column with the largest width from that row to a page in the ROW_OVERFLOW_DATA allocation unit. A 24-byte pointer to this off-row data is maintained on the original page.
As for your NULLable columns, this is false. NULLable columns are stored at the end of the disk structure anyway regardless of column order in the table definition. And a reference from Paul Randal: Inside the Storage Engine: Anatomy of a record again. Any some previous answers from me here on SO
Only if a particular row overflows will the offending data for that row be moved off into a separate overflow page - imagine the headache if the entire table needed rebuilding just because one value in one column overflowed!
I'd not heard of the idea of moving NULLables to the end of the table - I'll have to check into that!

Resources