I have a large transaction table in SQL server which is used to store about 400-500 records each day. What is the data type should I use in my PK column? The PK column stores numeric values, for which integer seems suitable but I'm afraid it will exceed the maximum value for integer since I have so many records everyday.
I am currently using integer data type for my PK column.

With a type INT, starting at 1, you get over 2 billion possible rows - that should be more than sufficient for the vast majority of cases. With BIGINT, you get roughly 922 quadrillion (922 with 15 zeros - 922'000 billions) - enough for you??
If you use an INT IDENTITY starting at 1, and you insert a row every second, you need 66.5 years before you hit the 2 billion limit .... so with 400-500 rows per day - it will take centuries before you run out of possible values... take 1'000 rows per day - you should be fine for 5883 years - good enougH?
If you use a BIGINT IDENTITY starting at 1, and you insert one thousand rows every second, you need a mind-boggling 292 million years before you hit the 922 quadrillion limit ....
Read more about it (with all the options there are) in the MSDN Books Online.

I may be wrong here as maths has never been my strong point, but if you use bigint this has a max size of 2^63-1 (9,223,372,036,854,775,807)
so if you divide that by say 500 to get roughly the number of days-worth of records you get 18446744073709600 days-worth of 500 new records.
divide again by 365, gives you 50539024859478.2 years-worth of 500 records a day
so (((2^63-1) / 500) / 365)
if that's not me being stupid then that's a lot of days :-)


Tricky: SQL Server-side aggregation of time-series data for charting

I have a large time-series data set in a table that contains 5 years of data. The data is very structured; it is clustered/ordered on the time column and there is exactly one record for exactly every 10 minutes over this entire 5 year period.
In my user-side application I have a time-series chart that is 400 pixels wide, and users can set the time scale from 1 hour up to 5 years. Therefore any query to the database by this chart that returns more than 400 records provides data that cannot be physically displayed.
What I want to know is; can anyone suggest an approach such that when the database is queried for a certain time range, the SQL database would dynamically make a suitable averaging aggregation that returns no more than 400 records?
Example 1): if the time range was 5 years, SQL Server would calculate ~1 value for every 4.5 days (5yrs*365days/400records required), so would average all the 10 minute samples for each 4.5 day bin and return a record for each bin. About 400 in total.
Example 2): If the time range was one month, SQL Server would calculate ~1 record for every 1.85 hours (31 days/400records), so would average all the 10 minute samples for each 1.85 hour bin and return a record for each bin. About 400 in total.
Ideally I'd like a solution that from the applications perspective can be queried just like a static Table.
I'd really appreciate any suggested approaches or code snippets.
some examples, if you have a datetime column (which is not quite clear from your question, as there is not table schema):
Grouping into interval of 5 minutes within a time range
SELECT / GROUP BY - segments of time (10 seconds, 30 seconds, etc)
They should be quite easy to port to SQL server, use datediff to convert your datetime values into an unix timestamp and use round() with the function parameter <> 0 for the div.

Can someone explain to me the meaning of field SELECTABILITY in relation to Cardinality?

read this
but still doesnt really sink in.
so let's say, we have 993 records, and a cardinality of 13, that means there are 13 unique/ possible values out of 993 records. Its selectability is 0.0130 or 1.3% right?
Now, what does 1.3% mean? All I know that lower the worse, and the higher selectability is better meaning more unique values and the sql engine optimizer is happy. BUT, how can i explain 1.3% ?
1.3% of???
when i select a row, variability is only 1.3% of the 13 possible records?
Sorry, it has been like 20+ years since i had my stat classes.
The 1.3% is of all the rows in the table, but you are confusing yourself by treating it as a percentage.
When you query a table, you want to get to the relevant rows as quickly as possible. The database has to choose which index to search first, and you want this index to return as small a set of rows as possible, with the relevant rows inside.
Imagine that you are looking for John Smith the guitar repairer in the Yellow Pages. There are 10,000 names and you have 2 choices:
Browse through the Last Name index, where all last names are grouped by their first character. This gives you a cardinality of 26, selectivity = 0.26%.
Browse through the Guitar Repair category. There are 500 business categories in your city so cardinality = 500, selectivity = 5%.
If you choose the first index, you then have to search through S-group, which contains on average 10,000 / 26 = 384.6 names.
If you choose the second index, you will have to search through the Guitar Repairers, which contains on average 10,000 / 500 = 20 names.
Clearly, the Business Category is a better index than the Last Name because you can narrow down your search range a lot faster. That's all it means by selectivity: you can get to the rows you want as quickly as possible.

Cassandra data model for time series

I am working on a Cassandra data model for storing time series (I'm a Cassandra newbie).
I have two applications: intraday stock data and sensor data.
The stock data will be saved with a time resolution of one minute.
Seven datafields build one timeframe:
Symbol, Datetime, Open, High, Low, Close, Volume
I will query the data mostly by Symbol and Date. e.g. give me all data for AAPL between 2013-01-01 and 2013-01-31 ordered by Datetime.
The recommendation for cassandra queries is to query whole columns. So you could create five rows with the keys Open, High, Low, Close, Volume. And for each Symbol and Minute an own column. E.g. "AAPL:2013-01-04T130400Z".
This would result in a table of five rows and n*NT columns where n = number of symbols, nT = number of minutes.
Most of the time I will query date ranges. I.e. all minutes of a day. So I could rearrange the data to have columns named "AAPL:2013-01-04" and rows: OpenT130400Z, HighT130400Z, LowT130400Z, CloseT130400Z, VolumeT130400Z.
This would result in a table with n*nD columns (n: number of Symbols, nD: number of Days) and 5*nM rows (nM: number of minutes/entries per day).
To sum up: I have columns, which hold the information for a whole day for one symbol.
I have found a description how to deal with time series data in cassandra here
But I don't really get, if they use the hour (1332960000) as a column name or as a row key!?
I understood they use the hour as row key and have the small timesteps as columns. So they would have a fixed column number. But that would have disadvantages in reading because I would have to do a range query on keys! Am I right?
Second question:
If I have sensor data, which is much more fine grained than 1 minute stock data (let's say I have to save timesteps with a resolution of microseconds) how would I deal with this?
If I use columns for saving a composite of sensor channel and hours, and rows for microseconds since the last hour this would result in 3,600,000,000 rows and n*nH columns (n: number of sensors, nH: number of Hours).
I could not use the microseconds since last hour for columns because I have 3,6 billion points which is higher than the allowed number of 2 billion columns.
Did I get it?
What do you think about this problem? How to solve it?
Thank you!
So I have a suggestion for your first question about the stock data. A naive implementation might look like this:
Column Format:
Name: The current datetime granular to a minute
Value: a composite column of Open,High,Low,Close,Volume
So you would have something like
AAPL = [2013-05-02-15:38:00 | 441.78:448.59:440.63:15066146:445.52] ... [2013-05-02-15:39:00 | 441.78:448.59:440.63:15066146:445.52] ... [2013-05-02-15:40:00 | 441.78:448.59:440.63:15066146:445.52]
That would give you roughly half a million columns in one year so it might be ok for maybe 4 years. I wouldn't go and attempt to hit the 2 billion limit. What you could do is define a splitting factor on the row key. It all depends on your usage pattern, but a simple one might be on the year so the column family entry might look like this with a composite row key and that would guarantee that you always have less than a million columns per row.
AAPL:2013 = [05-02-15:38:00 | 441.78:448.59:440.63:15066146:445.52] ... [05-02-15:39:00 | 441.78:448.59:440.63:15066146:445.52] ... [05-02-15:40:00 | 441.78:448.59:440.63:15066146:445.52]

Adding datalength condition makes query slow

I have a table mytable with some columns including the column datekey (which is a date and has an index), a column contents which is a varbinary(max), and a column stringhash which is a varchar(100). The stringhash and the datekey together form the primary key of the table. Everything is running on my local machine.
SELECT TOP 1 * FROM mytable where datekey='2012-12-05'
returns 0 rows and takes 0 seconds.
But if I add a datalength condition:
SELECT TOP 1 * FROM mytable where datekey='2012-12-05' and datalength(contents)=0
it runs for a very long time and does not return anything before I give up waiting.
My question:
Why? How do I find out why this takes such a long time?
Here is what I checked so far:
When I click "Display estimated execution plan" it also takes a very long time and does not return anything before I give up waiting.
If I do
SELECT TOP 1000 datalength(contents) FROM mytable order by datalength(contents) desc
it takes 7 seconds and returns a list 4228081, 4218689 etc.
exec sp_spaceused 'mytable'
rows reserved data index_size unused
564019 50755752 KB 50705672 KB 42928 KB 7152 KB
So the table is quite large at 50 GB.
SELECT TOP 1000 * FROM mytable
takes 26 seconds.
The sqlservr.exe process is around 6 GB which is the limit I have set for the database.
It takes a long time because your query needs DATALENGTH to be evaluated for every row and then the results sorted before it can return the 1st record.
If the DATALENGTH of the field (or whether it contains any value) is something you're likely to query repeatedly, I would suggest an additional indexed field (perhaps a persisted computed field) holding the result, and searching on that.
This old msdn blog post seems to agree with #MartW answer that datalength is evaluated for every row. But it's good to understand what is really meant by "evaluated" and what is the real root of the performance degradation.
As mentioned in the question, the size of every value in the column contents may be large. It means that every value bigger than ~8Kb is stored in special LOB-storage. So, taking into account the size of the other columns, it's clear that most of the space occupied by the table is taken by this LOB-storage, i.e. it's around 50Gb.
Even if the length of contents column for every row has been already evaluated, which is proved in post linked above, it's still stored in LOB. So engine still needs to read some parts of the LOB-storage to execute the query.
If LOB-storage isn't in RAM at the time of a query execution then we need to read it from a disk, which is of course much slower than from RAM. Also possibly the read of LOB-parts is rather randomized than linear which is even more slow as it tends to raise the whole number of memory-blocks needed to be read from a disk.
At the moment it probably won't be using the primary key because of the stringhash column included before the datekey column. Try adding an additional index that just contains the datekey column. Once that key is created if it's still slow you could also try a query hint such as:
SELECT TOP 1 * FROM mytable where datekey='2012-12-05' and datalength(contents)=0 WITH INDEX = IX_datekey
You could also create a seperate length column that's updated either in your application or in an insert / update trigger.

100k Rows Returned in a random order, without a SQL time out please

I've been doing a lot of reading on returning a random row set last year, and the solution we came up with was
ORDER BY newid()
This is fine for <5k rows. But when we are getting >10-20k rows we are getting SQL time outs, the Execution planned tells me that 76% of my query cost comes from this line. and removing this line increase the speed by an order of magnitude when we have a large amount of rows.
Our users have a requirement of doing up to 100k rows at a time like this.
To give you all a bit more details.
We have a table with 2.6 million 4 digit alpha-numeric codes. We use a random set of these to gain entry into a venue. For example, if we have an event with a 5000 capacity, a random set of 5000 of these will be drawn from the table then issued to the each customer as a bar-code, then the bar-code scanning app at the door with have the same list of 5000. The reason for using a 4 digit alpha numeric code (and not a stupidly long number like a GUID) is that it easy for people to write the number down (or SMS it to a friend) and just bring the number and have it entered manually, so we don't want large amount of characters. Customers love the last bit btw.
Is there a better way than ORDER BY newid(), or is there a faster way to get 100k random rows from a table with 2.6 mil?
Oh, and we are using MS SQL 2005.
There is an MSDN article entitled "Selecting Rows Randomly from a Large Table" that talks about this exact problem and shows a solution (using no sorting but instead using a WHERE clause on a generated column to filter the rows).
The reason your query is slow is that the ORDER BY clause causes the whole table to be copied into tempdb for sorting.
If you want to generate random 4-digit codes, why not just generate them instead of trying to pull them out of a database?
Generate 100k unique numbers from 0 to 1,679,616 (which is the number of unique four-digit alphanumeric codes, ignoring case - 2.6 million rows must have some duplicates) and convert them to your four-digit codes.
You don't have to sort.
DECLARE #RandomNumber int
DECLARE #Threshold float
SELECT #RandomNumber = COUNT(*) FROM customers
SELECT #Threshold = 50000 / #RandomNumber
SELECT TOP 50000 * FROM customers WHERE rand() > #Threshold ORDER BY newid()
Just as a matter of interest, what is the performance like if you replace
ORDER BY newid()
One thought is to break down the process into steps. Add a column in the table for a GUID then do an update statement into the table adding the GUIDs. This can be done ahead of time if necessary. You should then be able to run the query with an orderby on the GUID column to recieve the results the same way.
Have you tried using % (modulo) on a given int column? Not sure what your table structure is, but you could do something like this:
select top 50000 *
from your_table
where CAST((CAST(ASCII(SUBSTRING(venuecode,1,1)) as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,2,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,3,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,4,1))as varchar(3))) as bigint) % 500000 between 0 and 50000
The above code will take all of your alpha numeric venues and convert them to an integer and then split the entire table into 500,000 buckets of which you are taking the top 50000 that fall between 0 and 50000. You can play with the number after the % since (500,000) and you can play with the between. This should randomize it for you. Not sure if the where clause will bite you on performance, but it's worth a shot. Also, without an order by, there is no guarantee of the order (if you have multiple cpus and threading).
