Partitioning in SQL Server Standard Edition with billion of rows - sql-server

hi would like to ask about how to partition the following table (see below). The problem i'm having is not in the retrieval of History records which was resolved by the clustered Index. But as you can see the index is based on the HistoryParameterID then TimeStamp, this is needed because the retrieval of rows are based on the columns stated above.
The problem here is that whenever it reaches ~1 billion records, inserts are slowing down since the scenario is there will be 15k rows\second (note this can be 30k - 100k) to be inserted and per row it corresponds to a HistoryParameterID.
Basically, the HistoryParameterID is not unique , it has a one -> many relation ship with the other columns of the table below.
My hunch is that because of the index, it slows down the inserts because inserts are not always at the bottom because it is arranged by HistoryParameterID.
I did some testing using Timestamp as index but to no avail since query performance is unacceptable.
is there any way to partition this by history ParameterID? I was trying it so i created 15k Tables for partition view. But when i created the view it didn't finish executing. Any tips? or is there any way to partition ? Please note that i'm using Standard edition and using enterprise edition is not an option.
CREATE TABLE [dbo].[HistorySampleValues]
(
[HistoryParameterID] [int] NOT NULL,
[SourceTimeStamp] [datetime2](7) NOT NULL,
[ArchiveTimestamp] [datetime2](7) NOT NULL CONSTRAINT [DF__HistorySa__Archi__2A164134] DEFAULT (getutcdate()),
[ValueStatus] [int] NOT NULL,
[ArchiveStatus] [int] NOT NULL,
[IntegerValue] [bigint] SPARSE NULL,
[DoubleValue] [float] SPARSE NULL,
[StringValue] [varchar](100) SPARSE NULL,
[EnumNamedSetName] [varchar](100) SPARSE NULL,
[EnumNumericValue] [int] SPARSE NULL,
[EnumTextualValue] [varchar](256) SPARSE NULL
) ON [PRIMARY]
CREATE CLUSTERED INDEX [Source_HistParameterID_Index] ON [dbo].[HistorySampleValues]
(
[HistoryParameterID] ASC,
[SourceTimeStamp] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

I was trying it so i created 15k Tables for partition view. But when
i created the view it didn't finish executing. Any tips? or is there
any way to partition ? Please note that i'm using Standard edition and
using enterprise edition is not an option.
If you go down the partitioned view path (http://technet.microsoft.com/en-us/library/ms190019.aspx), I suggest fewer tables (under one hundred). Without partitioned tables, the optimizer must go through a lot of work since each table of the view could be indexed differently.
I would not expect inserts to slow down with table size if HistoryParameterID is incremental. However, in the case of a random value, inserts will become progressively slower as the table size grows due to lower buffer cache efficiency. That problem will exist with a single table, partitioned table, or partitioned view. See http://www.dbdelta.com/improving-uniqueidentifier-performance/ for an example using a guid but the issue applies to any random key value.
You might try a single table with SourceTimestamp alone as the clustered index key and a non-clustered index on HistoryID nad SourceTimestamp. That would provide the best insert performance and the non-clustered index (maybe with included columns) might be good enough for your select queries.

Everything you need is here. I'll hope you can figure it out.
http://msdn.microsoft.com/en-us/library/ms188730.aspx
and for Standard Edition alternative solutions exist like this answer.
and this is an interesting article too.
also we implement that in our enterprise automation application with custom indexing around table of users and it worked well.
Here's the cons and pros of custom implementation:
Pros:
Higher performance that partitioned table because of application's logic awareness.
Cons:
Implementing routing method and updating indexes.
Un-Centralized data.

Related

Fix SQL Server identity and restore the proper numeration order

I have SQL Server 2014 restarted unexpectedly and that broke straight auto-increment identity sequences on entities. All new entities inserted to tables have their identities incremented by 10 000.
Let's say, if there were entities with IDs "1, 2, 3" now all newly inserted entities are like "10004, 10005".
Here is real data:
..., 12379, 12380, 12381, (after the restart) 22350, 22351, 22352, 22353, 22354, 22355
(Extra question here is why has it inserted the very first entity after the restart with 22350? I thought it should have been 22382 as it's the latest ID by that moment 12381 + 10001 = 22382)
I searched and found out the reasons for what happened. Now I want to prevent such situations in the future and fix the current jump. It's a production server and users continuously add new stuff to the DB.
QUESTION 1
What options do I have here?
My thoughts on how to prevent it are:
Use sequences instead of identity columns
Disable T272 flag, reseed identity causing it started from the latest right value (I guess there is such an option)
What are the drawbacks of the two above? Please advise some new ways if there are.
QUESTION 2
I'm not an expert in SQL Server. And now I need to normalize and adjust the numeration of entities since it's a business requirement. I think I need to write a script that updates wrong ID values setting them to be right. Is it dangerous to update identity values? Some tables have dependent records. What does this script may look like?
OTHER INFO
Here is how my identity columns declared (got this using "Generate scripts" option in SSMS):
CREATE TABLE [dbo].[Tasks]
(
[Id] [uniqueidentifier] NOT NULL,
[Created] [datetime] NOT NULL,
...
[TaskNo] [bigint] IDENTITY(1,1) NOT NULL
CONSTRAINT [PK_dbo.Tasks]
PRIMARY KEY CLUSTERED ([Id] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
I also use Entity Framework 6 for database manipulating.
I will be happy to provide any other information by request if needed.
I did this once in a weekend of downtime and ended up having to reseed the whole table, by turning off the identity insert and then updating each row with a row numbers. This was based on the tables correct sort order to make sure the sequence was correct.
As it updated the whole table, (500 million rows) it generated a hell of a lot of transaction log data. Make sure you have enough space for this and presize the log if required.
As said above though, if you must rely on the identity column then amend it to a sequence. Also, make sure your rollback mechanism is good if there is an error during insert and the sequence has all ready been incremented.

Speeding up SQL query with index on identity column or the searched column?

When I extract my table I get this. The table has an ID column which is an identity column (autoinc).
Then there is still the readable Customer number which is theoretically unique, but the table does not enforce anything so far.
My customers are searching for the customer number not the Id.
My question now: should I still add an index (if yes clustered/nonclustered?) to the CUSTERMERNUMBER column to increase the speed of the search?
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE Customer
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[CUSTOMERNUMBER] [nvarchar](50) NULL,
-- other columns
CONSTRAINT [PK_Customer]
PRIMARY KEY CLUSTERED ([ID] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
If the ID field is referenced in other tables as a foreign key then leave that as the clustered index for sure, and create a non-clustered index on CUSTOMERNUMBER.
Consider not just creating an index on CUSTOMERNUMBER but going further and creating a Unique Constraint on it (which comes with an index). This will prevent the business rule requiring unique CUSTOMERNUMBER from being violated, and also gives additional information to the database that it can use to make operations more efficient.
As always, test first:
Alter Table Customer Add Constraint uCUSTOMERNUMBER Unique (CUSTOMERNUMBER);
(A downside of a unique constraint is that the unique index it creates can't include additional columns. If having includes was a requirement then a unique, non-clustered index is an option.)
If the majority of searches on your table are being done on the Customer Number, it might be a great idea. You can also create several test queries before you create your index and run the queries on your index again after you create your index to see if the Index actually increases your performance.
When deciding if you should make the index clustered or non-clustered, you should determine if you already have a clustered index (because they are created automatically on the primary key) and if this new index would be better utilized as the primary key. If so, you may have to create some constraints so customer number has to be unique to guarantee search correctness.
If you are interested in learning more about indexes, feel free to checkout this article I wrote:
https://dataschool.com/learn/how-indexing-works
Creating a Unique/Non Clustered index on customer number is a good idea .But we should be careful as to not create too many indexes especially if the table is huge with lots of DML s happening .

table needs indexes to improve performance

I was having timeout issue when giving long period of DateTime in below query (query runs from c# application). Table had 30 million rows with a non-clustered index on ID(not a primary key).
Found that there was no primary key so I recently updated ID as Primary Key, it’s not giving me timeout now. Can anyone help me for the below query to create index on more than one key for future and also if I remove non clustered index from this table and create on more than one column? Data is increasing rapidly and need improvement on performace
select
ID, ReferenceNo, MinNo, DateTime, DataNo from tbl1
where
DateTime BETWEEN '04/09/2013' AND '20/11/2013'
and ReferenceNo = 4 and MinNo = 3 and DataNo = 14 Order by ID
this is the create script
CREATE TABLE [dbo].[tbl1]( [ID] [int] IDENTITY(1,1) not null, [ReferenceNo] [int] not null, [MinNo] [int] not null, [DateTime] [datetime] not null, [DataNo] [int] not null, CONSTRAINT [tbl1_pk] PRIMARY KEY CLUSTERED ([ID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS
= ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]
Its hard to tell which index you should use without knowing more about your database and how its used.
You may want to change the ID column to a clustered index. If ID is an identity column you will get very few page splits while inserting new data. It will however require you to rebuild the table and that may be a problem depending on your usage of the database. You will be looking at some downtime.
If you want a covering index it should look something like this:
CREATE NONCLUSTERED INDEX [MyCoveringIndex] ON tbl1
(
[ReferenceNo] ASC,
[MinNo] ASC,
[DataNo] ASC,
[DateTime ] ASC
)
Its no need to include ID as a column as its already in the clusted index (clusted index columns will be included in all other indexes). This will however use up a whole lot of space (somewhere in the range of 1GB if the columns above are of the types int and datetime). It will also affect your insert, update and delete performance on the table in (most cases) a negative way.
You can create the index in online mode if you are using Enterprice Edition of SQL server. In all other cases there will be a lock on the table while creating the index.
Its also hard to know what other queries that are made against the table. You may want to tweek the order of the columns in the index to better match other queries.
Indexing all fields would be fastest, but would likely waste a ton of space. I would guess that a date index would provide the most benefit with the least storage capacity cost because the data is probably evenly spread out over a large period of time. If the MIN() MAX() dates are close together, then this will not be as effective:
CREATE NONCLUSTERED INDEX [IDX_1] ON [dbo].[tbl1] (
[DateTime] ASC
)
GO
As a side note, you can use SSMSE's "Display Estimated Execution Plan" which will show you what the DB needs to do to get your data. It will suggest missing indexes and also provide CREATE INDEX statements. These suggestions can be quite wasteful, but they will give you an idea of what is taking so long. This option is in the Standard Toolbar, four icons to the right from "Execute".

The best way to store the following temporal data in DB

Imagine we have a set of entities each of which has its state: free, busy or broken. The state is specified for a day, for example, today on 2011-05-17 an entity E1 is free and tomorrow on 2011-05-18 it is busy.
There is a need to store ~10^5 entities for 1000 days. Which is the best way to do so?
I am thinking about 2 options:
represent each day as a character "0", "1" or "2" and store for every entity a string of 1000 characters
store each day with entity's state in a row, i.e. 1000 rows for an entity
The most important query for such data is: given start date and end date identify which entities are free.
Performance is of higher priority than storage.
All suggestions and comments are welcome.
The best way is to try the simpler and more flexible option first (that is, store each day in its own row) and only devise a sophisticated alternative method if the performance is unsatisfactory. Avoid premature optimization.
10^8 rows isn't such a big deal for your average database on a commodity server nowadays. Put an index on the date, and I would bet that range queries ("given start date and end date...") will work just fine.
The reasons I claim that this is both simpler and more flexible than the idea of storing a string of 1000 characters are:
You'll have to process this in code, and that code would not be as straightforward to understand as code that queries DB records that contain date and status.
Depending on the database engine, 1000 character strings may be blobs that are stored outside of the record. That makes them less efficient.
What happens if you suddenly need 2,000 days instead of 1,000? Start updating all the rows and the code that processes them? That's much more work than just changing your query.
What happens when you're next asked to store some additional information per daily record, or need to change the granularity (move from days to hours for example)?
Create a single table to hold your data. Create the table with an ID, Date, Entity name and eight boolean fields. SQL Server 2008 gave me the code below for the table:
CREATE TABLE [dbo].[EntityAvailability](
[EA_Id] [int] IDENTITY(1,1) NOT NULL,
[EA_Date] [date] NOT NULL,
[EA_Entity] [nchar](10) NOT NULL,
[EA_IsAvailable] [bit] NOT NULL,
[EA_IsUnAvailable] [bit] NOT NULL,
[EA_IsBroken] [bit] NOT NULL,
[EA_IsLost] [bit] NOT NULL,
[EA_IsSpare1] [bit] NOT NULL,
[EA_IsSpare2] [bit] NOT NULL,
[EA_IsSpare3] [bit] NOT NULL,
[EA_IsActive] [bit] NOT NULL,
CONSTRAINT [IX_EntityAvailability_Id] UNIQUE NONCLUSTERED
(
[EA_Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
END
GO
IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[EntityAvailability]') AND name = N'IXC_EntityAvailability_Date')
CREATE CLUSTERED INDEX [IXC_EntityAvailability_Date] ON [dbo].[EntityAvailability]
(
[EA_Date] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
The clustered index on date will perform best for your range searches. Never allow searches without a date range and there will be no need for any index other than the clustered index. The boolean fields allows eight situations using only a single byte. The row size for this table is 35 bytes. 230 rows will fit on a page. You stated you had need to store 10^5 entities for 1000 days which is 100 million. One hundred million rows will occupy 434,782 8K pages or around 3 gig.
Install the table on an SSD and you are set to go.
Depending on whether entities are more often free or not just store the dates when an entity is free or not.
Assuming you store the dates when the entity is not free then the search is where start date <= date and end_date >= date and any row matching that means that the entity is not free for that period
It sounds like you might be on the right track and I would suggest because of the sheer number of records and the emphasis on performance that you keep the schema as denormalized as possible. The fewer joins you need to make to determine the free or busy entities the better.
I would broadly go for a Kimball Star Schema (http://en.wikipedia.org/wiki/Star_schema) type structure with three tables (initially)
FactEntity (FK kStatus, kDate)
DimStatus (PK kStatus)
DimDate (PK kDate)
This can be loaded quite simply (Dims first followed by Fact(s)), and queried also very simply. Performance can be optimised by suitable indexing.
A big advantage of this design is that it is very extensible; if you want to increase the date range, or increase the number of valid states it is trivial to extend.
Other dimensions could be sensibly added e.g. DimEntity which could have richer information that gives categorical information that migth be interesting to slice/dice your Entities.
The DimDate is normally enriched by adding DayNo, MonthNo, YearNo, DayOfWeek, WeekendFlag, WeekdayFlag, PublicHolidayFlag. These allow some very interesting analyses to be performed.
As #Elad asks, what would ahppen if you added Time based information, then this also can be inforporated by a DimTime dimension having one record per hour or minute.
Apologies for my naming, as I don't have a good understanding of your data. Given more time I could come up with some better ones!
To get free entities on a date, you may try:
select
e.EntityName
, s.StateName
, x.ValidFrom
from EntityState as x
join Entity as e on e.EntityId = x.EntityId
join State as s on s.StateID = x.StateID
where StateName = 'free'
and x.ValidFrom = ( select max(z.ValidFrom)
from EntityState as z
where z.EntityID = x.EntityID
and z.ValidFrom <= your_date_here )
;
Note: Make sure you store only state changes in EntityState table.

SQL Server speed

I had previously posted a question about my query speed with an XML column. After some further investigation I have found that it is not with the XML as previously thought. The table schema and query are very simple. There are over 800K rows, everything was running smooth but not with the increase in records it is taking almost a minute to run.
The Table:
/****** Object: Table [dbo].[Audit] Script Date: 08/14/2009 09:49:01 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Audit](
[ID] [int] IDENTITY(1,1) NOT NULL,
[PID] [int] NOT NULL,
[Page] [int] NOT NULL,
[ObjectID] [int] NOT NULL,
[Data] [nvarchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[Created] [datetime] NULL,
CONSTRAINT [PK_Audit] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The query:
SELECT *
FROM Audit
WHERE PID = 158
AND Page = 2
AND ObjectID = 93
The query only return 26 records and the interesting thing is that if I add "TOP 26" the query executes in less than a second, if I change it to "TOP 27" then it take a minute. Even if I change the query to SELECT ID, it does not matter.
Any help is appreciated
You have an index on ID, but your query is using other columns instead. Therefore, you're probably getting a full table scan. Changing to SELECT ID makes no difference because it's not anywhere in the WHERE clause.
It's quick when you ask for TOP 26 because it can quit once it finds 26 rows because you don't have any ORDER BY clause. Changing it to TOP 27 means that, once it finds the first 26 (which are all the matches, according to your post), it can't quit looking; it has to continue to search until it either finds a 27th matching row or reaches the end of the data.
A SHOW PLAN would have shown you the problem pretty quickly.
Add an index for the PID, Page and ObjectID fields.
Why not add a covering index for the Page and Object ID columns and call it a day?
I think you must add non unique indexes to your columns you want to search. Indexing will certainly reduce the search time it takes. Whether requesting single column or multi column in SELECT query will not make any difference. The time it takes to individually compare rows needs to be reduced by indexing.
the 26 rows are probably near the start of the table, when you scan you find them fast and abort the rest of the scan, when looking for the 27th that doesn't exist you scan the entire table, which is slow!
when looking for these type of problems try this from query management studio:
run: _SET SHOWPLAN_ALL ON_
then run your query, look for the word "SCAN", they is mostlikely where your query is running slow, figure out why no index is being used.
in this case you need to ad an index. I generally add an index based on how I query the data, if you always have one of the three: PID, Page, and Object ID, add an index with that column first, add another column to that index if you have that value sometimes too. etc.

Resources