I have a complex problem with SQL Server.
I administer 40 databases with identical structure but different data. Those database sizes vary from 2 MB to 10 GB of data. The main table for these databases is:
CREATE TABLE [dbo].[Eventos](
[ID_Evento] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
[FechaGPS] [datetime] NOT NULL,
[FechaRecepcion] [datetime] NOT NULL,
[CodigoUnico] [varchar](30) COLLATE Modern_Spanish_CI_AS NULL,
[ID_Movil] [int] NULL,
[CodigoEvento] [char](5) COLLATE Modern_Spanish_CI_AS NULL,
[EventoData] [varchar](150) COLLATE Modern_Spanish_CI_AS NULL,
[EventoAlarma] [bit] NOT NULL CONSTRAINT [DF_Table_1_Alarma] DEFAULT ((0)),
[Ack] [bit] NOT NULL CONSTRAINT [DF_Eventos_Ack] DEFAULT ((0)),
[Procesado] [bit] NOT NULL CONSTRAINT [DF_Eventos_Procesado] DEFAULT ((0)),
[Latitud] [float] NULL,
[Longitud] [float] NULL,
[Velocidad] [float] NULL,
[Rumbo] [smallint] NULL,
[Satelites] [tinyint] NULL,
[EventoCerca] [bit] NOT NULL CONSTRAINT [DF_Eventos_FueraCerca] DEFAULT ((0)),
[ID_CercaElectronica] [int] NULL,
[Direccion] [varchar](250) COLLATE Modern_Spanish_CI_AS NULL,
[Localidad] [varchar](150) COLLATE Modern_Spanish_CI_AS NULL,
[Provincia] [varchar](100) COLLATE Modern_Spanish_CI_AS NULL,
[Pais] [varchar](50) COLLATE Modern_Spanish_CI_AS NULL,
[EstadoEntradas] [char](16) COLLATE Modern_Spanish_CI_AS NULL,
[DentroFuera] [char](1) COLLATE Modern_Spanish_CI_AS NULL,
[Enviado] [bit] NOT NULL CONSTRAINT [DF_Eventos_Enviado] DEFAULT ((0)),
[SeñalGSM] [int] NOT NULL DEFAULT ((0)),
[GeoCode] [bit] NOT NULL CONSTRAINT [DF_Eventos_GeoCode] DEFAULT ((0)),
[Contacto] [bit] NOT NULL CONSTRAINT [DF_Eventos_Contacto] DEFAULT ((0)),
CONSTRAINT [PK_Eventos] PRIMARY KEY CLUSTERED
(
[ID_Evento] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
USE [ABS]
GO
ALTER TABLE [dbo].[Eventos] WITH CHECK ADD CONSTRAINT [FK_Eventos_Eventos] FOREIGN KEY([ID_Evento])
REFERENCES [dbo].[Eventos] ([ID_Evento])
I also have a cycle that runs every n seconds to process these records (only new ones and mark them as processed). This process uses this query:
SELECT
Tbl.ID_Cliente, Ev.ID_Evento, Tbl.ID_Movil, Ev.EventoData, Tbl.Evento,
Tbl.ID_CercaElectronica, Ev.Latitud, Ev.Longitud, Tbl.EsAlarma, Ev.FechaGPS,
Tbl.AlarmaVelocidad, Ev.Velocidad, Ev.CodigoEvento
FROM
dbo.Eventos AS Ev
INNER JOIN
(SELECT
Det.CodigoEvento, Mov.CodigoUnico, Mov.ID_Cliente, Mov.ID_Movil, Det.Evento,
Mov.ID_CercaElectronica, Det.EsAlarma, Mov.AlarmaVelocidad
FROM
dbo.Moviles Mov
INNER JOIN
dbo.GruposEventos AS GE
INNER JOIN
dbo.GruposEventosDet AS Det ON Det.ID_GrupoEventos = GE.ID_GrupoEventos
ON GE.ID_GrupoEventos = Mov.ID_GrupoEventos) as Tbl ON EV.CodigoUnico = Tbl.CodigoUnico AND Ev.CodigoEvento = Tbl.CodigoEvento
WHERE
(Ev.Procesado = 0)
The table can have on some databases more than 1.000.000 records. So to optimize the process I created this index specific for this query using SQL assistant for optimization:
CREATE NONCLUSTERED INDEX [OptimizadorProcesarEventos] ON [dbo].[Eventos]
(
[Procesado] ASC,
[CodigoEvento] ASC,
[CodigoUnico] ASC,
[FechaGPS] ASC
)
INCLUDE ( [ID_Evento],
[EventoData],
[Latitud],
[Longitud],
[Velocidad]) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
This used to work perfect. But now occasionally and only in some databases, the query takes forever and gives me timeout. So I run a "show execution plan" and realize that in some scenarios depending on the data from the table, SQL Server decides not to use my index and use a PK Index instead. I verify this running the same execution plan on other db that works fine and the index is being use.
So my question: why does SQL Server on some occasions decide not to use my index?
Thank you for your interest!
UPDATE
I already try to UPDATE STATICS and didn´t help. I preffer to avoid the use of HINT for now, so the question remains: Why SQL Server choose a more inefficient way to execute my query if has an index for it?
UPDATE II
After many test, I could finaly resolve the problem, even though i don't quite undestand why this worked. I change the index to this:
CREATE NONCLUSTERED INDEX [OptimizadorProcesarEventos] ON [dbo].[Eventos]
(
[CodigoUnico] ASC,
[CodigoEvento] ASC,
[Procesado] ASC,
[FechaGPS] ASC
)
INCLUDE ( [ID_Evento],
[EventoData],
[Latitud],
[Longitud],
[Velocidad]) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
Basicaly i change the order of the fields in the index and the query inmediatly start to use the index as espected. I'ts still a mistery for me how SQL Server choose to use or not to use indexes on specific query. Thanks to everyone.
you must have find lot of articles on how Query optimizer chooses the Right Index. if not search something on google.
I can point out one to start with.
Index Selection and the Query Optimizer
The simple answer is as follows:
"Based on the index usage history, statistics, number of rows inserted/updated/deleted etc.... Query optimizer has find out that using the PK index is less costly than using the other Non Clustered index."
now you will have lot of questions around how did Query Optimizer finds that out? and that will require some home work.
though in your specific situation, I am not agree with "Femi" as mentioned to try and running "Update Statistics" because there are some other situations as well where Update Statistics will also not help.
It sound like you have tested this Index on this query and if you are sure that you want only this index to be used 100% of time by that query, use the query hint and specify this index needs to be used. by that way you can always sure that this index will be used.
CAUTION: you must have done more than enough testing on various data loads to make sure in no case using this index is not expected or not acceptable. Once you use the Query hints every execution will use that only and Optimizer will always come up with execution plan using that Index.
Its difficult to tell in this specific case, but very often the query planner will look at the statistics it has for the specific table and decide to use the wrong index (for some definition of wrong; probably just not the index you think it should use). Try running UPDATE STATISTICS on the table and see if the query planner arrives at a different set of decisions.
Determining why the optimizer does or doesn't choose a given index can be somewhat of a dark art. I do notice, however, that there's likely a better index that you could be using. Specifically:
CREATE NONCLUSTERED INDEX [OptimizadorProcesarEventos] ON [dbo].[Eventos]
(
[Procesado] ASC,
[CodigoEvento] ASC,
[CodigoUnico] ASC,
[FechaGPS] ASC
)
INCLUDE ( [ID_Evento],
[EventoData],
[Latitud],
[Longitud],
[Velocidad])
WHERE Procesado = 0 -- this makes it a filtered index
WITH (SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF,
IGNORE_DUP_KEY = OFF,
ONLINE = OFF)
ON [PRIMARY]
This goes on my assumption that at any given time, most of the rows in your table are processed (i.e. Procesado = 1) so the above index would be much smaller than the non-filtered version.
Related
Hy guys,
I inherited a database with the following table with only 200 rows:
CREATE TABLE [MyTable](
[Id] [uniqueidentifier] NOT NULL,
[Name] [varchar](255) NULL,
[Value] [varchar](8000) NULL,
[EffectiveStartDate] [datetime] NULL,
[EffectiveEndDate] [datetime] NULL,
[Description] [varchar](2000) NOT NULL DEFAULT (''),
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
As you can see there is a Clustered PK on a UniqueIdentifier column. I was doing some performance checks and the most expensive query so far (CPU and IO) is the following:
SELECT #Result = Value
FROM MyTable
WHERE #EffectiveDate BETWEEN EffectiveStartDate AND EffectiveEndDate
AND Name=#VariableName
The query above is encapsulated in a UDF and usually the udf is not called in a select list or where clause, instead it's return is usually assigned to a variable.
The execution plan shows a Clustered Index Scan
Our system is based in a large number of aggregations and math processing in real time. Every time our Web Application refreshes the main page, it calls a bunch of Stored Procedures and UDFs and the query above is run around 500 times per refresh per user.
My question is: Should I change the PK to nonclustered and create a clustered index on the Name, EffectiveStartDate, EffectiveEndDate in a such small table?
No you should not. You can just add another index which will be covering index:
CREATE INDEX [IDX_Covering] ON dbo.MyTable(Name, EffectiveStartDate, EffectiveEndDate)
INCLUDE(Value)
If #VariableName and #EffectiveDate are variables with correct types you should now see index seek.
I am not sure this will help, but you need to try, because index scan of 200 rows is just nothing, but calling it 500 times may be a problem. By the way if those 200 rows are in one page I suspect this will not help. The problem may be somewhere else, like opening a connection 500 times or something like that...
I was trying to see the kind of performance gains column-store indexes can provide on a table. The table has about a 3.7 million rows, 11 columns and is stored as a heap (i.e without a primary key). I create a column-store index on the table and run the following query:
SELECT
[Area], [Family],
AVG([Global Sales Value]) AS [Average GlobalSalesValue],
COUNT([Projected Sales])
FROM
dbo.copy_Global_Previous5FullYearSales
WHERE
[Year] > 2012
GROUP BY
[Area], [Family]
The create table statement is as follows:
CREATE TABLE [dbo].[copy_Global_Previous5FullYearSales]
(
[SBU] [NVARCHAR](10) NULL,
[Year] [INT] NULL,
[Global Sales Value] [MONEY] NULL,
[Area] [NVARCHAR](50) NULL,
[Sub Area] [NVARCHAR](50) NULL,
[Projected Sales] [MONEY] NULL,
[Family] [NVARCHAR](50) NULL,
[Sub Family 1] [NVARCHAR](50) NULL,
[Sub Family 2] [NVARCHAR](50) NULL,
[Manufacturer] [NVARCHAR](40) NULL,
[rowguid] [UNIQUEIDENTIFIER] NOT NULL,
[ID] [INT] IDENTITY(1,1) NOT NULL,
PRIMARY KEY CLUSTERED ([ID] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The performance gains I get from column-store index in this case is negligible. The query with the column-store index runs nearly as slow as the original query without an index, in some cases, even slower, even though batch mode is processing is used too.
Surprisingly, when I create an ever increasing primary key - ID on the existing table and rebuild the column-store index, I get a 15X improvement on the CPU time and a 3X improvement on the elapsed time.
I don't understand how the addition of a primary key could affect query performance for column-store indexes which store the data in a compressed format anyway. Also primary keys only change the ordering of the pages which in this case, would be none.
Below is the execution plan
The presence of a a key changes how the columnstore is built. Because the builder gets its input in order the resulted segments are better candidates for segment elimination. Read more at Ensuring Your Data is Sorted or Nearly Sorted by Date to Benefit from Date Range Elimination:
The most common type of filter in data warehouse queries is by date. Columnstore segment elimination helps you skip entire one-million-row segments if the system can determine that no rows qualify, simply by looking at the minimum and maximum values for a column in a segment. So you usually will want to make sure that your segments are sorted, or nearly sorted, by date, so date filters can be executed as fast as possible.
Your order is by ID but I'm pretty sure that causes functional dependencies side-effects.
Vague title I know.
I have, at the moment, 16,000 rows in my database. This was created just while in development, I want to now delete all these rows so I can start again (so I don't have duplicate data).
The database is on SQL Azure.
If I run a select query
SELECT [Guid]
,[IssueNumber]
,[Severity]
,[PainIndex]
,[Status]
,[Month]
,[Year]
,[DateCreated]
,[Region]
,[IncidentStart]
,[IncidentEnd]
,[SRCount]
,[AggravatingFactors]
,[AggravatingFactorDescription]
FROM [dbo].[WeeklyGSFEntity]
GO
This returns all the rows, and SSMS says this takes 49 seconds.
If I attempt to drop the table, this goes on for 5 minutes plus.
DROP TABLE [dbo].[WeeklyGSFEntity]
GO
/****** Object: Table [dbo].[WeeklyGSFEntity] Script Date: 10/01/2013 09:46:18 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[WeeklyGSFEntity](
[Guid] [uniqueidentifier] NOT NULL,
[IssueNumber] [int] NULL,
[Severity] [int] NULL,
[PainIndex] [nchar](1) NULL,
[Status] [nvarchar](255) NULL,
[Month] [int] NULL,
[Year] [int] NULL,
[DateCreated] [datetime] NULL,
[Region] [nvarchar](255) NULL,
[IncidentStart] [datetime] NULL,
[IncidentEnd] [datetime] NULL,
[SRCount] [int] NULL,
[AggravatingFactors] [nvarchar](255) NULL,
[AggravatingFactorDescription] [nvarchar](max) NULL,
PRIMARY KEY CLUSTERED
(
[Guid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)
GO
If I attempt to delete each row, this also takes 5 minutes plus.
DELETE
FROM [dbo].[WeeklyGSFEntity]
GO
Am I doing something wrong or is it just that this is big data and I'm being impatient?
UPDATE:
Dropping the entire database took some 25 seconds.
Importing 22,000 rows (roughly the same 16,000 plus more) into localdb\v11.0 took 6 seconds. I know this is local but surely the local dev server is slower than Azure? Surely...
UPDATE the second:
Recreating the database and recreating the schema (with (Fluent) NHibernate), and then inserting some 20,000 rows took 2 minutes 6 seconds. All Unit Tests pass.
Is there anything I can do to look back?
Dropping and recreating the database sped things up considerably.
The reason for this is unknown.
Possible there is an open transaction on causing a lock on the table. This could be caused by cancelling an operation half way like we all do during dev.
Do a sp_who2 and see which Id is in blkby column. If there is one that's it.
To kill that process do kill id
I'm currently dealing with performance/memory consumption optimizations of our application. One of the tasks to perform is to replace all blobs in the a table that correspond to empty arrays with null values; this should reduce db size, memory consumption and speed up the load. Here is the table definition:
CREATE TABLE [dbo].[SampleTable](
[id] [bigint] NOT NULL,
[creationTime] [datetime] NULL,
[binaryData] [image] NULL,
[isEvent] [bit] NULL,
[lastSavedTime] [datetime] NULL,
CONSTRAINT [PK_SampleTable] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
I updated the table and replaced image field values (binaryData) with NULL values where appropriate (data corresponding to empty arrays in the application). Now, I observe the performance deterioration when running trivial SELECT * FROM SampleTable.
Originally those fields that had been updated had length = 512 bytes, not sure if it matters, though.
Any ideas why selecting blobs containing NULL values takes longer than selecting real binary data even if the data is the same for different rows?
I don't know the answer to this question. I tried the following test though and got a result that I found surprising.
CREATE TABLE [dbo].[SampleTable](
[id] [BIGINT] NOT NULL,
[creationTime] [DATETIME] NULL,
[binaryData] [IMAGE] NULL,
[isEvent] [BIT] NULL,
[lastSavedTime] [DATETIME] NULL,
CONSTRAINT [PK_SampleTable] PRIMARY KEY CLUSTERED
(
[id] ASC
)
)
INSERT INTO [dbo].[SampleTable]
SELECT 1, GETDATE(),
0x1111,
1, GETDATE()
INSERT INTO [dbo].[SampleTable]
SELECT 2, GETDATE(),
0x2222,
2, GETDATE()
INSERT INTO [dbo].[SampleTable]
SELECT 3, GETDATE(),
NULL,
3, GETDATE()
UPDATE [dbo].[SampleTable] SET [binaryData] = NULL
WHERE [id]=2
Looking at this in SQL Internals Viewer I was surprised to see a difference between the row I inserted as NULL and the one I updated to NULL.
It looks as though even when the value is updated to NULL it doesn't just set the NULL bitmap for some reason and still needs to follow a pointer to another LOB_DATA page.
Inserted as NULL
Inserted http://img809.imageshack.us/img809/9301/row3.png
Updated to NULL
Updated http://img84.imageshack.us/img84/420/row2.png
Let me help you restate this:
You have sql server doing a table scan while testing every.single.record. for a null value on the one side, versus the other where you have sql server doing a massive dump of ALL the records...
If your blobs are relatively small, then it's pretty obvious which one would be faster...
I have a 'SessionVisit' table which collects data about user visits.
The script for this table is below. There may be 25,000 rows added a day.
The table CREATE statement is below. My database knowledge is definitely not up to scratch as far as understanding the implications of such a schema.
Can anyone give me their 2c of advice on some of these issues :
Do I need to worry about ROWSIZE for this schema for SQL Server 2008. I'm not even sure how the 8kb rowsize works in 2008. I don't even know if I'm wasting a lot of space if I'm not using all 8kb?
How should I purge old records I don't want. Will new rows fill in the empty spaces from dropped rows?
Any advice on indexes
I know this is quite general in nature. Any 'obvious' or non obvious info would be appreciated.
Here's the table :
USE [MyDatabase]
GO
/****** Object: Table [dbo].[SessionVisit] Script Date: 06/06/2009 16:55:05 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[SessionVisit](
[SessionGUID] [uniqueidentifier] NOT NULL,
[SessionVisitId] [int] IDENTITY(1,1) NOT NULL,
[timestamp] [timestamp] NOT NULL,
[SessionDate] [datetime] NOT NULL CONSTRAINT [DF_SessionVisit_SessionDate] DEFAULT (getdate()),
[UserGUID] [uniqueidentifier] NOT NULL,
[CumulativeVisitCount] [int] NOT NULL CONSTRAINT [DF_SessionVisit_CumulativeVisitCount] DEFAULT ((0)),
[SiteUserId] [int] NULL,
[FullEntryURL] [varchar](255) NULL,
[SiteCanonicalURL] [varchar](100) NULL,
[StoreCanonicalURL] [varchar](100) NULL,
[CampaignId] [int] NULL,
[CampaignKey] [varchar](50) NULL,
[AdKeyword] [varchar](50) NULL,
[PartnerABVersion] [varchar](10) NULL,
[ABVersion] [varchar](10) NULL,
[UserAgent] [varchar](255) NULL,
[Referer] [varchar](255) NULL,
[KnownRefererId] [int] NULL,
[HostAddress] [varchar](20) NULL,
[HostName] [varchar](100) NULL,
[Language] [varchar](50) NULL,
[SessionLog] [xml] NULL,
[OrderDate] [datetime] NULL,
[OrderId] [varchar](50) NULL,
[utmcc] [varchar](1024) NULL,
[TestSession] [bit] NOT NULL CONSTRAINT [DF_SessionVisit_TestSession] DEFAULT ((0)),
[Bot] [bit] NULL,
CONSTRAINT [PK_SessionVisit] PRIMARY KEY CLUSTERED
(
[SessionGUID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
ALTER TABLE [dbo].[SessionVisit] WITH CHECK ADD CONSTRAINT [FK_SessionVisit_KnownReferer] FOREIGN KEY([KnownRefererId])
REFERENCES [dbo].[KnownReferer] ([KnownRefererId])
GO
ALTER TABLE [dbo].[SessionVisit] CHECK CONSTRAINT [FK_SessionVisit_KnownReferer]
GO
ALTER TABLE [dbo].[SessionVisit] WITH CHECK ADD CONSTRAINT [FK_SessionVisit_SiteUser] FOREIGN KEY([SiteUserId])
REFERENCES [dbo].[SiteUser] ([SiteUserId])
GO
ALTER TABLE [dbo].[SessionVisit] CHECK CONSTRAINT [FK_SessionVisit_SiteUser]
I see SessionGUID and SessionVisitId, why have both a uniqueidentifier and an Identity(1,1) on the same table? Seems redundant to me.
I see referer and knownrefererid, think about getting the referer from the knownrefererid if possible. This will help reduce excess writes.
I see campaignkey and campaignid, again if possible get from the campaigns table if possible.
I see orderid and orderdate. I'm sure you can get the order date from the orders table, correct?
I see hostaddress and hostname, do you really need the name? Usually the hostname doesn't serve much purpose and can be easily misleading.
I see multiple dates and timestamps, is any of this duplicate?
How about that SessionLog column? I see that it's XML. Is it a lot of data, is it data you may already have in other columns? If so get rid of the XML or the duplicated columns. Using SQL 2008 you can parse data out of that XML column when reporting and possibly eliminate a few extra columns (thus writes). Are you going to be in trouble in the future when developers add more to that XML? XML to me just screams 'a lot of excessive writing'.
Mitch says to remove the primary key. Personally I would leave the index on the table. Since it is clustered that will help speed up write times as the DB will always write new rows at the end of the table on the disk.
Strip out some of this duplicate information and you'll probably do just fine writing a row each visit.
Well, I'd recommend NOT inserting a few k of data with EVERY page!
First thing I'd do would be to see how much of this information I could get from a 3rd party analytics tool, perhaps combined with log analysis. That should allow you to drop a lot of the fields.
25k inserts a days isn't much, but the catch here is that busier your site gets, the more load this is going to put on the db. Perhaps you could build a queuing system that batches the writes, but really, most of this information is already in the logs.
Agre with Chris that you would probably be better off using log analysis (check out Microsoft's free Log Parser)
Failing that, I would remove the Foreign Key constraints from your SessionVisit table.
You mentioned rowsize; the varchar's in your table do not pre-allocate to their maximum length (more 4 + 4 bytes for an empty field (approx.)). But saying that, a general rule is to keep rows as 'lean' as possible.
Also, I would remove the primary key from the SessionGUID (GUID) column. It won't help you much.
That's also an awful lot of nulls in that table. I think you should group together the columns that must be non-null at the same time. In fact, you should do a better analysis of the data you're writing, rather than lumping it all together in a single table.