Recommended SQL Server table design for file import and processing

Recommended SQL Server table design for file import and processing - sql-server

I have a scenario where files will be uploaded into a database table (dbo.FileImport) with each line of the file in a new row. Each row will contain the line data and the name of the file it came from. The file names are unique but may contain a few million lines. Multiple file's data may exist in the table at one time.
Each file is processed and the results are stored in a separate table. After processing the data related to the file, the data is deleted from the import table to keep the table from growing indefinitely.
The table structure is as follows:
CREATE TABLE [dbo].[FileImport] (
[Id] BIGINT IDENTITY (1, 1) NOT NULL,
[FileName] VARCHAR (100) NOT NULL,
[LineData] NVARCHAR (300) NOT NULL
);
During the processing the data for the relevant file is loaded with the following query:
SELECT [LineData] FROM [dbo].[FileImport] WHERE [FileName] = #FileName
And then deleted with the following statement:
DELETE FROM [dbo].[FileImport] WHERE [FileName] = #FileName
My question is pertaining to the table design with regard to performance and longevity...
Is it necessary to have the [Id] column if I never use it (I am concerned about running out of numbers in the Identity eventually too)?
Should I add a PRIMARY KEY Constraint to the [Id] column?
Should I have a CLUSTERED or NONCLUSTERED index for the [FileName] column?
Should I be making use of NOLOCK whenever I query this table (it is updated very regularly)?
Would there be concern of fragmentation with the continual adding and deleting of data to/from this table? If so, how should I handle this?
Any advice or thoughts would be much appreciated. Opinionated designs are welcome ;-)
Update 2017-12-10
I failed to mention that the lines of a file may not be unique. So please take this into account if this affects the recommendation.
An example script in the answer would be an added bonus! ;-)

Is it necessary to have the [Id] column if I never use it (I am
concerned about running out of numbers in the Identity eventually
too)?
It is not necessary to have an unused column. This is not a relational table and will not be referenced by a foreign key so one could make the argument a primary key is unnecessary.
I would not be concerned about running out of 64-bit integer values. bigint can hold a positive value of up to 36,028,797,018,963,967. It would take centuries to run out of values if you load 1 billion rows a second.
Should I add a PRIMARY KEY Constraint to the [Id] column?
I would create a composite clustered primary key on FileName and ID. That would provide an incremental value to facilitate retrieving rows in the order of insertion and the FileName leftmost key column would benefit your queries greatly.
Should I have a CLUSTERED or NONCLUSTERED index for the [FileName]
column?
See above.
Should I be making use of NOLOCK whenever I query this table (it is
updated very regularly)?
No. Assuming you query by FileName, only the rows requested will be touched with the suggested primary key.
Would there be concern of fragmentation with the continual adding and
deleting of data to/from this table? If so, how should I handle this?
Incremental keys avoid fragmentation.
EDIT:
Here's the suggested DDL for the table:
CREATE TABLE dbo.FileImport (
FileName VARCHAR (100) NOT NULL
, RecordNumber BIGINT NOT NULL IDENTITY
, LineData NVARCHAR (300) NOT NULL
CONSTRAINT PK_FileImport PRIMARY KEY CLUSTERED(FileName, RecordNumber)
);

Here is a rough sketch how I would do it
CREATE TABLE [FileImport].[FileName] (
[FileId] BIGINT IDENTITY (1, 1) NOT NULL,
[FileName] VARCHAR (100) NOT NULL
);
go
alter table [FileImport].[FileName]
add constraint pk_FileName primary key nonclustered (FileId)
go
create clustered index cix_FileName on [FileImport].[FileName]([FileName])
go
CREATE TABLE [FileImport].[LineData] (
[FileId] VARCHAR (100) NOT NULL,
[LineDataId] BIGINT IDENTITY (1, 1) NOT NULL,
[LineData] NVARCHAR (300) NOT NULLL.
constraint fk_LineData_FileName foreign key (FileId) references [FileImport].[FileName](FIleId)
);
alter table [FileImport].[LineData]
add constraint pk_FileName primary key clustered (FileId, LineDataId)
go
This is with some normalization so you don't have to reference your full file name every time - you probably don't have to do (in case you prefer not to and just move FileName to second table instead of the FileId and cluster your index on (FileName, LeneDataId)) it but since we are using relational database ...
No need for any additional indexes - tables are sorted by the right keys
Should I be making use of NOLOCK whenever I query this table (it is
updated very regularly)?
If your data means anything to you, don't use it, It's a matter in fact, if you have to use it - something really wrong with your DB architecture. The way it is indexed SQL Server will use Seek operation which is very fast.
Would there be concern of fragmentation with the continual adding and
deleting of data to/from this table? If so, how should I handle this?
You can set up a maintenance job that rebuilds your indexes and run it nightly with Agent (or what ever)

Related

SQL Server - smart approach for combining GUID and identity column

I am trying to come up with a design for my database where across all my tables I'd like to have the combination of a GUID column (uniqueidentifier data type) and an identity column (int data type).
The GUID column is going to be a NONCLUSTERED index whilst the identity column is going to be the CLUSTERED index. I was wondering if the script below is a correct/safe approach when it comes to database design:
CREATE TABLE country
(
guid uniqueidentifier DEFAULT NEWID() NOT NULL,
code int IDENTITY(1, 1) NOT NULL,
isoCode nvarchar(5) NOT NULL,
description nvarchar(255) NOT NULL,
created date NOT NULL DEFAULT GETDATE(),
updated date NOT NULL DEFAULT GETDATE(),
inactive bit DEFAULT 0
CONSTRAINT NIX_guid PRIMARY KEY NONCLUSTERED(guid),
CONSTRAINT AK_code UNIQUE(code),
CONSTRAINT AK_isoCode UNIQUE(isoCode)
)
GO
CREATE UNIQUE CLUSTERED INDEX [IX_code] ON country ([code] ASC)
GO
That's how it looks after running the above script:
Any tips would be much appreciated!

The domain of all possible countries is never going to be more than a few hundred, so performance should not be a concern.
You already have an isoCode. That is a canonically defined candidate key. I understand what you mean when talking about GUIDs being useful because they can never collide when created on separate servers/application instances/etc. But ISO country codes can never collide either, because they're already defined by an external authority. You don't need the GUID.
Why is your existing isoCode column an nvarchar(5)? There is a 2 letter, and a 3 letter, ISO3166 standard. There are no unicode characters required, so you can use char(2) or char(3) depending on which standard you pick, both of which would be narrower than a 4 byte int.
Yes, an identity-based clustered index does mean not having to worry about page splits on insert. But these are countries. We already know all of the countries you need to insert right now, and are you really worried that the handful of changes that might be made over the next few decades will kill the performance of your system due to page splits on insert? No, so you don't need the identity column either.
Eliminate both surrogates, and just go with the alpha-2 or alpha-3 ISO country code as your clustered primary key.

Is it safe to add IDENTITY PK Column to existing SQL SERVER table?

After rebuilding all of the tables in one of my SQL SERVER databases, into a new database, I failed to set the 'ID' column to IDENTITY and PRIMARY KEY for many of the tables. Most of them have data.
I discovered this T-SQL, and have successfully implemented it for a couple of the tables already. The new/replaced ID column contains the same values from the previous column (simply because they were from an auto-incremented column in the table I imported from), and my existing stored procedures all still work.
Alter Table ExistingTable
Add NewID Int Identity(1, 1)
Go
Alter Table ExistingTable Drop Column ID
Go
Exec sp_rename 'ExistingTable.NewID', 'ID', 'Column'
--Then open the table in Design View, and set the new/replaced column as the PRIMARY KEY
--I understand that I could set the PK when I create the new IDENTITY column
The new/replaced ID column is now the last column in the table, and so far, I haven't ran into issues with the ASP.Net/C# data access objects that call the stored procedures.
As mentioned, each of these tables had no PRIMARY KEY (nor FOREIGN KEY) set. With that in mind, are there any additional steps I should take to ensure the integrity of the database?
I ran across this SO post, which suggests that I should run the 'ALTER TABLE REBUILD' statement, but since there was no PK already set, do I really need to do this?
Ultimately, I just want to be sure I'm not creating issues that won't appear until later in the game, and be sure the methods I'm implementing are sound, logical, and ensure data integrity.
I suppose it might be a better option to DROP/RECREATE the table with the proper PK/IDENTITY column, and I could write some T-SQL to dump the existing data into a TEMP table, then drop/recreate, and re-populate the new table with data from the TEMP table. I specifically avoided this option as it seems much more aggressive, and I don't fully understand what it means for the Stored Procedures/Functions, etc., that depend on these tables.
Here is an example of one of the tables I've performed this on. You can see the NewID values are identical to the original ID.enter image description here

Give this a go; it's rummaged up from a script we used a few years ago in a similar situation, can't remember what version of SQLS it was used against.. If it works out for your scenario you can adapt it to your tables..
SELECT MAX(Id)+1 FROM causeCodes -- run and use value below
CREATE TABLE [dbo].[CauseCodesW]( [ID] [int] NOT NULL IDENTITY(put_maxplusone_here,1), [Code] [varchar](50) NOT NULL, [Description] [varchar](500) NULL, [IsActive] [bit] NOT NULL )
ALTER TABLE CauseCodes SWITCH TO CauseCodesW;
DROP TABLE CauseCodes;
EXEC sp_rename 'CauseCodesW','CauseCodes';
ALTER TABLE CauseCodes ADD CONSTRAINT PK_CauseCodes_Id PRIMARY KEY CLUSTERED (Id);
SELECT * FROM CauseCodes;
You can now find any tables that have FKs to this table and recreate those relationships..

Creating in MS-SQL a primary key results in a dynamic index name format eg PKdummy3213E83F2E1BDC42

Hope for help because of the following problem. Assume we have a table
CREATE TABLE [dbo].[dummy](
[id] [char](36) NOT NULL,
[name] [varchar](50) NOT NULL
) ON [PRIMARY]
If I create a primary key like this (version 1)
ALTER TABLE dummy ADD CONSTRAINT PK_dummy PRIMARY KEY (ID);
I get a unique name. In this case PK_dummy.
But if I create a primary key like this (version 2)
ALTER TABLE dummy ADD PRIMARY KEY Clustered (ID);
The name changes with every recreation of this primary key.
The format is always PK__dummy__"a dynamic number"
What is the meaning of this number?
And how can I identify primary keys created with version 2 in a hugh database?
Thanks for hints.

What is the meaning of this number?
This depends on product version - it is either based on a unique id or generated randomly.
how can I identify primary keys created with version 2 in a huge database?
SELECT *
FROM sys.key_constraints
WHERE is_system_named = 1

If you don't define the name of a constraint, index, key, etc, SQL Server will give it a name. To ensure uniqueness across the database, it therefore will add "random" characters at the end.
If having a consistent name is important then define the name in your statement, as you did in the first statement.

SQL Server: how to constrain a table to contain a single row?

I want to store a single row in a configuration table for my application. I would like to enforce that this table can contain only one row.
What is the simplest way to enforce the single row constraint ?

You make sure one of the columns can only contain one value, and then make that the primary key (or apply a uniqueness constraint).
CREATE TABLE T1(
Lock char(1) not null,
/* Other columns */,
constraint PK_T1 PRIMARY KEY (Lock),
constraint CK_T1_Locked CHECK (Lock='X')
)
I have a number of these tables in various databases, mostly for storing config. It's a lot nicer knowing that, if the config item should be an int, you'll only ever read an int from the DB.

I usually use Damien's approach, which has always worked great for me, but I also add one thing:
CREATE TABLE T1(
Lock char(1) not null DEFAULT 'X',
/* Other columns */,
constraint PK_T1 PRIMARY KEY (Lock),
constraint CK_T1_Locked CHECK (Lock='X')
)
Adding the "DEFAULT 'X'", you will never have to deal with the Lock column, and won't have to remember which was the lock value when loading the table for the first time.

You may want to rethink this strategy. In similar situations, I've often found it invaluable to leave the old configuration rows lying around for historical information.
To do that, you actually have an extra column creation_date_time (date/time of insertion or update) and an insert or insert/update trigger which will populate it correctly with the current date/time.
Then, in order to get your current configuration, you use something like:
select * from config_table order by creation_date_time desc fetch first row only
(depending on your DBMS flavour).
That way, you still get to maintain the history for recovery purposes (you can institute cleanup procedures if the table gets too big but this is unlikely) and you still get to work with the latest configuration.

You can implement an INSTEAD OF Trigger to enforce this type of business logic within the database.
The trigger can contain logic to check if a record already exists in the table and if so, ROLLBACK the Insert.
Now, taking a step back to look at the bigger picture, I wonder if perhaps there is an alternative and more suitable way for you to store this information, perhaps in a configuration file or environment variable for example?

I know this is very old but instead of thinking BIG sometimes better think small use an identity integer like this:
Create Table TableWhatever
(
keycol int primary key not null identity(1,1)
check(keycol =1),
Col2 varchar(7)
)
This way each time you try to insert another row the check constraint will raise preventing you from inserting any row since the identity p key won't accept any value but 1

Here's a solution I came up with for a lock-type table which can contain only one row, holding a Y or N (an application lock state, for example).
Create the table with one column. I put a check constraint on the one column so that only a Y or N can be put in it. (Or 1 or 0, or whatever)
Insert one row in the table, with the "normal" state (e.g. N means not locked)
Then create an INSERT trigger on the table that only has a SIGNAL (DB2) or RAISERROR (SQL Server) or RAISE_APPLICATION_ERROR (Oracle). This makes it so application code can update the table, but any INSERT fails.
DB2 example:
create table PRICE_LIST_LOCK
(
LOCKED_YN char(1) not null
constraint PRICE_LIST_LOCK_YN_CK check (LOCKED_YN in ('Y', 'N') )
);
--- do this insert when creating the table
insert into PRICE_LIST_LOCK
values ('N');
--- once there is one row in the table, create this trigger
CREATE TRIGGER ONLY_ONE_ROW_IN_PRICE_LIST_LOCK
NO CASCADE
BEFORE INSERT ON PRICE_LIST_LOCK
FOR EACH ROW
SIGNAL SQLSTATE '81000' -- arbitrary user-defined value
SET MESSAGE_TEXT='Only one row is allowed in this table';
Works for me.

I use a bit field for primary key with name IsActive.
So there can be 2 rows at most and and the sql to get the valid row is:
select * from Settings where IsActive = 1
if the table is named Settings.

The easiest way is to define the ID field as a computed column by value 1 (or any number ,....), then consider a unique index for the ID.
CREATE TABLE [dbo].[SingleRowTable](
[ID] AS ((1)),
[Title] [varchar](50) NOT NULL,
CONSTRAINT [IX_SingleRowTable] UNIQUE NONCLUSTERED
(
[ID] ASC
)
) ON [PRIMARY]

You can write a trigger on the insert action on the table. Whenever someone tries to insert a new row in the table, fire away the logic of removing the latest row in the insert trigger code.

Old question but how about using IDENTITY(MAX,1) of a small column type?
CREATE TABLE [dbo].[Config](
[ID] [tinyint] IDENTITY(255,1) NOT NULL,
[Config1] [nvarchar](max) NOT NULL,
[Config2] [nvarchar](max) NOT NULL

IF NOT EXISTS ( select * from table )
BEGIN
///Your insert statement
END

Here we can also make an invisible value which will be the same after first entry in the database.Example:
Student Table:
Id:int
firstname:char
Here in the entry box,we have to specify the same value for id column which will restrict as after first entry other than writing lock bla bla due to primary key constraint thus having only one row forever.
Hope this helps!

Creating a Primary Key on a temp table - When?

I have a stored procedure that is working with a large amount of data. I have that data being inserted in to a temp table. The overall flow of events is something like
CREATE #TempTable (
Col1 NUMERIC(18,0) NOT NULL, --This will not be an identity column.
,Col2 INT NOT NULL,
,Col3 BIGINT,
,Col4 VARCHAR(25) NOT NULL,
--Etc...
--
--Create primary key here?
)
INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...
INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...
--
-- ...or create primary key here?
My question is when is the best time to create a primary key on my #TempTable table? I theorized that I should create the primary key constraint/index after I insert all the data because the index needs to be reorganized as the primary key info is being created. But I realized that my underlining assumption might be wrong...
In case it is relevant, the data types I used are real. In the #TempTable table, Col1 and Col4 will be making up my primary key.
Update: In my case, I'm duplicating the primary key of the source tables. I know that the fields that will make up my primary key will always be unique. I have no concern about a failed alter table if I add the primary key at the end.
Though, this aside, my question still stands as which is faster assuming both would succeed?

This depends a lot.
If you make the primary key index clustered after the load, the entire table will be re-written as the clustered index isn't really an index, it is the logical order of the data. Your execution plan on the inserts is going to depend on the indexes in place when the plan is determined, and if the clustered index is in place, it will sort prior to the insert. You will typically see this in the execution plan.
If you make the primary key a simple constraint, it will be a regular (non-clustered) index and the table will simply be populated in whatever order the optimizer determines and the index updated.
I think the overall quickest performance (of this process to load temp table) is usually to write the data as a heap and then apply the (non-clustered) index.
However, as others have noted, the creation of the index could fail. Also, the temp table does not exist in isolation. Presumably there is a best index for reading the data from it for the next step. This index will need to either be in place or created. This is where you have to make a tradeoff of speed here for reliability (apply the PK and any other constraints first) and speed later (have at least the clustered index in place if you are going to have one).

If the recovery model of your database is set to simple or bulk-logged, SELECT ... INTO ... UNION ALL may be the fastest solution. SELECT .. INTO is a bulk operation and bulk operations are minimally logged.
eg:
-- first, create the table
SELECT ...
INTO #TempTable
FROM MyTable
WHERE ...
UNION ALL
SELECT ...
FROM MyTable2
WHERE ...
-- now, add a non-clustered primary key:
-- this will *not* recreate the table in the background
-- it will only create a separate index
-- the table will remain stored as a heap
ALTER TABLE #TempTable ADD PRIMARY KEY NONCLUSTERED (NonNullableKeyField)
-- alternatively:
-- this *will* recreate the table in the background
-- and reorder the rows according to the primary key
-- CLUSTERED key word is optional, primary keys are clustered by default
ALTER TABLE #TempTable ADD PRIMARY KEY CLUSTERED (NonNullableKeyField)
Otherwise, Cade Roux had good advice re: before or after.

You may as well create the primary key before the inserts - if the primary key is on an identity column then the inserts will be done sequentially anyway and there will be no difference.

Even more important than performance considerations, if you are not ABSOLUTELY, 100% sure that you will have unique values being inserted into the table, create the primary key first. Otherwise the primary key will fail to be created.
This prevents you from inserting duplicate/bad data.

If you add the primary key when creating the table, the first insert will be free (no checks required.) The second insert just has to see if it's different from the first. The third insert has to check two rows, and so on. The checks will be index lookups, because there's a unique constraint in place.
If you add the primary key after all the inserts, every row has to be matched against every other row. So my guess is that adding a primary key early on is cheaper.
But maybe Sql Server has a really smart way of checking uniqueness. So if you want to be sure, measure it!

I was wondering if I could improve a very very "expensive" stored procedure entailing a bunch of checks at each insert across tables and came across this answer. In the Sproc, several temp tables are opened and reference each other. I added the Primary Key to the CREATE TABLE statement (even though my selects use WHERE NOT EXISTS statements to insert data and ensure uniqueness) and my execution time was cut down SEVERELY. I highly recommend using the primary keys. Always at least try it out even when you think you don't need it.

I don't think it makes any significant difference in your case:
either you pay the penalty a little bit at a time, with each single insert
or you'll pay a larger penalty after all the inserts are done, but only once
When you create it up front before the inserts start, you could potentially catch PK violations as the data is being inserted, if the PK value isn't system-created.
But other than that - no big difference, really.
Marc

I wasn't planning to answer this, since I'm not 100% confident on my knowledge of this. But since it doesn't look like you are getting much response ...
My understanding is a PK is a unique index and when you insert each record, your index is updated and optimized. So ... if you add the data first, then create the index, the index is only optimized once.
So, if you are confident your data is clean (without duplicate PK data) then I'd say insert, then add the PK.
But if your data may have duplicate PK data, I'd say create the PK first, so it will bomb out ASAP.

When you add PK on table creation - the insert check is O(Tn) (where Tn is "n-th triangular number", which is 1 + 2 + 3 ... + n) because when you insert x-th row, it's checked against previously inserted "x - 1" rows
When you add PK after inserting all the values - the checker is O(n^2) because when you insert x-th row, it's checked against all n existing rows.
First one is obviously faster since O(Tn) is less than O(n^2)
P.S. Example: if you insert 5 rows it is 1 + 2 + 3 + 4 + 5 = 15 operations vs 5^2 = 25 operations

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight