Database Performance and maintenance with one thousand columns. - sql-server

I need to create a table with one thousand fields(columns) and I don't know how to handle the performance and how to maintain it please help me with suggestions.

If most times most values are NULL then you should upgrade to SQL Server 2008 and use sparse columns, see Using Sparse Columns and Using Column Sets.
If your column values are not mostly NULL then I question the soundness of your data model.

First things you will have to do
Normalize
Define the entities and separate out to different tables. Draw an ER diagram and
you will get more ideas.
Dont exceed Table column beyond 15 if the columns are varchar or text because
then SQL will have to store the data in different pages. If the columns are
boolean then it can be around 30.
Define clustered index properly based on your data as this will optimize querying.
Since the nature of the question is without much details the answers are also generic and from a 100 feet view.

Again, please don't do this.
Check out the Entity Attribute Value model with respect to databases. It will help you store a large amount of sparse attributes on an entity and doesn't make databases cry.
The basic concept is shown below
create table #attributes
(
id int identity(1,1),
attribute varchar(20),
attribute_description varchar(max),
attribute_type varchar(20)
)
insert into #attributes values ('Column 1','what you want to put in column 1 of 1000','string')
insert into #attributes values ('Column 2','what you want to put in column 2 of 1000','int')
create table #entity
(
id int identity(1,1),
whatever varchar(max)
)
insert into #entity values ('Entity1')
insert into #entity values ('Entity2')
create table #entity_attribute
(
id int identity(1,1),
entity_id int,
attribute_id int,
attribute_value varchar(max)
)
insert into #entity_attribute values (1,1,'e1value1')
insert into #entity_attribute values (1,2,'e1value2')
insert into #entity_attribute values (2,2,'e2value2')
select *
from #entity e
join #entity_attribute ea on e.id = ea.entity_id
The difference between what goes in the #entity table and what goes in the #attribute table is somewhat dependent on the application but a general rule would be something that is never null and is accessed every time you need the entity, I would limit this to 10 or so items.
Let me guess this is a medical application?

Related

Bad design to compare to computed columns?

Using SQL Server I have a table with a computed column. That column concatenates 60 columns:
CREATE TABLE foo
(
Id INT NOT NULL,
PartNumber NVARCHAR(100),
field_1 INT NULL,
field_2 INT NULL,
-- and so forth
field_60 INT NULL,
-- and so forth up to field_60
)
ALTER TABLE foo
ADD RecordKey AS CONCAT (field_1, '-', field_2, '-', -- and so on up to 60
) PERSISTED
CREATE INDEX ix_foo_RecordKey ON dbo.foo (RecordKey);
Why I used a persisted column:
Not having the need to index 60 columns
To test to see if a current record exists by checking just one column
This table will contain no fewer than 20 million records. Adds/Inserts/updates happen a lot, and some binaries do tens of thousands of inserts/updates/deletes per run and we want these to be quick and live.
Currently we have C# code that manages records in table foo. It has a function which concatenates the same fields, in the same order, as the computed column. If a record with that same concatenated key already exists we might not insert, or we might insert but call other functions that we may not normally.
Is this a bad design? The big danger I see is if the code for any reason doesn't match the concatenation order of the computed column (if one is edited but not the other).
Rules/Requirements
We want to show records in JQGrid. We already have C# that can do so if the records come from a single table or view
We need the ability to check two records to verify if they both have the same values for all of the 60 columns
A better table design would be
parts table
-----------
id
partnumber
other_common_attributes_for_all_parts
attributes table
----------------
id
attribute_name
attribute_unit (if needed)
part_attributes table
---------------------
part_id (foreign key to parts)
attribute_id (foreign key to attributes)
attribute value
It looks complicated but due to proper indexing this is super fast even if part_attributes contain billions of records!

How can I insert rows of one table into multiple tables using a SQL Server stored procedure?

I am interested in inserting my rows of tempDataTable into two tables.
This is the table design of my tempdatatable:
The two tables I want to create via the stored procedure from my TempDataTable (one in the image).
Design for the two new table would be something like;
Table one (Product): ProductID (PK), ProductName, Product URL
Table two (ProductPricing): ProductPricingID(PK), ProductId (FK), price, priceperunit, Date
It's been a complete day I am searching for a solution, and will kept doing this but I am unable to exact solution. I am not experience with SQL but this is something I have to do.
Okay, I'm not sure exactly where you are struggling, so here's a script that sort of does what you asked for. None of this is too hard to follow, so maybe have a scan through it, and then let me know which bits are confusing?
Set up the table structure:
CREATE TABLE tempDataTable (
TempProductId INT,
TempProductUrl VARCHAR(512),
TempProductPrice VARCHAR(50),
TempProductPricePerUnit VARCHAR(50),
TempProductName VARCHAR(512));
INSERT INTO tempDataTable SELECT 2491, 'https://yadayada1', '£1.65/unit', '46p/100g', 'Yeo Valley Little Yeos, blah';
INSERT INTO tempDataTable SELECT 2492, 'https://yadayada2', '60p/unit', '1p/ea', 'Sainsbury''s Little Ones, etc';
CREATE TABLE Product (
ProductId INT PRIMARY KEY,
ProductName VARCHAR(512),
ProductUrl VARCHAR(512));
CREATE TABLE ProductPricing (
ProductPricingId INT IDENTITY(1,1) PRIMARY KEY,
ProductId INT,
ProductPrice VARCHAR(50),
ProductPricePerUnit VARCHAR(50),
ProductPricingDate DATETIME);
ALTER TABLE ProductPricing ADD CONSTRAINT foreignkey$ProductPricing$Product FOREIGN KEY (ProductId) REFERENCES Product (ProductId);
This gives me three tables to play with, one with some temporary data in it, and two that you want to push the data into, with a couple of primary keys, and a foreign key constraint to ensure integrity between the two tables.
Good so far?
Now to split the data between the two tables is as simple as:
INSERT INTO Product (ProductId, ProductName, ProductUrl) SELECT TempProductId, TempProductName, TempProductUrl FROM tempDataTable;
INSERT INTO ProductPricing (ProductId, ProductPrice, ProductPricePerUnit, ProductPricingDate) SELECT TempProductId, TempProductPrice, TempProductPricePerUnit, GETDATE() FROM tempDataTable;
If you run that then you should end up with data in your two tables, like this:
Product
ProductId ProductName ProductUrl
2491 Yeo Valley Little Yeos, blah https://yadayada1
2492 Sainsbury's Little Ones, etc https://yadayada2
ProductPricing
ProductPricingId ProductId ProductPrice ProductPricePerUnit ProductPricingDate
1 2491 £1.65/unit 46p/100g 2020-04-27 14:29:14.657
2 2492 60p/unit 1p/ea 2020-04-27 14:29:14.657
Now there's a whole load of questions that arise from this:
how are you going to cope with running this more than once, because the second time you run it there will be primary key violations?
do you want to clear down the temporary data somehow on successful completion?
do you want to use the system date as the pricing date, or are there more columns off the edge of your image?
do you want to check the data for duplicates and deal with them before running the script, or it will just fail?
if you do get a duplicate then do you skip it, or update the data (MERGE)?
why do you want this as a stored procedure? I mean it's easy enough to make into one, but I don't see why this would need to be repeatable... without seeing the other "moving parts" in this system anyway.
I'm guessing that you are loading bulk data into that temporary table somehow, from an Excel workbook, or XML, or similar. So all you want is a way to "tear the data up" into multiple tables. If this is indeed the case, then using a tool like SSIS might be more practical?
Okay, so that's 90% there, but you need two other things:
cope with situations where the product id already exists - don't try to insert it a second time as it will fail;
where the product id already exists then update the price data.
This should handle the first tweak:
INSERT INTO Product (ProductId, ProductName, ProductUrl) SELECT t.TempProductId, t.TempProductName, t.TempProductUrl FROM tempDataTable t
WHERE NOT EXISTS (SELECT * FROM Product p WHERE p.ProductId = t.TempProductId);
...and to UPDATE prices where the data already exists, or INSERT them if they don't exist, well you can use a MERGE statement:
MERGE
ProductPricing AS [target]
USING (SELECT TempProductId, TempProductPrice, TempProductPricePerUnit, GETDATE() AS ProductPricingDate FROM tempDataTable)
AS [source] (
ProductId,
ProductPrice,
ProductPricePerUnit,
ProductPricingDate)
ON ([target].ProductId = [source].ProductId)
WHEN MATCHED THEN
UPDATE SET
ProductPrice = [source].ProductPrice,
ProductPricePerUnit = [source].ProductPricePerUnit,
ProductPricingDate = [source].ProductPricingDate
WHEN NOT MATCHED THEN
INSERT (
ProductId,
ProductPrice,
ProductPricePerUnit,
ProductPricingDate)
VALUES (
[source].ProductId,
[source].ProductPrice,
[source].ProductPricePerUnit,
[source].ProductPricingDate);
Actually, re-reading your comment, I don't think you even need a MERGE (but I'm going to leave it there anyway, as it took me a little effort to write it).
I think your second case is as simple as just letting the second INSERT always run. There's two scenarios:
if there's already an entry for that product - then just add a new row to the ProductPricing table, so you will have one product, and two (or more) prices, each with a different date;
if it's a new product - then add the product and the price, so you will have one product and one price (until a new price arrives).
...and I can't resist adding, this is because you are using a natural key, i.e. a key from your data, so it doesn't change as you load it. If you were using a surrogate key (e.g. an IDENTITY that you got when you inserted the Product) then this wouldn't work, you would need to go and look up the surrogate key, then use this so your foreign key constraint worked properly. It's probably best to not think about this too hard?

About Surrogate key in Loading Process in DataWarehouse

When you do the loading process from stage table to the fact and dimension table and does it mean that you also load the surrogate key from stage to the dimension table in relation to new rows?
Or do you create new surrogate key in dimension table by using the sql code Identity for the table? (https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql-identity-property?view=sql-server-2017)?
Which approach is correct?
Other information:
*I'm newbie in ETL and Business Intelligence
*I'm using only T-SQL, no SSIS.
Thank you!
Question is not very clear. Ill attempt to answer based on what i "think" you are asking but it would be better to ensure the question is crystal clear to people unfamiliar with the data, and provide sample data.
I think you are asking if you need to load entries into a dimension table, for records that are being loaded into a fact table, at the same time the fact table is being loaded.
Generally the dimension members are loaded into the dimension table before loading data into the fact table. Its just easier to do it this way if at all possible.
The steps i would use, in order are:
load the dimension with any new members in its own stored procedure. This ensures the you now have a surrogate key for any new members. do this for all dimensions.
Create a 2nd stored procedure to load the fact table. join the staging table to the dimension tables to get the surrogate keys. code below shows an example for one dimension but just do more joins to more dimensions as needed.
The below code populates a sample dimension and factStaging table with contrived data, to show how to then get the surrogate key and data to be inserted into the fact table.
create table #factstaging
(
dimension1Value nvarchar(20),
factmeasure1 int,
factmeasure2 int
)
create table #dimension1
(
ID int identity(1,1),
dimension1Value nvarchar(20)
)
insert into #dimension1
values
('d1 value 1'),
('d1 value 2'),
('d1 value 3')
insert into #factstaging
values
('d1 value 1',22,44),
('d1 value 1',22,44),
('d1 value 2',22,44),
('d1 value 3',22,44)
--contents of stored procedure to insert fact rows
select d1.ID as Dimension1SurrogateKey, s.factmeasure1,s.factmeasure2
from #factStaging s
join #dimension1 d1 on s.dimension1Value = d1.dimension1Value
Note:
your data needs to be clean.
if facts are arriving before the dimension data, the pattern will be different, and need to use something like a late arriving dimension pattern which is a lot more complex.

Slow insert performance on a table with a lot of empty nvarchar columns

I have a table with 150 columns. Most of them are of type nvarchar(100) null, and some are decimal(20,7) null
Example:
Create table myTable
(
ID bigint (PK),
Col1 Nvarchar(100) null,
Col2 Nvarchar(100) null,
Col3 Nvarchar(100) null,
....
Col150 nvarchar(100) null
)
When I do an insert I insert only in 20 columns. When I try to insert 1 or 2 million records it takes a lot of time (>1 minute with 32gb ram)
When I insert the same amount of records into a temp table it takes just 1-2 seconds.
I also tried to remove primary key but the results are the same. How can I speed up insert into a table with a lot of empty nvarchar columns?
Since you have a lot of empty columns I'd suggest using XML as a column type if it's possible
Create table myTable
(
ID bigint (PK),
Col1 XML
)
This will increase your performances in circumstances you explained.
Since using xml is a greater topic you can read more here
Speaking about the design, this is not the best practice obviously, having hundreds of nullable columns, which will be null most of the times.
So, I may say :
If you can redesign the table, you may :
Use one XML column to store all of those nullable columns (as #seesharpguru mentioned)
Convert all of those columns into rows in a separate table
Use empty strings instead of NULL values, so you can set all the columns NOT NULLABLE
If you don't have any choices, you may :
Use BULK INSERT
Just wait until finish and let's hope users will be OK with the SLA ;-)
I'm sure there are many options out there, but at least I've gone through all of these solutions before.
Sorry I cannot give you more other than ideas.
It's likely you experience massive page splits. Can be confirmed by monitoring (Profiler or DMV).
Too many columns per a table.
Requires either to:
redesign the table (normalize or split - Having 4 tables with 40 columns is much faster than 1 table with 150 columns)
try to defragment the table
try different fillfactor

Storing detailed data in SQL Server

I’m designing a database in which I save votes.
I’ve created a table:
CREATE TABLE [dbo].[users_votes](
[id] [bigint] NOT NULL,
[like_votes] [int] NOT NULL DEFAULT ((0)),
[dislike_votes] [int] NOT NULL DEFAULT ((0)),
[commented_votes] [int] NOT NULL DEFAULT ((0)),
[comments_likes] [int] NOT NULL DEFAULT ((0))
The issue is that there is a requirement to also store the breakdown data by location.
So for example if user_votes table has 1,000 like_votes for a specific id, I need to know the break down by location, e.g.:
United States 340
France 155
Denmark 25
Brazil 290
Australia 190
I’m getting the data from the client as comma delimited String, for example:
(1,2,45,67,87,112,234) and the country code for location (us, au, ca, etc...).
I’ve been thinking about a few possibilities to store this data, but wanted to know which of these approached is best suited (if any).
As the number of country codes is finite, I can expand users_votes table and add columns with country codes for each criteria. E.g. like_votes_us, dislike_votes_us, comment_votes_us, comment_likes_us.
In this case I will probably use Dynamic SQL to insert/update the data.
Create new tables for each column, so for example I will have a table named like_votes, in which I will have an id, external_id which will be users_votes (table) id, country_code, and count column. So the data will be stored in users_votes and also in like_votes table. I will have a record for each combination of external_id and country code.
In this case I will need to iterate the inserted data in order to determine if this external_id combination exists (and then just increment it) or it needs to be inserted.
Which approach, if any, is the optimal way to store this data so it will be easy to insert/update and also to query?
This type of table design you have at the moment isn't a good idea, in all honesty. One big important point of building a good relational database is using Normal Form. i'm not going to explain what that is here, as there are 10's of thousands of articles on the internet explaining it, and its different iterations (from 1NF to 6NF iirc).
Anyway, you can easily do this with a few tables. I having to guess a lot of your set up here, but hopefully you'll be able to extrapolate what you need, and adjust what doesn't.
Firstly, let's start with a client table:
CREATE TABLE dbo.Client (ClientID int IDENTITY(1,1),
ClientName varchar(100), --You should really split this into Title, Forename and Surname, I'm just being "lazy" here
ClientCountryID int, --Not sure if a Client is related to a country or the vote is, i've guessed the client is.
DOB date,
EmailAddress varchar(100));
GO
So, we have a simple Client Table now. Next, we want a Country Table. This is very simple:
CREATE TABLE dbo.Country (CountryID int IDENTITY(1,1),
CountryName varchar(100),
CountryCode char(2)); --For example UK for United Kingdom, FR for France, etc
GO
You might want to store additional content there, but I don't know your set up.
Now, this is where I'm really guessing a lot. I'm assuming that your likes and dislikes, etc, are linked to something. What, I have no idea, so, I'm going to have a table called "Content", however, not knowing what these likes are against, I have no context for this table, thus it's going to be very basic:
CREATE TABLE dbo.Content (ContentID int IDENTITY(1,1),
ContentType int, --Guessing might be types, maybe videos, Comments, articles? I have no idea to be honest)
ContentParent int, --Comments are joined to a Content (just like here on SO)? I'll guess it's possible
Content nvarchar(MAX)); --because I have no idea what's going in there
--Very simple Content Type Table
CREATE TABLE dbo.ContentType (TypeID int IDENTITY(1,1),
TypeDescription varchar(100));
GO
Now, finally, we can get onto the votes that you want to store; which might look something like this:
CREATE TABLE dbo.Vote (VoteID int IDENTITY(1,1),
ClientID int,
ContentID int,
Liked bit); --1 for Liked, 0 for Disliked, NULL for N/A perhaps?
GO
Ok, now we have some tables. Now I realise I haven't given any kind of Sample data to go in here, so I'll provide a few INSERTS statements for you, so you can get the idea:
INSERT INTO dbo.Country (CountryName, CountryCode)
VALUES ('United Kingdom','GB'),
('France','FR'),
('Germany','DE');
GO
INSERT INTO dbo.Client (ClientName, ClientCountryID, DOB, EmailAddress)
VALUES ('Mr John Smith',1, '19880106','Bob#gmial.com'),
('Ms Penelope Vert',2,'19930509','PVert#mfn.com');
GO
INSERT INTO dbo.ContentType (TypeDescription)
VALUES ('Video'),('Article'),('Comment');
GO
INSERT INTO dbo.Content (ContentType, ContentParent, Content)
VALUES (2, NULL, 'This is my first article, hi everyone!'),
(3, 1, 'Nice! Good to see you''re finally posting!'),
(1, NULL, 'http://youtube.com');
GO
--And now some votes:
INSERT INTO dbo.Vote (ClientID, ContentID, Liked)
VALUES (1, 1, 1),
(2, 1, 1),
(2, 2, 1),
(2, 3, 0);
GO
Notice how I've put the votes in. I've not aggregated in the table; doing so is an awful idea. instead store each vote individually and use a query to Aggregate. You can easily do this, for example:
SELECT C.ContentID,
Cy.CountryName,
COUNT(CASE V.Liked WHEN 1 THEN 1 END) AS LikedVotes,
COUNT(CASE V.Liked WHEN 0 THEN 1 END) AS DisLikedVotes
FROM dbo.Content C
JOIN dbo.Vote V ON C.ContentID = V.ContentID
JOIN dbo.Client CV ON V.ClientID = CV.ClientID
JOIN dbo.Country Cy ON CV.ClientCountryID = Cy.CountryID
GROUP BY C.ContentID,
Cy.CountryName;
This gives you the number of Liked Votes per Content Item, and splits it into Countries as well for you. If you want to put these countries into their own columns, then I strongly suggest doing this in your presentation layer, not your SQL (as you'll have to use Dynamic SQL, and (no offence) I imagine this is beyond your skills at the moment based on your current database design choice(s)). Excel is very good at going this using Pivot tables. if you want to retain the process in SQL Server, consider using SSRS and a matrix.
If you have any questions, please do ask.
Note: I have no made any kind of foreign keys, constraints, Default values, etc here. These are a definate must for any good database design.
Clean Up script:
DROP TABLE dbo.Client;
DROP TABLE dbo.Country;
DROP TABLE dbo.Vote;
DROP TABLE dbo.Content;
DROP TABLE dbo.ContentType;
GO

Resources