Speed up table scans using a middle-man data dump table - sql-server

I'm a dev (not dba) working on a project where we have been managing "updates" to data via bulk insert. However we first insert to a non-indexed "pre-staging" table. This is due to the fact that we need to normalize a lot of denormalized data and make sure it's properly split up into our schema.
Naturally this makes the update and insert processes slow since we have to check if the information exists for each specific table with non-indexed codes or identifiers.
Since the "pre-staging" table is truncated we didn't include auto-generated IDs either.
I'm looking into ways to speed up table scans on that particular table in our stored procedures. What would be the best approach to do so? Indexes? Auto-generated IDs as clustered indexes? This last one is tricky because we cannot establish relationships with our "staging" data since it is truncated per data dump.

Related

Performance improvement by selective updating of Fact table columns in dimensional data warehouse

I'm trying to improve the loading of a dimensional data warehouse. I need to load tens of thousands of records each time I run the loading operation; sometimes I need to load hundreds of thousands of records; semi-annually I need to load millions of records.
For the loading operation, I build an in-memory record corresponding to a 48-column Fact table record which is to be updated or inserted (depending on whether a previous version of the record is already in the Fact table). The in-memory record includes about 2 dozen foreign key pointers to various dimension tables. I have indexes on all of those Fact table foreign keys (and of course I can disable those indexes while loading the data warehouse).
After populating the in-memory record with fresh data for the Fact table record, I add it to the Fact table (update or insert, as above).
I am wondering if I could improve performance in update situations by updating only those columns which have changed, instead of mindlessly updating every column in the Fact table record. Doing this would add some overhead in that my loading program would become more complex. But would I gain any benefit anyway? Suppose for example that one of those dimension table foreign keys hasn't changed. Is there going to be a performance benefit if I do not update that particular column? I guess it comes down to whether SQL Server internally tracks (counts??) foreign key references between the Fact and dimension tables. If SQL Server does do such tracking, then perhaps there would be a performance benefit if I do not update the column, since then SQL Server would not need to perform its internal tracking operation. But if SQL Server does not do such tracking then I suppose that I should just update all 48 Fact table columns regardless of whether they have changed or not, since that would avoid adding complexity & overhead to my loading program.
I'd appreciate any comments or suggestions.

SQL Server delete rows from large table and create indexes

I have a table in a production server with 350 million rows and aproximatelly 25GB size. It has a single clustered identity index.
The queries targeting this table require some missing indexes for better perfomance.
I need to delete unnecessary data (aprox 200 million rows) and then create two non-clustered indexes.
However, I have some concerns:
I need to avoid increasing the log too much
Keep the database downtime as low as possible.
Keep the identity (primary key) the same in the remaining data.
I would like to hear you opinion for the best solution to adopt.
The following is a guideline on how you might do this:
Suspend insert/update operations or start logging them explicitly (this might result in degraded performance).
Select the records to keep into a new table.
Then you have two options. If this is the only table in your universe:
Build the indexes on the new table.
Stop the system.
Rename the existing table to something else.
Rename the new table to the real table name
Turn the system back on.
If there are other tables (such as foreign key relationships):
Truncate the existing table
Insert the data into the existing table
Build the secondary indexes
Turn the system back on
Depending on your user requirements, one of the above variations is likely to work for your problem.
Note that there are other more internally intensive techniques. For instance, create a replicated database and once that is working, you have two systems and can do the clean-up work on one at a time (a method such as this would be the preferred method for a system with near 100% uptime requirements). Or create a separate table that is just right and swap the table spaces.

How to create a 'sanitized' copy of our SQL Server database?

We're a manufacturing company, and we've hired a couple of data scientists to look for patterns and correlation in our manufacturing data. We want to give them a copy of our reporting database (SQL 2014), but it must be in a 'sanitized' form. This means that all table names get converted to 'Table1', 'Table2' etc., and column names in each table become 'Column1', 'Column2' etc. There will be roughly 100 tables, some having 30+ columns, and some tables have 2B+ rows.
I know there is a hard way to do this. This would be to manually create each table, with the sanitized table name and column names, and then use something like SSIS to bulk insert the rows from one table to another. This would be rather time consuming and tedious because of the manual SSIS column mapping required, and manual setup of each table.
I'm hoping someone has done something like this before and has a much faster, more efficienct, way.
By the way, the 'sanitized' database will have no indexes or foreign keys. Also, it may seem to make any sense why we would want to do this, but this is what was agreed to by our Director of Manufacturing and the data scientists, as the first round of analysis which will involve many rounds.
You basically want to scrub the data and objects, correct? Here is what I would do.
Restore a backup of the db.
Drop all objects not needed (indexes, constraints, stored procedures, views, functions, triggers, etc.)
Create a table with two columns, populate the table, each row has orig table name and new table name
Write a script that iterates through the table, roe by row, and renames your tables. Better yet, put the data into excel, and create a third column that builds the tsql you want to build, then cut/paste and execute in ssms.
Repeat step 4, but for all columns. Best to query sys.columns to get all the objects you need, put to excel, and build your tsql
Repeat again for any other objects needed.
Backip/restore will be quicker than dabbling in SSIS and data transfer.
They can see the data but they can't see the column names? What can that possibly accomplish? What are you protecting by not revealing the table or column names? How is a data scientist supposed to evaluate data without context? Without a FK all I see is a bunch of numbers on a column named colx. What are expecting to accomplish? Get a confidentially agreement. Consider a FK columns customerID verses a materialID. Patterns have widely different meanings and analysis. I would correlate a quality measure with materialID or shiftID but not with a customerID.
Oh look there is correlation between tableA.colB and tableX.colY. Well yes that customer is college team and they use aluminum bats.
On top of that you strip indexes (on tables with 2B+ rows) so the analysis they run will be slow. What does that accomplish?
As for the question as stated do a back up restore. Using system table drop all triggers, FK, index, and constraints. Don't forget to drop the triggers and constraints - that may disclose some trade secret. Then rename columns and then tables.

Quickly update a large amount of rows in a table, without blocking inserts on referencing tables, in SQL Server 2008

Context:
I have a system that acts as a Web UI for a legacy accounting system. This legacy system sends me a large text file, several times a day, so I can update a CONTRACT table in my database (the file can have new contracts, or just updated values for existing contracts). This table currently has around 2M rows and about 150 columns. I can't have downtime during these updates, since they happen during the day and there's usually about 40 logged users in any given time.
My system's users can't update the CONTRACT table, but they can insert records in tables that reference the CONTRACT table (Foreign Keys to the CONTRACT table's ID column).
To update my CONTRACT table I first load the text file into a staging table, using a bulk insert, and then I use a MERGE statement to create or update the rows, in batches of 100k records. And here's my problem - during the MERGE statement, because I'm using READ COMMITED SNAPSHOT isolation, the users can keep viewing the data, but they can't insert anything - the transactions will timeout because the CONTRACT table is locked.
Question: does anyone know of a way to quickly update this large amount of rows, while enforcing data integrity and without blocking inserts on referencing tables?
I've thought about a few workarounds, but I'm hoping there's a better way:
Drop the foreign keys. - I'd like to enforce my data consistency, so this don't sound like a good solution.
Decrease the batch size on the MERGE statement so that the transaction is fast enough not to cause timeouts on other transactions. - I have tried this, but the sync process becomes too slow; Has I mentioned above, I receive the update files frequently and it's vital that the updated data is available shortly after.
Create an intermediate table, with a single CONTRACTID column and have other tables reference that table, instead of the CONTRACT table. This would allow me to update it much faster while keeping a decent integrity. - I guess it would work, but it sounds convoluted.
Update:
I ended up dropping my foreign keys. Since the system has been in production for some time and the logs don't ever show foreign key constraint violations, I'm pretty sure no inconsistent data will be created. Thanks to everyone who commented.

Is there a reason for a Primary Key column that would never be used?

I have a routine that will be creating individual tables (Sql Server 2008) to store the results of reports generated by my application (Asp.net 3.5). Each report will need its own table, as the columns for the table would vary based on the report settings. A table will contain somewhere between 10-5,000 rows, rarely more than 10,000.
The following usage rules will apply:
Once stored, the data will never be updated.
Whenever results for the table are accessed, all data will be retrieved.
No other table will need to perform a join with this table.
Knowing this, is there any reason to create a PK index column on the table? Will doing so aid the performance of retrieving the data in any way, and if it would, would this outweigh the extra load of updating the index when inserting data (I know that 10K records is a relatively small amount, but this solution needs to be able to scale).
Update: Here are some more details on the data being processed, which goes into the current design decision of one table per report:
Tables will record a set of numeric values (set at runtime based on the report settings) that correspond to a different set of reference varchar values (also set at runtime based on the report settings).
Whenever data is retrieved, it some post-processing on the server will be required before the output can be displayed to the user (thus I will always be retrieving all values).
I would also be suspicious of someone claiming that they had to create a new table for each time the report was run. However, given that different columns (both in number, name and datatype) could conceivably be needed for every time the report was run, I don't see a great alternative.
The only other thing I can think of is to have an ID column (identifying the ReportVersionID, corresponding to another table), ReferenceValues column (varchar field, containing all Reference values, in a specified order, separated by some delimiter) and NumericValues column (same as ReferenceValues, but for the numbers), and then when I retrieve the results, put everything into specialized objects in the system, separating the values based on the defined delimiter). Does this seem preferable?
Primary keys are not a MUST for any and all data tables. True, they are usually quite useful and to abandon them is unwise. However, in addition to a primary missions of speed (which I agree would doubtfully be positively affected) is also that of uniqueness. To that end, and valuing the consideration you've already obviously taken, I would suggest that the only need for a primary key would be to govern the expected uniqueness of the table.
Update:
You mentioned in a comment that if you did a PK that it would include an Identity column that presently does not exist and is not needed. In this case, I would advise against the PK altogether. As #RedFilter pointed out, surrogate keys never add any value.
I would keep it simple, just store the report results converted to json or xml, in a VARCHAR(MAX) column
One of the most useful and least emphasized (explicitly) benefits of data integrity (primary keys and foreign key references to start with) is that it forces a 'design by contract' between your data and your application(s); which stops quite a lot of types of bugs from doing any damage to your data. This is such a huge win and a thing that is implicitly taken for granted (it is not 'the database' that protects it, but the integrity rules you specify; forsaking the rules you expose your data to various levels of degradation).
This seems unimportant to you (from the fact that you did not even discuss what would be a possible primary key) and your data seems quite unrelated to other parts of the system (from the fact that you will not do joins to any other tables); but still - if all things are equal I would model the data properly and then if primary keys (or other data integrity rules) are not used and if chasing every last bit of performance I would consider dropping them in production (and test for any actual gains).
As for comments that creating tables is a performance hit - that is true, but you did not tell us how temporary are these tables? Once created will they be heavily used before scrapped? Or do you plan to create tables for just dozen of read operations.
In case you will heavily use these tables and if you will provide clean mechanism for managing them (removing them when not used, selecting them, etc...) I think that dynamically creating the tables would be perfectly fine (you could have shared more details on the tables themselves; use case would be nice)
Notes on other solutions:
EAV model
is horrible unless very specific conditions are met (for example: flexibility is paramount and automating DDL is too much of a hassle). Keep away from it (or be very, very good at anticipating what kinds of queries will you have to deal with and rigorous in validating data on the front end).
XML/BLOB approach
might be the right thing for you if you will consume the data as XML/BLOBs at presentation layer (always read all of the rows, always write the whole 'object' and finally, if your presentation layer likes XML/BLOBS)
EDIT:
Also, depending on the usage patterns, having primary key can indeed increase the speed of retrieval, and if I can read the fact that the data will not be updated as 'it will be written once and read many times' then there is a good chance that it will indeed overweight the cost of updating the index on inserts.
will it 1 table for every run of a given report, or one table to all runs of a given report? in other words, if you have Report #1 and you run it 5 times, over a different range of data, will you produce 5 tables, or will all 5 runs of the report be stored in the same table?
If you are storing all 5 runs of the report in the same table, then you'll need to filter the data so that it is appropriate to the run in question. in this case, having a primary key will let you do the where statement for the filter, much faster.
if you are creating a new table for every run of the report, then you don't need a primary key. however, you are going to run into to other performance problems as the number of tables in your system grows... assuming you don't have something in place to drop old data / tables.
If you are really not using the tables for anything other than as a chunk of read-only data, you could just as well store all the reports in a single table, as XML values.
What column or columns would the PK index be built on? If just a surrogate identity column, you'll have no performance hit when inserting rows, as they'd be inserted "in order". If it is not a surrogate key, then you have the admittedly minor but still useful assurance that you don't have duplicate entries.
Is the primary key used to control the order in which report rows are to be printed? If not, then how do you ensure proper ordering of the information? (Or is this just a data table that gets summed one way and another whenever a report is generated?)
If you use a clustered primary key, you wouldn't use as much storage space as you would with a non-clustered index.
By and large, I find that while not every table requires a primary key, it does not hurt to have one present, and since proper relational database design requires primary keys on all tables, it's good practice to always include them.

Resources