MSSQL Creating and loading data - sql-server

I'm interested to hear other developers views on creating and loading data as the current site I'm working on has a completely different take on DWH loading.
The protocol used currently to load a fact table has a number of steps;
Drop old table
Recreate Table with no PK/Clustered Index
Load cleaned/new data
Create PK & Indexes
I'm wondering how much work really goes on under the covers with step 4? The data are loaded without a Clusterd index so I'm assuming that the natural order of the data load defines its order on disk. When step 4. creates a primary key (clustered) it will re-order the data on disk into that order. Would it not be better to load the data and have the PK/Clustered Index already defined thereby reduce server workload?

When inserting a large amount of records, the overhead in updating the index can often be larger than simply creating it from scratch. The performance gain comes from inserting onto a heap which is the most efficient way to get data into a table.
The only way you can know if your import strategy is faster with the indexes left intact, will be to test both on your own environment and compare.

Up to my thoughts Indexers are Good for Select. and may be bad for DML operations.
And if you are loading the Huge amount of data that means you need to update Indexers for every insert. This may lag the performance. Some times it may go beyond the limit.

Related

SQL Server: Many columns in a table vs Fewer columns in two tables

I have a database table (called Fields) which has about 35 columns. 11 of them always contains the same constant values for about every 300.000 rows - and act as metadata.
The down side of this structure is that, when i need to update those 11 columns values, i need to go and update all 300.000 rows.
I could move all the common data in a different table, and update it only one time, in one place, instead of 300.000 places.
However, if i do it like this, when i display the fields, i need to create INNER JOIN's between the two tables, which i know makes the SELECT statement slower.
I must say that updating the columns occurs more rarely than reading (displaying) the data.
How you suggest that i should store the data in database to obtain the best performances?
I could move all the common data in a different table, and update it only one time, in one
place, instead of 300.000 places.
I.e. sane database design and standad normalization.
This is not about "many empty fields", it is brutally about tons of redundant data. Constants you should have isolated. Separate table. This may also make things faster - it allows the database to use memory more efficient because your database is a lot smaller.
I would suggest to go with a separate table unless you've concealed something significant (of course it would be better to try and measure, but I suspect you already know it).
You can actually get faster selects as well: joining a small table would be cheaper then fetching the same data 300000 times.
This is a classic example of denormalized design. Sometimes, denormalization is done for (SELECT) performance, and always in a deliberate, measurable way. Have you actually measured whether you gain any performance by it?
If your data fits into cache, and/or the JOIN is unusually expensive1, then there may well be some performance benefit from avoiding the JOIN. However, the denormalized data is larger and will push at the limits of your cache sooner, increasing the I/O and likely reversing any gains you may have reaped from avoiding the JOIN - you might actually lose performance.
And of course, getting the incorrect data is useless, no matter how quickly you can do it. The denormalization makes your database less resilient to data inconsistencies2, and the performance difference would have to be pretty dramatic to justify this risk.
1 Which doesn't look to be the case here.
2 E.g. have you considered what happens in a concurrent environment where one application might modify existing rows and the other application inserts a new row but with old values (since the first application hasn't committed yet so there is no way for the second application to know that there was a change)?
The best way is to seperate the data and form second table with those 11 columns and call it as some MASTER DATA TABLE, which will be having a primary key.
This primary key can be referred as a foreign key in those 30,000 rows in the first table

Oracle Materialized view VS Physical Table

Note: Oracle 11gR2 Standard version (so no partitioning)
So I have to build a process to build reports off a table containing about 27 million records. The dilemma I'm facing is the fact that I can't create my own indexes off this table as it's a 3rd party table that we can't alter. So, I started experimenting with the use of Materialized views where I can then create my own indexes, or a physical table that would basically just be a duplicate that I'd truncate and repopulate on demand.
The advantage with the MAT view is that it's basically pulling from the "Live" table, so I don't have to worry about discrepancies as long as I refresh it before use, the problem is the refresh seems to take a significant amount of time. I then decided to try the physical table approach, where I tried truncating and repopulating (Took around 10 min), then rebuild indexes (which takes another 10, give or take).... I also tried updating with only "new" record by performing a:
INSERT... SELECT where NOT Exists (Select 1 from Table where PK = PK)
Which almost takes 10 min also regardless of my index, parallelism, etc...
Has anyone had to deal with this amount of data (which will keep growing) and found an approach that performs well and works efficiently??
Seems a view won't do.... so I'm left with those 2 options because I can't tweak indexes on my primary table, so any tips suggestions would be greatly appreciated... The whole purpose of this process was to make things "faster" for reporting, but somehow where I'm gaining performance in some areas, I end up losing in others given the amount of data I need to move around. Are there other options aside from:
Truncate / Populate Table, Rebuild indexes
Populate secondary table from primary table where PK not exist
Materialized view (Refresh, Rebuild indexes)
View that pulls from Live table (No new indexes)
Thanks in advance for any suggestions.....
Does anyone know if doing a "Create Table As Select..." perform better than "Insert... Select" if I render my indexes and such unusable when doing my insert on the second option, or should it be fairly similar?
I think that there's a lot to be said for a very simple approach on this sort of task. Consider a truncate and direct path (append) insert on the duplicate table without disabling/rebuilding indexes, with NOLOGGING set on the table. The direct path insert has a index maintenance mechanism associated with it that is possibly more efficient than running multiple index rebuilds post-load, as it logs in temporary segments the data required to build the indexes and thus avoids subsequent multiple full table scans.
If you do want to experiment with index disable/rebuild then try rebuilding all the indexes at the same time without query parallelism, as only one physical full scan will be used -- the rest of the scans will be "parasitic" in that they'll read the table blocks from memory.
When you load the duplicate table consider ordering the rows in the select so that commonly used predicates on the reports are able to access fewer blocks. For example if you commonly query on date ranges, order by the date column. Remember that a little extra time spent in building this report table can be recovered in reduced report query execution time.
Consider compressing the table also, but only if you're loading with direct path insert unless you have the pricey Advanced Compression option. Index compression and bitmap indexes are also worth considering.
Also, consider not analyzing the reporting table. Report queries commonly use multiple predicates that are not well estimated using conventional statistics, and you have to rely on dynamic sampling for good cardinality estimates anyway.
"Create Table As Select" generate lesser undo. That's an advantage.
When data is "inserted" indexes also are maintained and performance is impacted negatively.

Oracle - Do you need to calculate statistics after creating index or adding columns?

We use an Oracle 10.2.0.5 database in Production.
Optimizer is in "cost-based" mode.
Do we need to calculate statistics (DBMS_STATS package) after:
creating a new index
adding a column
creating a new table
?
Thanks
There's no short answer. It totally depends on your data and how you use it. Here are some things to consider:
As #NullUserException pointed out, statistics are automatically gathered, usually every night. That's usually good enough; in most (OLTP) environments, if you just added new objects they won't contain a lot of data before the stats are automatically gathered. The plans won't be that bad, and if the objects are new they probably won't be used much right away.
creating a new index - No. "Oracle Database now automatically collects statistics during index creation and rebuild".
adding a column - Maybe. If the column will be used in joins and predicates you probably want stats on it. If it's just used for storing and displaying data it won't really affect any plans. But, if the new column takes up a lot of space it may significantly alter the average row length, number of blocks, row chaining, etc., and the optimizer should know about that.
creating a new table - Probably. Oracle is able to compensate for missing statistics through dynamic sampling, although this often isn't good enough. Especially if the new table has a lot of data; bad statistics almost always lead to under-estimating the cardinality, which will lead to nested loops when you want hash joins. Also, even if the table data hasn't changed, you may need to gather statistics one more time to enable histograms. By default, Oracle creates histograms for skewed data, but will not enable those histograms if those columns haven't been used as a predicate. (So this applies to adding a new column as well). If you drop and re-create a table, even with the same name, Oracle will not maintain any of that column use data, and will not know that you need histograms on certain columns.
Gathering optimizer statistics is much more difficult than most people realize. At my current job, most of our performance problems are ultimately because of bad statistics. If you're trying to come up with a plan for your system you ought to read the Managing Optimizer Statistics chapter.
Update:
There's no need to gather statistics for empty objects; dynamic sampling will work just as quickly as reading stats from the data dictionary. (Based on a quick test hard parsing a large number of queries with and without stats.) If you disable dynamic sampling then there may be some weird cases where the Oracle default values lead to inaccurate plans, and you would be better off with statistics on an empty table.
I think the reason Oracle automatically gathers stats for indexes at creation time is because it doesn't cost much extra. When you create an index you have to read all the blocks in the table, so Oracle might as well calculate the number of levels, blocks, keys, etc., at the same time.
Table statistics can be more complicated, and may require multiple passes of the data. Creating an index is relatively simple compared to the arbitrary SQL that may be used as part of a create-table-as-select. It may not be possible, or efficient, to take those arbitrary SQL statements and transform them into a query that also returns the information needed to gather statistics.
Of course it wouldn't cost anything extra to gather stats for an empty table. But it doesn't gain you anything either, and it would just be misleading to anyone who looks at the USER_TABLES.LAST_ANALYZED - the table appear to be analyzed, but not with any meaningful data.

Is it better to create Oracle SQL indexes before or after data loading?

I need to populate a table with a huge amount of data (many hours loading) on an Oracle database, and i was wondering which would be faster, to create an index on the table before loading it or after loading it. I initially thought that inserting on an indexed table is penalized, but then if i create the index with the full table, it will take a lot of time. Which is best?
Creating indexes after loading the data is much faster. If you load data into a table with indexes, the loading will be very slow because of the constant index updates. If you create the index later, it can be efficiently populated just once (which may of course take some time, but the grand total should be smaller).
Similar logic applies to constraints. Also enable those later (unless you expect data to fail the constraints and want to know that early on).
The only reason why you might want to create the index first is to enforce unique constraints. Otherwise, loading is much faster with a naked table - no indexes, no constraints, no triggers enabled.
Creating an index after the data load is the recommended practice for bulk loads. You must be sure about the incoming data quality though especially if you are using unique indices. The absence of the index means that data validation that occurs due to the presence of unique indexes will not happen.
Another issue for you to consider is whether you have a one time load operation or is it going to be a regular affair? If it is a regular affair, then you can drop the indexes before each data load and recreate them after a successful load.

What are some best practices and "rules of thumb" for creating database indexes?

I have an app, which cycles through a huge number of records in a database table and performs a number of SQL and .Net operations on records within that database (currently I am using Castle.ActiveRecord on PostgreSQL).
I added some basic btree indexes on a couple of the feilds, and as you would expect, the performance of the SQL operations increased substantially. Wanting to make the most of dbms performance I want to make some better educated choices about what I should index on all my projects.
I understand that there is a detrement to performance when doing inserts (as the database needs to update the index, as well as the data), but what suggestions and best practices should I consider with creating database indexes? How do I best select the feilds/combination of fields for a set of database indexes (rules of thumb)?
Also, how do I best select which index to use as a clustered index? And when it comes to the access method, under what conditions should I use a btree over a hash or a gist or a gin (what are they anyway?).
Some of my rules of thumb:
Index ALL primary keys (I think most RDBMS do this when the table is created).
Index ALL foreign key columns.
Create more indexes ONLY if:
Queries are slow.
You know the data volume is going to increase significantly.
Run statistics when populating a lot of data in tables.
If a query is slow, look at the execution plan and:
If the query for a table only uses a few columns, put all those columns into an index, then you can help the RDBMS to only use the index.
Don't waste resources indexing tiny tables (hundreds of records).
Index multiple columns in order from high cardinality to less. This means: first index the columns with more distinct values, followed by columns with fewer distinct values.
If a query needs to access more than 10% of the data, a full scan is normally better than an index.
Here's a slightly simplistic overview: it's certainly true that there is an overhead to data modifications due to the presence of indexes, but you ought to consider the relative number of reads and writes to the data. In general the number of reads is far higher than the number of writes, and you should take that into account when defining an indexing strategy.
When it comes to which columns to index I'v e always felt that the designer ought to know the business well enough to be able to take a very good first pass at which columns are likely to benefit. Other then that it really comes down to feedback from the programmers, full-scale testing, and system monitoring (preferably with extensive internal metrics on performance to capture long-running operations),
As #David Aldridge mentioned, the majority of databases perform many more reads than they do writes and in addition, appropriate indexes will often be utilised even when performing INSERTS (to determine the correct place to INSERT).
The critical indexes under an unknown production workload are often hard to guess/estimate, and a set of indexes should not be viewed as set once and forget. Indexes should be monitored and altered with changing workloads (that new killer report, for instance).
Nothing beats profiling; if you guess your indexes, you will often miss the really important ones.
As a general rule, if I have little idea how the database will be queried, then I will create indexes on all Foriegn Keys, profile under a workload (think UAT release) and remove those that are not being used, as well as creating important missing indexes.
Also, make sure that a scheduled index maintenance plan is also created.

Resources