Note: Oracle 11gR2 Standard version (so no partitioning)
So I have to build a process to build reports off a table containing about 27 million records. The dilemma I'm facing is the fact that I can't create my own indexes off this table as it's a 3rd party table that we can't alter. So, I started experimenting with the use of Materialized views where I can then create my own indexes, or a physical table that would basically just be a duplicate that I'd truncate and repopulate on demand.
The advantage with the MAT view is that it's basically pulling from the "Live" table, so I don't have to worry about discrepancies as long as I refresh it before use, the problem is the refresh seems to take a significant amount of time. I then decided to try the physical table approach, where I tried truncating and repopulating (Took around 10 min), then rebuild indexes (which takes another 10, give or take).... I also tried updating with only "new" record by performing a:
INSERT... SELECT where NOT Exists (Select 1 from Table where PK = PK)
Which almost takes 10 min also regardless of my index, parallelism, etc...
Has anyone had to deal with this amount of data (which will keep growing) and found an approach that performs well and works efficiently??
Seems a view won't do.... so I'm left with those 2 options because I can't tweak indexes on my primary table, so any tips suggestions would be greatly appreciated... The whole purpose of this process was to make things "faster" for reporting, but somehow where I'm gaining performance in some areas, I end up losing in others given the amount of data I need to move around. Are there other options aside from:
Truncate / Populate Table, Rebuild indexes
Populate secondary table from primary table where PK not exist
Materialized view (Refresh, Rebuild indexes)
View that pulls from Live table (No new indexes)
Thanks in advance for any suggestions.....
Does anyone know if doing a "Create Table As Select..." perform better than "Insert... Select" if I render my indexes and such unusable when doing my insert on the second option, or should it be fairly similar?
I think that there's a lot to be said for a very simple approach on this sort of task. Consider a truncate and direct path (append) insert on the duplicate table without disabling/rebuilding indexes, with NOLOGGING set on the table. The direct path insert has a index maintenance mechanism associated with it that is possibly more efficient than running multiple index rebuilds post-load, as it logs in temporary segments the data required to build the indexes and thus avoids subsequent multiple full table scans.
If you do want to experiment with index disable/rebuild then try rebuilding all the indexes at the same time without query parallelism, as only one physical full scan will be used -- the rest of the scans will be "parasitic" in that they'll read the table blocks from memory.
When you load the duplicate table consider ordering the rows in the select so that commonly used predicates on the reports are able to access fewer blocks. For example if you commonly query on date ranges, order by the date column. Remember that a little extra time spent in building this report table can be recovered in reduced report query execution time.
Consider compressing the table also, but only if you're loading with direct path insert unless you have the pricey Advanced Compression option. Index compression and bitmap indexes are also worth considering.
Also, consider not analyzing the reporting table. Report queries commonly use multiple predicates that are not well estimated using conventional statistics, and you have to rely on dynamic sampling for good cardinality estimates anyway.
"Create Table As Select" generate lesser undo. That's an advantage.
When data is "inserted" indexes also are maintained and performance is impacted negatively.
Related
I have an Audit table in a SQL server database that with the following columns:
Sequence --- bigint (primary key)
TableName --- varachar(50) (nonclustered index)
ColumnName --- varachar(50) (nonclustered index)
Control --- char(10) (nonclustered index)
BeforeValue --- varchar(500) (nonclustered index)
AfterValue, ---varchar(500)
DateChanged --- datetime
ChangedBy --- char(20)
CompanyCode --- char(5) (nonclustered index)
It has 5 billion+ rows of data. Around 200+ triggers are inserting data in this table and around 50+ stored procedures are inserting as well as querying data from this table. Whenever a column is updated/deleted in any of the 200+ tables in a transactional database, a row is inserted in the Audit table respectively.
I inherited this table recently. We have been experiencing performance issues lately and I am told to redesign this Audit table to address the associated performance problems.
I am looking for suggestions, next steps, performance matrix ideas, any help will be appreciated.
Thanks in advance.
I think you do not need to change much, but just need to redesign your process as follows:
Create an archive table exactly the same as your current audit table.
Transfer all your current audit able into this new archive table, meaning your current audit table is empty
Schedule a daily (or weekly) job to move data from your audit table to the archive table.
Based on your data retention policy, clean up your archive table.
But for performance issue, you have to make sure whether inserting into your audit table is really a culprit. If it is, then the above-mentioned way may relieve your pain. Otherwise, it may not help.
I had to do something similar. If you must maintain the 5 billion records, then your best solution is to partition the table. You will need to do the following:
Create partition function
Create partition schema
Create partition files
The partition function will govern how the data is partitioned: typically it is either by row count (i.e., sequence) or date (i.e., monthly, quarterly, yearly, etc.). Note: partitioning is only available on SQL Server 201X Enterprise.
https://learn.microsoft.com/en-us/sql/relational-databases/partitions/partitioned-tables-and-indexes
I highly recommend that you first read Microsoft's white paper on this before doing anything. Also, once you implement it, this should be done over the weekend to allow for processing as I imagine it will take some time to complete.
https://technet.microsoft.com/en-us/library/dd578580(v=sql.100).aspx
For comparisons sake, a query would not complete after 10 minutes of run time in a pre-partitioned state. After partitioning the table, the query completed within 10 seconds. Note: once the table is partitioned, you will need to tune your queries to include the partitioned column in the predicate. Otherwise, you probably won't notice much of a difference in response times.
You're not highly clear on where the performance issues are: are they in reading the table, or writing to it? Typically, if the issue is in reading the table, you'll notice a particular area of the system that is slow due to such reads, while if the issue is with writing to the table, you'll have a more subtle performance impact, but will be much more widespread and tends to slow everything down (in varying degrees) instead of just certain hotspot areas.
Another unclear issue is whether this is primarily write-only or heavily read data. Audit tables are typically write-optimized, to be as fast as they can be for adding new data, while slow to read from it when you do need it (typically we're writing far more often than reading: the opposite of "normal" transactional rows in a RDMS).
Thus, the first step is to determine where the performance issues are, and gather some clues from there to determine whether you need to be optimizing the read portion or the write portion of the table.
If it is indeed a write performance issue (a bit harder to profile and recognize compared to read issues), I would look at perhaps dropping some indexes that are on the table. I think for a write-optimized table, there's a lot of indexes on there that invoking a lot of overhead on each write operation (remember that each index needs to be maintained with every write to the table, so while indexes are great for reading, they take away write performance: especially nonclustered indexes).
If it is a read issue, I might still opt to write-optimize anyway, then devise some means to improve the read performance as best can be once it's write-optimized. Other answers here address that in particular, but without being more familiar with the system requirements, it's hard to say which direction is the best to be moving.
I'm interested to hear other developers views on creating and loading data as the current site I'm working on has a completely different take on DWH loading.
The protocol used currently to load a fact table has a number of steps;
Drop old table
Recreate Table with no PK/Clustered Index
Load cleaned/new data
Create PK & Indexes
I'm wondering how much work really goes on under the covers with step 4? The data are loaded without a Clusterd index so I'm assuming that the natural order of the data load defines its order on disk. When step 4. creates a primary key (clustered) it will re-order the data on disk into that order. Would it not be better to load the data and have the PK/Clustered Index already defined thereby reduce server workload?
When inserting a large amount of records, the overhead in updating the index can often be larger than simply creating it from scratch. The performance gain comes from inserting onto a heap which is the most efficient way to get data into a table.
The only way you can know if your import strategy is faster with the indexes left intact, will be to test both on your own environment and compare.
Up to my thoughts Indexers are Good for Select. and may be bad for DML operations.
And if you are loading the Huge amount of data that means you need to update Indexers for every insert. This may lag the performance. Some times it may go beyond the limit.
How can you determine if the performance gained on a SELECT by indexing a column will outweigh the performance loss on an INSERT in the same table? Is there a "tipping-point" in the size of the table when the index does more harm than good?
I have table in SQL Server 2008 with 2-3 million rows at any given time. Every time an insert is done on the table, a lookup is also done on the same table using two of its columns. I'm trying to determine if it would be beneficial to add indexes to the two columns used in the lookup.
Like everything else SQL-related, it depends:
What kind of fields are they? Varchar? Int? Datetime?
Are there other indexes on the table?
Will you need to include additional fields?
What's the clustered index?
How many rows are inserted/deleted in a transaction?
The only real way to know is to benchmark it. Put the index(es) in place and do frequent monitoring, or run a trace.
This depends on your workload and your requirements. Sometimes data is loaded once and read millions of times, but sometimes not all loaded data is ever read.
Sometimes reads or writes must complete in certain time.
case 1: If table is static and is queried heavily (eg: item table in Shopping Cart application) then indexes on the appropriate fields is highly beneficial.
case 2: If table is highly dynamic and not a lot of querying is done on a daily basis (eg: log tables used for auditing purposes) then indexes will slow down the writes.
If above two cases are the boundary cases, then to build indexes or not to build indexes on a table depends on which case above does the table in contention comes closest to.
If not leave it to the judgement of Query tuning advisor. Good luck.
I have an sql server 2008 database along with 30000000000 records in one of its major tables. Now we are looking for the performance for our queries. We have done with all indexes. I found that we can split our database tables into multiple partitions, so that the data will be spread over multiple files, and it will increase the performance of the queries.
But unfortunatly this functionality is only available in the sql server enterprise edition, which is unaffordable for us.
Is there any way to opimize for the query performance? For example, the query
select * from mymajortable where date between '2000/10/10' and '2010/10/10'
takes around 15 min to retrieve around 10000 records.
A SELECT * will obviously be less efficiently served than a query that uses a covering index.
First step: examine the query plan and look for and table scans and the steps taking the most effort(%)
If you don’t already have an index on your ‘date’ column, you certainly need one (assuming sufficient selectivity). Try to reduce the columns in the select list, and if ‘sufficiently’ few, add these to the index as included columns (this can eliminate bookmark lookups into the clustered index and boost performance).
You could break your data up into separate tables (say by a date range) and combine via a view.
It is also very dependent on your hardware (# cores, RAM, I/O subsystem speed, network bandwidth)
Suggest you post your table and index definitions.
First always avoid Select * as that will cause the select to fetch all columns and if there is an index with just the columns you need you are fetching a lot of unnecessary data. Using only the exact columns you need to retrieve lets the server make better use of indexes.
Secondly, have a look on included columns for your indexes, that way often requested data can be included in the index to avoid having to fetch rows.
Third, you might try to use an int column for the date and convert the date into an int. Ints are usually more effective in range searches than dates, especially if you have time information to and if you can skip the time information the index will be smaller.
One more thing to check for is the Execution plan the server uses, you can see this in management studio if you enable show execution plan in the menu. It can indicate where the problem lies, you can see which indexes it tries to use and sometimes it will suggest new indexes to add.
It can also indicate other problems, Table Scan or Index Scan is bad as it indicates that it has to scan through the whole table or index while index seek is good.
It is a good source to understand how the server works.
If you add an index on date, you will probably speed up your query due to an index seek + key lookup instead of a clustered index scan, but if your filter on date will return too many records the index will not help you at all because the key lookup is executed for each result of the index seek. SQL server will then switch to a clustered index scan.
To get the best performance you need to create a covering index, that is, include all you columns you need in the "included columns" part of your index, but that will not help you if you use the select *
another issue with the select * approach is that you can't use the cache or the execution plans in an efficient way. If you really need all columns, make sure you specify all the columns instead of the *.
You should also fully quallify the object name to make sure your plan is reusable
you might consider creating an archive database, and move anything after, say, 10-20 years into the archive database. this should drastically speed up your primary production database but retains all of your historical data for reporting needs.
What type of queries are we talking about?
Is this a production table? If yes, look into normalizing a bit more and see if you cannot go a bit further as far as normalizing the DB.
If this is for reports, including a lot of Ad Hoc report queries, this screams data warehouse.
I would create a DW with seperate pre-processed reports which include all the calculation and aggregation you could expect.
I am a bit worried about a business model which involves dealing with BIG data but does not generate enough revenue or even attract enough venture investment to upgrade to enterprise.
I have to design a database to store log data but I don't have experience before. My table contains about 19 columns (about 500 bytes each row) and daily grows up to 30.000 new rows. My app must be able to query effectively again this table.
I'm using SQL Server 2005.
How can I design this database?
EDIT: data I want to store contains a lot of type: datetime, string, short and int. NULL cells are about 25% in total :)
However else you'll do lookups, a logging table will almost certainly have a timestamp column. You'll want to cluster on that timestamp first to keep inserts efficient. That may mean also always constraining your queries to specific date ranges, so that the selectivity on your clustered index is good.
You'll also want indexes for the fields you'll query on most often, but don't jump the gun here. You can add the indexes later. Profile first so you know which indexes you'll really need. On a table with a lot of inserts, unwanted indexes can hurt your performance.
Well, given the description you've provided all you can really do is ensure that your data is normalized and that your 19 columns don't lead you to a "sparse" table (meaning that a great number of those columns are null).
If you'd like to add some more data (your existing schema and some sample data, perhaps) then I can offer more specific advice.
Throw an index on every column you'll be querying against.
Huge amounts of test data, and execution plans (with query analyzer) are your friend here.
In addition to the comment on sparse tables, you should index the table on the columns you wish to query.
Alternatively, you could test it using the profiler and see what the profiler suggests in terms of indexing based on actual usage.
Some optimisations you could make:
Cluster your data based on the most likely look-up criteria (e.g. clustered primary key on each row's creation date-time will make look-ups of this nature very fast).
Assuming that rows are written one at a time (not in batch) and that each row is inserted but never updated, you could code all select statements to use the "with (NOLOCK)" option. This will offer a massive performance improvement if you have many readers as you're completely bypassing the lock system. The risk of reading invalid data is greatly reduced given the structure of the table.
If you're able to post your table definition I may be able to offer more advice.