I have 2 databases, one is the main database that many users work on it and a testing database, the second one is test database that loaded by a dump from the main DB.
I have a select query a with join conditions and union all on a table TAB11 that contains 40 million rows.
The problem that the query is reading wrong index in the main DB but in test DB is reading correct index. Note that both have latest gather statistics on the table and same count rows. I start to dig into histograms and skew data and I noticed in main DB the table has 37 histogram created on its columns ,however in the test db the table has only 14 columns has histogram. so apparently those created histogram are effecting the query plan to read wrong index (right?). ( those histogram created by oracle , and not by anyone)
My question:
-should I remove the histogram from those columns, and when I gather static again oracle will create the needed one and read them correctly ? but I am afraid it will effect the performance of the table.
-should I add this when i gather tab statistics method_opt=>'for all columns size skewonly' but I am not sure if the data are skewed or not.
-should I run gather index stats on the desired index and the oracle might read it?
how to make the query read the right index, without droping it or using force index?
There are too many possible reasons for choosing a different index in one DB vs another (including object life-cycle differences e.g. when data gets loaded, deletions/truncations/inserts/stats gathering index rebuilds ...). Having said that, in cases like this I usually do a parameter by parameter comparison of the initialization parameters on each DB; also an object by object comparison (you've already observed a delta in the histogram; thee may be others as well that are impacting this).
Related
Sorry for the longish description... but here we go...
We have a fact table somewhat flattened with a few properties that you might have put in a dimension in a more "classic" data warehouse.
I expect to have billions of rows in that table.
We want to enrich these properties with some cleansing/grouping that would not change often, but would still do from time to time.
We are thinking of keeping this initial fact table as the "master" that we never update or delete from, and making an "extended fact" table copy of it where we just add the new derived properties.
The process of generating these extended property values requires mapping to some fort of lookup table, from which we get several possibilities for each row, and then select the best one (one per initial row).
This is likely to be processor intensive.
QUESTION (at last!):
Imagine my lookup table is modified and I want to re-assess the extended properties for only a subset of my initial fact table.
I would end up with a few million rows I want to MODIFY in the target extended fact table.
What would be the best way to achieve this update? (updating a couple of million rows within a couple of billion rows table)
Should I write an UPDATE statement with a join?
Would it be better to DELETE this million rows and INSERT the new ones?
Any other way, like creating a new extended fact table with only the appropriate INSERTs?
Thanks
Eric
PS: I come from a SQL Server background where DELETE can be slow
PPS: I still love SQL Server too! :-)
Write performance for Snowflake vs. traditional RDBS behaves quite differently. All your tables persist in S3, and S3 does not let you rewrite only select bytes of an existing object; the entire file object must be uploaded and replaced. So, while in say SQL server where data and indexes are modified in place, creating new pages as necessary, an UPDATE/DELETE in snowflake is a full sequential scan on the table file, creating an immutable copy of the original with applicable rows filtered out (deleted) or modified (update), which then replaces the file just scanned.
So, whether updating 1 row, or 1M rows, at minimum the entirety of the micro-partitions that the modified data exists in will have to be rewritten.
I would take a look at the MERGE command, which allows you to insert, update, and delete all in one command (effectively applying the differential from table A into table B. Among other things, it should keep your Time-Travel costs down vs constantly wiping and rewriting tables. Another consideration is that since snowflake is column oriented, a column update in theory should only require operations on the S3 files for that column, whereas an insert/delete would replace all S3 files for all columns, which would lower performance.
I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.
I use SSMS 2016. I have a view that has a few millions of records. The view is not indexed and should not be as it's being updated (insert, delete, update) every 5 minutes by a job on the server to then display update data sets in to the client calling application in GUI.
The view does a very heavy volume of conversion INT values to VARCHAR appending to them some string values.
The view also does some CAST operations on the NULL assigning them column names aliases. And the worst performance hit is that the view uses FOR XML PATH('') function on 20 columns.
Also the view uses two CTEs as the source as well as Subsidiaries to define a single column value.
I made sure I created the right indexes (Clustered, nonclustered,composite and Covering) that are used in the view Select,JOIN,and WHERE clauses.
Database Tuning Advisor also have not suggested anything that could substantialy improve performance.
AS a workaround I decided to create two identical physical tables with clustered indexes on each and using the Merge statement (further converted into a SP and then Into as SQL Server Agent Job) maintain them updated. And to assure there is no long locking of the view. I will then swap(rename) the tables names immediately after each merge finishes. So in this case all the heavy workload falls onto a SQL Server Agent Job keep the tables updated.
The problem is that the merge will take roughly 15 minutes considering current size of the data, which may increase in the future. So, I need to have a real time design to assure that the view has the most up-to-date information.
Any ideas?
I have a SQL Server 2014 database where I need to page through a very large amount of data to feed a webservice. The problem I currently have is there are 2 indexes on this table that got created, one of them has 79 included columns and the other has 80. Both indexes are 9+GB and both are very similar but different queries use them. Looking at the data my webservice actually needs it looks like around 60 columns from the database.
So I want to kill these 2 indexes and create a single index that will serve both of these queries.
I believe I have 2 options:
Create an index that includes these 60 columns - which I know is super large and can slow things down and cause bloat.
OR
Create an index that does not include these columns but then causes lookups to get the needed column data.
I am having a hard time determining which of these is a better approach.
I have the following problem. I have a SQL Server database with a total size of 3 GB. The contents of this database are used for the analysis in a data cube. I want to test the performance of this data cube by a database size of 30 GB. What is the best way to do this? Duplicate multiple times the content of the database? In this case the foreign keys would be a real problem, cause I want to keep the relationships between the tuples.
Thanks in advance.
I found a tricky way to do it. Here is a sketch of my approach:
Turn off the identities in the tables whose content has to be duplicated
Turn off the constraint check for all tables
Make a copy of the DB and work with the copy from now on
Increase all ID's in relevant tables by a high value, for example by 10.000.000
Merge the manipulated DB copy with the original DB
Make a copy of the merged DB and perform step 4 again with increment by another smaller value (for example by 5.000.000) and merge it with original DB again
Perform the steps 3 to 5 until the wished DB size is reached
After you finish don't forget to turn on the identities and constraint checks for all effected tables
Using the described approach I was able to increase the number of movies in my DB from 5 to 60 in four iterations and successfully perform the necessary load test.
Be careful by choosing the values for the increment! First I used 10.000.000, than 5.000.000, than 2.000.000 and finally 1.000.000 to avoid ID intersections.