I have a database with one critical table which is widely used and queried. The lifetime of the data in this table is divided in two stages: Table1 and Table1_Hist, this way once the User has finished his work then the record is passed from Table1 to Table1_Hist for consultation and reports.
The structure of Table1 is
ID (Long) KEY,
Priority (INT)
val1 (VARBINARY(XXXX))
val2 (VARNINARY(XXXX))
This table is filled with millions of records per month and the records passed to Table1_Hist are made in no specific order, this means that User1 may insert a record in Table1 today and finish working with it today too, but User2 may insert a similar record in Table1 and finish working with it the next week or month or in 3 months.
The issue arises when the Table1_Hist starts to grow to the point of affecting the performance of the queries to this table. I had in mind to split this table in Table1_Hist_1, Table1_Hist_2... Table1_Hist_n and create an ID table Map where I can register the range of IDs stored on each table. For example, I may have a map saying that in Table1_Hist_1 are stored the IDs from 1 to 10M and so on, but as I said before, there is no order in time in which the records are inserted in these historical tables, hence I may have a Map from 1 to 10 pointing to table Table1_Hist_1 and also have a record ID=3 in Table1_Hist_2, just because the ID=3 was finished 2 months later and stored in the following table.
So, anyone knows about an efficient approach to split a table in multiple tables with their respective mapping?
Related
Is there any way of converting the last a.ROWID > b.ROWID values in below code in to snowflake? the below is the oracle code. Need to take the ROW ID to snowflake. But snowflake does not maintain ROW ID. Is there any way to achieve the below and convert the row id issue?
DELETE FROM user_tag.user_dim_default a
WHERE EXISTS (SELECT 1
FROM rev_tag.emp_site_weekly b
WHERE a.number = b.ID
AND a.accountno = b.account_no
AND a.ROWID > b.ROWID)
So this Oracle code seem very broken, because ROWID is a table specific pseudo column, thus comparing value between table seem very broken. Unless the is some aligned magic happening, like when user_tag.user_dim_default is inserted into rev_tag.emp_site_weekly is also written. But even then I can imagine data flows where this will not get what you want.
So as with most things Snowflake, "there is no free lunch", so the data life cycle that is relying on ROW_ID needs to be implemented.
Which implies if you are wanting to use two sequences, then you should do explicitly on each table. And if you are wanting them to be related to each other, it sounds like a multi table insert or Merge should be used so you can access the first tables SEQ and relate it in the second.
ROWID is an internal hidden column used by the database for specific DB operations. Depending on the vendor, you may have additional columns such as transaction ID or a logical delete flag. Be very carful to understand the behavior of these columns and how they work. They may not be in order, they may not be sequential, they may change in value as a DB Maint job runs while your code is running, or someone else runs an update on a table. Some of these internal columns may have the same value for more than one row for example.
When joining tables, the RowID on one table has no relation to the RowID on another table. When writing Dedup logic or delete before insert type logic, you should use the primary key, and then additionally an audit column that has the date of insert or date of last update in combo with that. Check the data model or ERD digram for the PK/FK relationships between the tables and what audit columns are available.
I am trying to understand how fact tables are form in relation to the dimension tables.
E.g. Sale Fact Table
For there is a query for Sale of product by year/month/week/day, do I create a dimension for each type of period: Dim_Year, Dim_Month, Dim_Week and Dim_Day, each with their own respective keys?
Or is it possible to just use one dimension for all periods: Dim_Date and only have one date key?
Another area I am confused about is that why do some fact tables not contain their own ID? E.g. Sale fact table does not have SaleID included in the fact table.
Sale Fact Table Textbook Example
DATES
Your date dimension needs to correspond to the grain of your fact table. So if you had daily sales you would have a Dim_Day, weekly sales you would have a Dim_Week, etc.
You would normally have multiple date dimensions (at different grains) in your data warehouse as you would have facts at different date grains.
Each date dimension would hold hold attributes applicable to levels higher up in the date hierarchy. So a Dim_Day might hold day, week, month, year attributes; Dim_Month might hold month, quarter and year attributes, etc.
PRIMARY KEYS
Primary keys are rarely (never?) a technical requirement when creating tables in a database i.e. you can create a table without defining a PK. So you need to consider why we normally (at least in OLTP DBs) include PKs. Common reasons include:
To easily identify an individual record
To ensure that duplicate records (those with the same PK value) are
not created
So there are good reasons for creating PKs, however there are cost overheads e.g. the PK needs to be checked every time a new record is inserted into the table.
In a dimensional model where you are performing bulk inserts/updates, having PKs would cause a significant performance hit. Additionally, the insert logic/checks should always be implemented in your ETL processes so there is no need to include these types of checks/constraints in the DB itself.
Fact tables do have a primary key but it is often implicit rather than explicit - so a group of the FKs in the fact table uniquely identify each record. This compound PK may be documented but is is never enabled/implemented.
Occasionally a fact table will have an explicit, single column, PK. This is normally used when the fact table needs to be updated and its implicit PK involves a large number of columns. There is normally logic required to identify the record to be updated using its FKs but this returns the PK; then the update statement just has a clause like this:
WHERE table_pk = 12345678
rather than having to include all the columns in the implicit PK:
WHERE table_sk1 = 1234
AND table_sk2 = 5678
AND table_sk3 = 9876
....
Hope this helps?
Premise:
Orders table - id, rate, unit, amount, timestamp
Order States table - id, order_id, state, timestamp
Both tables are insert only (no updates)
For each order created, the state table can possibly have one or more states of the order, like 0 for open, 1 for completed etc.
Purpose:
Retrieve all open orders
Retrieve all open orders order by rate desc/asc
The same is currently achieved using subquery, group by and having clause. Also tried joins, group by and having clause.
Problem:
Very slow select time - Around 500 to 1000 ms for an order table size of approx a million records.
Required Help:
Indexing suggestions
Query re-write suggestions
Any help greatly appreciated.
Without knowing the tables structure (indexes and constraints).
If your Order States table contains every state each order went through and considering the queries you need to run, i would suggest to maintain a table with only the last state (which is the current state too i guess) of each order and keep the previous states in a history table.
without knowing what you already have
Orders table
id: primary key (automatically indexed using b-tree)
rate: indexed (because you need to sort your result by rate)
Order States table
id: primary key (automatically indexed using b-tree)
order_id: foreign key + index
state: index
maybe an index order_id,state could help too (give it a try)
Designing an oracle database for an ordering system. Each row will be a schedule that stores can be assigned that designates if/when they will order from a specific vendor for each day of the week.
It will be keyed by vendor id and a unique schedule id. Started out with those columns, and then a column for each day of the week like TIME_SUN, TIME_MON, TIME_TUE... to contain the order time for each day.
I'm normally inclined to try and normalize data and have another table referencing the schedule id, with a column like DAY_OF_WEEK and ORDER_TIME, so potentially 7 rows for the same data.
Is it really necessary for me to do this, or is it just over complicating what can be handled as a simple single row?
Normalization is the best way. Reasons:
The table will act as a master table
The table can be used for reference in future needs
It will be costly to normalize later
If there are huge number of rows with repeating more column values then database size growth is unwanted
Using master table will limit redundant data only to the foreign key
Normalization would be advisable. In future if you are required to store two or more order times for the same day then just adding rows in your vendor_day_order table will be required. In case you go with the first approach you will be required to make modifications to your table structure.
I have a table called Products.
This table contains over 3 million entries. Every day there are approximately 5000 new entries. which only happens during the night in 2 minutes.
But this table gets queried every night maybe over 20 000 times with this query.
SELECT Price
FROM Products
WHERE Code = #code
AND Company = #company
AND CreatedDate = #createdDate
Table structure:
Code nvarchar(50)
Company nvarchar(10)
CreatedDate datetime
I can see that this query takes about a second to return a result from Products table.
There is no productId column in the table as it is not needed. So there is no primary key in the table.
I would like to somehow improve this query to return the result faster.
I have never used indexes before. What would be the best way to use indexes on this table?
If I provide a primary key do you think it would speed up the query result? Keep in mind that I will still have to query the table by providing 3 parameters as
WHERE Code = #code
AND Company = #company
AND CreatedDate = #createdDate.
This is mandatory.
As I mentioned that the table gets new entries in 2 minutes every day during the night. How would this affect the indexes?
If I use indexes, which column would be the best to use and whether I should use clustered or non-clustered indexes?
The best thing to do would depend on what other fields the table has and what other queries run against that table.
Without more details, a non-clustered index on (code, company, createddate) that included the "price" column will certainly improve performance.
CREATE NONCLUSTERED INDEX IX_code_company_createddate
ON Products(code, company, createddate)
INCLUDE (price);
That's because if you have that index in place, then SQL will not access the actual table at all when running the query, as it can find all rows with a given "code, company, createddate" in the index and it will be able to do that really fast as the index allows precisely for fast access when using the fields that define the key, and it will also have the "price" value for each row.
Regarding the inserts, for each row added, SQL Server will have to add them to the index as well, so performance for inserts will be impacted. In think you should expect the gains on SELECT performance to outweigh the impact on the inserts, but you should test that.
Also, you will be using more space as the index will store all those fields for each row besides the space used by the original table.
As others have noted in the comments, adding a PK to your table (even if that means adding a ProductId column you don't actually need) might be a good idea as well.