We are ingesting data using a COPY INTO statement triggered by an external orchestration tool. The ingest is a full table load each time - I know, not ideal but it is what we currently have.
To get this ingested data to the final table, we clone the target, truncate the data, then insert the new ingested data. We then swap this with the target. This was working well but then someone put a materialized view on the target. Now this table swap causes the materialized view to become invalid each time the target is updated.
I should probably rewrite this to perform a merge into the target instead but that would be more complicated to write as there are no keys on this table, so I am wondering what the preferred solution is here?
Is is possible to force refresh a materialized view after a table swap or does it have to be rebuilt each time?
I think you found a bug in Snowflake and you should report it to Snowflake Support. My investigation shows that after executing the SWAP command, the view in the definition refers to the correct view, but in the metadata it switches to another table, it probably should not be so.
Below is a screenshot from the SHOW MATERIALIZED VIEWS command:
Related
Would there be something similar as the master-slave database but at the table level in the database?
For example, I have the following scenario:
I have a table with millions of records and the reason is because the system is more than 15 years old.
I only want to show the records of the last year (2019-2020).
I decided to create a view that only shows the records of that range (1 year) from the information of that table that contains millions of records.
Thanks to the view, the loading time of that system screen is lighter, thanks to the fact that I have less load of records.
The problem: What if the user adds a new record to the table that contains millions of records? how do I make my view modify when the other table are modified ...
I can use triggers to update the view I think, but, is there a functionality in oracle that allows me something similar to what I just asked (master-slave) where the "slave" table is updated as the "master" table suffers changes?
First of all, you misunderstood views. View is not a physical table, and does not store any information. If you insert data into view, you are actually inserting into the source table.
Since the view is not physical, you are just filtering the data. This does not have any performance benefits.
For the big tables, you can use partitioning which drastically improves performance. And if you still need archival you can archive the partitioned data.
Partitioning is generally the best method, because you can typically archive data by simply doing an "exchange" command to archive off old data.
Data doesn't "move" in that scenario, it simply gets 'detached' from the table via data dictionary manipulation.
Would there be something similar as the master-slave database but at the table level in the database
If you are asking about master/slave replication on a table level, then,
I suppose, table/materialized view relationship is appropriate to call as a master-slave. Quote from Oracle Docs:
A materialized view is a database object that contains the results of a query. The FROM clause of the query can name tables, views, and other materialized views. Collectively these objects are called master tables (a replication term)...
When you need to "update" or, more appropriately - refresh the mview, you can use different options:
update mview periodically and refresh it periodically
update mview each time the data in the master table is changed and commited.
update manually calling DBMS_MVIEW.REFRESH or DBMS_SNAPSHOT.REFRESH
Mview could be faster then view because each time you select from a mview you select from a different "table" which was replicated from the original one. Especially if you have complex logic in a sql, you can put the logic to mview definition.
The drawbacks are you need extra disk space for mview, and there will be a delay of refreshing the data.
I have a table named abcTbl, the data in there is populated
from other tables from a different database. Every time I am loading
data to abcTbl, I am doing a delete all to it and loading the buffer
data into it.
This package runs daily. My question is how do I avoid losing data
from the table abcTbl if we fail to load the data into it. So my
first step is deleting all the data in the abcTbl and then
selecting the data from various sources into a buffer and then
loading the buffer data into abcTbl.
Since we can encounter issues like failed connections, package
stopping prematurely, supernatural forces trying to stop/break my
package from running smoothly, etc. which will end up with the
package losing all the data in the buffer after I have already
deleted the data from abcTbl.
My first intuition was to save the data from the abcTbl into a
backup table and then deleting the data in the abcTbl but my DBAs
wouldn't be too thrilled about creating a backup table for in every
environment for the purpose of this package, and giving me juice to
create backup tables on the fly and then deleting it again is out of
the question too. This data is not business critical and can be repopulated
again if lost.
But, what is the best approach here? What are the best practices for this issue?
For backing up your table, instead of loading data from one table (Original) to another table (Backup), you can just rename your original table to something (back-up table), create original table again like the back-up table and then drop the renamed table only when your data load is successful. This may save some time to transfer data from one table to another. You may want to test which approach is faster for you depending on your data/table structure etc., But what I wanted to mention is, this is also one of the way to do it. If you have lot of data in that table below approach may be faster.
sp_rename 'abcTbl', 'abcTbl_bkp';
CREATE TABLE abcTbl ;
While creating this table, you can keep similar table structure as that of abcTbl_bkp
Load your new data to abcTbl table
DROP TABLE abcTbl_bkp;
Trying to figure this out but I think what you are asking for is a method to capture the older data before loading the new data. I would agree with your DBA's that a seperate table for every reload would be extremely messy and not very usable if you ever need it.
Instead, create a table that copies your load table but adds a single DateTime field(say history_date). Each load you would just flow all the data in your primary table to the backup table. Use a Derived Column task in the Data Flow to add the history_date value to the backup table.
Once the backup table is complete, either truncate or delete the contents of the current table. Then load the new data.
Instead of created additional tables you can set the package to execute as a single transaction. By doing this, if any component fails all the tasks that have already executed will be rolled back and subsequent ones will not run. To do this, set the TransactionOption to Required on the package. This will allow that the package will begin a transaction. After this set all this property to Supported for all components that you want to succeed or fail together. The Supported level will have these tasks join a transaction that is already in progress by the parent container, being the package in this case. If there are other components in the package that you want to commit or rollback independent of these tasks you can place the related objects in a Sequence container, and apply the Required level to the Sequence instead. An important thing to note is that if anything performs a TRUNCATE then all other components that access the truncated object will need to have the ValidateExternalMetadata option set to false to avoid the known blocking issue that is a result of this.
I have noticed quite a few mentions of the word "materializing" when people are talking about using temporary tables in SQL Server. Can someone expand on what it is that means? I am just trying to get a better understanding of what that means in terms of using temp tables?
Thanks!
S
The term "materializing" is normally used in the context of a view. When you create a clustered index on a view, you materialize the view; this means that the view's data is stored like a table on disk, and updated automatically when the tables which participate in the view get updated.
If a view is not materialized, SQL Server must compute the data in the view by performing the joins in the view definition every time a query is performed (though the results may be cached, or what-have-you).
I'm not sure the term "materializing" properly applies to temporary tables. The word materialized in a database context implies "caching query results in concrete table that may be updated from the original base tables". Perhaps that is custom internal jargon to indicate converting the results that are in a temp table into a permanent table.
Are there best practices out there for loading data into a database, to be used with a new installation of an application? For example, for application foo to run, it needs some basic data before it can even be started. I've used a couple options in the past:
TSQL for every row that needs to be preloaded:
IF NOT EXISTS (SELECT * FROM Master.Site WHERE Name = #SiteName)
INSERT INTO [Master].[Site] ([EnterpriseID], [Name], [LastModifiedTime], [LastModifiedUser])
VALUES (#EnterpriseId, #SiteName, GETDATE(), #LastModifiedUser)
Another option is a spreadsheet. Each tab represents a table, and data is entered into the spreadsheet as we realize we need it. Then, a program can read this spreadsheet and populate the DB.
There are complicating factors, including the relationships between tables. So, it's not as simple as loading tables by themselves. For example, if we create Security.Member rows, then we want to add those members to Security.Role, we need a way of maintaining that relationship.
Another factor is that not all databases will be missing this data. Some locations will already have most of the data, and others (that may be new locations around the world), will start from scratch.
Any ideas are appreciated.
If it's not a lot of data, the bare initialization of configuration data - we typically script it with any database creation/modification.
With scripts you have a lot of control, so you can insert only missing rows, remove rows which are known to be obsolete, not override certain columns which have been customized, etc.
If it's a lot of data, then you probably want to have an external file(s) - I would avoid a spreadsheet, and use a plain text file(s) instead (BULK INSERT). You could load this into a staging area and still use techniques like you might use in a script to ensure you don't clobber any special customization in the destination. And because it's under script control, you've got control of the order of operations to ensure referential integrity.
I'd recommend a combination of the 2 approaches indicated by Cade's answer.
Step 1. Load all the needed data into temp tables (on Sybase, for example, load data for table "db1..table1" into "temp..db1_table1"). In order to be able to handle large datasets, use bulk copy mechanism (whichever one your DB server supports) without writing to transaction log.
Step 2. Run a script which as a main step will iterate over each table to be loaded, if needed create indexes on newly created temp table, compare the data in temp table to main table, and insert/update/delete differences. Then as needed the script can do auxillary tasks like the security role setup you mentioned.
What is the best approach to synchronizing a DataSet with data in a database? Here are the parameters:
We can't simply reload the data because it's bound to a UI control which a user may have configured (it's a tree grid that they may expand/collapse)
We can't use a changeflag (like a UpdatedTimeStamp) in the database because changes don't always flow through the application (e.g. a DBA could update a field with a SQL statement)
We cannot use an update trigger in the database because it's a multi-user system
We are using ADO.NET DataSets
Multiple fields can change of a given row
I've looked at the DataSet's Merge capability, but this doesn't seem to keep the notion of an "ID" column. I've looked at DiffGram capability but the issue here is those seem to be generated from changes within the same DataSet rather than changes that occured on some external data source.
I've been running from this solution for a while but the approach I know would work (with a lot of ineffeciency) is to build a separate DataSet and then iterate all rows applying changes, field by field, to the DataSet on which it is bound.
Has anyone had a similar scenario? What did you do to solve the problem? Even if you haven't run into a similar problem, any recommendation for a solution is appreciated.
Thanks
DataSet.Merge works well for this if you have a primary key defined for each DataTable; the DataSet will raise changed events to any databound GUI controls
if your table is small you can just re-read all of the rows and merge periodically, otherwise limiting the set to be read with a timestamp is a good idea - just tell the DBAs to follow the rules and update the timestamp ;-)
another option - which is a bit of work - is to keep a changed-row queue (timestamp, row ID) using a trigger or stored procedure, and base the refresh queries off of the timestamp in the queue; this will be more efficient if the base table has a lot of rows in it, allowing you (via an inner join on the queue record) to pull only the changed rows since the last poll time.
I think it would be easier to store a list of the nodes that the user has expanded (assuming you can uniquely identify each one), then re-load the data and re-bind it to the tree view, and then expand all the nodes previously expanded.