In dbt, Why is my incremental model not registering as incremental and the compiled sql is not what I would expect? - data-modeling

My incremental model is not working, and I don't know why! Thank you in advance.
My intention is to only process records that are within 60 days of today to be materialized into a table that has many years worth of data. Since a record can change until it is older than 60 days, I need to reprocess all records younger than 60 days every day.
So I set up an incremental model like this
Which, compiled, looks like this (notice that the 60 day clause does not show up, and I don't know why )
When I run this model, there is no mention of a temp table being created
also no mention of a merge action for the Unique_key clause at the top of my model

is_incremental() behavior
Have you run this model before? How are you running this model? I ask because is_incremental() will only be true, and therefore your extra WHERE clause will only be included if certain criteria are met.
The docs here say:
The is_incremental() macro will return True if:
the destination table already exists in the database
dbt is not running in full-refresh mode
the running model is configured with materialized='incremental'
In this case, you're right that #2 & #3 are both True, but my guess is that #1 is False
To learn more about is_incremental() check out the source code, which, IMHO, clearly lays out the logic.
no TEMP table being created?
When I run this model, there is no mention of a temp table being created
I do see a create or replace transient table statement, which is how dbt makes a temp table on Snowflake (source code)

In the dbt cloud development environment, is_incremental() always evaluates to False. In dbt cloud production runs (and using the CLI), is_incremental() evaluates to True if the table already exists.

Related

dbt incremental load merge operation without IS_INCREMENTAL() macro

I have set my model to be incremental by
{{
config(
materialized='incremental',
unique_key='my_key',
incremental_strategy='merge'
)
}}
Do I absolutely have to have IS_INCREMENTAL() macro at the end of the statement?
If not, then without IS_INCREMENTAL(),
1- Is dbt running a full refresh in the pipeline?
2- Is dbt still merging the output on my_key in Snowflake? Or is dbt going to completely overwrite the entire Snowflake table with the full refresh output?
is_incremental() is just a convenience macro. It evaluates to True if the model is being run in "incremental mode" (i.e., the table already exists and the --full-refresh flag has not been passed).
If a model is running for the first time or --full-refresh is set, dbt will drop the target table and create a new table with the results of the select statement in your model.
Otherwise, in incremental mode, an incremental model will attempt to upsert any records returned by the select statement.
Typically, an incremental model's source continues to grow over time (maybe it's a table of 100M records growing by 2M records/week). In this case, you would use something like:
select *
from source
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
so that you're only selecting the new records from your source table.
However, if your source table already purges old records, you may not need the special where clause. Just note that this isn't the pattern dbt was designed for, and if someone ever runs --full-refresh on your prod target you'll lose all of the old data.
If you do have a source that purges (deletes/truncates) old records, I would consider creating a snapshot of that source, instead of loading the values into an incremental model directly. Snapshots are designed for this kind of use case, and will generalize better across multiple dev environments, etc.
Incremental docs
Snapshot docs

Ghost data rows added into Firebird database table?

I faced today strange case when receiving customer database for investigation.
System settings:
Firebird server v 2.5.9.26074
Firebird client v 2.6.5
Database file is accessed directly by the application, i.e., it is NOT registered via aliases.conf.
When I first looked into database, everything seemed to be pretty consistent. However, during the first startup there are two rows added in certain table without any detected SQL execution. I have confirmed with debugger that the application is not adding these rows. I also used Audit and Trace inferface (fbtracemgr) and saw in log file that there are not such rows added to the database.
There is one hint that something is wrong in the original database. The table that contains the problem is using INSERT trigger to set the table row's ID column value from generator. Now the generator value seem to be one too high in the original database. This leads me to think that the "ghost data" has already been entered in the file in some sort of cache as the generator is already increment by one.
The result is that after these the two ghost rows are added, the next real addition to the table leads into exception:
FirebirdSql.Data.FirebirdClient.FbException (0x80004005): violation of
PRIMARY or UNIQUE KEY constraint "INTEG_275" on table "DATALOG" --->
violation of PRIMARY or UNIQUE KEY constraint "INTEG_275" on table
"DATALOG"
as there already exist row with equal ID that the generator suggests.
Is there persistent "unsaved data cache" that could contain row data entered during the previous application runs? What could lead to this situation? Power break during database writing or backuping?
Any thoughts?
Firebird server v 2.5.9.26074
There is no such version released.
Firebird-2.5.8.27089
http://www.firebirdsql.org/en/firebird-2-5/
Basically u seem to use some destabilized FB developers internal build, which can have any number of strange averse effects.
So I would advice to use standard released verison or if using snapshot builds is required for some untold reasons - to ask developers in firebird-support mail list - http://www.firebirdsql.org/en/support/
Though don't hold your breath for much of support over exotic Firebird builds.
UPD. Thanks to Mark, here it is: https://www.firebirdsql.org/en/firebird-2-5-0/
2.5.0 - was the first release after a significant reworking of the engine. Not the most stable, obviously. For example there was an issue with indices right in the next 2.5.1 version.
if the behavior would be repeated on standard 2.5.8 Firebird, then i would suggest exporting all the database (at least all the meta-data, but maybe the data as well) into a long text file, SQL script, and then searching for the said table name in it. For example there might be on-database-connect triggers adding some data. Or stored procedures. Or views made on triggers. Or something yet else. For example - though malpractice - even UDF function may make it's own database connection and do things, though this should be shown in FBTrace.
However, during the first startup there are two rows added in certain table
startup of what ?
will those rows still be added if you use standard tools like iSQL/FlameRobin/IBExpert/etc just to connect and then disconnect from the database?
as there already exist row with equal ID that the generator suggests
Generator can not suggest things like that. It can only suggest that once such a number was reserved for possibly being added to one or another table. It does not mean the row was actually inserted, was inserted into that table, was not deleted later.
You may try to search with indices prohibited, in case index corruption could occur, something like
select id+0, count(*) from tableName group by 1
Also http://www.firebirdfaq.org/faq324/
when receiving customer database for investigation
BTW, how exactly did they created a copy of the database to give you?
Did they made back-up (FBK) ? If not, did they stopped Firebird server before making copies?

Different ways to define which database to use?

I am trying to find out the differences in the way you can define which database to use in SSMS.
Is there any functional difference between using the 'Available Databases' drop down list
Adventure works Available Databases dropdown,
the database being defined in the query
SELECT * FROM AdventureWorks2008.dbo.Customers
and
stating the database at the start?
USE AdventureWorks2008
GO
SELECT * FROM dbo.Customers
I'm interested to know if there is a difference in terms of performance or something that happens behind the scenes for each case.
Thank you for your help
Yes, there is. Very very small overhead is added when you use "USE AdventureWorks2008" as it will execute it against the database every time you execute the query. It also will print the "Command(s) completed successfully.". However it is so small overhead and if you are OK with this message then just do not care about that.
Yes, there can be the difference.
When you execute statements like this: SELECT * FROM AdventureWorks2008.dbo.Customers in the context of another database (not AdventureWorks2008) that another database's settings are applied.
First of all, any database has its Compatibility Level that can be different, so it can limit usage of some code, for example you cannot use APPLY operator in the context of database with CL set to 80 but you can do it within database with CL >= 90
Second, every database has its own set of options such as AUTO_UPDATE_STATISTICS_ASYNC and Forced Parameterization that can affect your query plan.
I did encounter some cases when the context of database influenced the plan:
One case was when I created filtered index for one table and it was used in the plan until I executed my query in the context of database with Simple parameterization, and it was not used for the same query when executed in the context of database with Forced parameterization. When I used the hint to force that index I've got the error that the query plan cannot be produced due to query hint, so I need to investigate and I found out that my query was parameterized and instead of my condition fld = 0 there was fld = #p and it could not use my filtered index with fld = 0 condition.
The second case was reguarding table cardinality estimation: we use staging tables to load the data in our ETL procedures and then switch then to actual tables like this:
insert into stg with(tablock);
...
truncate table actual;
alter table stg swith to actual;
All the staging tables are empty when the procedure compiles but within the proc they are filled with the data, so when we do joins between them they are not emty anymore. Passing from 0 rows to non-0 rows triggers statement recompilation that should take in consideration actual number of rows, but it did not happen on the production server, so all estimations were completely wrong (1 row fo every table) and I need to investigate. The cause was AUTO_UPDATE_STATISTICS_ASYNC set to ON in production database.
Now imagine you have 2 db: db1 and db2 with this option set to ON and OFF respectively, in db1 this code will have wrong estimations while if you execute it in db2 using db1.dbo.stg it will have right estimations. The execution time will be very different in these 2 databases.

Does Oracle (RDB in general?) take a snapshot of the table affected by DML?

Objective
To understand the mechanism/implementation when processing DMLs against a table. Does a database (I work on Oracle 11G R2) take snapshots (for each DML) of the table to apply the DMLs?
Background
I run a SQL to update the AID field of the target table containing old values with the new values from the source table.
UPDATE CASES1 t
SET t.AID=(
SELECT DISTINCT NID
FROM REF1
WHERE
oid=t.aid
)
WHERE EXISTS (
SELECT 1
FROM REF1
WHERE oid=t.aid
);
I thought the 'OLD01' could be updated twice (OLD01 -> NEW01 -> SCREWED).
However, it did not happen.
Question
For each DML, does a database take a snapshot of table X (call it X+1) for a DML (1st) and then keep taking snapshot (call it X+2) of the result (X+1) for the next DML (2nd) on the table, and so on for each DML that are successibly executed? Is this also used as a mechanism to implement Rollback/Commit?
Is it an expected behaviour specified as a standard somewhere? If so, kindly suggest relevant references. I Googled but not sure what the key words should be to get the right result.
Thanks in advance for your help.
Update
Started reading Oracle Core (ISBN 9781430239543) by Jonathan Lewis and saw the diagram. So current understanding is the UNDO records are created in the UNDO tablespace for each update and the original data is reconstructed from there, which I initially thought as snapshots.
In Oracle, if you ran that update twice in a row in the same session, with the data as you've shown, I believe you should get the results that you expected. I think you must have gone off track somewhere. (For example, if you executed the update once, then without committing you opened a second session and executed the same update again, then your result would make sense.)
Conceptually, I think the answer to your question is yes (speaking specifically about Oracle, that is). A SQL statement effectively operates on a snapshot of the tables as of the point in time that the statement starts executing. The proper term for this in Oracle is read-consistency. The mechanism for it, however, does not involve taking a snapshot of the entire table before changes are made. It is more the reverse - records of the changes are kept in undo segments, and used to revert blocks of the table to the appropriate point in time as needed.
The documentation you ought to look at to understand this in some depth is in the Oracle Concepts manual: http://docs.oracle.com/cd/E11882_01/server.112/e40540/part_txn.htm#CHDJIGBH

For Oracle Database How to find when the row was inserted? (timestamp) [duplicate]

Can I find out when the last INSERT, UPDATE or DELETE statement was performed on a table in an Oracle database and if so, how?
A little background: The Oracle version is 10g. I have a batch application that runs regularly, reads data from a single Oracle table and writes it into a file. I would like to skip this if the data hasn't changed since the last time the job ran.
The application is written in C++ and communicates with Oracle via OCI. It logs into Oracle with a "normal" user, so I can't use any special admin stuff.
Edit: Okay, "Special Admin Stuff" wasn't exactly a good description. What I mean is: I can't do anything besides SELECTing from tables and calling stored procedures. Changing anything about the database itself (like adding triggers), is sadly not an option if want to get it done before 2010.
I'm really late to this party but here's how I did it:
SELECT SCN_TO_TIMESTAMP(MAX(ora_rowscn)) from myTable;
It's close enough for my purposes.
Since you are on 10g, you could potentially use the ORA_ROWSCN pseudocolumn. That gives you an upper bound of the last SCN (system change number) that caused a change in the row. Since this is an increasing sequence, you could store off the maximum ORA_ROWSCN that you've seen and then look only for data with an SCN greater than that.
By default, ORA_ROWSCN is actually maintained at the block level, so a change to any row in a block will change the ORA_ROWSCN for all rows in the block. This is probably quite sufficient if the intention is to minimize the number of rows you process multiple times with no changes if we're talking about "normal" data access patterns. You can rebuild the table with ROWDEPENDENCIES which will cause the ORA_ROWSCN to be tracked at the row level, which gives you more granular information but requires a one-time effort to rebuild the table.
Another option would be to configure something like Change Data Capture (CDC) and to make your OCI application a subscriber to changes to the table, but that also requires a one-time effort to configure CDC.
Ask your DBA about auditing. He can start an audit with a simple command like :
AUDIT INSERT ON user.table
Then you can query the table USER_AUDIT_OBJECT to determine if there has been an insert on your table since the last export.
google for Oracle auditing for more info...
SELECT * FROM all_tab_modifications;
Could you run a checksum of some sort on the result and store that locally? Then when your application queries the database, you can compare its checksum and determine if you should import it?
It looks like you may be able to use the ORA_HASH function to accomplish this.
Update: Another good resource: 10g’s ORA_HASH function to determine if two Oracle tables’ data are equal
Oracle can watch tables for changes and when a change occurs can execute a callback function in PL/SQL or OCI. The callback gets an object that's a collection of tables which changed, and that has a collection of rowid which changed, and the type of action, Ins, upd, del.
So you don't even go to the table, you sit and wait to be called. You'll only go if there are changes to write.
It's called Database Change Notification. It's much simpler than CDC as Justin mentioned, but both require some fancy admin stuff. The good part is that neither of these require changes to the APPLICATION.
The caveat is that CDC is fine for high volume tables, DCN is not.
If the auditing is enabled on the server, just simply use
SELECT *
FROM ALL_TAB_MODIFICATIONS
WHERE TABLE_NAME IN ()
You would need to add a trigger on insert, update, delete that sets a value in another table to sysdate.
When you run application, it would read the value and save it somewhere so that the next time it is run it has a reference to compare.
Would you consider that "Special Admin Stuff"?
It would be better to describe what you're actually doing so you get clearer answers.
How long does the batch process take to write the file? It may be easiest to let it go ahead and then compare the file against a copy of the file from the previous run to see if they are identical.
If any one is still looking for an answer they can use Oracle Database Change Notification feature coming with Oracle 10g. It requires CHANGE NOTIFICATION system privilege. You can register listeners when to trigger a notification back to the application.
Please use the below statement
select * from all_objects ao where ao.OBJECT_TYPE = 'TABLE' and ao.OWNER = 'YOUR_SCHEMA_NAME'

Resources