using delta lake tables as lookups on another changing delta table - acid

I have a scenario where I'm using one delta table as a lookup table for another delta table. if during the lookup, a lookup value gets added to the underlying table during the operation, will it been picked up in my lookup join?

At the beginning of each job, Delta will select a snapshot of the table that will be used for the entire duration of the job. Delta will always select the latest snapshot available when the job starts, but if the table changes during the execution of the job it will not see those changes.

Related

What is the purpose of automatically creating statistics for a non-indexed column?

In SQL server, creating an index automatically creates statistics object for that index and uses it to decide best query execution plan.
Also, statistics object is automatically created for columns used in the WHERE clause - for example:
SELECT *
FROM AWSales
WHERE ProductID = 898
The above query automatically creates a statistics object for ProductID. What purpose does this serve?
Since the non-indexed column is unsorted, and it is also not a B-tree structure, then how does statistics help to choose a better query plan than a table scan?
I thought the purpose of statistics was to allow the engine to choose between using an index or not; and whether to use seek or scan. What knowledge am I missing?
It serves the same purpose as the statistics created for the index. It will use the statistics for estimates to choose the best execution plan based on CPU time and I/O's. The plan with the lowest cost will be selected.
When the indexes on the table do not cover the column in the where clause, so ProductID, in your example, it will create statistics on the column to create the histogram to sniff the estimates for the value you've supplied unless it already has a cached plan.
In your execution plan you can see what statistics the engine used to pick the plan by viewing the properties on the SELECT object in the plan (the left most object). Expand the OptimizerStatsUsage property.

SSIS Lookup Transform use Table or Query

I have a Lookup Transformation on a table with 30 columns but I only am using two columns: ID column for the join and Update column as Output.
On the connection should I enter a query Select ID, Update From T1 or Use Table in the drop down?
Using table in Drop down would this be like doing Select * From T1 or is SSIS clever enough to know I only need 2 columns.
I'm thinking I should go with the Query Select ID, Update From T1.
On the connection should I enter a query Select ID, Update From T1 or Use Table in the drop down?
It is best to specify which columns you want.
Using table in Drop down, would this be like doing Select * From T1
Yes, it is a SELECT *.
or is SSIS clever enough to know I only need 2 columns?
Nope.
Keep in mind that Lookups are good for pulling data from Dimension Tables where the row count and record set is small. If you are dealing with large amounts of unique data, then it will be better to perform a MERGE JOIN, instead. The performance difference can be substantial. For example, when using a Lookup on 20K rows of data, you could experience run times in the tens of minutes. A MERGE JOIN, however, would run within seconds.
Lookups have the drawback of behaving like correlated sub-queries in that they fire off a query to the server for every row passing through it. You can have the Lookup cache the data, which means SSIS will store the results in memory and then check the memory before going to the server for all subsequent rows passing through the Lookup. As a result, this is only effective if there are a large number of matching records for a small cache set. In other words, Lookups are not optimal when there is large amount of Distinct ID's to lookup. To that point, caching data is almost pointless.
This is where you would switch over to using a MERGE JOIN. Note: you will need to perform a SORT on both of the data flows before the MERGE JOIN because the MERGE JOIN component requires the incoming rows to be sorted.
When handled incorrectly, a single poorly placed Lookup can bring an entire package to its knees - lookups can be huge performance bottlenecks. Though, handled correctly, a Lookup can simplify the design of the dataflow and speed development by removing the extra development required to MERGE JOIN data flows.
The bottom line to all of this is that you want the Lookup performing the fewest number of queries against the server.
If you need only two columns from the lookup table then it is better to use a select query then selecting table from drop down list but the columns specified must contains the primary key (ID). Because reading all columns will consume more resources. Even if it may not meaningful effect in small tables.
You can refer to the following answer on database administrators community for more information:
SSIS OLE DB Source Editor Data Access Mode: “SQL command” vs “Table or view”
Note that what #JWeezy mentioned about lookup from large table is right. Lookups is not designed for large table, i will use SQL JOINs instead.

Is using a wide temporal table with only one regularly updated column efficient?

I have been unable to pin down how temporal table histories are stored.
If you have a table with several columns of nvarchar data and one stock quantity column that is updated regularly, does SQL Server store copies of the static columns for each change made to stock quantity, or is there an object-oriented method of storing the data?
I want to include all columns in the history because it is possible there will be rare changes to the nvarchar columns, but wary of the table history size if millions of qty updates are duplicating the other columns.
I suggest that you use the SQL Server temporal table only for the values that need monitoring otherwise the fixed unchanging attribute values would get duplicated with every change. SQL Server stores a whole new row whenever a row update occurs. See the docs:
UPDATES: On an UPDATE, the system stores the previous value of the row
in the history table and sets the value for the SysEndTime column to
the begin time of the current transaction (in the UTC time zone) based
on the system clock
You need to move your fixed varchar attributes/fields to another table and use a relation, 1:1 or whatever will be suitable.
Check also other relevant questions under the temporal-tables tag:
SQL Server - Temporal Table - Storage costs
SQL Server Temporal Table Creating Duplicate Records
Duplicates in temporal history table

SQL Server - Temporal Table - Storage costs

are there any information in the net, where i can verify how hight are the storage costs for temporal tables feature?
Will the server creates a the full hardcopy of the row/tuple that was modified?
Or will the server use a reference/links to the original values of the master table that are not modified?
For example. I have a row with 10 columns = storage 100 KB. I change one value of that row, thow times. I have thow rows in the historical table after that changes. Is the fill storage cost for the master und historial table then ~300KB?
Thanks for every hint!
Ragards
Will the server creates a the full hardcopy of the row/tuple that was
modified? Or will the server use a reference/links to the original
values of the master table that are not modified?
Here is the cite of the book Pro SQL Server Internals
by Dmitri Korotkevitch that ansers your question:
In a nutshell, each temporal table consists of two tables — the
current table with the current data, and a history table that stores
old versions of the rows. Every time you modify or delete data in
the current table, SQL Server adds an original version of those rows
to the history table.
A current table should always have a primary key defined. Moreover,
both current and history tables should have two datetime2 columns,
called period columns, that indicate the lifetime of the row. SQL
Server populates these columns automatically based on transaction
start time when the new versions of the rows were created. When a row
has been modified several times in one transaction, SQL Server does
not preserve uncommitted intermediary row versions in the history
table.
SQL Server places the history tables in a default filegroup, creating
non-unique clustered indexes on the two datetime2 columns that
control row lifetime. It does not create any other indexes on the
table.
In both the Enterprise and Developer Editions, history tables use
page compression by default.
So it's not
reference/links to the original values of the master table
Previous row version is just copied as it is into historical table on every mofification.

How to update the origin table in the CDC workflow (via SSIS)?

I have a CDC process setup, whereby TableA's additional rows (or updates) are automatically picked up by an ETL and put into a TableB
TableA >>CDC>> TableB
The CDC works fine, except I want to update the first table once the CDC process is finished. I want to update the table by populating it with the
"extraction date". So my tableA has, lets say: Name, Age, OtherInfo, ExtractionDate. CDC is setup on Name,Age and OtherInfo columns (extractionDate column is excluded for obvious reasons).
Then, once CDC is performed on TableA and it's taken to TableB, I'd like to populate TableA's "extractionDate" with the current date. However, given I do not know which rows are being moved, I am having difficulty populating the column. Specifically, how can I make a "selective" where clause to select the "changed" rows, when that's only known to SSIS.
In the Table A database there are system tables that were created as part enabling CDC. You should be able to easily find the table associated with Table A. This is where MSSQL keeps track of all the changes.
The __$start_lsn is a timestamp of when the change was made and your SSIS imports use this value to bring across a range of changes. The lsn_time_mapping lets you look up the timestamp so it easier to understand.
In my processing I store the start and end lsn values so I know what was brought across with each SSIS run. I could then use these lsn values to go back to this CDC source table and see all the changes that MSSQL has tracked during that time-span.
Keep in mind that the CDC system tables are automatically cleaned out every few days - so you wouldn't be able to applyt this logic historically - only for recent imports.

Resources