SSIS Lookup Failure - sql-server

In an ssis Dataflow there is a lookup component ,which lookup on a table with 18 million records.I have configured the lookup with full cache.
Default buffer size :20485760
Default Buffer Max rows: 100000
The lookup join is based on an ID column of varchar(13)type
It gives an error as shown below.What lookup configuration is suitable to cache these many records
Error: The buffer manager cannot write 8 bytes to file "C:\Users\usrname\AppData\Local\Temp\16\DTS{B98CD347-1EF1-4BC1-9DD9-C1B3AB2B8D73}.tmp". There was insufficient disk space or quota.
what would be the difference in performance if i use a lookup with no cache?
I did understand that in full cache mode ,the data is cached before pre execute stage and do not have to go back to database.This full cache memory takes large amount of memory and add aditional startup time for the data flow.My question is what configuration do i have to setup in order to handle large amount of data in full cache mode
Whats the solution if the lookup table has million records (and they dont fit in a full cache)

Use a Merge Join component instead. Sort both inputs on the join key, specify inner/left/full joins based on your specification. Use the different outputs to get functionality like the lookup component.
Merge Join usually performs better on larger datasets.

You can set Buffertempstoragepath property in SSIS to some of the fastdrives as Blobtempstoragepath and buffertempstoragepath will be using temp and tmp system variables. So if the tmp variable cannot hold the large dataset in you case, its using lookup transformation. So the largedataset will be using the drive space and will perform the job for you.

Related

Snowflake query performance is unexpectedly slower for external Parquet tables vs. internal tables

When I run queries on external Parquet tables in Snowflake, the queries are orders of magnitude slower than on the same tables copied into Snowflake or with any other cloud data warehouse I have tested on the same files.
Context:
I have tables belonging to the 10TB TPC-DS dataset in Parquet format on GCS and a Snowflake account in the same region (US Central). I have loaded those tables into Snowflake using create as select. I can run TPC-DS queries(here #28) on these internal tables with excellent performance. I was also able to query those files on GCS directly with data lake engines with excellent performance, as the files are "optimally" sized and internally sorted. However, when I query the same external tables on Snowflake, the query does not seem to finish in reasonable time (>4 minutes and counting, as opposed to 30 seconds, on the same virtual warehouse). Looking at the query profile, it seems that the number of records read in the table scans keeps growing indefinitely, resulting in a proportional amount of spilling to disk.
The table happens to be partitioned but it those not matter on the query of interest (which I tested with other engines).
What I would expect:
Assuming proper data "formatting", I would expect no major performance degradation compared to internal tables, as the setup is technically the same - data stored in columnar format in cloud object store - and as it is advertised as such by Snowflake. For example I saw no performance degradation with BigQuery on the exact same experiment.
Other than double checking my setup, I see don't see many things to try...
This is what the "in progress" part of the plan looks like 4 minutes into execution on the external table. All other operators are at 0% progress. You can see external bytes scanned=bytes spilled and 26G!! rows are produced. And this is what it looked like on a finished execution on the internal table executed in ~20 seconds. You can see that the left-most table scan should produce 1.4G rows but had produced 23G rows with the external table.
This is a sample of the DDL I used (I also tested without defining the partitioning column):
create or replace external table tpc_db.tpc_ds.store_sales (
ss_sold_date_sk bigint as
cast(split_part(split_part(metadata$filename, '/', 4), '=', 2) as bigint)
,
ss_sold_time_sk bigint as (value:ss_sold_time_sk::bigint),
ss_item_sk bigint as (value:ss_item_sk::bigint),
ss_customer_sk bigint as (value:ss_customer_sk::bigint),
ss_cdemo_sk bigint as (value:ss_cdemo_sk::bigint),
ss_hdemo_sk bigint as (value:ss_hdemo_sk::bigint),
ss_addr_sk bigint as (value:ss_addr_sk::bigint),
ss_store_sk bigint as (value:ss_store_sk::bigint),
ss_promo_sk bigint as (value:ss_promo_sk::bigint),
ss_ticket_number bigint as (value:ss_ticket_number::bigint),
ss_quantity bigint as (value:ss_quantity::bigint),
ss_wholesale_cost double as (value:ss_wholesale_cost::double),
ss_list_price double as (value:ss_list_price::double),
ss_sales_price double as (value:ss_sales_price::double),
ss_ext_discount_amt double as (value:ss_ext_discount_amt::double),
ss_ext_sales_price double as (value:ss_ext_sales_price::double),
ss_ext_wholesale_cost double as (value:ss_ext_wholesale_cost::double),
ss_ext_list_price double as (value:ss_ext_list_price::double),
ss_ext_tax double as (value:ss_ext_tax::double),
ss_coupon_amt double as (value:ss_coupon_amt::double),
ss_net_paid double as (value:ss_net_paid::double),
ss_net_paid_inc_tax double as (value:ss_net_paid_inc_tax::double),
ss_net_profit double as (value:ss_net_profit::double)
)
partition by (ss_sold_date_sk)
with location = #tpc_ds/store_sales/
file_format = (type = parquet)
auto_refresh = false
pattern = '.*sales.*[.]parquet';
Snowflake didn't originally have external file queries, nor did it originally have Parquet support. I feel like I remember when external queries arrives, it was a simple read 100% of all the files into the system and the start processing. Which aligns with out you are seeing. This was a blessing because the prior state was to have to load all files into a staging table, and then run a filter on that, and sometimes if it was a one off query (but it almost never is in the end) executing SQL against the raw files, was rather helpful.
Yes it should be possible to optimize parquet file reads, to gather the meta data and then eliminate wasteful partiton reads. But that is not the order of evolution. So I am not surprised by you findings.
I would never suggest using an external data model as the general day-to-day Snowflake operations, as it not presently optimized for that. For two reasons, the disk costs of storing it in snowflake are the same as storing it in S3, and Snowflake can then have complete control over meta data, and read/write sync between nodes. Which all amounts to performance.
Also spilling to local storage is not bad per say, spilling to remote is the worst spilling. But it does appear that you are getting the effective result of a full file import and then process.
Probably Snowflake plan assumes it must read every parquet file because it cannot tell beforehand if the files are sorted, number of unique values, nulls, minimum and maximum values for each column, etc.
This information is stored as an optional field in Parquet, but you'll need to read the parquet metadata first to find out.
When Snowflake uses internal tables, it has full control about storage, has information about indexes (if any), column stats, and how to optimize a query both from a logical and physical perspective.
I don't know how your query looks but there is also a small chance that you suffer from a known issue where Snowflake interprets a function in the partition filter as dynamic and thus runs over all data, see details in Joao Marques blog on Medium: Using Snowflake External Tables? You must read this!.
Example of how not to do it from the blog
SELECT COUNT(*)
FROM EXTERNAL_TABLE
WHERE PARTITION_KEY = dateadd(day, -1, current_date)
Example of how to do it from the blog
SET my_var = (select dateadd(day, -1, current_date));
SELECT COUNT(*)
FROM EXTERNAL_TABLE
WHERE PARTITION_KEY = $my_var
All credits to the blog author, I have merely stumbled across this issue myself and found his blog.

Look up transformation in SSIS bug:doesn`t prevent insertion of already existing record

I have SSIS package that reads from a source, performs look up transofrmation to check if the record exists in the destination , if it exists it redirects to match output and updates it, otherwise to no match output and inserts it. The problem is that sometimes it inserts a record that should be redirected for update. This is done via job, if I manually execute the package, everything is fine. The look up component is set up correctly with the matching column.
I can`t find out why this happens, the silliest thing is I can not debug it because manually everything is fine.
Any ideas?
Two options on I the scenarios where you have inserts that should have been updates.
Duplicate source values
The first is that you have duplicate keys in your source data and nothing in the target table.
Source data
Key|Value
A |abc
B |bcd
A |cde
Destination data
Key|Value
C |yza
B |zab
In this situation, assuming the default behaviour of the Lookup Component, full cache, before the package begins execution, SSIS will run the source query for the Lookup reference table. Only once all the lookup table data has been cached will the data flow being flowing data.
The first row, A:abc hits the lookup. Nope, no data and off to the Insert path.
The second row B:bcd hits the lookup. Nope, no data and off to the Insert path.
The third row A:cde hits the lookup. Nope, no data and off to the Insert path (and hopefully a primary/unique key violation)
When the package started, it only knew about data in the destination table. During the run you added the same key value to the table but never asked the lookup component to check for updates.
In this situation, there are two resolutions: The first is to change the cache mode from Full to None (or Partial). This will have the lookup component issue a query against the target table for every row that flows through the data flow. That can get expensive for large rows. It also won't be fool proof because the data flow has the concept of buffers and in a situation like our sample 3 row load, that would all fit in one buffer. All the rows in the buffer would hit the Lookup at approximately the same time and thus the target table will still not contain an A value when the third row flows through the component. You can put the brakes on the data flow and force it to process one row at a time by adjusting the buffer size to 1 but that's generally not going to be a good solution.
The other resolution would be dedupe/handle survivorship. Which A row should make it to the database in the event our source has different values for the same business key? First, last, pick one? If you can't eliminate the data before it hits the Data Flow, then you'll need to deduplicate the data using an Aggregate component to rollup your data best you can.
Case sensitive lookups
Source data
Key|Value
A |abc
B |bcd
a |cde
Destination data
Key|Value
C |yza
B |zab
The other scenario where the Lookup component bites you is that the default, Full Cache, matching is based on .NET matching rules for strings. Thus AAA is not equal to AaA. If your lookup is doing string matching, even if your database is case insensitive, the SSIS lookup will not be insensitive.
In situations where I need to match alpha data, I usually make an extra/duplicate column in my source data which is the key data all in upper or lower case. If I am querying the data, I add it to my query. If I am loading from a flat file, then I use a Derived Column Component to add my column to the data flow.
I then ensure the data in my reference table is equally cased when I use the Lookup component.
Lookup Component caveats
Full Cache mode:
- insensitive to changes to the reference data
- Case sensitive matches
- generally faster overall
- delays data flow until the lookup data has been cached
- NULL matches empty string
- Cached data requires RAM
No Cache mode:
- sensitive to changes in the reference
- Case sensitivity matching is based on the rules of the lookup system (DB is case sensitive, you're getting a sensitive match)
- It depends (100 rows of source data, 1000 rows of reference data - no one will notice.
1B rows of source data and 10B rows of reference data - someone will notice. Are there indexes to support the lookups, etc)
- NULL matches nothing
- No appreciable SSIS memory overhead
Partial Cache:
The partial cache is mostly like the No cache option except that once it gets a match against the reference table, it will cache that value until execution is over or until it gets pushed out due to memory pressure
Lookup Cache NULL answer

Oracle Historgram and reading wrong index

I have 2 databases, one is the main database that many users work on it and a testing database, the second one is test database that loaded by a dump from the main DB.
I have a select query a with join conditions and union all on a table TAB11 that contains 40 million rows.
The problem that the query is reading wrong index in the main DB but in test DB is reading correct index. Note that both have latest gather statistics on the table and same count rows. I start to dig into histograms and skew data and I noticed in main DB the table has 37 histogram created on its columns ,however in the test db the table has only 14 columns has histogram. so apparently those created histogram are effecting the query plan to read wrong index (right?). ( those histogram created by oracle , and not by anyone)
My question:
-should I remove the histogram from those columns, and when I gather static again oracle will create the needed one and read them correctly ? but I am afraid it will effect the performance of the table.
-should I add this when i gather tab statistics method_opt=>'for all columns size skewonly' but I am not sure if the data are skewed or not.
-should I run gather index stats on the desired index and the oracle might read it?
how to make the query read the right index, without droping it or using force index?
There are too many possible reasons for choosing a different index in one DB vs another (including object life-cycle differences e.g. when data gets loaded, deletions/truncations/inserts/stats gathering index rebuilds ...). Having said that, in cases like this I usually do a parameter by parameter comparison of the initialization parameters on each DB; also an object by object comparison (you've already observed a delta in the histogram; thee may be others as well that are impacting this).

Large file, SSIS to Small chunks, parallel enrichment

I have a pretty large file 10GB in Size need to load the records into the DB,
I want to have two additional columns
LoadId which is a constant (this indicates the files unique Load number)
ChunkNumber which would indicate the Chunk of the batch size.
So if I have a batch size of 10,000 records I want
LoadId = {GUID}
ChunkNumber = 1
for the next 10,000 records i want
LoadId = {GUID}
ChunkNumber = 2
Is this possible in SSIS? I suppose I can write a custom component for this but there should be a inbuilt ID if i could use as SSIS is already running stuff in batches of size 10,000
Can some one help me to figure out this parameter if it exists and can it be used?
Ok little bit more detail on the background of what and why.
We we get the data into a Slice of 10,000 records then we can start calling the Stored Procedures to enrich the data in chunks, all i am trying to do is can the SSIS help here by putting a Chunk number and a Guid
this helps the stored proc to move the data in chunks, although i could do this after the fact with a row number, Select has to travel through the whole set again and update the chunk numbers. its a double effort.
A GUID will represent the the complete dataset and individual chunks are related to it.
Some more insight. There is a WorkingTable we import this large file into and if we start enriching all the data at once the Transaction log would be used up, it is more manageable if we can get the data into chunks, so that Transaction log would not blow up and also we can parallel the enrichment process.
The data moves from De-normalized format normalized format from here. SP is more maintainable in therms of release and management of day today, so any help is appreciated.
or is there an other better way of dealing with this?
For the LoadID, you could use the SSIS variable
System::ExecutionInstanceGUID
which is generated by SSIS when the package runs.

SSIS - out of memory error again

I have cca 25 databases which I need to consolidate into 1 database. First I tried to build a ssis package which would copy all data from each table into one place but then I got error:
Information: The buffer manager failed a memory allocation call for
10485760 bytes, but was unable to swap out any buffers to relieve
memory pressure. 1892 buffers were considered and 1892 were locked.
Either not enough memory is available to the pipeline because not
enough are installed, other processes were using it, or too many
buffers are locked.
Then I realized this is not good idea and that I need to insert only new records and update existing ones. After that I tried this option:
Get a list of all conn. strings
foreach db, copy new records and update existing ones (those which need to be updated copy from source to temp table, delete them from destination and copy from temp to destination table)
Here's how data flow task looks like
In some cases data flow procceses more than million rows. BUT, I still get the same error - ran out of memory.
In task manager the situation is following:
I have to note that there are 28 databases being replicated on this same server and when this package is not running sql server is still using more than 1gb of memory. I've read that it's normal, but now I'm not that sure...
I have installed hotfix for SQL Server I've found in this article: http://support.microsoft.com/kb/977190
But it doesn't help...
Am I doing something wrong or this is just the way things work and I am suppose to find a workaround solution?
Thanks,
Ile
You might run into memory issues if your Lookup transformation is set to Full cache. From what I have seen, the Merge Join performs better than Lookup transformation if the number of rows exceed 10 million.
Have a look at the following where I have explained the differences between Merge Join and Lookup transformation.
What are the differences between Merge Join and Lookup transformations in SSIS?
I found a solution and the problem was in SQL Server - it was consuming too much of memory. By default max server memory was set to 2147483647 (this is default value). Since my server has 4gb RAM, I limited this number to 1100 mb. Since then, there were no memory problems, but still, my flow tasks were very slow. The problem was in using Lookup. By default, Lookup selects everything from Lookup table - I changed this and selected only columns I need for lookup - it fastened the process several times.
Now the whole process of consolidation takes about 1:15h.

Resources