I have a Power BI report pulling from SQL Server that needs to be set up for incremental refresh due to the large data pull. As the load is fairly complex (and PQuery editor is tedious and often breaks folding), I need to use a SQL query (aka a "native query" in PBI speak) while retaining query folding (so that incremental refresh works).
I've been using the nice...
Value.NativeQuery( Source, query, null, [EnableFolding = true])
... trick found here to get that working.
BUT it only seems to work if the native query finishes fairly quickly. When my WHERE clause only pulls data for this year, it's no problem. When I remove the date filter in the WHERE clause (so as to not conflict with the incremental refresh filter), or simply push the year farther back, it takes longer seemingly causing PBI to determine that:
"We cannot fold on top of this native query. Please modify the native query or remove the 'EnableFolding' option."
The error above comes up after a few minutes both in the PQuery editor or if I try to "bypass" it by quickly quickly clicking Close & Apply. And unfortunately, the underlying SQL is probably about as good as it gets due to our not-so-great data structures. I've tried tricking PBI's seeming time-out via an OPTION (FAST 1) in the script, but it just can't pull anything quick enough.
I'm now stuck. This seems like a silly barrier as all I need to do is get that first import to complete as obviously it can query fold for the shorter loads. How do I work past this?
In retrospect, it's silly that I didn't try this initially, but even though the Value.NativeQuery() M step doesn't allow a command time-out option, you can still put it in a preceding Sql.Database() step manually and it carries forward. I also removed some common table expressions from my query which also were breaking query folding (not sure why, but easy fix by saving my complex logic as a view in sql server itself and just joining to that). Takes a while to run now, but doesn't time-out!
I've got an SSIS package - the primary function of which is to precalculate some data and invoke an parameterized SSRS report. The SSRS report has multiple datasets, that it retrieves through stored procedures. It takes around 2-2.5 seconds to generate.
When I loop through the report within the package, the loop obviously executes one report at a time. To speed up this process, I split up the dataset into two and tried passing each dataset into its own loop container. The problem is that although the loops process simultaneously, the step at which the report is generated (script task) halts the process for the other loop - that is, while one report is generating, the other waits.
Given this, it seems that SSRS locks and only allows for one execution at a time. The profiler showed "sp_WriteLockSession" being invoked but according to this it appears that that is based on the design. I've also read up on the "no lock" hint but I'm not sure that's the route I want to go down either.
I'm not sure if I'm approaching this in the right way. Am I missing something? The only other thing I can think of is to create a second report and invoke that instead but if its locking due to the underlying datasets, then I'm really not sure what else to do. The datasets are primarily just select statements, with a couple of them inserting one row into a single table at the very end.
I'd appreciate any advice, thanks in advance!
I have a controlling package to manage data transfers between two different data sources, so we can coy multiple tables and capture how many rows were transferred and duration for each transfer.
The package was working as intended until yesterday, when I promoted a change to the controlling package to allow us to have variables which determine whether the transfer required the destination table to be Truncated or not. The next day I saw the bizarre behaviour of the Package log stating all rows transferred, but the actual table counts were not what the log stated, see attached file. Thinking, as usual that it may be something my change did, I reverted the change and guess what the same errors happened, but on some of the same tables and some different ones.
As it is the child packages doing the transfers and nothing has changed in them, has anyone got any ideas how this could possibly happen? We did have a SAN controller fail a week or so ago and we are now back on that same controller, but either way I cannot see how the log would be able to say all rows are transferred and yet not all of the rows are actually moved???
If you need anything further please let me know as this has me well and truly baffled...
Process output - where you can see counts and log
Table Transfer Controller (as txt not dtsx)
An example of the data transfer child packages (as txt not dtsx)
regards,
Anthony
I have a problem with an application that runs on a machine in my factory(where I work I mean).
Anyway the application creates unique numbers for packages which are tied to unique box numbers which they are loaded into. As such they should always be unique numbers.
Now it was seen in the report file that 2 box numbers where identical and the components within. That means that 2 numbers where loaded into the database table from where the report is generated.
Now the coding part
The programmers of the application want to solve this by using the
SELECT DISTINCT sql command
from the database to generate the file so that it will only ever write one version of the double registered number. (This is because we don't know how the number was put into the database twice) I don't want this solution because it is only treating one known symptom and not the cause. There might be other effects that we are not aware of.
I have suggested that they use
INSERT WHERE NOT EXISTS sql command
so that the same item can never be registered.
Now they return to me and say that that condition already exists....? I cant understand that is it possible?
The only scenario I can think of is that it is not just the field of the number but a combination of fields that they perform the INSERT WHERE NOT EXISTS.
Is it possible that a command can fail in another way? I have no access to the code so I cant give an example of it but my problem is a concrete one. I don't want to be railroaded by them so that is why I am asking you guys.
It seems probable that this is a concurrency problem.
The statement
insert into [] where not exists []
could be submitted twice at the same time and run in parallel. Both statements check their condition and then both insert. This may be more exposed by having long running transactions.
To uncover the bug (something will fail somewhere) actually tell the database that this value is meant to be unique. First you would ofcourse have to remove duplicates, then you could for example use a
create unique index on [ ]
I am working on a project to add logging to our SSIS packages. I am doing my own custom logging by implementing some of the event handlers. I have implemented the OnInformation event to write the time, source name, and message to the log file. When data is moved from one table to another, the OnInformation event will give me a message such as:
component "TABLENAME" (1)" wrote 87 rows.
In the event that one of the rows fails, and lets say only 85 rows were processed out of the expected 87. I would assume that the above line would read wrote 85 rows. How do I track how many rows SHOULD HAVE processed in this case? I would like to see something like wrote 85 of 87 rows. Basically, I think I need to know how to get the number of rows returned from the Source's query. Is there an easy way to do this?
Thank you
You can use the Row Count transaformation after the Data source and save it the variable. This is going to be number of rows to be processed. Once it got loaded into the Destination, you should use the Execute SQL Task in Control flow and use Select Count(*) from <<DestinationTable>> and save the count into the Other variable[You should use the Where clause in your query to identify the current load]. So you will have number rows processed for logging.
Hope this helps!
Not enough space in comments to provide feedback. Posting an incomplete answer as I need to leave for the day.
You are going to have trouble accomplishing what you are asking for. Based on your comments in Gowdhaman008's answer, the value of a variable is not visible outside of a Data flow until after the finalizer event fires (OnPostExecute, I think). You can cheat and get that data out by making use of a script task to count rows through and firing off events, custom or predefined, to reporting package progress. In fact, just capture the OnPipelineRowsSent event. That will record how many rows are passing through a particular juncture and time surrounding it. SSIS Performance Framework Plus, you don't have to do any custom work or maintenance on your stuff. Out of the box functionality is a definite win.
That said, you aren't really going to know how many rows are coming out of a source until it's finished. That sounds stupid and I completely agree but it's the truth. Imagine a simple case, an OLE DB Source that is going to send 1,000,000 rows straight into an OLE DB Destination. Most likely, not all 1M rows are going to start in the pipeline, maybe only 10k will be in the first buffer. Those buffers are pushed to the destination and now you know 10k rows out of 10k rows have been processed. Lather, rinse, repeat a few times and in this buffer, a row has a NULL where it shouldn't. Boom goes the dynamite and the process fails. We have had 60k rows flow into the pipeline and that's all we know about because of the failure.
The only way to ensure we have accounted for all the source rows is to put an asynchronous transformation into the the pipeline to block all downstream components until all the data has arrived. This will obliterate any chance you have of getting good performance out of your packages. You'd still be subject to the aforementioned restrictions on updating variables but your FireXEvent message would accurately describe how many rows could have been processed in the queue.
If you started an explicit transaction, you could do something ugly like an Execute SQL Task just to get the expected count, write that to a variable and then log rows processed but then you're double querying your data and you increase the likelyhood of blocking on the source system because of the double pump. And that's only going to work for something database like. The same concept would apply for a flat file except now you'd need a script task to read all the rows first.
Where this gets uglier is for a slow starting data source, like a web service. The default buffer size might cause the entire package to run much longer than it'd need to simple because we are waiting on the data to arrive Slow starts
What I'd do
I'd record my starting and error counts (and more) using the Row Count. This will help you account for all the data that came in and where it went. I'd then turn on the OnPipelineRowsSent event to allow me to query the log and see how many rows are flowing through it RIGHT NOW.
What you want is the Row Count transformation. Just add that to your data flow after your source query, and assign its output to a variable. Then you can write that variable to your log file.
Here is what I currently do. It's super tedious, but it works.
1)
2) I have a constant "1" value on all of the records. They are literally all the same.
3) Using a multicast step, I send the data flow off in 2 directions. Despite all being the same, we still have to sort by that constant value.
4) Use an aggregate step to aggregate on that constant and then resort it in order to join with the bottom data flow (it holds all of the actual data records with no aggregation).
Doing this allows me to have my initial row count.
Later on, shown below, is use a conditional split step and do the same thing again after applying your condition. If the row count is the same, everything is fine and there are no problems.
If the row count is not the same, something is wrong.
This is the general idea for the approach for solving your problem without having to use another data flow step.
TLDR:
Get a row count for 1 of the conditions by using a multicast, sort by some constant value, and aggregation step.
Do a sort and merge to grab the row count.
Use a conditional split and do it again.
If the pre and post row counts are the same, do this.
If the pre and post row counts are not the same, do that.
This MAY help if you have a column which has no bad data . Add a second Flat File Source to the package. Use the same connection as your existing File source. Choose the first column only and direct the output to a Row Count.