I have a task to import/transform and extract zipped binary files that contain both text data as well as embedded binary data. Within the data is data that is relational in nature and needs to be processed into a defined database structure. Currently I have a C# single threaded app that essentially grabs all the files from the directory (currently there is 13K files of varying sizes) and extracts the data on a single thread line by line inserts to the database. As you could imagine this is a very slow process and unacceptable. There are several different parsing routines used depending on the header record in the file. There are potentially up to a million rows per file when all the data is extracted to the row level of detail. Follow on task is to parse those rows into their appropriate tables based on is content. i.e. the textual content has to be parsed further into "buckets" of like data in the database. That about sums up the big picture. Now for the problem task list.
How do i iterate through a packet of data using SSIS? In the app the file is decompressed and then is parsed using streams data type and byte arrays and is routed to the required parsing routine based on the header data of each packet. There is bit swapping involved as well. Should i wrap up the app code into a script task(s) and let it do the custom processing? The data is separated by year and the SQL server tables is partitioned by year as well. I need to be able to "catch" bad file data as well and process by hand most likely.
Should i simply load the zipped file to SQL as a blob and parse the file with T-SQL? Would that be multi threaded if done that way? Not sure how to do the parsing in T-SQL that is involved here. Which do you think would be faster?
Potentially the data that is currently processed via files could come to us via a socket. Can SSIS collect that data in real time? How would i go about setting that up?
Processing these new files from the directories will become a daily task.
I can manage the data once i get it to SQL Server. Getting it there in a timely fashion seems to be the long pole in the tent for me. I would appreciate any comments or suggestions from the group.
Rick
I think you are out of luck here - SSIS is just not the tool for that. Binary manipulation is not what they had in mind when they were conceptualiztin it. SSIS is basically in the core for ETL processes loading data warehouses with all kinds of data.
SSIS will work just fine. You can improve the process by not grabbing all 13k files in a single task. You can round robin pulling in the files. Split your files based upon the number of CPUs you have on your SSIS box. If the C#app can be slimmed down you can put that into a script task. I have a framework that will enable you to move the files in parallel. I use it to move .pdf files into sql server. If you send me your email I will forward it to you.
Related
I am looking at suggestions on how to tackle this and whether I am using the right tool for the job. I work primarily on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
Problem:
We would be receiving positional flat files every day(around 50) from various partners and the theoretical total number of records received would be over a million records. Each record has some identifying information that will need to be sent to a web service which would come back essentially with a YES or NO based on which the incoming file is split into two files.
Originally, the scope for daily expected records was 10k which later ballooned to 100k and now is at a million records.
Attempt 1: Scatter-Gather pattern
I am debatching the records in a custom pipeline using the file disassembler, adding a couple of port configurable properties for the scatter part(following Richard Seroter's suggestion of implementing a round-robin assignment) where I control the number of scatter/worker orchestrations I spin up to call the web service and mark the records to be sent to 'Agency A' or 'Agency B' and finally push a control message that spins up the Gather/Aggregator orchestration that collects all the messages that are processed from the workers into the messagebox via correlation and creates two files to be routed to Agency A and Agency B.
So, every file that gets dropped will have it's own set of workers and a aggregator that would process the file.
This works well for files with fewer number of records but if a file has over 100k records, I see throttling happen and the file takes a long time to process and generate the two files.
I have put the receive location/worker & aggregator/send port on separate hosts.
It appears to be that the gatherer seems to be dehydrated and not really aggregating the records processed by the workers until all of them are processed and i think since the ratio of msgs published vs processed is very large, it is throttling.
Approach 2:
Assuming that the Aggregator orchestration is the bottleneck, instead of accumulating them in an orchestration, i pushed the processed records to a SQL db and 'split' the records into two XML files(basically a concatenate of msgs going to Agency A/B and wrapping it in XML declaration and using the correct msg type based on writing some of the context properties to the SQL table along with the record).
These aggregated XML records are polled and routed to the right agencies.
This seems to work okay with 100k records and completes in an acceptable amount of time. Now that the goal post/requirement has again changed with regard to expected volume, i am trying to see if BizTalk is even a feasible choice anymore.
I have indicated that BT is not the right tool for the job to perform such a task but the client is suggesting we add more servers to make it work. I am looking at SSIS.
Meanwhile, while doing some testing, some observations:
Increasing the number of workers improved processing(duh):
It looks like if each worker processed a fewer number of records in it's queue/subscription, they finished their queue quickly. When testing this 100k record file, using 100 workers completed in under 3 hrs. This is with minimal activity on the server from other applications.
I am trying to get the web service hosting team to give me a theoretical maximum no of concurrent connection they can handle. I am leaning towards asking them to see if they can handle 1000 calls and maybe the existing solution would scale with my observations.
I have adjusted a few settings for the host with regard to message count and physical memory threshold so it won't balk with the volume but I am still unsure. I didn't have to mess with these settings before and can use advice to monitor any particular counters.
The post is a bit long but I am hoping this gives an idea on what I did so far. Any help/insight appreciated in tackling this problem. If you are suggesting alternatives, i am restricted to .NET or MS based tools/frameworks but would love to hear on other options as well.
I will try to answer or give more detail if you want to clarify or understand something I didn't make clear.
First, 1 million records/messages is not the issue, but you can make it a problem by handling it poorly.
Here's the pattern I would lay out first.
Load the records into SQL Server with SSIS. This will be very fast.
Process/drain the records into you BizTalk app for...well, whatever needs to be done. Calling the service etc.
Update the SQL Record with the result.
When that process is complete, query out the Yes and No batches as one (large) message each, transform and send.
My guess is the Web Service will be the bottleneck unless it's specifically designed for such a load. You will probably have to tune BizTalk to throttle only when necessary but don't worry about that just yet. A good app pattern is more important.
In such scenarios, you should consider following approach:
De-batch the file and store individual records to MSMQ. You can easily achieve this without any extra coding effort, all you need is to create a send port using MSMQ adapter or WCF custom with netmsmq binding. If required, you can also create separate queues depending on different criteria you may have in your messages.
Receive the messages from MSMQ using receive location on a separate host.
Send them to web service on a different BizTalk host.
Try using messaging only scenarios, you can handle service response using a pipeline component if required. You can use Map on send port itself. In worst case if you need orchestration, it should only be to handle one message processing without any complex pattern.
You can again push messages back to two MSMQ for two different agencies based of web service response.
You can then receive those messages again and write them to file, you can simply use a send port with FileAppend option or use a custom pipeline component to write the received messages to file without aggregating them in orchestration. You can gather them in orchestration, if per file you don't have more than few thousand messages.
With this approach you won't have any bottleneck within BizTalk and you don't need to use complex orchestration pattern which usually end up having many persistent points.
If web service becomes a bottleneck, then you can control the rate of received message from MSMQ using 1) Ordered Delivery on MSMQ receive location and if required 2) using BizTalk host throttling by changing two properties Message Count in Db to a very low number e.g. 1000 from 50K default and increasing Spool and Tracking Data Multiplier accordingly e.g. 500 from 10 default to make sure the multiply of both number is enough for not to cause throttling due to messages within BizTalk. You can also reduce the number of worker threads on BizTalk host to make it little slow.
Please note MSMQ is part of Windows OS and does not require any additional setup. Usually installed by default, if not you can add using add-remove features. You can also use IBM MQ if your organization has the infrastructure. But for one million messages, MSMQ will be just fine.
Apologies on the late update*
We've decided to use SSIS to bulk import the file to a table and since the lookup web service is part of the same organization and network although using a different stack, they have agreed to allow us to call their lookup table upon which their web service is based on and we are using a 'merge' between those tables to identify 'Y' or 'N' and export them out via SSIS as well.
In short, we've skipped using BT. The time it now takes is within a couple of mins for a 1.5 million record file to be processed and send the split files.
Appreciate all the advice provided here.
We're using NServiceBus 4.6 with SQL Server transport (2012). The SQL Server database is set up for high availability using Availability Groups.
Our DBAs are complaining about the amount of "churn"a we have happening in the transport database, and especially the load that this is placing on our WAN.
We are currently using XML Serialization, so I started looking at the other serialization options that are available to us (would probably favour JSON so that it's still readable). However, in starting to look into this, I've realised that our message bodies are typically between 600 and 1000 bytes, whereas our message headers are regularly in the range of 1200 - 1800 bytes1. So, even if I achieve a great saving in terms of body sizes, it's not going to produce the large scale improvements that I'm looking for.
The Question
Given, as I understand it, that the headers don't have to be readable when the messages are stored in the SQL Server database, is there any way that I can compress them?
Or other strategies to reduce the amount of data that we're adding and deleting from this database? (Whilst staying on NSB 4.6 for now)
1We are adding a few custom headers ourselves for metadata that really doesn't belong in the message classes.
aSince every message at least goes into a queue table, then is removed from that table and is placed in the audit table, before we later remove older audit entries, and we've got a lot of messages, we're putting a lot in the SQL Server transaction log.
You can compress and decompress the messages content via a Mutator. The mutator example actually is based on compressing a message, so should be an easy solution for that part:
http://docs.particular.net/samples/messagemutators/#code-walk-through-transportmessagecompressionmutator
You can probably add some code that will do the same with the headers, compressing your custom attributes before writing them and decompressing them before reading them.
We have four Biztalk servers on production envionment. The sendport is configured to write incoming message in one textfile. This port receives thousands of messages in a day. So multiple host instances tries to write to file at single time, before one instance finishes writing complete record another instances starts writing new record causing data scattered all over the file.
What can we do resolve this issue?
...before one instance finishes writing complete record another instances starts writing new record causing data scattered all over the file.
What can we do resolve this issue?
The easy way is to only use a single Host Instance to write data to the file, however you may then start to experience throttling issues. Alternatively, you could explore using the 'Allow Cache on write' option on the File Adapter which may offer some improvements.
However, I think your approach is wrong. You cannot expect four separate and totally disconnected processes (across 4 servers no-less) to reliably append to a single file - IN ORDER.
I therefore think you should look re-architecting this solution:
As each message is received, write the contents of the message to a database table (a simple INSERT) with an 'unprocessed' flag. You can reliably have four Host Instances banging data into SQL without fear of them tripping over each other.
At a scheduled time, have BizTalk extract all of the records that are marked as unprocessed in that SQL Table (the WCF-SQL Adapter can help you here). Once you have polled the records, mark them as 'in-process'.
You should now have a single message containing all of the currently unprocessed records (as retrieved from SQL). Using a single (or multiple) Host Instance/s, write the message to disk, appending each of the records to the file in a single write. The key here is that you are only writing a single message to the one file, not lots and lots and lots :-)
If the write is successful, update each of the records in the SQL table with a 'processed' flag so they are not picked-up again on the next poll.
You might want to consider a singleton orchestration for this piece to ensure that there is only ever one poll-write-update process taking place at the same time.
If FIFO is important, BizTalk has ordered delivery mechanism (FILE adapter supported) but it comes at performance cost.
The better solution would be let instances writing to individual files and then have another scheduled process (or orchestration) to combine them in one file. You can enforce FIFO using timestamps. This would provide better performance and resource utilization vs. mentioned earlier singleton orchestration. Other option may be using any suitable implementation of a queue.
You can move to a database system instead of a file. That would be very simply solution and also very efficient.
If you don't want to go that way, you must implement file locking or a semaphore inside of your application so the new threads will wait for other threads to finish writing.
I'm building a DB2 "Infosphere" data warehouse and am expecting to have 8-16 nodes or partitions.
Since I'll be loading from 130-300 million rows a day, and my load process is also my recovery process - I want the loads to be as fast as possible. I'm not surprised to find this tip in the IBM "infocenter" documentation:
"Better performance can be expected if the database partitions participating in the distribution process are different from the loading database partitions, since there is less contention for CPU cycles."
I'd prefer not to dedicate an expensive DB2 node just to splitting load files by hashkey - since my ETL servers are so cheap (we use python, not a licensed commercial product). Plus, since I rely on archived loads for recovery - I may have to convert them in case we add nodes to the database. I'd like that also done on an ETL server. Note - I believe DataStage also performs this task on the ETL server rather than through DB2.
Can anyone suggest how our python ETL process can efficiently use the same hashing algorithm and mapping tables that DB2 will use? And other tips?
Thanks
First of all:
You do not need to pre-split the data inside your ETL process. The LOAD utility will handle splitting the data for you. Your python process can either write the data to load to a flat file or write directly to a pipe (that the LOAD utility reads from). In almost every case, it is easier to let the database handle partitioning the data for you.
The InfoCenter comment about the splitters taking up CPU cycles is probably not something you need to worry about. This generally applies only in extreme situations, where there are many more database partitions (i.e., when you need to have multiple processes splitting the data) and when CPU utilization on the database nodes is very high.
From a LOAD perspective, the amount of time you'll save by having pre-split data is negligible. The limiting factor when loading data is writing the data out to disk – not partitioning it. If reloading data is your primary method of recovery, then I wouldn't worry too much about this.
If all of this does not convince you and you really want to go down the path of having your ETL process split the data, DB2 does provide an API (in C) that applications can call to handle this: db2GetDistMap() and db2GetRowPartNum(). You may be able to write a native python module to handle this.
These are most useful in cases where an application is using SQL to INSERT rows into the table (as opposed to using the LOAD utility), and spawns multiple threads to write data to each partition independently (i.e., each thread is doing the transformation and loading in parallel). If you can't parallelize the transformation portion, then don't bother with this.
Obviously, there are a lot of variables, so YMMV.
We have a process that is getting data in real time and adding records to a database. We're using SQL Server 2008 Integration Services to run our Extract Transform Load (ETL) process. We download about 50 files from an FTP site, process them and then archive the files.
The problem is that the processing is taking about 17s per file even thought the files are really small (about 10 lines) and the processing code is fairly simple. Looking at the load on the machine it is CPU bound and there are not a lot of traffic on the network, disc, or memory.
I suspect that SSIS might be re-compiling the C# code every time it is run. Has anyone run into similar problems? Or have you used a similar process without problems?
Are there any tools that can allow us to profile a dtsx package?
Since you're using SSIS 2008, your Script Tasks are always precompiled.
Are you sure it's the script task in the first place?
I had some extensive script tasks which built many dictionaries, saw if an incoming value was in various dictionaries according to crazy complex business logic and did a translation or other work. Buy building the dictionaries once in the task initialization instead of on the each row method, the processing improved vastly, as you might expect. But this was a very special case.
The package components will be validated (either at the beginning or right before each control flow component is run), that's some overhead you can't get away from.
Are you processing all the files in a single loop within SSIS? In that case, the data flow validation shouldn't be repeated.