Character level profiling in SSIS or SQL Server - sql-server

I need to profile reference fields in a database to understand the patterns they are composed of. This needs to be done at a character level as there will be no spaces or punctuation in the reference fields.
As an example I'm looking for a solution that will take input like:
ABA1235DV6778
ABA1235DV6788
ABA2335DV6778
And suggest patterns like:
ABA\d\d35DV67\d\d
This will be used to later validate those reference fields once I can understand the permissable values in those columns.
I have looked at the profiling functionality in SSIS but it seems to lack granularity. Does anybody know how I can tune the profiling in SSIS 2008 or have an efficient function for SQL Server 2008 that can be used to achieve this?
Any help would be greatly appreciated,
Niall

It's not really clear from your post exactly what logic you want to apply to the strings. I'm guessing you want to use some form of edit distance calculation to identify similar strings, then generate a regular expression that matches them all. Those are typically tasks that would be implemented in an external program written in an appropriate language, not in SSIS or SQL Server. It is certainly not something you can do with pre-existing SSIS functionality.
So I would forget SSIS for now and work out the best way to implement your algorithm in .NET (or whatever other language you're comfortable with). Once you've done that you can decide whether to:
Write a self-contained executable and call it from an Execute Process task
Write a .NET DLL and use it in a Script Task, Script Component or CLR stored procedure
Write your own custom SSIS component
Write a complete program instead of using SSIS

Related

Why odi if we have PL/SQL lanuguage

Someone asked me why we need ODI tool if we have PL/SQL code. Odi is generating the PL/SQL code in the back end . why we need ODI interface if we can use code generated by odi interface even using on step less instead of putting data into I$ table we can directly push it with PL/SQL.
Let's take and example:
IF we have to insert 2000 records into a another table from one table we can directly use PL/SQL code instead of designing odi interface which make me confused thinking how odi is better than just a tool.
There are a lot to say, but I can mention to you the most important aspects, in my opinion:
In ODI, you can write KM's (Knowledge module - some SQL/OS commands/Groovy/Java generic code, that generates the statements that you need, based on the source and target tables/table). After creating, you can use it in many mappings. Conclusion: write once, use many times;
ODI have an API: with it, you can automatically generate mappings/objects. So, you don't need to manually create 100 mappings (for example), but maintain a Metadata Repository, from where you can generate mappings automatically;
The fact that you can combine sql with Groovy gives you such a power, that you can't find it in other ETL tools (from what I know);
ODI Contexts - permit you to run the same mapping on different servers or for parallel work;
For your example, it's clearly that it's easy to made it through sql, if it's once. But if you have 10 similar sql to build, you may save some time writing a KM that meets your desire and then generate/create 10 mappings.
There are more to say. If you need, I can expand this post with more. Feel free to tell me.

What is the best way to get XML data to multiple SSIS Data Flow Tasks?

This is a question on how to structure an SSIS package to solve a very specific problem (I'm new SSIS and have not found anything on the correct approach).
My Problem: I have a SSIS package that reads a very simple XML file. The XML Source sees the information as a single table. One of the table columns is a qualifier that affects the way a record is processed. Rather than having the processing for all of the qualifiers in a single task, I’d like to have a separate task for each qualifier (for modularity). I could have the task for each qualifier read, shred, and process the XML file, but reading and shredding the XML file multiple times seems like an inefficient way of doing this. I’d think it would be better to have a task for an XML Source that persists the data, and then have that data used by a number of other tasks that process the data.
A Possible Solution: From what I’ve read, the correct approach is to save the data into a Raw File Destination, and then to have the various tasks use a Raw File Source. This seems too much like a global variable to me. Is there a better way? I can figure out the specifics, so I don’t need a detailed answer, just the best approach.
Thanks
I would use the SSIS Conditional Split transformation for this. It can evaluate your "Qualifier" column and send specified instances down different paths within that Data Flow Task.
https://msdn.microsoft.com/en-us/library/ms137886.aspx
It doesn't seem that there is a way of factoring a DFT (Data Flow Task) designed into SSIS. SSIS is structured so the each DFT provides a complete ETL function. The only way to "factor" a DFT seems to be to create an ad hoc data flow in the Control Flow by using Raw File sources and destinations to fake DFT input and output parameters. This means creating and managing lots of files (and variables for the file names). Recordset sources and destinations can also be used, but the coding overhead might be higher (I haven’t tried using Recordsets, only Raw Files).
Not being able to factor DFTs makes creating and validating complex SSIS packages extremely difficult. Microsoft really needs to come up with a solution. I found another use case in a web forum, being able to recover a task without having to go all of the way back to the beginning. I’ll add a link to that post here if I can find it again.
One solution might be to allow a DFT to execute another DFT, just as a package can execute another package in the control flow. This breaks the convention that each DFT provides a complete ETL function, but if this is the best way the benefit of being able to factor DFTs outweighs any added conceptual complexity.
Disclaimer: I am an experienced LabVIEW programmer, so my view of dataflow is likely biased. I could be missing an obvious solution.

Is there a replacement for Transact-SQL

For the first time in years I've been doing some T-SQL programming in SQL Server 2008 and had forgotten just how bad the language really is:
Flow control (all the begin/end stuff) feels clunky
Exception handling is poor. Exceptions dont bubble in the same way they do in every other language. There's no re-throwing unless you code it yourself and the raiserror function isn't even spelt correctly (this caused me some headaches!)
String handling is poor
The only sequence type is a table. I had to write a function to split a string based on a delimiter and had to store it in a table which had the string parts along with a value indicating there position in the sequence.
If you need to doa lookup in the stored proc then manipulating the results is painful. You either have to use cursors or hack together a while loop with a nested lookup if the results contain some sort of ordering column
I realize I could code up my stored procedures using C#, but this will require permissioning the server to allow CLR functions, which isn't an option at my workplace.
Does anyone know if there are any alternatives to T-SQL within SQL Server, or if there are any plans to introduce something. Surely there's got to be a more modern alternative...
PS: This isn't intended to start a flame-war, I'm genuinely interested in what the options are.
There is nothing wrong with T-SQL; it does the job it was intended for (except perhaps for the addition of control flow structures, but I digress!).
Perhaps take a look at LINQ? You can write CLR Stored Procedures, but I don't recommended this unless it is for some feature that's missing (or heavy string handling).
All other database stored procedure languages (PL/SQL, SQL/PSM) have about the same issues. Personally, i think these languages are exactly right for what they are intended to be used for - they are best used to do code data-driven logic, esp. if you want to reuse that for multiple applications.
So I guess my counter question to you is, why do you want your program to run as part of the database server process? Isn't what you're trying to do better solved at the application or middle-ware level? There you can take any language or data-processing tool of your choosing.
From My point of view only alternative to T-SQL within SQL Server is to NOT use SQL Server
According to your point handling stings whit delimiter ,
From where cames these strings ?
You could try Integration services and "ssis packages" for converting data from one to other.
Also there is nice way to access non SQL data over Linked Serves,

Are there any automated white box testing tools for T-SQL stored procedures and/or functions?

I was wondering if there are any tools similar to Pex that analyze T-SQL stored procedures and functions (instead of managed code) in order to generate meaningful and interesting input values for parameterized unit tests.
AFAIK, no. I've never come across one and another look around has failed to throw one up (I did come across this article on the subject).
The only semi-relevent tools I can suggest are:
TSQLUnit - testing framework for TSQL
Red Gate's Data Generator - for automated test data generation
Or, just writing tests in NUnit. You could create a basic data access layer in (e.g.) .NET, each method wrapping a call to a different sproc with the same parameters to pass through. You could then use a tool like Pex on that data access layer - a sort of proxy approach.
In addition to the tools mentioned by #AdaTheDev, have you looked at DbUnit.NET?

Need help designing big database update process

We have a database with ~100K business objects in it. Each object has about 40 properties which are stored amongst 15 tables. I have to get these objects, perform some transforms on them and then write them to a different database (with the same schema.)
This is ADO.Net 3.5, SQL Server 2005.
We have a library method to write a single property. It figures out which of the 15 tables the property goes into, creates and opens a connection, determines whether the property already exists and does an insert or update accordingly, and closes the connection.
My first pass at the program was to read an object from the source DB, perform the transform, and call the library routine on each of its 40 properties to write the object to the destination DB. Repeat 100,000 times. Obviously this is egregiously inefficent.
What are some good designs for handling this type of problem?
Thanks
This is exactly the sort of thing that SQL Server Integration Services (SSIS) is good for. It's documented in Books Online, same as SQL Server is.
Unfortunately, I would say that you need to forget your client-side library, and do it all in SQL.
How many times do you need to do this? If only once, and it can run unattended, I see no reason why you shouldn't reuse your existing client code. Automating the work of human beings is what computers are for. If it's inefficient, I know that sucks, but if you're going to do a week of work setting up a SSIS package, that's inefficient too. Plus, your client-side solution could contain business logic or validation code that you'd have to remember to carry over to SQL.
You might want to research Create_Assembly, moving your client code across the network to reside on your SQL box. This will avoid network latency, but could destabilize your SQL Server.
Bad news: you have many options
use flatfile transformations: Extract all the data into flatfiles, manipulate them using grep, awk, sed, c, perl into the required insert/update statements and execute those against the target database
PRO: Fast; CON: extremly ugly ... nightmare for maintanance, don't do this if you need this for longer then a week. And a couple dozens of executions
use pure sql: I don't know much about sql server, but I assume it has away to access one database from within the other, so one of the fastes ways to do this is to write it as a collection of 'insert / update / merge statements fed with select statements.
PRO: Fast, one technology only; CON: Requires direct connection between databases You might reach the limit of SQL or the available SQL knowledge pretty fast, depending on the kind of transformation.
use t-sql, or whatever iterative language the database provides, everything else is similiar to pure sql aproach.
PRO: pretty fast since you don't leave the database CON: I don't know t-sql, but if it is anything like PL/SQL it is not the nicest language to do complex transformation.
use a high level language (Java, C#, VB ...): You would load your data into proper business objects manipulate those and store them in the database. Pretty much what you seem to be doing right now, although it sounds there are better ORMs available, e.g. nhibernate
use a ETL Tool: There are special tools for extracting, transforming and loading data. They often support various databases. And have many strategies readily available for deciding if an update or insert is in place.
PRO: Sorry, you'll have to ask somebody else for that, I so far have nothing but bad experience with those tools.
CON: A highly specialized tool, that you need to master. I my personal experience: slower in implementation and execution of the transformation then handwritten SQL. A nightmare for maintainability, since everything is hidden away in proprietary repositories, so for IDE, Version Control, CI, Testing you are stuck with whatever the tool provider gives you, if any.
PRO: Even complex manipulations can be implemented in a clean maintainable way, you can use all the fancy tools like good IDEs, Testing Frameworks, CI Systems to support you while developing the transformation.
CON: It adds a lot of overhead (retrieving the data, out of the database, instanciating the objects, and marshalling the objects back into the target database. I'd go this way if it is a process that is going to be around for a long time.
Building on the last option you could further glorify the architectur by using messaging and webservices, which could be relevant if you have more then one source database, or more then one target database. Or you could manually implement a multithreaded transformer, in order to gain through put. But I guess I am leaving the scope of your question.
I'm with John, SSIS is the way to go for any repeatable process to import large amounts of data. It should be much faster than the 30 hours you are currently getting. You could also write pure t-sql code to do this if the two database are on the same server or are linked servers. If you go the t-sql route, you may need to do a hybrid of set-based and looping code to run on batches (of say 2000 records at a time) rather than lock up the table for the whole time a large insert would take.

Resources