How to get matched data into database? - file

I took a flat file and looked up a field in a database and added another field as a new column to the flat file.
But when I directed the matched output to another database, the matched field is NULL upon inspection with a Select statement.
What did I do wrong?

I would check for any of the following on either the flat file or lookup data, which may cause a non-match:
- text data with trailing blanks
- text data with upper case vs lower case
- numeric data of varying datatypes, even just precisions
- probably other issues I haven't listed above - it's just ridiculously fussy
To avoid these issues I always explicitly use SQL CAST or Derived Column transforms to make sure the key fields are all text, all upper case and all exactly the same, byte by byte.

Related

SQL: Update Unicode Data in column to Accented characters

I have a column which was set to Varchar and the database set to SQL_Latin1_General_CP1_CI_AS.
When a user entered their name into our web front end and save the data, it was not saving accented characters correctly.
The web user was entering the following, "Béala" but this was being saved on the database as the following, "Béala".
I believe that changing the column from Varchar to NVarchar should prevent this from happening going forward(?), however, I have two questions.
1) How do I perform a select on the existing data in the column and display it correctly?
select CONVERT(NVARCHAR(100),strAddress1) from [dbo].[tblCustomer]
This still shows the data incorrectly.
2) How do I update the data in the column once converted to NVarchar to save the accented characters correctly?
Many thanks,
Ray.
The only idea that came to my mind is that you have to prepare an update that will fool this badly loaded data, i.e. a sign
'é' will always match exactly one character (in this case 'é'), you have to catch all special characters and this
as have been changed (just a simple update with cases and replace). Of course, the first column must be of the nvarchar type.
It solves the problem 1 and 2 (the data will be correct in the table, the data will be displayed correctly, I described the update above)
Here is way to get it in normal characters scheme.
select 'Réunion', cast('Réunion' as varchar(100)) COLLATE SQL_Latin1_General_CP1253_CI_AI
Moreover to check all possible collations in SQL Server you can try this query
SELECT name, description
FROM sys.fn_helpcollations();

What is a good substitute for DSPF with physical files that contain nulls?

I have found that the command DSPF (Display Physical file) does not correctly display records that contain a null. If I define a physical file in DDS with the ALWNULL keyword on a field, then fill the file with data, DSPF will correctly display the data for records without nulls, but all records that contain at least one null will display only blanks in both the null and the non-null fields.
This can be misleading. For example in the screenshot below, the seemingly blank records have data in most of their fields with null only in the date. It displays as blanks in character mode and as zeros in hex mode, not indicating what the real values stored in the physical file are.
Is there a different system command or freely available utility that shows what the data really is? I have found DSPF to be quite useful in debugging and would like to be able to see what the characters and hex values (especially for packed decimals) really are. I could use SQL to see the data, but sometimes it is better to get a raw dump, especially if you are using RPG statements like SETLL or CHAIN and don't want to be misled by SQL ordering.
DSPPFM shows the data for fields that aren't null, and it shows the default value for any fields that are null, usually blanks or zeros, but you can set a different default value when you create the file.
SQL...
SQL doesn't have an ordering unless you give it one. So if you want to see data in the order that an RPG RLA program would be using it, specify 'ORDER BY KEY1, KEY2'
There are other commercial options, such as ProData's DBU utility. But SQL is your best bet.

SSIS Lookup finds no match on varchar field

I have a pretty basic lookup transformation that is matching on two varchar fields. The source is varchar(13) and the lookup field is varchar(20). I have a clear match between the two but yet the rows are directed to No match output.
Whenever I have come across this before its usually a leading or trailing space, or a mismatch between data types that causes the problem, but I have checked and double checked and can't see any issue. I even joined the tables with a SQL query and that does return rows.
What other possibilities are there?
SSIS performs comparisons differently from SQL Server. It follows more strict rules, so if you are matching strings, make sure the columns are exactly the same: string lengths, padding, casing, code page, ANSI / Unicode, etc.
Putting Derived Column transformations before the lookup that would normalise these parameters usually helps.

Does the number of fields in a table affect performance even if not referenced?

I'm reading and parsing CSV files into a SQL Server 2008 database. This process uses a generic CSV parser for all files.
The CSV parser is placing the parsed fields into a generic field import table (F001 VARCHAR(MAX) NULL, F002 VARCHAR(MAX) NULL, Fnnn ...) which another process then moves into real tables using SQL code that knows which parsed field (Fnnn) goes to which field in the destination table. So once in the table, only the fields that are being copied are referenced. Some of the files can get quite large (a million rows).
The question is: does the number of fields in a table significantly affect performance or memory usage? Even if most of the fields are not referenced. The only operations performed on the field import tables are an INSERT and then a SELECT to move the data into another table, there aren't any JOINs or WHEREs on the field data.
Currently, I have three field import tables, one with 20 fields, one with 50 fields and one with 100 fields (this being the max number of fields I've encountered so far). There is currently logic to use the smallest file possible.
I'd like to make this process more generic, and have a single table of 1000 fields (I'm aware of the 1024 columns limit). And yes, some of the planned files to be processed (from 3rd parties) will be in the 900-1000 field range.
For most files, there will be less than 50 fields.
At this point, dealing with the existing three field import tables (plus planned tables for more fields (200,500,1000?)) is becoming a logistical nightmare in the code, and dealing with a single table would resolve a lot of issues, provided I don;t give up much performance.
First, to answer the question as stated:
Does the number of fields in a table affect performance even if not referenced?
If the fields are fixed-length (*INT, *MONEY, DATE/TIME/DATETIME/etc, UNIQUEIDENTIFIER, etc) AND the field is not marked as SPARSE or Compression hasn't been enabled (both started in SQL Server 2008), then the full size of the field is taken up (even if NULL) and this does affect performance, even if the fields are not in the SELECT list.
If the fields are variable length and NULL (or empty), then they just take up a small amount of space in the Page Header.
Regarding space in general, is this table a heap (no clustered index) or clustered? And how are you clearing the table out for each new import? If it is a heap and you are just doing a DELETE, then it might not be getting rid of all of the unused pages. You would know if there is a problem by seeing space taken up even with 0 rows when doing sp_spaceused. Suggestions 2 and 3 below would naturally not have such a problem.
Now, some ideas:
Have you considered using SSIS to handle this dynamically?
Since you seem to have a single-threaded process, why not create a global temporary table at the start of the process each time? Or, drop and recreate a real table in tempdb? Either way, if you know the destination, you can even dynamically create this import table with the destination field names and datatypes. Even if the CSV importer doesn't know of the destination, at the beginning of the process you can call a proc that would know of the destination, can create the "temp" table, and then the importer can still generically import into a standard table name with no fields specified and not error if the fields in the table are NULLable and are at least as many as there are columns in the file.
Does the incoming CSV data have embedded returns, quotes, and/or delimiters? Do you manipulate the data between the staging table and destination table? It might be possible to dynamically import directly into the destination table, with proper datatypes, but no in-transit manipulation. Another option is doing this in SQLCLR. You can write a stored procedure to open a file and spit out the split fields while doing an INSERT INTO...EXEC. Or, if you don't want to write your own, take a look at the SQL# SQLCLR library, specifically the File_SplitIntoFields stored procedure. This proc is only available in the Full / paid-for version, and I am the creator of SQL#, but it does seem ideally suited to this situation.
Given that:
all fields import as text
destination field names and types are known
number of fields differs between destination tables
what about having a single XML field and importing each line as a single-level document with each field being <F001>, <F002>, etc? By doing this you wouldn't have to worry about number of fields or have any fields that are unused. And in fact, since the destination field names are known to the process, you could even use those names to name the elements in the XML document for each row. So the rows could look like:
ID LoadFileID ImportLine
1 1 <row><FirstName>Bob</FirstName><LastName>Villa</LastName></row>
2 1 <row><Number>555-555-5555</Number><Type>Cell</Type></row>
Yes, the data itself will take up more space than the current VARCHAR(MAX) fields, both due to XML being double-byte and the inherent bulkiness of the element tags to begin with. But then you aren't locked into any physical structure. And just looking at the data will be easier to identify issues since you will be looking at real field names instead of F001, F002, etc.
In terms of at least speeding up the process of reading the file, splitting the fields, and inserting, you should use Table-Valued Parameters (TVPs) to stream the data into the import table. I have a few answers here that show various implementations of the method, differing mainly based on the source of the data (file vs a collection already in memory, etc):
How can I insert 10 million records in the shortest time possible?
Pass Dictionary<string,int> to Stored Procedure T-SQL
Storing a Dictionary<int,string> or KeyValuePair in a database
As was correctly pointed out in comments, even if your table has 1000 columns, but most of them are NULL, it should not affect performance much, since NULLs will not waste a lot of space.
You mentioned that you may have real data with 900-1000 non-NULL columns. If you are planning to import such files, you may come across another limitation of SQL Server. Yes, the maximum number of columns in a table is 1024, but there is a limit of 8060 bytes per row. If your columns are varchar(max), then each such column will consume 24 bytes out of 8060 in the actual row and the rest of the data will be pushed off-row:
SQL Server supports row-overflow storage which enables variable length
columns to be pushed off-row. Only a 24-byte root is stored in the
main record for variable length columns pushed out of row; because of
this, the effective row limit is higher than in previous releases of
SQL Server. For more information, see the "Row-Overflow Data Exceeding
8 KB" topic in SQL Server Books Online.
So, in practice you can have a table with only 8060 / 24 = 335 nvarchar(max) non-NULL columns. (Strictly speaking, even a bit less, there are other headers as well).
There are so-called wide tables that can have up to 30,000 columns, but the maximum size of the wide table row is 8,019 bytes. So, they will not really help you in this case.
yes. large records take up more space on disk and in memory, which means loading them is slower than small records and fewer can fit in memory. both effects will hurt performance.

SSIS: Capture Truncation Warning from Flat File Source with "Ignore Failure" Enabled

I have a 2005 SQL Server Integration Services (SSIS) package that is loading delimited flat files into some tables. A very small percentage of records have a text field that is longer than the file format specification says it can be. Rather than try to play an ongoing game of "guess the real maximum length", the customer has requested I just truncate anything over the size in the spec.
I have set the Trunctation event to "Ignore Failure" in the Flat File Source Editor, and that takes care of my extra data. However, it seems to be a completely silent truncation (it does not write any warning to the log). I am concerned that if there is ever a question about what data has been truncated, I have no way to identify it.
What is a simple way to log the fact the truncation happened?
It would be enough to identify that the file had a truncated row in it, but if I could also specify the actual row that would be great. Whether it is captured as part of the built in package logging or I have to make a special call makes no difference to me.
Before you do the actual insert have a conditional split task that takes the records longer than the actual field length and puts them into a logging table. Then you can truncate the data and rejoin them to the orginal path using a Merge or Merge Join transformation.
You can do the truncation yourself as part of the data flow. Set the flat file column width to a value that is very big (larger than any expected values). You can use a conditional split to identify rows that violate the length.
In the data flow path for invalid rows, you can record the information to your log. Then, you can convert the values to the valid length and merge them back with the valid rows. And, finally add the rows to the destination.

Resources