Visual Studio Data Comparison with less columns - sql-server

My project currently has a database which contains several tables, the most important of which has one binary column with very large entries (representing serialized C# objects). There are a large number of entries in the production database, and when debugging, it is often necessary to pull these entries down into the local development database (as remote debugging does not seem to work, which is a separate issue).
If I attempt to compare the local and production databases on this table with all columns, the comparison can take up to an hour, or eventually time out, but this has worked in the past and allowed me to download the entries and debug them successfully. If I compare on all table columns but the binary data column, the comparison is almost instantaneous, but that column is not then transferred to the production database.
My question is: is there any way to run a data comparison between two tables, excluding a particular column for the comparison itself (other fields give enough information to differentiate without it) but including it when updating the target database?

You could use a hash function on your large varbinary fields and compare those. HASHBYTES with MD5 is a good method for comparing as it's astronomically unlikely to generate the same hash value for two different inputs. Problem is, HASHBYTES only works on fields up to 8000 bytes. There are some work arounds though by creating a function. A few posted here:
SQL Server 2008 and HashBytes
You would have the option of storing the hash values in your table at the time of insert or update by using a persisted calculated fields. Or you could just generate the hash values while doing your comparison query.

Related

SQL Server Except operator - how to identify culprit columns

Just to give background, we have created new SSIS packages to get feeds into SQL Server tables.
To make sure new SSIS packages put data in same manner in which how current live process does, we have written comparison script to compare Live and Test tables which will be executed parallel for a month.
So have i have used EXCEPT command to get the differences and it returns data too which has differences.
Now problem is that, i have around 50 columns and i need to check data of each column and compare with other tables to find out the culprit.
In some cases, i am getting around 40000 rows as a difference and identifying correct columns is too cumbersome.
In most of the cases, it was because of NULL value in Live and blank value in Test. I have done following to identify columns
Limit Number of columns
Look at value for few columns and then change SQL statement to include ISNULL function & so on
This is time consuming.
Is there any good way to get the columns names because of which we are getting difference in rows.
It would be good if i can get better way to handle this.

Does the number of fields in a table affect performance even if not referenced?

I'm reading and parsing CSV files into a SQL Server 2008 database. This process uses a generic CSV parser for all files.
The CSV parser is placing the parsed fields into a generic field import table (F001 VARCHAR(MAX) NULL, F002 VARCHAR(MAX) NULL, Fnnn ...) which another process then moves into real tables using SQL code that knows which parsed field (Fnnn) goes to which field in the destination table. So once in the table, only the fields that are being copied are referenced. Some of the files can get quite large (a million rows).
The question is: does the number of fields in a table significantly affect performance or memory usage? Even if most of the fields are not referenced. The only operations performed on the field import tables are an INSERT and then a SELECT to move the data into another table, there aren't any JOINs or WHEREs on the field data.
Currently, I have three field import tables, one with 20 fields, one with 50 fields and one with 100 fields (this being the max number of fields I've encountered so far). There is currently logic to use the smallest file possible.
I'd like to make this process more generic, and have a single table of 1000 fields (I'm aware of the 1024 columns limit). And yes, some of the planned files to be processed (from 3rd parties) will be in the 900-1000 field range.
For most files, there will be less than 50 fields.
At this point, dealing with the existing three field import tables (plus planned tables for more fields (200,500,1000?)) is becoming a logistical nightmare in the code, and dealing with a single table would resolve a lot of issues, provided I don;t give up much performance.
First, to answer the question as stated:
Does the number of fields in a table affect performance even if not referenced?
If the fields are fixed-length (*INT, *MONEY, DATE/TIME/DATETIME/etc, UNIQUEIDENTIFIER, etc) AND the field is not marked as SPARSE or Compression hasn't been enabled (both started in SQL Server 2008), then the full size of the field is taken up (even if NULL) and this does affect performance, even if the fields are not in the SELECT list.
If the fields are variable length and NULL (or empty), then they just take up a small amount of space in the Page Header.
Regarding space in general, is this table a heap (no clustered index) or clustered? And how are you clearing the table out for each new import? If it is a heap and you are just doing a DELETE, then it might not be getting rid of all of the unused pages. You would know if there is a problem by seeing space taken up even with 0 rows when doing sp_spaceused. Suggestions 2 and 3 below would naturally not have such a problem.
Now, some ideas:
Have you considered using SSIS to handle this dynamically?
Since you seem to have a single-threaded process, why not create a global temporary table at the start of the process each time? Or, drop and recreate a real table in tempdb? Either way, if you know the destination, you can even dynamically create this import table with the destination field names and datatypes. Even if the CSV importer doesn't know of the destination, at the beginning of the process you can call a proc that would know of the destination, can create the "temp" table, and then the importer can still generically import into a standard table name with no fields specified and not error if the fields in the table are NULLable and are at least as many as there are columns in the file.
Does the incoming CSV data have embedded returns, quotes, and/or delimiters? Do you manipulate the data between the staging table and destination table? It might be possible to dynamically import directly into the destination table, with proper datatypes, but no in-transit manipulation. Another option is doing this in SQLCLR. You can write a stored procedure to open a file and spit out the split fields while doing an INSERT INTO...EXEC. Or, if you don't want to write your own, take a look at the SQL# SQLCLR library, specifically the File_SplitIntoFields stored procedure. This proc is only available in the Full / paid-for version, and I am the creator of SQL#, but it does seem ideally suited to this situation.
Given that:
all fields import as text
destination field names and types are known
number of fields differs between destination tables
what about having a single XML field and importing each line as a single-level document with each field being <F001>, <F002>, etc? By doing this you wouldn't have to worry about number of fields or have any fields that are unused. And in fact, since the destination field names are known to the process, you could even use those names to name the elements in the XML document for each row. So the rows could look like:
ID LoadFileID ImportLine
1 1 <row><FirstName>Bob</FirstName><LastName>Villa</LastName></row>
2 1 <row><Number>555-555-5555</Number><Type>Cell</Type></row>
Yes, the data itself will take up more space than the current VARCHAR(MAX) fields, both due to XML being double-byte and the inherent bulkiness of the element tags to begin with. But then you aren't locked into any physical structure. And just looking at the data will be easier to identify issues since you will be looking at real field names instead of F001, F002, etc.
In terms of at least speeding up the process of reading the file, splitting the fields, and inserting, you should use Table-Valued Parameters (TVPs) to stream the data into the import table. I have a few answers here that show various implementations of the method, differing mainly based on the source of the data (file vs a collection already in memory, etc):
How can I insert 10 million records in the shortest time possible?
Pass Dictionary<string,int> to Stored Procedure T-SQL
Storing a Dictionary<int,string> or KeyValuePair in a database
As was correctly pointed out in comments, even if your table has 1000 columns, but most of them are NULL, it should not affect performance much, since NULLs will not waste a lot of space.
You mentioned that you may have real data with 900-1000 non-NULL columns. If you are planning to import such files, you may come across another limitation of SQL Server. Yes, the maximum number of columns in a table is 1024, but there is a limit of 8060 bytes per row. If your columns are varchar(max), then each such column will consume 24 bytes out of 8060 in the actual row and the rest of the data will be pushed off-row:
SQL Server supports row-overflow storage which enables variable length
columns to be pushed off-row. Only a 24-byte root is stored in the
main record for variable length columns pushed out of row; because of
this, the effective row limit is higher than in previous releases of
SQL Server. For more information, see the "Row-Overflow Data Exceeding
8 KB" topic in SQL Server Books Online.
So, in practice you can have a table with only 8060 / 24 = 335 nvarchar(max) non-NULL columns. (Strictly speaking, even a bit less, there are other headers as well).
There are so-called wide tables that can have up to 30,000 columns, but the maximum size of the wide table row is 8,019 bytes. So, they will not really help you in this case.
yes. large records take up more space on disk and in memory, which means loading them is slower than small records and fewer can fit in memory. both effects will hurt performance.

Does a Full-Text Index work well for columns with embedded code values

Using SQL Server 2012, I've got a table that currently has several hundred-thousand rows, and will grow.
In this table, I've got a nvarchar(30) field that contains Medical Record Number (MRN) values. These values can be just about any alphanumeric value, but are not words.
For Example,
DR-345687
34568523
*45612345;T
My application allows the end user to enter a value, say '456' in the search field. The application would need to return all three of the example records.
Currently, I'm using Entity Framework 5.0, and asking for a field.Contains('456') type of search.
This always takes 3-5 seconds to return since it appears to do a table search.
My question is: Would creating a Full Text Index on this column help performance? I haven't tried it yet because the only copy of the database that I have with lots of data in it is currently in QA trials.
Looking at the documentation for the Full Text Indexes it appears that it is optimized around separate words in the field value, so I am hesitant to take the performance hit to create the index without knowing how it is likely to affect my query performance.
EF won't use the T-SQL keywords needed to access the SQL Server full text index (http://msdn.microsoft.com/en-us/library/ms142571.aspx#queries) so your solution won't fly without more work.
I think you would have to create a SProc to get the data using the FTI and then have EF call this. I have a similar issue and would be interested to know your results.
Andy

Combining multiple text fields into one text field

I'm trying to merge multiple text columns into one concatenated text column. Each of the fields were previously used for various descriptions, but per new reqs, I need all of those fields to be combined into one.
I tried converting them to varchar(max) first then concatenating, but some of the rows have values in these columns which are longer than the max and are being truncated in the result.
Is there a way to combine multiple text fields in SQL Server 2000?
The best advice I have for you is to either
perform the concatenation in your middle or presentation tier (or add an abstraction layer that allows this, including routing your query through a newer version of SQL Server which performs the concatenation after pulling through a linked server to 2000); or,
upgrade.
You can't fool SQL Server 2000 into supporting [n]varchar(max), and the limitation you've come across is just one of many, many, many reasons the [n]text data types were deprecated.

ADO - Can I edit results of a complex query with multiple join statements?

I'm working on a data conversion utility which can push data from one master database out to a number of different databases. The utility its self will have no knowledge of how data is kept in the destination (table structure), but I would like to provide writing a SQL statement to return data from the destination using a complex SQL query with multiple join statements. As long as the data is in a standardized format that the utility can recognize (field names) in an ADO query.
What I would like to do is then modify the live data in this ADO Query. However, since there are multiple join statements, I'm not sure if it's possible to do this. I know at least with BDE (I've never used BDE), it was very strict and you had to return all fields (*) and such. ADO I know is more flexible, but I don't know quite how flexible in this case.
Is it supposed to be possible to modify data in a TADOQuery in this manner, when the results include fields from different tables? And even if so, suppose I want to append a new record to the end (TADOQuery.Append). Would it append to two different tables?
The actual primary table I'm selecting from has a complimentary table which is joined by the same primary key field, one is a "Small" table (brief info) and the other is a "Detail" table (more info for each record in Small table). So, a typical statement would include something like this:
select ts.record_uid, ts.SomeField, td.SomeOtherField from table_small ts
join table_detail td on td.record_uid = ts.record_uid
There are also a number of other joins to records in other tables, but I'm not worried about appending to those ones. I'm only worried about appending to the "Small" and "Detail" tables - at the same time.
Is such a thing possible in an ADO Query? I'm willing to tweak and modify the SQL statement in any way necessary to make this possible. I have a bad feeling though that it's not possible.
Compatibility:
SQL Server 2000 through 2008 R2
Delphi XE2
Editing these Fields which have no influence on the joins is usually no problem.
Appending is ... you can limit the Append to one of the Tables by
procedure TForm.ADSBeforePost(DataSet: TDataSet);
begin
inherited;
TCustomADODataSet(DataSet).Properties['Unique Table'].Value := 'table_small';
end;
but without an Requery you won't get much further.
The better way will be setting Values by Procedure e.g. in BeforePost, Requery and Abort.
If your View would be persistent you would be able to use INSTEAD OF Triggers
Jerry,
I encountered the same problem on FireBird, and from experience I can tell you that it can be made(up to a small complexity) by using CachedUpdates . A very good resource is this one - http://podgoretsky.com/ftp/Docs/Delphi/D5/dg/11_cache.html. This article has the answers to all your questions.
I have abandoned the original idea of live ADO query updates, as it has become more complex than I can wrap my head around. The scope of the data push project has changed, and therefore this is no longer an issue for me, however still an interesting subject to know.
The new structure of the application consists of attaching multiple "Field Links" on various fields from the original set of data. Each of these links references the original field name and a SQL Statement which is to be executed when that field is being imported. Multiple field links can be on one single field, therefore can execute multiple statements, placing the value in various tables, etc. The end goal was an app which I can easily and repeatedly export a common dataset from an original source to any outside source with different data structures, without having to recompile the app.
However the concept of cached updates was not appealing to me, simply for the fact pointed out in the link in RBA's answer that data can be changed in the database in the mean-time. So I will instead integrate my own method of customizable data pushes.

Resources