Compare two datasets from different databases using VB.NET - sql-server

For data assurance, my task is two compare two datasets from different databases. Currently I am performing the cell-by-cell value comparison which is a Brute Force method and is consuming lot of time.
I would like to know if there are any methods which would save my time and memory space, which is able to provide a result indication "Tables are identical" or "Tables are not identical".
Thank you for your assistance.

How about creating a checksum for each table and compare the them?
Something like:
SELECT CHECKSUM_AGG(CHECKSUM(*)) FROM TableX
This might need a ORDER BY to be more precise.

If they are from different sources, there is no other way than comparing them cell by cell AFAIK. However I can suggest you something that will probably increasing comparison speed by many folds. If your DataTables have identical structures, which they hopefully should since you're already comparing them cell by cell, try comparing ItemArray of each pair of rows instead of accessing them by column index or column names (or row properties if you're using strongly-typed DataSets). This will hopefully give you much better results.
If you're using .NET 3.5 or above, this line should do it:
Enumerable.SequenceEqual(row1.ItemArray, row2.ItemArray);

Related

What is the best solution to store a volunteers availability data in access 2016 [duplicate]

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

how to compare two databases in matlab

I have two database in excel being extracted not in matlab. How can I compare them within matlab? Is the right way, storing them in a structure ? how I can find the similarities and differences?
Each database has 4 columns, and database has around 400 rows of data.
If you are interested in the structure, you could create a struct for each database where each fieldname is equal to a column name.
In this case one can use visdiff.
However if you are going to compare a lot of numbers this is not practical.
To confirm that they are equal, one can use something like isequal
To see how they are different, plot them or plot the difference of them.
To see whether the data behaves differently calculate some basic statistics like max, min, mean, std, you may also be interested in the correlation between columns of the two datasets.

To use or not to use computed columns for performance and maintainability

I have a table where am storing a startingDate in a DateTime column.
Once i have the startingDate value, am supposed to calculate the
number_of_days,
number_of_weeks
number_of_months and
number_of_years
all from the startingDate to the current date.
If you are going to use these values in two or more places in the application and you do care much about the applications response time, would you rather make the calculations in a view or create computed columns for each so you can query the table directly?
Computed columns are easy to maintain and provide an ideal solution to your problem – I have used such a solution recently. However, be aware the values are calculated when requested (when they are SELECTed), not when the row is INSERTed into the table – so performance might still be an issue. This might be acceptable if you can off-load work from the application server to the database server. Views also don’t exist until they are requested (unless they are materialised) so, again, there will be an overhead at runtime, but, again it’s on the database server, not the application server.
Like nearly everything: It depends.
As #RedX suggest it probably not much of a performance difference either way, so it becomes a question of how will use them. To me this is more of a feel thing.
Using them more than once doesn't wouldn't necessary drive me immediately to either a view or computed columns. If I only use them in a few places or low volume code paths I might calc them in-line in those places or use a CTE. But if the are in wide spread or heavy use I would agree with a view or computed column.
You would also want them in a view or cc if you want them available via ORM tools.
Am I using those "computed columns" individual in places or am I using them in sets? If using them in sets I probably want a view of the table that shows included them all.
When i need them do I usually want them associated with data from a particular other table? If so that would suggest a view.
Am I basing updates on the original table of those computed values? If so then I want computed columns to avoid joining the view in these case.
Calculated columns may seem an easy solution at first, but I have seen companies have trouble with them because when they try to do ETL with CDC for real-time Change Data Capture with tools like Attunity it will not recognize the calculated columns since the values are not there permanently. So there are some issues. Also if the columns will be retrieve many, many times by users, you will save time in the long run by putting that logic in the ETL tool or procedure and write it once to the database instead of calculating it many times for each request.

Store array of numbers in database field

Context: SQL Server 2008, C#
I have an array of integers (0-10 elements). Data doesn't change often, but is retrieved often.
I could create a separate table to store the numbers, but for some reason it feels like that wouldn't be optimal.
Question #1: Should I store my array in a separate table? Please give reasons for one way or the other.
Question #2: (regardless of what the answer to Q#1 is), what's the "best" way to store int[] in database field? XML? JSON? CSV?
EDIT:
Some background: numbers being stored are just some coefficients that don't participate in any relationship, and are always used as an array (i.e. never a value is being retrieved or used in isolation).
Separate table, normalized
Not as XML or json , but separate numbers in separate rows
No matter what you think, it's the best way. You can thank me later
The "best" way to store data in a database is the way that is most conducive to the operations that will be performed on it and the one which makes maintenance easiest. It is this later requirement which should lead you to a normalized solution which means storing the integers in a table with a relationship. Beyond being easier to update, it is easier for the next developer that comes after you to understand what and how the information is stored.
Store it as a JSON array but know that all accesses will now be for the entire array - no individual read/writes to specific coefficients.
In our case, we're storing them as a json array. Like your case, there is no relationship between individual array numbers - the array only make sense as a unit and as a unit it DOES has a relationship with other columns in the table. By the way, everything else IS normalized. I liken it to this: If you were going to store a 10 byte chunk, you'd save it packed in a single column of VARBINARY(10). You wouldn't shard it into 10 bytes, store each in a column of VARBINARY(1) and then stitch them together with a foreign key. I mean you could - but it wouldn't make any sense.
YOU as the developer will need to understand how 'monolithic' that array of int's really is.
A separate table would be the most "normalized" way to do this. And it is better in the long run, probably, since you won't have to parse the value of the column to extract each integer.
If you want you could use an XML column to store the data, too.
Sparse columns may be another option for you, too.
If you want to keep it really simple you could just delimit the values: 10;2;44;1
I think since you are talking about sql server that indicates that your app may be a data driven application. If that is the case I would keep definately keep the array in the database as a seperate table with a record for each value. It will be normalized and optimized for retreival. Even if you only have a few values in the array you may need to combine that data with other retreived data that may need to be "joined" with your array values. In which case sql is optimized for by using indexes, foreign keys, etc. (normalized).
That being said, you can always hard code the 10 values in your code and save the round trip to the DB if you don't need to change the values. It depends on how your application works and what this array is going to be used for.
I agree with all the others about the best being a separate normalized table. But if you insist in having it all in the same table don't place the array in one only column. In instead create the 10 columns and store each array value in a different column. It will save you the parsing and update problems.

Adequately Good Way to Store Variable Amounts of Data in a Single Column

I need to find a relatively robust method of storing variable types data in a single column of a database table. The data may represent a single value or multiple values and may any of a long list of characters (too long to enumerate easily). I'm wondering what approaches might work in this process. I'd toyed with the ideas of adding some form of separator, but I'm worried that any simple separator or combination might occur naturally in the data. I'd also like to avoid XML or snippets since in fact the data could be XML. Arguably I could encode the XML, but that still seems fragile.
I realize this is naturally a bit of an opinion question, but I lack the mojo to make it community.
Edit for Clarification:
Background for the problem: the column will hold data that is then used to make a evaluation based on another column. Functionally it's the test criteria for a decision engine. Other columns hold the evaluation's nature and the source of the value to test. The data doesn't need to be searchable.
Does the data need to be searchable? If not, slap it in a varbinary(MAX) and have a field to assist in deserialization.
Incidentally, though; using the right XML API, there should be no trouble storing XML inside an XML node.
But my guess is there has to be a better way to do this... it seems... ugh!
JSON format, though I agree with djacobson, your question is like asking for the best way to saw a 2x4 in half with a teaspoon.
EDIT: The order in which data are stored in the JSON string is irrelevant; each datum is stored as a key-value pair.
There's not a "good" way to do this. There is a reason that data types exist in SQL.
The only conceivable way I can think of to make it close is to make your column a lookup column, which refers to a GUID or ID in another table, which itself has additional columns indicating which table and row have your data.

Resources