SQL Data verification framework? - sql-server

I receive a variety of flat files that need to be transformed and aggregated in several stages of an ETL process before loading it into a SQL Server database.
After each stage, I'd like to verify the data in several ways, and I'm looking into existing technologies that can help.
Upon receiving the data, it needs to be validated for things such as truncated data, date formatting and generally ensuring the data is ready for transformation.
After the data is cleaned in this way, I want to verify the data. This would consist of comparing values such as row counts, % nulls, average values etc. to previous loads, or predefined values. If the verification fails, the developer should be alerted.
tSQLt, the database unit testing framework, has several assertions that can be used to do what I want. It's easy to set up and has decent documentation. This is the nearest tool I can see, but it's a long way from what it's designed for.
The alternative is to create my own tool, but I want to know - does something like this already exist?

After a bit of searching I found a commercial solution which I think would solve the problem: QuerySurge. There are a couple of similar tools like this (ETL validator), though it claims to be unique software.
It works by:
Using set comparison between 2 queries, raising errors if they do not
match. This could be row counts before/after transformations, or
simply checking a result returns nothing.
Queries can be performed against any JDBC compliant data source using
ANSI SQL and any connection specific SQL. The results are stored on a
separate server using a MySql backend and you can choose to either
host this yourself or use their servers.
It permits command line usage and therefore supports continuous
integration tools.
A nice feature is the grouping of tests (test suites), although it is
not clear how the results of a group would affect an overall test.
The built-in reporting tools also look nice.
That's the majority of what I gleaned from the website. I haven't downloaded the trial as the software itself is outside of my price range.
The tool is not complicated in principal, and we'll be developing our own framework to cope.

Related

Which one is better, iterate and sort data in backend or let the database handle it?

I'm trying to design a database schema for Djabgo rest framework web application.
At some point, I have two choces:
1- Choose a schema in which in one or several apies, I have to get a queryset from database and iterate and order it with python. (For example, I can store some datas in an array-data-typed column, get them from database and sort them with python.)
2- store the data in another table and insert a kind of big number of rows with each insert. This way, I can get the data in my favorite format in much less lines with orm codes.
I tried some basic tests and benchmarking to see which way is faster, and letting database handle more of the job (second way) didn't let me down. But I don't have the means of setting a more real situatuin and here's the question:
Is it still a good idea to let database handle the job when it also has to handle hundreds of requests from other apies and clients each second?
Is database (and orm) usually faster and more reliable than backend?
As a general rule, you want to let the database do work when the work is appropriate for the database. Sorting result sets would be in that category.
Keep in mind:
The database is running on a server, often on a distributed system and so it has access to more resources.
Databases are designed to handle large data, so they are not limited by the memory in a single thread.
When this question comes up, often more data needs to be passed back to the application than is strictly needed. Consider a problem such as getting the top 10 of something.
Mixing processing in the application and the database often requires multiple queries and passing data back and forth, which is expensive.
(And there are no doubt other considerations.)
There are some situations where it might be more efficient or convenient to do work in the application. A common example is formatting result sets for the application -- say turning 1234.56 into $1,234.56. Other examples would be when the application language has capabilities that are not directly in SQL or are hard to implement in SQL.

Multiple tables vs one big table with JSON serialized data

Here is my situation,
I have an application in which I need to store information about the results of different tests made on blood samples. I am currently using ASP.Net core for the web application and SQL Server for the database. (Might switch to Postgres as I will surely host on Linux and SQL Server for Linux is not totally available yet)
All the tests have some information in common, who performed it, at what time, any other related information for tracking purposes. But then all of them also have specific variables that I need to save for reporting/further calculations.
As of now I have about 20 different types of tests we perform on the samples we receive. The question I have is what would be the best way to save that data?
The two options I see are the following:
Have 20 different tables, all containing the general sample tracking info + specific test variables. This way, when I need to fetch the info, everything for a specific type of test is easily accessible. But then I need to query all these tables by join queries whenever I want to generate a report or modify sample results information (as all the test results/variables entry forms are in a single page). There if very few moments where I need to query only a specific type of test, most of the time, I need to retrieve them all at once, which means that I will always (mostly) query the 20+ tables every time I need to access sample data.
Have one big table containing all the results for the different tests performed and serialize (JSON format) only the specific test variables. So I would have all tracking information available (queryable, searchable, etc....) but the variables and results of each test would be in a single serialized column.
It is important to know that the variables/results won't be queried directly, I don't need to filter by them or anything like that (yet at the very least).
Now I wonder what would give me the best performance in the long term between using the multiple tables with join queries vs using serialization/deserialization that needs to take place whenever I access the data.
Also, I am aware that by serializing the test results/variables, I am losing ability to query by the information they contain (except for SQL server 2016 that now includes a way to query JSON information if I'm not mistaken...).
I also try to follow best practices by normalizing the database but I'm not a pro and I don't know what would be the best approach between my two options (or any other option if there is a better alternative, I'm totally open to better ideas)
So what would be the best approach and why?
Usage estimate
There might be around 15 to 30 millions tests performed every year. Of which I would say 2/3 would be of 5 different blood tests and the other third would be all the other tests performed.
Different table for different test is a good idea to work with.
Reason 1:If only 10 tests are performed on the sample rest of the column will unnecessary waste DB space.
Reason 2:Creating report will be easy in future according to samples
Reason 3:Filtering of data will be easy
Reason 4:maintenance will be easy
If in case of tests are mandatory go with 1 table

Database queries as application healthchecks - management tool

Hey there fellow Stackoverflowers,
In our company we have several application stacks running on different types of databases (MySQL, PostgreSQL, MS SQL, Azure SQL,..). For monitoring purposes we use some scripted queries on the databases of all these application stacks, with Nagios reporting back the results in an email.
Now, since our support team would also like easy access to these queries in order to manually run them or modify them, we were considering building an application specifically designed to be able to store, run and modify queries that can be executed on any of the above listed database types and offering both a user-friendly webinterface and a REST API with JSON output for our new reporting stack based on SENSU, to be deployed in a few months.
My personal belief is that a tool like this must already be out there, since the use case for it is so generic. However, googling did not yield any results even closely resembling what I am looking for.
So my question to you is: Do you know of such a tool? If you had to build it yourself: what would your approach be? We're mostly a Java/C++ team, but are open to all options.
Some or may be all of this stuff can be done by an existing API called NAGIRA. Look it up on Google. This will definitely give you all the results in JSON format. Also i think it would allow you to run checks manually. So you can may be build a little front end and call this API to achieve what you want.
A little late of a reply, but check out http://cloudmonix.com -- it offers ability to create metrics based on custom SQL queries, supports SQL Azure, SQL Server, MySQL, and Oracle. Also integrates with Nagios (and Zabbix)

Data Correlation in large Databases

We're trying to identify the locations of certain information stored across our enterprise in order to bring it into compliance with our data policies. On the file end, we're using Nessus to search through differing files, but I'm wondering about on the database end.
Using Nessus would seem largely pointless because it would output the raw data and wouldn't tell us what table or row it was in, or give us much useful information, especially considering these databases are quite large (hundreds of gigabytes).
Also worth noting, this system needs to be able to do pattern-based matching (such as using regular expressions). Not just a "dumb search" engine.
I've investigated the use of Data Mining and Data Warehousing in order to find this data but it seems like they're more for analysis of data than actually just finding data.
Is there a better method of searching through large amounts of data in a database to try and find this information? We're using both Oracle 11g and SQL Server 2008 and need to perform the searches on both, so I'd like to stay away from server-specific paradigms (although if I have to rewrite some code to translate from T-SQL to PL/SQL, and vice versa, I don't mind)
On SQL Server for searching through large amounts of text, you can look into Full Text Search.
Read more here http://msdn.microsoft.com/en-us/library/ms142559.aspx
But if I am reading right, you want to spider your database in a similar fashion to how a web search engine spiders web sites and web pages.
You could use a set of full text queries that bring back the results spanning multiple tables.
Oracle supports regular expression with the RegExp_Like() function and it ought to be fairly straightforward to automate the generation of the code you need based on system metadate (to find all text columns over a certain length, for example, and include them in a predicate againt that table to find the rows and values that match your regexp). Doesn't sound too challenging really. In theory you could check constrain columns to prevent the insertion of values that match a regexp but that might be overkill.
Oracle Text is suited for searching for words/phrases in larg(ish) bits of text (eg PDFs, HTMLs, TXT or DOCs) held in the database. There is some limited fuzziness searching, but not regular expressions per se.
You don't really go into what sort of data you are looking for or what you have in your databases. Nessus indicates you are looking for security issues, but the title of "Data Correlation" suggests something completely different.
Really the data structures should provide the information about what to look for and where. That's what databases are about - structuring data for accessibility. A database backing a CMS, forum software or similar would be a different kettle of fish.

Need help designing big database update process

We have a database with ~100K business objects in it. Each object has about 40 properties which are stored amongst 15 tables. I have to get these objects, perform some transforms on them and then write them to a different database (with the same schema.)
This is ADO.Net 3.5, SQL Server 2005.
We have a library method to write a single property. It figures out which of the 15 tables the property goes into, creates and opens a connection, determines whether the property already exists and does an insert or update accordingly, and closes the connection.
My first pass at the program was to read an object from the source DB, perform the transform, and call the library routine on each of its 40 properties to write the object to the destination DB. Repeat 100,000 times. Obviously this is egregiously inefficent.
What are some good designs for handling this type of problem?
Thanks
This is exactly the sort of thing that SQL Server Integration Services (SSIS) is good for. It's documented in Books Online, same as SQL Server is.
Unfortunately, I would say that you need to forget your client-side library, and do it all in SQL.
How many times do you need to do this? If only once, and it can run unattended, I see no reason why you shouldn't reuse your existing client code. Automating the work of human beings is what computers are for. If it's inefficient, I know that sucks, but if you're going to do a week of work setting up a SSIS package, that's inefficient too. Plus, your client-side solution could contain business logic or validation code that you'd have to remember to carry over to SQL.
You might want to research Create_Assembly, moving your client code across the network to reside on your SQL box. This will avoid network latency, but could destabilize your SQL Server.
Bad news: you have many options
use flatfile transformations: Extract all the data into flatfiles, manipulate them using grep, awk, sed, c, perl into the required insert/update statements and execute those against the target database
PRO: Fast; CON: extremly ugly ... nightmare for maintanance, don't do this if you need this for longer then a week. And a couple dozens of executions
use pure sql: I don't know much about sql server, but I assume it has away to access one database from within the other, so one of the fastes ways to do this is to write it as a collection of 'insert / update / merge statements fed with select statements.
PRO: Fast, one technology only; CON: Requires direct connection between databases You might reach the limit of SQL or the available SQL knowledge pretty fast, depending on the kind of transformation.
use t-sql, or whatever iterative language the database provides, everything else is similiar to pure sql aproach.
PRO: pretty fast since you don't leave the database CON: I don't know t-sql, but if it is anything like PL/SQL it is not the nicest language to do complex transformation.
use a high level language (Java, C#, VB ...): You would load your data into proper business objects manipulate those and store them in the database. Pretty much what you seem to be doing right now, although it sounds there are better ORMs available, e.g. nhibernate
use a ETL Tool: There are special tools for extracting, transforming and loading data. They often support various databases. And have many strategies readily available for deciding if an update or insert is in place.
PRO: Sorry, you'll have to ask somebody else for that, I so far have nothing but bad experience with those tools.
CON: A highly specialized tool, that you need to master. I my personal experience: slower in implementation and execution of the transformation then handwritten SQL. A nightmare for maintainability, since everything is hidden away in proprietary repositories, so for IDE, Version Control, CI, Testing you are stuck with whatever the tool provider gives you, if any.
PRO: Even complex manipulations can be implemented in a clean maintainable way, you can use all the fancy tools like good IDEs, Testing Frameworks, CI Systems to support you while developing the transformation.
CON: It adds a lot of overhead (retrieving the data, out of the database, instanciating the objects, and marshalling the objects back into the target database. I'd go this way if it is a process that is going to be around for a long time.
Building on the last option you could further glorify the architectur by using messaging and webservices, which could be relevant if you have more then one source database, or more then one target database. Or you could manually implement a multithreaded transformer, in order to gain through put. But I guess I am leaving the scope of your question.
I'm with John, SSIS is the way to go for any repeatable process to import large amounts of data. It should be much faster than the 30 hours you are currently getting. You could also write pure t-sql code to do this if the two database are on the same server or are linked servers. If you go the t-sql route, you may need to do a hybrid of set-based and looping code to run on batches (of say 2000 records at a time) rather than lock up the table for the whole time a large insert would take.

Resources