Query equivalence evaluation - sql-server

My question is rooted in T-SQL, SQL Server environment, but its scope is not confined to this technology. I am working on a database with a quite complex business logic, with existing views, stored procedures and new ones to be designed. By means of comparisons of different queries or part of them, I have a strong feeling that there are sections performing the same job with a different arrangement, but of course to refactor the whole mess I need something more that a feeling; so I am trying to determine a way to demonstrate that two statements are equivalent.
An obvious but weak response could be to ascertain that the two queries A and B produce the same recordset: if A is a subset of B and B is a subset of A, they are the same recordset; but I am not sure that this is a good idea because, of course, a recordset is not a query, the results could depend on data and specific parameter values. My questions is: there is a method to prove the equivalence of two different queries? I would say yes, because the optimization performed by the database should works on this. Someone could provide me some pointer to documentation or books digging in this? If there is no general method to prove the equivalence, there is some smart approach based on regression testing performed according to some effective heuristic that does the job?
Edited later: in case, reverse engineering the queries (by hand?) by means of relational algebra, could be a superior method to assess the query equivalence instead of using other queries and / or the computer? There are automated tools helping in performing this "reverse engineering", in case?
Thanks a lot for helping

You probably can't prove it, since the problem seems to be NP-complete; check this SO question on query equivalence (that one is about Oracle, but there are a couple of answers / links that should be relevant for you).

You can check the execution plans of the two queries. If they are the same, you have your answer!

Only by the execution plan you can check it. Apart from that i dont think that there is any way to prove this thing.

You'll need to implement some "canonical query plan" generator for this (an "optimal query plan" as generated by the DBMS can be nondeterministic). In most cases, using alphabetical ordering of terms and tables as a tie-breaker will get you there.

I doubt you are going to be able to formally proof or disprove this but my take on this would be to
identify all use cases
identify all boundary values
identify all parameters
and derive a test plan from that. It would require you to
create testdata for each case
run both queries against that data
compare the results
If you don't find any differences after testing, you can be reasonably assured that both statements are equivallent.

Related

Stored procedure with multiple selects - interaction with client tool?

Suppose I have a stored procedure as follows:
create procedure p_x
as
begin
select 'a','b','c'
select 'c','d','e'
select 'e','f','g'
end
go
This is of course not the real code, but it illustrates enough to be able to ask my questions.
I'm looking for the best performance and the best practices to deal with it.
How will the client tool (eg Informatica Data Quality) calling this procedure react?
Will it receive 3 separate results, just the last query result or all results at once?
Will each separate query be send to the client directly (and will the procedure halt till completed)? or is this done after the procedure finished?
Is it good practice to work this way? I was looking for the exchange of an OUTPUT table type parameter, but this doesn't seem possible if I'm correct (based on other stories)(just as input)
Is there a performance impact in this way? And if so what is the way to do this as efficient as possible (e.g. to just send one result back to the client)
You would be better served by posting your question to the Informatica forums. They should be able to answer your questions precisely and accurately. But I'll give it a go.
How will the tool react? Don't know, but often tools that support using stored procedures as a datasource will assume and will consume a single (and the first) resultset. Any others will be ignored. Go ask in their forums.
Will it receive 3 ...? Roughly the same question and answer as the first.
Will each separate query ...? Your procedure produces three resultsets. How the client consumes them is, again, something you should ask in their forums. The procedure itself will not "halt" waiting for the client to do anything.
Is it good practice...? Not in my opinion. Nor is posting a complete nonsense procedure a useful tool for discussing the pros/cons of this approach. Can it be a useful thing to do? Likely. But it is not often done IME. In addition, you are dealing with a tool with which you are not familiar. The simpler you keep things the better you are off in the long run regardless of your tools.
A procedure is a unit of work and should do one "thing". If it produces multiple resultsets, one can argue that it ceases to do a single thing since, logically, each resultset represents a set of different (even if related) things. And typically one would expect to see some relationship among the resultsets. If there are no relationships, then the resultsets are obviously different things which violates the idea of a procedure. You might want to review the topic of coupling and cohesion. But I think I see a bigger issue - which I'll address with the next item.
Is there a performance impact ...? This can't really be answered. Performance is always, ALWAYS specific to a particular situation (query, schema, etc). Based on that last sentence, I think you have not made the adjustment to thinking in terms of sets - something that is critical to writing efficient sql. Rather, I'll guess that you are thinking in terms of a loop which includes a select statement and each iteration will produce a set of (perhaps 1 but who knows) rows. If you think you have the "option" to produce just one resultset of 3 rows vs. 3 resultsets of 1 row, then you are most likely stuck in RBAR land. Regardless, this can't really be answered. It is also a question for the Informatica people.

Is it bad design to use arrays within a database?

So I'm making a database for a personal project just to get more than my feet wet with PostgreSQL and certain languages and applications that can use a PostgreSQL database.
I've come to the realization that using an array isn't necessarily even compliant (Arrays are not atomic, right?) with 1NF. So my question is: Is there a lack of efficiency or data safety this way? Should I learn early to not use arrays?
Short answer to the title: No
A bit longer answer:
You should learn to use arrays when appropriate. Arrays are not bad design themselves, they are as atomic as a character varying field (array of characters, no?) and they exists to make our lives easier and our databases faster and lighter. There are issues considering portability (most database systems don't support arrays, or do so in a different way than Postgres)
Example:
You have a blog with posts and tags, and each post may have 0 or more tags. The first thing that comes to mind is to make a different table with two columns postid and tagid and assign the tags in that table.
If we need to search through posts with tagid, then the extra table is necessary (with appropriate indexes of course).
But if we only want the tag information to be shown as the post's extra info, then we can easily add an integer array column in the table of posts and extract the information from there. This can still be done with the extra table, but using an array reduces the size of the database (no needed extra tables or extra rows) and simplifies the query by letting us execute our select queries with joining one less table and seems easier to understand by human eye (the last part is in the eye of the beholder, but I think I speak for a majority here). If our tags are preloaded, then not even one join is necessary.
The example may be poor but it's the first that came to mind.
Conclusion:
Arrays are not necessary. They can be harmful if you use them wrong. You can live without them and have a great, fast and optimized database. When you are considering portability (e.g. rewriting your system to work with other databses) then you must not use arrays.
If you are sure you'll stick with Postgres, then you can safely use arrays where you find appropriate. They exist for a reason and are neither bad design nor non-compliant. When you use them in the right places, they can help a little with simplicity of database structures and your code, as well as space and speed optimization. That is all.
Whether an array is atomic depends on what you're interested in. If you generally want the whole array then it's atomic. If you are more interested in the individual elements then it is being used as structure. A text field is basically a list of characters. However, we're usually interested in the whole string.
Now - from a practical viewpoint, many frameworks and ORMs don't automatically unpack PostgreSQL's array types. Also, if you want to port the database to e.g. MySQL then you'll
Likewise foreign-key constraints can't be added to an array (EDIT: this is still true as of 2021).
Short answer: Yes, it is bad design. Using arrays will guarantee that your design is not 1NF, because to be 1NF there must be no repeating values. Proper design is unequivocal: make another table for the array's values and join when you need them all.
Arrays may be the right tool for the job in certain limited circumstances, but I would still try hard to avoid them. They're a feature of last resort.
The biggest problem with arrays is that they're a crutch. You know them already and you want to use them because they're familiar to you. But they do not work quite like you expect, and they will only allow you to postpone a true understanding of SQL and relational databases. You're much better off waiting until you're forced to use them than learning them and looking for opportunities to rely on them.
I believe arrays are a useful and appropriate design in cases where you're working with array-like data and want to use the power of SQL for efficient queries and analysis. I've begun using PostgreSQL arrays regularly for data science purposes, as well as in PostGIS for edge cases, as examples.
In addition to the well-explained challenges mentioned above, I'm finding the biggest problem in getting third-party client apps to be able to handle the array fields in ways I'd expect. In Tableau and QGIS, for example, arrays are treated as strings, so array operations are unavailable.
Arrays are a first class data type in the SQL standard, and generally allow for a simpler schema and more efficient queries. Arrays, in general, are a great data type. If your implementation is self-contained, and doesn't need to rely on third-party tools without an API or some other middleware that can deal with incompatibilities, then use the array field.
IF, however, you interface with third-party software that directly queries the DB, and arrays are used to produce queries, then I'd avoid them in favor of simpler lookup tables and other traditional relational approaches.

db design: efficiency consideration when adding an intermediate class into a Many-Many relationship

I understand an intermediate class is often introduced to capture information in a situation where for example, a team has many players, and a player plays for many teams over the years. The intermediate class introduced is contract with cardinality as shown:
Team -1----N- Contract -N----1- Player
Let's say however that 98% of all queries only want current information and don't care about historical information. Given the name of a player, they want to know information about his current team, and perhaps current contract.
Given the above relationship, should all the contracts always be looked through to find the current one first, and then from there access information about the team? Or should an optimization be made with direct linkage between the player and his current team?
Thanks
If it is assured that there is only one team for each player at given time, you just add
currentTeam column to the Player table and that's it. But remember you must update it every time you update the Contracts table! And it must be done within the transaction, so that the database is kept consistent at any time.
You violate some normal form this way, but you know what and why you are doing that - for efficiency and optimization. I do this trick many times.
This seems to be under the context of some kind of ORM, so I'll run with that. (Even if it isn't, keep reading.)
Objects are useful for modeling complex operations. For example, adding a new Contract causes all sorts of crazy things to happen to both the Team, the Players, and various PayChecks (I made the last one up, but you get the point). This is the perfect kind of thing to be handled in code than in, say, a hideously complex T-SQL stored procedure.
But when it comes to querying, I find that it often makes sense to write a view/SQL statement/projection that is shamelessly tailored to the set of information that you need to perform a function. As long as you do this for reading data, and not for writing it, then you're not really subverting your object model; you are just looking at it a different way, and you're just making a pragmatic observation that most of the time, you only need the information from a IPlayerCurrentContractQuery and not the whole list of Contracts within the Player. Since it is a method that is called a bajillion times, you've written an integration test to make sure that the SQL produces correct results, and you've looked closely at its query plan to make sure that it's not doing awful things like table scans to the database. This commonly-used screen in your app is fast and everyone is happy.
One could make the case that creating such a separate query is a premature optimization, but it probably isn't. I mean, if a player usually only has a few Contracts, then it might not be worth separating out the query and interface. Sucking down all of the Contracts from the database to loop through them and pluck out the current one is going to perform worse than selecting the right one from the database first, but if it's just a handful of Contracts, then a "yeah I'm fully aware it's kinda dumb but it's fast enough" approach is probably good enough, just move on. But if these Contracts stretch back years or are large objects, then separating out the query becomes a no-brainer.
If that starts performing badly because of the joins (which is unlikely unless you start seeing significant traffic), then you add a cache. And if that doesn't work due to lots of writes, then you can start denormalizing your database by adding a direct reference. But unless you are writing the next Facebook of baseball then YAGNI, and at that point you're sharding across servers and throwing away most of the benefits of the relational model anyway so who cares.
A similar situation is posed in my answer to this question.
(If this question isn't about ORM, and really is just about modeling how the tables are designed, then you make sure that you have an index that covers the query that selects the current contract--such as start and stop dates--and you are pretty much done unless you have really exceptional scaling requirements as mentioned above. If you're writing a particular set of joins very often, then you might write a function or stored procedure to remove the boilerplate.)
That's my brain dump. Hope this helps!
Given the above relationship, should all the contracts always be
looked through to find the current one first, and then from there
access information about the team?
A modern query optimizer will use the most selective index first. Assuming that player_id is in that index in a usable position, the optimizer will probably find all the rows for that player first--and there won't be many, right?--then do another index scan on the contract dates to find the current contract.
If I were you, I'd create a view that returns only the "current" rows. Let application code run against that view.

When do you have too many tables?

Two of my colleagues and I are building a system to do all sorts of hydrology and related stuff. It has a lot of requirements and have a good number of tables.
We are handling all sorts of sampling that it is done within this scope (hydrology) and we are trying to figure out a way to do it in a less painful way.
Sometimes we need to get all that sampling together and I'm starting to think we are over-complicating our database design.
How or when do you know that you are over-designing a database? Of course we are considering a lot of Normal Form Rules and other good practices, but when it is OK to drop one of those rules, e.g. not normalizing something?
What are your opinions on this?
Short Answer
You can't, worry about something else.
Long Answer
This sounds like yet another form of premature optimization. (YAFPO?)
You should design your schema using third normal form (3NF). Once designed, you should populate your tables with data and begin profiling.
If a particular query is deemed too costly then you should look into denormalization on a case by case basis.
Technical Answer (for the nitpickers who will inevitably object to: "you can't")
You will reach a limit at some point based on your choice of RDBMS and/or storage engine. Likely ceilings will be memory consumption or open file descriptors.
"When do you have too many tables?"
At the level of logical design, the correct answer is "never".
At the level of physical design (insofar as "having a table" really refers to some concept that pertains to the physical design), the correct answer is "if and when the queries that you need to do, given the restrictions of the DBMS you are using, are causing performance to be unacceptably low.".
We have a system with literally hundreds of tables - its no big deal, its just that a lot of different things are stored in the database.
We have a ton of tables in our system as well. What we did was normalize the database to a good point, then created a few views that encompass the most common table usage needs of our system. Something like that could help you as well.

Can you use Decision Tables in Relational Databases

I heard that decision tables in relational database have been researched a lot in academia. I also know that business rules engines use decision tables and that many BPMS use them as well.
I was wondering if people today use decision tables within their relational databases?
A decision table is a cluster of conditions and actions. A condition can be simple enough that you can represent it with a simple "match a column against this value" string. Or a condition could be hellishly complex. An action, similarly, could be as simple as "move this value to a column". Or the action could involve multiple parts or steps or -- well -- anything.
A CASE function in a SELECT or WHERE clause is a decision table. This is the first example of decision table "in" a relational database.
You can have a "transformation" table with columns that have old-value and replacement-value. You can then write a small piece of code like the following.
def decision_table( aRow ):
result= connection.execute( "SELECT replacement_value FROM transformation WHERE old_value = ?", aRow['somecolumn'] )
replacement= result.fetchone()
aRow['anotherColumn']= result['replacement_value']
Each row of the decision table has a "match this old_value" and "move this replacement_value" kind of definition.
The "condition" parts of a decision table have to be evaluated somewhere. Your application is where this will happen. You will fetch the condition values from the database. You'll use those values in some function(s) to see if the rule is true.
The "action" parts of a decision table have to be executed somewhere; again, your application does stuff. You'll fetch action values from the database. You'll use those values to insert, update or delete other values.
Decision tables are used all the time; they've always been around in relational databases. Each table requires a highly customized data model. It also requires a unique condition function and action procedure.
It doesn't generalize well. If you want, you could store XML in the database and invoke some rules engine to interpret and execute the BPEL rules. In this case, the rules engine does the condition and action processing.
If you want, you could store Python (or Tcl or something else) in the database. Seriously. You'd write the conditions and actions in Python. You'd fetch it from the database and run the Python code fragment.
Lots of choices. None of the "academic". Indeed, the basic condition - action stuff is done all the time.
Wheter or not to put decision tables in a database depends on a number of other questions.
Will your conditions be calculated inside the RDBMS or elsewhere? If the data used for evaluating these conditions, and a suitable method for evaluating them inside the RDBMS can be devised, it is probably a good idea. Maybe your actions also happens inside your database, which would make it even more attractive.
Your conditions, and even execution of your actions might be on the outside of the RDBMS, but you could still keep the connections between combinations of conditions and actions on the inside. Probably because most of you other data is there, and all you have is a web server sitting on top of it.
I can think of two ways to model this, depending on how many conditions you have (and wheter they are binary), and what the capacity for columns per table is.
Let's say you have 6 conditions that are binary, this means you have 2^6 = 64 possible combinations. Then you could have one column for every combination, and one row for every action.
Or you could have 16 conditions which means you would have almost an incalculable number of combinations (actually 65536). Which is a ridiculous number of columns. Better then to have a column for each condition and a column for each action and 65536 rows of what to do in each possible situation. Each row would represent a situation and what to do in that situation. The only datatype you use would be bool. You could also package these bools into bitmasked integers.
Actually, bigger decision tables are better avoided. Divide and rule, and use more tables is a much better way. Usually a subject matter expert will get tired if asked to give opinions on too high a number of conditions.
The strength of the decision table is really in the modelling stage where the developer and the subject matter expert can find out if every possible situation is mapped, and no blind spots can exist.
I think they will contribute to the already too much declined state of what used to be "in-person" communications- enough hide behind the screen as it is..... come out of the closet, get out - got the picture.
I would look into using an Object database rather than a traditional RDBMS (Relational Database Management System). Object databases are designed to be fast at handling hierarchical relationships between objects, whereas in an RDBMS, you have to represent these relationships across multiple table rows, or even tables so your queries (tree traversals) will be slow.

Resources