Optimize SQL Server Query using "OR" or "UNION" - sql-server

We have a SQL Server table called Deals which is a general table including all financial trade entries, the number of which can be hundreds of thousands or even millions. Each row in Deals has one column called Product, which is of type varchar and corresponds to different trade type: bond trade, equity trade, future or option trade, some are simple and some are more complex. Now we want to search the whole table to get all deals which are not matured. Depending on the type of the trade, the term "matured" has different meanings. For example, for Product type "BOND" and some others, we check the value of column MaturityDate in Deals table, is larger than the given date, i.e., MaturityDate > #DATE. And for the deal type "FUTURE", we have the column ValueDate in the same table larger than given date AND MaturityDate larger than given date. So my first draft of the query would look like this:
SELECT * FROM Deals
WHERE
(
(Product IN ('ProdA', 'ProdB') AND MaturityDate > #CD) OR
(Product IN ('ProdC', 'ProdD') AND (MaturityDate > #CD AND ValueDate > #CD)) OR
(Product IN ('ProdE', 'ProdF', SOME_OTHER_PRODUCT_TYPE) AND (MaturityDate > #CD AND ValueDate > #CD) AND SOME_OTHER_CRITERIA) OR
...
)
We have over 30 Product values and and least 8 sets of criteria (the most common criteria set applies to maybe 7 or 8 Product values, and the least common criteria sets apply to only one or two Product values). The column Product is indexed, some columns (but not all) in the criteria, for example, MaturityDate are also indexed. Most of the criteria sets only check the column values in Deals table, but there are indeed some criteria sets which involve with JOIN to some other tables and check the values there as well.
So now my question is, how I can optimize such a query (as a SW developer, I am really not an expert in DB and seldom write data access code)? Because I read from somewhere that it might be a good idea to replace OR clause with UNION. However, when I executed queries using UNION and using OR, the former took 5 seconds on my development machine (with less than 100,000 items in Deals table) while the latter took 3 seconds. And like I said, I have limited knowledge so I don't know if there is some other way to optimize such query. Could anyone share some experience? Thanks!

Related

Excel formula to retrieve conditional sumproduct for massive data set

I have a data table of over 50,000 rows. The data contains SALES information for permutations of STORES, DATES, and PRODUCTS. However, the PRODUCTS are actually a combination of PRIMARY PRODUCTS (PP) and SECONDARY PRODUCTS (SP), where a sale QUANTITY of 1 PP should convert to the sale of 1 or more SPs. I have another sheet containing the CONVERSIONS of PP to SP containing the respective MULTIPLIERS (over 500 rows). PPs and SPs have a many-to-many relationship; PPs may convert to several different SPs, and the same SP may be converted from other PPs.
At the moment, only unconverted sales quantities exist for the PRODUCTS, and it's my job to convert those figures to each PP's respective SP if a MULTIPLIER exists.
Sample: https://i.stack.imgur.com/YdGHn.png
I am able to do that with the following SUMPRODUCT() formula, which appears more efficient than an array formula:
=SUMPRODUCT(
(Conversions[Multiplier]),
--(Conversions[SP]=[#Product]),
SUMIFS([Quantity],[Product],Conversions[PP],[Store],[#Store],[Date],[#Date])
)
However, given the size of my data set, it still takes forever to process. Is there a more efficient way to do this?
EDIT:
I tried wrapping the formula in a conditional so that SUMPRODUCT only evaluates if the Product in question can be found in the Conversion table as an SP (and it also now displays the values of PRODUCTS that don't have any conversions). This seems to have sped things up a little, but still nowhere near quick enough...
=IFERROR(IF(MATCH([#Product],Conversions[SP],0)>0,
SUMPRODUCT(
(Conversions[Multiplier]),
--(Conversions[SP]=[#Product]),
SUMIFS([Quantity],[Product],Conversions[PP],[Store],[#Store],[Date],[#Date])
),0),0)+[#Quantity]
if you have the possibility to import your data to a database. you then can work with indexed tables. should be faster.

Need help in developing DB logic

This is a mini-project of mine - Airline reservation system - lets call this airline FlyMi : I have a database (Not decided which one, friend of mine wants to go with MongoDB). Anyhoo, this is my requirement :
I have a table which has details of the flight - Flight number, schedule etc. I'm going to use this table to perform various operations - booking , cancellation , modification
This is where I'm stuck : For the desktop app and the web application - I'm offering an option to select seats. This means I've got to keep track of which seats are booked , which ones are not. And assume I have an UI , which shows seats as Red - Booked Green - Not Booked.And all of this - for each and every flight. My question is : What do you think would be the most efficient way to track seat bookings , for each flight in that airline?
Current Idea : Keep a table named passenger - with all the details such as name , address etc. which keep track of all passengers, and maintain a passenger ID such that , first 4 characters are flight ID, Last 2 character are seat numbers they have chosen, with random number in-between ( I say random because I think it is immaterial here). So, for any flight , If I have to find out number of un-booked seats, I will have to scan through every passenger , who has booked, and who has booked in that flight. I think this is really in-efficient. Provide me with the most efficient logic to do this.
Don't use "smart keys".
This is a bad idea called "smart keys" or "encoding information in keys".
See this answer which contains this excerpt:
Despite it now being easy to implement a Smart Key, it is hard to recommend that you create one of your own that isn't a natural key, because they tend to eventually run into trouble, whatever their advantages, because it makes the databases harder to refactor, imposes an order which is difficult to change and may not be optimal for your queries, requires a string comparison if the Smart Key includes non-numeric characters, and is less effective than a composite key in helping range-based aggregations. It also violates the basic relational guideline that every column should store atomic values
Smart Keys also tend to outgrow their original coding constraints
(Notice that seat locations are typically identified by smart keys in that they are a row number and a count across a row. But they are also typically visibly physically permanently bolted into that formation. And imagine if they were labelled and rearranged.)
Educate yourself about database design.
Just describe your business in the most straightforward terms. That is how relational model databases & DBMSs work.
Find enough fill-in-the-[named-]blanks sentence templates to describe your business situations:
"customer [cid] has name [firstname] [lastname]
AND customer [cid] has a phone number [phonenumber] of type [type] ..."
"customer [cid] can use credit card #[card_no]"
"seat [seatid] is at row [row] and column [column]"
"seat [seatid] is booked"
"seat [seatid] is temporarily committed to an unfinished booking"
...
For each such parameterized sentence template (aka predicate) have a base table where the names of the blanks/parameters are column names. Each row in a table states the statement (proposition) got from filling in the blanks per its column values; each row not in a table states NOT the statement from filling in the blanks per its column values.
Then for each table find every functional dependency (FD) that holds. (When a predicate can be expressed in the form "... AND column = F(column1,...)" then we say that column set {column1,...} functionally determines column column and that FD set → column holds.) Then identify every candidate key (CK). (A superkey is a column set that functionally determines every column. Ie that is unique, ie where each subrow of values for those columns appears only in one row of a table. A CK is a superkey that doesn't contain a smaller superkey.) Then find every join dependency (JD). (Some predicates say "... AND ..." for some number of ANDs & "..."s. There is a JD when the table for each predicate "..." would look like what you get from taking only its columns from the original table.) Note that every FD comes with an associated (binary) JD.
Then normalize your tables to fifth normal form (5NF). This means decomposing (ie replacing a table in which JD "... AND ..." holds by tables whose predicates are the "..."s) until each JD that holds is implied by the CKs (ie must hold when the JDs from the FDs from the CKs hold.) (For performance reasons one can also then denormalize by combining to base tables that aren't in 5NF.)
See this answer and this one.
Then we query by describing the rows we want. We do this by connecting base table predicates with logical operators (ie AND, OR, NOT, FOR SOME, FOR ALL etc) and function calls to give the predicates for the tables we want and/or by connecting base table names by relation operators (ie JOIN, UNION, MINUS/EXCEPT, PROJECT/SELECT, RENAME/AS) to give the values of the tables we want and/or both (eg RESTRICT/WHERE).
The JOIN of two tables holds the rows that make a true statement from, ie has as predicate, the AND of their predicates; and the UNION the OR, the MINUS/EXCEPT the AND NOT; and that PROJECT/SELECT columns of a table puts FOR SOME all-other-columns before its predicate; and RESTRICT/WHERE puts AND condition after its predicate; and the RENAME/AS of column renames that parameter in its predicate. So a table expression corresponds to a predicate: A table (base table or query result) value contains the rows that make a true statement from its (base table's or query expression's) predicate.
See this answer.
The same goes for constraints, which are true statements that collectively describe the application situations and database states than can arise given the situations that can arise and the base table predicates.
See this answer.

T-SQL: query which joins with all dependent tables and produce cartesian product

I have a bunch of tables which refer to some number of other tables (zero, one, two or more).
My example tables might contain following columns:
Id | StatementTable1Id | StatementTable2Id | Value
where StatementTable1 will contain following columns:
Id | Name | Label
I wish to get all possible combinations and join all of them.
I found this link very useful (query which produce information about dependencies).
I would imagine my code as follows:
Prepare list of tables which I wish to query.
Query link for all my tables and save results into temporary table.
Check maximum number of dependent tables. Prepare query template - for example if maximum number of dependent tables is equal two:
Select
Id, '%Table1Name%' as Table1Name,
'%StatementLabelTable1%' as StatementLabelTable1,
'%Table2Name%' as Table2Name,
'%StatementLabelTable2%' as StatementLabelTable2, Value"
Use cursor - for each dependent table replace appropriate part with dependent table name and label of elements within it.
When all dependent tables have been used - replace all remaining columns with empty string.
add "UNION ALL" and proceed to next table
Run query
Could you tell me if there's any easier or better way?
What you've listed there sounds like you'll need to do if you don't know the column details ahead of time. There's likely going to be some trial-and-error to get the details correct, but it's a good plan to start.
That being said, why on earth would you want to do such a thing? It sounds like you need to narrow down your requirements on what data is actually needed. Otherwise, as you add data to your database, this query and resulting data set is going to quickly become quite unwieldy (these data sets are the kinds you hear about becoming daily "door-stop reports"; no one uses them, but they never remember why it was created, so they keep running the report, and just use it as a door-stop).

Database table structure for price list

I have like about 10 tables where are records with date ranges and some value belongin to the date range.
Each table has some meaning.
For example
rates
start_date DATE
end_date DATE
price DOUBLE
availability
start_date DATE
end_date DATE
availability INT
and then table dates
day DATE
where are dates for each day for 2 years ahead.
Final result is joining these 10 tables to dates table.
The query takes a bit longer, because there are some other joins and subqueries.
I have been thinking about creating one bigger table containing all the 10 tables data for each day, but final table would have about 1.5M - 2M records.
From testing it seems to be quicker (0.2s instead of about 1s) to search in this table instead of joining tables and searching in the joined result.
Is there any real reason why it should be bad idea to have a table with that many records?
The final table would look like
day DATE
price DOUBLE
availability INT
Thank you for your comments.
This is a complicated question. The answer depends heavily on usage patterns. Presumably, most of the values do not change every day. So, you could be vastly increasing the size of the database.
On the other hand, something like availability may change every day, so you already have a large table in your database.
If your usage patterns focused on one table at a time, I'd be tempted to say "leave well-enough alone". That is, don't make a change if it ain't broke. If your usage involved multiple updates to one type of record, I'd be inclined to leave them in separate tables (so locking for one type of value does not block queries on other types).
However, your usage suggests that you are combining the tables. If so, I think putting them in one row per day per item makes sense. If you are getting successive days at one time, you may find that having separate days in the underlying table greatly simplifies your queries. And, if your queries are focused on particular time frames, your proposed structure will keep the relevant data in the cache, giving room for better performance.
I appreciate what Bohemian says. However, you are already going to the lowest level of granularity and seeing that it works for you. I think you should go ahead with the reorganization.
I went down this road once and regretted it.
The fact that you have a projection of millions of rows tells me that dates from one table don't line up with dates from another table, leading to creating extra boundaries for some attributes because being in one table all attributes must share the same boundaries.
The problem I encountered was that the business changed and suddenly I had a lot more combinations to deal with and the number of rows blew right out, slowing queries significantly. The other problem was keeping the data up to date - my "super" table was calculated from the separate tables when ever they changed.
I found that keeping them separate and moving the logic into the app layer worked for me.
The data I was dealing with was almost exactly the same as yours except I had only 3
tables: I had availability, pricing and margin. The fact was that the 3 were unrelated, so date ranges never aligned, leasing to lots of artificial rows in the big table.

schema for storing different varchar fields over time?

This app I'm working on needs to store some meta data fields about an entity. The problem is that we can already foresee that these fields are going to change a lot in the future. Right now every entity's property is translated to one column in the entity table, but altering table columns later down the road will be costly and error-prone right?
Should I go for something like this (key-value store) instead?
MetaDataField
-----
metaDataFieldID (PK), name
FieldValue
----------
EntityID (PK, FK), metaDataFieldID (PK, FK), value [varchar(255)]
p.s. I also thought of using XML on SQL Server 05+. After talking to some ppl, seems like it is not a viable solution 'cause it will be too slow for doing certain query for reporting purposes.
You're right, you don't want to go changing your data schema any time a new parameter comes up!
I've seen two ways of doing something like this. One, just have a "meta" text field, and format the value to define both the parameter and the value. Joomla! does this, for example, to track custom article properties. It looks like this:
ProductTable
id name meta
--------------------------------------------------------------------------
1 prod-a title:'a product title',desc:'a short description'
2 prod-b title:'second product',desc:'n/a'
3 prod-c title:'3rd product',desc:'please choose sm med or large'
Another way of handling this is to use additional tables, like this:
ProductTable
product_id name
-----------------------
1 prod-a
2 prod-b
3 prod-c
MetaParametersTable
meta_id name
--------------------
1 title
2 desc
ProductMetaMapping
product_id meta_id value
-------------------------------------
1 1 a product title
1 2 a short description
2 1 second product
2 2 n/a
3 1 3rd product
3 2 please choose sm med or large
In this case, a query will need to join the tables, but you can optimize the tables better, can query for independent meta without returning all parameters, etc.
Choosing between them will depend on complexity, whether data rows ever need to have differing meta, and how the data will be consumed.
The Key Value table is a good idea and it works much faster than the SQL Server 2005 XML indexes. I started the same type of solution with XML in a project and had to change it to a indexed Key Value table to gain performance. I think SQL Server 2008 XML Indexes are faster, but have not tried them yet.
The XML speed only factors in depending on the size of the data going into the xml column. We had a project that stuffed data into and processed data from an xml column. It was very fast.. until you hit around 64kb. 63KB and less took milliseconds to get the data out or insert into. 64KB and the operations jumped to a full minute. Go figure.
Other than that the main issue we had was complexity. Working with xml data in sql server is not for the faint of heart.
Regardless, your best bet is to have a table of name / value pairs tied to the entity in question. Then it's easy to support having entities with either different properties or dynamically adding / removing properties. This too has it's caveats. For example, if you have more than say 10 properties, then it will be much faster to do pivots in code.
There is also a pattern for this to consider -- called the observation pattern.
See similar questions/answers: one, two, three.
The pattern is described in Martin Fowler's book Analysis Patterns, essentially it is an OO pattern, but can be done in DB schema too.
"altering table columns later down the road will be costly and error-prone right?"
A "table column", as you name it, has exactly two properties : its name and its data type. Therefore, "altering a table column" can refer only to two things : altering the name or altering the data type.
Wanting to alter the name is indeed a costly and error-prone operation, but fortunately there should never be a genuine business need for it. If a certain established column seems somewhat inappropriate, with afterthought, and "it might have been given a better name", then it is still not the case that the business incurs losses from that fact! Just stick with the old name, even if with afterthought, it was poorly chosen.
Wanting to alter the data type is indeed a costly operation, susceptible to breaking business operations that were running smoothly, but fortunately it is quite rare that a user comes round to tell you that "hey, I know I told you this attribute had to be a Date, but guess what, I was wrong, it has to be a Float.". And other changes of the same nature, but more likely to occur (e.g. from shortint to integer or so), can be avoided by being cautious when defining the database.
Other types of database changes (e.g. adding a new column) are usually not that dangerous and/or disruptive.
So don't let yourself be scared by those vague sloganesque phrases such as "changing a database is expensive and dangerous". They usually come from ignorants who know too little about database management to be involved in that particular field of our profession anyway.
Maintaining queries, constraints and constraint enforcement on an EAV database is very likely to turn out to be thousands of times more expensive than "regular" database structure changes.

Resources