SQL Invoice Query Performance converting Credits to negative numbers - sql-server

I have a 3rd party database that contains Invoice data I need to report on. The Quantity and Amount Fields are stored as Positive numbers regardless of whether the "invoice" is a Credit Memo or actual Invoice. There is a single character field that contains the Type "I" = Invoice, "R" = Credit.
In a report that is equating 1.4 million records, I need to sum this data, so that Credits subtract from the total and Invoices add to the total, and I need to do this for 8 different columns in the report (CurrentYear, PreviousYear, etc)
My problem is performance of the many different ways to achieve this.
The Best performing seems to be using a CASE statement within the equation like so:
Case WHEN ARH.AccountingYear - 2 = #iCurrentYear THEN ARL.ShipQuantity * (CASE WHEN InvoiceType = 'R' THEN -1 ELSE 1 END) ELSE 0 END as PPY_INVOICED_QTY
But code readable wise, this is super ugly since I have to do it to 8 different columns, performance is good, runs against all 1.4M records in 16 seconds.
Using a Scalar UDF kills performance
Case WHEN ARH.AccountingYear - 2 = #iCurrentYear THEN ARL.ShipQuantity * dbo.fn_GetMultiplier(ARH.InvoiceType) ELSE 0 END as PPY_INVOICED_QTY
Takes almost 5 minutes. So can't do that.
Other options I can think of would be:
Multiple levels of Views, use a new view to add a Multiplier column, then SELECT from that and do the multiplication using the new column
Build a table that has 2 columns and 2 records, R, -1 and I, 1, and join it based on InvoiceType, but this seems excessive.
Any other ideas I am missing, or suggestions on best practice for this sort of thing? I cannot change the stored data, that is established by the 3rd party application.

I decided to go with the multiple views as Igor suggested, actually using the nested version, even though readability is lower, maintenance is easier due to only 1 named view instead of 2. Performance is similar to the 8 different case statements, so overall running in just under 20 seconds.
Thanks for the insights.

Related

Calculate working days between two dates: null value returned

I'm trying to figure out the number of working days between two dates. The table (dfDates) is laid out as follows:
Key
StartDateKey
EndDateKey
1
20171227
20180104
2
20171227
20171229
I have another table (dfDimDate) with all the relevant date keys and whether the date key is a working day or not:
DateKey
WorkDayFlag
20171227
1
20171228
1
20171229
1
20171230
0
20171231
0
20180101
0
20180102
1
20180103
1
20180104
1
I'm expecting a result as so:
Key
WorkingDays
1
6
2
3
So far (I realise this isn't complete to get me the above result), I've written this:
workingdays = []
for i in range(0, len(dfDates)):
value = dfDimDate.filter((dfDimDate.DateKey >= dfDates.collect()[i][1]) & (dfDimDate.DateKey <= df.collect()[i][2])).agg({'WorkDayFlag': 'sum'})
workingdays.append(value.collect())
However, only null values are being returned. Also, I've noticed this is very slow and took 54 seconds before it errored.
I think I understand what the error is about but I'm not sure how to fix it. Also, I'm not sure how to optimise the command so it runs faster. I'm looking for a solution in pyspark or spark SQL (whichever is easiest).
Many thanks,
Carolina
Edit: The error below was resolved thanks to a suggestion from #samkart who said to put the agg after the filter
AnalysisException: Resolved attribute(s) DateKey#17075 missing from sum(WorkDayFlag)#22142L in operator !Filter ((DateKey#17075 <= 20171228) AND (DateKey#17075 >= 20171227)).;
A possible and simple solution:
from pyspark.sql import functions as F
dfDates \
.join(dfDimDate, dfDimDate.DateKey.between(dfDates.StartDateKey, dfDates.EndDateKey)) \
.groupBy(dfDates.Key) \
.agg(F.sum(dfDimDate.WorkDayFlag).alias('WorkingDays'))
That is, first join the two datasets in order to link each date with all the dimDate rows in its range (dfDates.StartDateKey <= dfDimDate.DateKey <= dfDates.EndDateKey).
Then simply group the joined dataset by the date key and count the number of working days in its range.
In the solution you proposed, you are performing the calculation directly on the driver, so you are not taking advantage of the parallelism that spark offers. This should be avoided when possible, especially for large datasets.
Apart from that, you are requesting repeated collects in the for-loop, even for the same data, resulting in a further slowdown.

Extracting all records using a conditional SQL Server query?

I have a long database of observations for individuals. There are multiple observations for each individual, all assigned different medcodeid's.
I want to extract all records of individuals with certain medcodeid's assigned, but only if they at some point have had a smaller list of specific codes assigned.
This is an example of what I start with:
long dataset, multiple observations
and this is the records I'd like to extract:
multiple observations, but patients 3 and 5 are not extracted, as they never had a medcode 12
Would this be an additional WHERE clause? I am struggling as this will then only extract the second AND medcodeid list. But I want it to extract all, if the individual has had one of these certain fewer codes at some point. I hope that makes some sense. I am unfamiliar with IF command? And cannot see how CASE WHEN would work either.
Thank you very much in advance!
You definitely don't want to filter out all the rows so you're right that an additional condition won't help with that. And where only lets you look at the current row and you're trying to make a decision based all the rows belonging to the patient.
This query just uses a table expression and an analytic count() that tags each row with the number of matches as it lets you look outside the current row just like you need.
-- my additions to your query are in lowercase
with data as (
SELECT obs.patid, yob, obsdate, medcodeid,
count(case when medcodeid IN (<list of mandatory codes>) then 1 end)
over (partition by obs.patid) as medcode_count
-- assuming the relationship looks something like this
from obs inner join medcode on medcode.patid = obs.patid
WHERE medcodeid IN (<list of codes>)
AND obsdate BETWEEN '2004-12-31' AND GETDATE()
AND patienttypeid = 3 AND acceptable = 1 AND gender = 2
AND YEAR(obsdate) - yob > 15 AND YEAR(obsdate) - yob < 45
)
select * from data where medcode_count > 0;
At first I thought you were requiring that at least five of the codes from the full set were found. Now that you've edited the question I believe that you want to require that at least one code from a smaller subset is present. Either way this approach will work.
If I'm understanding what you're asking, I think what you need is an additional WHERE clause with a subquery. This could be done with and EXIST or a join but I find an IN query to be easier to work with.
You left the FROM out of your query so I had to guess at it but try this:
SELECT
obs.patid,
yob,
obsdate,
medcodeid
FROM
obs
WHERE
medcodeid IN (list of 20 codes)
AND (obsdate BETWEEN '2004-12-31' AND GETDATE())
AND patienttypeid = 3
AND acceptable = 1
AND gender = 2
AND ((YEAR(obsdate))-yob) > 15
AND ((YEAR(obsdate)) - yob) < 45
AND obs.patid IN (
SELECT
obs.patid
FROM
obs
WHERE
medcodeid IN (5 of the 20 codes)
);

Using a sort order column in a database table

Let's say I have a Product table in a shopping site's database to keep description, price, etc of store's products. What is the most efficient way to make my client able to re-order these products?
I create an Order column (integer) to use for sorting records but that gives me some headaches regarding performance due to the primitive methods I use to change the order of every record after the one I actually need to change. An example:
Id Order
5 3
8 1
26 2
32 5
120 4
Now what can I do to change the order of the record with ID=26 to 3?
What I did was creating a procedure which checks whether there is a record in the target order (3) and updates the order of the row (ID=26) if not. If there is a record in target order the procedure executes itself sending that row's ID with target order + 1 as parameters.
That causes to update every single record after the one I want to change to make room:
Id Order
5 4
8 1
26 3
32 6
120 5
So what would a smarter person do?
I use SQL Server 2008 R2.
Edit:
I need the order column of an item to be enough for sorting with no secondary keys involved. Order column alone must specify a unique place for its record.
In addition to all, I wonder if I can implement something like of a linked list: A 'Next' column instead of an 'Order' column to keep the next items ID. But I have no idea how to write the query that retrieves the records with correct order. If anyone has an idea about this approach as well, please share.
Update product set order = order+1 where order >= #value changed
Though over time you'll get larger and larger "spaces" in your order but it will still "sort"
This will add 1 to the value being changed and every value after it in one statement, but the above statement is still true. larger and larger "spaces" will form in your order possibly getting to the point of exceeding an INT value.
Alternate solution given desire for no spaces:
Imagine a procedure for: UpdateSortOrder with parameters of #NewOrderVal, #IDToChange,#OriginalOrderVal
Two step process depending if new/old order is moving up or down the sort.
If #NewOrderVal < #OriginalOrderVal --Moving down chain
--Create space for the movement; no point in changing the original
Update product set order = order+1
where order BETWEEN #NewOrderVal and #OriginalOrderVal-1;
end if
If #NewOrderVal > #OriginalOrderVal --Moving up chain
--Create space for the momvement; no point in changing the original
Update product set order = order-1
where order between #OriginalOrderVal+1 and #NewOrderVal
end if
--Finally update the one we moved to correct value
update product set order = #newOrderVal where ID=#IDToChange;
Regarding best practice; most environments I've been in typically want something grouped by category and sorted alphabetically or based on "popularity on sale" thus negating the need to provide a user defined sort.
Use the old trick that BASIC programs (amongst other places) used: jump the numbers in the order column by 10 or some other convenient increment. You can then insert a single row (indeed, up to 9 rows, if you're lucky) between two existing numbers (that are 10 apart). Or you can move row 370 to 565 without having to change any of the rows from 570 upwards.
Here is an alternative approach using a common table expression (CTE).
This approach respects a unique index on the SortOrder column, and will close any gaps in the sort order sequence that may have been left over from earlier DELETE operations.
/* For example, move Product with id = 26 into position 3 */
DECLARE #id int = 26
DECLARE #sortOrder int = 3
;WITH Sorted AS (
SELECT Id,
ROW_NUMBER() OVER (ORDER BY SortOrder) AS RowNumber
FROM Product
WHERE Id <> #id
)
UPDATE p
SET p.SortOrder =
(CASE
WHEN p.Id = #id THEN #sortOrder
WHEN s.RowNumber >= #sortOrder THEN s.RowNumber + 1
ELSE s.RowNumber
END)
FROM Product p
LEFT JOIN Sorted s ON p.Id = s.Id
It is very simple. You need to have "cardinality hole".
Structure: you need to have 2 columns:
pk = 32bit int
order = 64bit bigint (BIGINT, NOT DOUBLE!!!)
Insert/UpdateL
When you insert first new record you must set order = round(max_bigint / 2).
If you insert at the beginning of the table, you must set order = round("order of first record" / 2)
If you insert at the end of the table, you must set order = round("max_bigint - order of last record" / 2)
If you insert in the middle, you must set order = round("order of record before - order of record after" / 2)
This method has a very big cardinality. If you have constraint error or if you think what you have small cardinality you can rebuild order column (normalize).
In maximality situation with normalization (with this structure) you can have "cardinality hole" in 32 bit.
It is very simple and fast!
Remember NO DOUBLE!!! Only INT - order is precision value!
One solution I have used in the past, with some success, is to use a 'weight' instead of 'order'. Weight being the obvious, the heavier an item (ie: the lower the number) sinks to the bottom, the lighter (higher the number) rises to the top.
In the event I have multiple items with the same weight, I assume they are of the same importance and I order them alphabetically.
This means your SQL will look something like this:
ORDER BY 'weight', 'itemName'
hope that helps.
I am currently developing a database with a tree structure that needs to be ordered. I use a link-list kind of method that will be ordered on the client (not the database). Ordering could also be done in the database via a recursive query, but that is not necessary for this project.
I made this document that describes how we are going to implement storage of the sort order, including an example in postgresql. Please feel free to comment!
https://docs.google.com/document/d/14WuVyGk6ffYyrTzuypY38aIXZIs8H-HbA81st-syFFI/edit?usp=sharing

100k Rows Returned in a random order, without a SQL time out please

Ok,
I've been doing a lot of reading on returning a random row set last year, and the solution we came up with was
ORDER BY newid()
This is fine for <5k rows. But when we are getting >10-20k rows we are getting SQL time outs, the Execution planned tells me that 76% of my query cost comes from this line. and removing this line increase the speed by an order of magnitude when we have a large amount of rows.
Our users have a requirement of doing up to 100k rows at a time like this.
To give you all a bit more details.
We have a table with 2.6 million 4 digit alpha-numeric codes. We use a random set of these to gain entry into a venue. For example, if we have an event with a 5000 capacity, a random set of 5000 of these will be drawn from the table then issued to the each customer as a bar-code, then the bar-code scanning app at the door with have the same list of 5000. The reason for using a 4 digit alpha numeric code (and not a stupidly long number like a GUID) is that it easy for people to write the number down (or SMS it to a friend) and just bring the number and have it entered manually, so we don't want large amount of characters. Customers love the last bit btw.
Is there a better way than ORDER BY newid(), or is there a faster way to get 100k random rows from a table with 2.6 mil?
Oh, and we are using MS SQL 2005.
Thanks,
Jo
There is an MSDN article entitled "Selecting Rows Randomly from a Large Table" that talks about this exact problem and shows a solution (using no sorting but instead using a WHERE clause on a generated column to filter the rows).
The reason your query is slow is that the ORDER BY clause causes the whole table to be copied into tempdb for sorting.
If you want to generate random 4-digit codes, why not just generate them instead of trying to pull them out of a database?
Generate 100k unique numbers from 0 to 1,679,616 (which is the number of unique four-digit alphanumeric codes, ignoring case - 2.6 million rows must have some duplicates) and convert them to your four-digit codes.
You don't have to sort.
DECLARE #RandomNumber int
DECLARE #Threshold float
SELECT #RandomNumber = COUNT(*) FROM customers
SELECT #Threshold = 50000 / #RandomNumber
SELECT TOP 50000 * FROM customers WHERE rand() > #Threshold ORDER BY newid()
Just as a matter of interest, what is the performance like if you replace
ORDER BY newid()
by
ORDER BY CHECKSUM(newid())
One thought is to break down the process into steps. Add a column in the table for a GUID then do an update statement into the table adding the GUIDs. This can be done ahead of time if necessary. You should then be able to run the query with an orderby on the GUID column to recieve the results the same way.
Have you tried using % (modulo) on a given int column? Not sure what your table structure is, but you could do something like this:
select top 50000 *
from your_table
where CAST((CAST(ASCII(SUBSTRING(venuecode,1,1)) as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,2,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,3,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,4,1))as varchar(3))) as bigint) % 500000 between 0 and 50000
The above code will take all of your alpha numeric venues and convert them to an integer and then split the entire table into 500,000 buckets of which you are taking the top 50000 that fall between 0 and 50000. You can play with the number after the % since (500,000) and you can play with the between. This should randomize it for you. Not sure if the where clause will bite you on performance, but it's worth a shot. Also, without an order by, there is no guarantee of the order (if you have multiple cpus and threading).

How do I get around the Sum(First(...)) not allowed limitation is SSRS2005

The problem that I have is SQL Server Reporting Services does not like Sum(First()) notation. It will only allow either Sum() or First().
The Context
I am creating a reconciliation report. ie. what sock we had a the start of a period, what was ordered and what stock we had at the end.
Dataset returns something like
Type,Product,Customer,Stock at Start(SAS), Ordered Qty, Stock At End (SAE)
Export,1,1,100,5,90
Export,1,2,100,5,90
Domestic,2,1,200,10,150
Domestic,2,2,200,20,150
Domestic,2,3,200,30,150
I group by Type, then Product and list the customers that bought that product.
I want to display the total for SAS, Ordered Qty, and SAE but if I do a Sum on the SAS or SAE I get a value of 200 and 600 for Product 1 and 2 respectively when it should have been 100 and 200 respectively.
I thought that i could do a Sum(First()) But SSRS complains that I can not have an aggregate within an aggregate.
Ideally SSRS needs a Sum(Distinct())
Solutions So Far
1. Don't show the Stock at Start and Stock At End as part of the totals.
2. Write some code directly in the report to do the calc. tried this one - didn't work as I expected.
3. Write an assembly to do the calculation. (Have not tried this one)
Edit - Problem clarification
The problem stems from the fact that this is actually two reports merged into one (as I see it). A Production Report and a sales report.
The report tried to address these criteria
the market that we sold it to (export, domestic)
how much did we have in stock,
how much was produced,
how much was sold,
who did we sell it to,
how much do we have left over.
The complicating factor is the who did we sell it to. with out that, it would have been relativly easy. But including it means that the other top line figures (stock at start and stock at end) have nothing to do with the what is sold, other than the particular product.
I had a similar issue and ended up using ROW_NUMBER in my query to provide a integer for the row value and then using SUM(IIF(myRowNumber = 1, myValue, 0)).
I'll edit this when I get to work and provide more data, but thought this might be enough to get you started. I'm curious about Adolf's solution too.
Pooh! Where's my peg?!
Have you thought about using windowing/ranking functions in the SQL for this?
This allows you to aggregate data without losing detail
e.g. Imagine for a range of values, you want the Min and Max returning, but you also wish to return the initial data (no summary of data).
Group Value Min Max
A 3 2 9
A 7 2 9
A 9 2 9
A 2 2 9
B 5 5 7
B 7 5 7
C etc..
Syntax looks odd but its just
AggregateFunctionYouWant OVER (WhatYouWantItGroupedBy, WhatYouWantItOrderedBy) as AggVal
Windowing
Ranking
you're dataset is a little weird but i think i understand where you're going.
try making the dataset return in this order:
Type, Product, SAS, SAE, Customer, Ordered Qty
what i would do is create a report with a table control. i would set up the type, product, and customer as three separate groups. i would put the sas and sae data on the same group as the product, and the quantity on the customer group. this should resemble what i believe you are trying to go for. your sas and sae should be in a first()
Write a subquery.
Ideally SSRS needs a Sum(Distinct())
Re-write your query to do this correctly.
I suspect your problem is that you're written a query that gets you the wrong results, or you have poorly designed tables. Without knowing more about what you're trying to do, I can't tell you how to fix it, but it has a bad "smell".

Resources