Joining poorly designed SQL tables?

Joining poorly designed SQL tables? - sql-server

I've tried searching for information on joining tables without foriegn keys but it seems the answer is always to create the foreign key. I cannot modify the tables in question to do this and I must report on data that is already in production. The following is a portion of the data in the tables involved in order to exemplify the issue.
Table A
Journal Account Debit Credit Sequence
--------------------------------------------------
87041 150-00 100.00 0.00 16384
87041 150-10 0.00 100.00 32768
87041 150-00 50.0 0.0 49152
87041 210-90 0.0 50.0 65536
Then the second table, tracking additional bits of information, is largely the same but missing the Sequence number that would tie the line items together properly. It has its own Sequence Number that is unrelated.
Table B
Journal Account Label Artist Sequence
--------------------------------------------------
87041 150-00 Label02 Artist12 1
87041 150-10 Label09 Artist03 2
87041 150-00 Label04 Artist01 3
87041 210-90 Label01 Artist05 4
At present the best I can come up with is to join on Journal and Account but that duplicates records. I have gotten close by playing around with grouping and max() on sequence number but the result has been that that not all duplicates are removed for journal entries with a very large number of rows and the first match from the second table is always displayed for lines that have the same account.
Closest - but bad - result
Journal Account Debit Credit Sequence Label Artist
----------------------------------------------------------------------
87041 150-00 100.00 0.00 16384 Label02 Artist12
87041 150-10 0.00 100.00 32768 Label09 Artist03
87041 150-00 50.0 0.0 49152 Label02 Artist12 <-- wrong
87041 210-90 0.0 50.0 65536 Label01 Artist05
How can I join the tables such that duplicates are excluded but also so that the correct Label and Artist are displayed? It sort of feels like I have to produce a query which knows that one of the records from Table B has already been used when the 49152 record from Table A comes looking for a match.
EDIT:
#Justin Crabtree A.Sequence will be the order in which the line items were entered. So a user could have entered the last line in the example first, then the first line, then the third, and finally the second.
#Edper Microsoft SQL Server...hmm, I cannot remote into the client's machine this morning...otherwise I would provide the version.
#Abe Miessler yes, you are correct.
As soon as I can get back into the server I will try your suggestion #pkuderov

Try this
;WITH a AS
(
SELECT Journal,
Account,
Debit,
Credit,
Sequence,
Id = ROW_NUMBER() OVER(PARTITION BY Journal ORDER BY Sequence)
FROM dbo.tablea
)
, b AS
(
SELECT Journal,
Account,
Label,
Artist,
Id = ROW_NUMBER() OVER(PARTITION BY Journal ORDER BY Sequence)
FROM dbo.tableb
)
SELECT a.Journal,
a.Account,
a.Debit,
a.Credit,
a.Sequence,
b.Label,
b.Artist
FROM a
JOIN b ON b.Journal = a.Journal
AND b.Account = a.Account
AND b.Id = a.Id

Hi, that's just an idea:
select
a.Journal, a.Account, a.Debit, a.Credit, a.Sequence, b.Label, b.Artist
from (
select
*,
row_number() over(partition by Journal, Account order by Sequence) as idInGroup
from a
) as a
join (
select
*,
row_number() over(partition by Journal, Account order by Sequence) as idInGroup
from b
) as b on
a.Journal = b.Journal
and a.Account = b.Account
and a.idInGroup = b.idInGroup
Here I assume that orders appeared in Sequence order (in both tables) and that's the base hint for join tables.

If you ordered the 2 table rows by their own sequence numbers, will the rows align in the same order?
If so, this is a possible solution for SQL server:
You can create 2 CTEs, one for each table, with ROW_NUMBER column, and that way, both tables will have a matching row number column that you can use to join. Let me know if you need an example.

If I'm reading your requirements correctly and you want all rows from Table A, but only the first matching row from Table B, your best bet would be to do an OUTER APPLY with a TOP(1). That would look something like this:
select *
from TableA
OUTER APPLY
(select TOP(1) Journal, Account, Label, Artist, Sequence
FROM TableB
WHERE Journal = TableA.Journal AND Account = TableA.Account
ORDER BY Sequence) as B
(Definitely pseudo-code, but that should be somewhat close.)
If it comes down to it, you could use ROW_NUMBER(), partition that by Journal and Account and then match on those Row_Number values for each result set. You'd generate one sub-query/CTE for TableA and another CTE for TableB - each with a RowNumber value that would be essentially a new sequence integer. The first row in TableA would match the first row in TableB, Second row in TableA would match the second in TableB, etc. Of course, you'd run into some issues if there are more rows for Journal/Account in "A" than there are in "B".
A better question might be - "How does your code determine all matches between TableA and TableB if they can't use any data columns to tie them together?"

Related

SQL Server - deducting from available credit

I have invoicing solution that uses Azure SQL to store and calculate invoice data. I have been requested to provide 'credit' functionality so rather than recovering customers charges, the totals are deducted from an amount of available credit and reflected in the invoice (solution xyz may have 1500 worth of charges, but deducted from available credit of 10,000 means its effectively zero'd and leaves 8,500 credit remaining ). Unfortunately after several days I haven't been able to work out how to do this.
I am able to get a list of items and their costs from sql easily:
invoice_id
contact_id
solution_id
total
date
202104-015
52
10000
30317.27
2021-05-22
202104-015
52
10001
2399.90
2021-05-22
202104-015
52
10005
8302.27
2021-05-22
202104-015
52
10060
3625.22
2021-05-22
202104-015
52
10111
22.87
2021-05-22
202104-015
52
10115
435.99
2021-05-22
I have another table that shows the credit available for the given contact:
id
credit_id
owner_id
total_applied
date_applied
1
C00001
52
500000.00
2021-05-14
I have tried using the following SQL statement, based on another stackoverflow question to subtract from the previous row, thinking each row would then reflect the remaining credit:
Select
invoice_id,
solution_id
sum(total) as 'total',
cr.total_remaining - coalesce(lag(total)) over (order by s.solution_id), 0) as credit_available,
date
from
invoices
left join credits cr on
cr.credit_id = 'C00001'
Whilst this does subtract, it only subtracts from the row above it, not all of the rows above it:
invoice_id
solution_id
total
credit_available
date
202104-015
10000
30317.27
500000.00
2021-05-22
202104-015
10001
2399.90
469682.73
2021-05-22
202104-015
10005
8302.27
497600.10
2021-05-22
202104-015
10060
3625.22
491697.73
2021-05-22
202104-015
10111
22.87
496374.78
2021-05-22
202104-015
10115
435.99
499977.13
2021-05-22
I've also tried various queries with a mess of case statements.
Im at the point where I am contemplating using powershell or similar to do the task instead (loop through each solution, check if there is enough available credit, update a deduction table, goto next etc) but I'd rather keep it all in SQL if I can.
Anyone have some pointers for this beginner?

You don't need to use window functions, use a sub-query that sums the total of previous invoices. But be sure to use index the table correctly so that performance is not a problem.
There are two sub-queries, one for the previous total sum and another to get the date of the next credit for contact_id.
SELECT [inv].[invoice_id],
[inv].[solution_id],
[inv].[total],
-- subquery that sums the previous totals
[cr].[total_applied] - COALESCE((
SELECT SUM([inv_inner].[total])
FROM [dbo].[invoices] AS [inv_inner]
WHERE [inv_inner].[solution_id] < [inv].[solution_id]
), 0) AS [credit_available],
[inv].[date]
FROM [dbo].[invoices] [inv]
LEFT JOIN [dbo].[credits] [cr]
ON [cr].[owner_id] = [inv].[contact_id]
-- here, we make sure that the credit is available for the correct period
-- invoice date >= credit date_applied
AND [inv].[date] >= [cr].[date_applied]
-- and invoice date < next date_applied or tomorrow, in case there are no next date_applied
AND [inv].[date] < COALESCE((
SELECT MIN([cr2].[date_applied])
FROM [dbo].[credits] [cr2]
WHERE [cr2].[owner_id] = [cr].[owner_id]
AND [cr2].[date_applied] > [cr].[date_applied]
), GETDATE()+1)
AND [cr].[credit_id] = 'C00001';
This query works, but it is for this question only. Please study it and adapt to your real world problem.

This is a pretty complex scenario. I sadly cannot spend the time to offer a complete solution here. I do can provide you with tips and points of attention here:
Be sure to determine the actual remaining credit based on the complete invoice history. If you introduce filtering (in a WHERE-clause, for example, or by including joins with other tables), the results should not be affected by it. You should probably pre-calculate the available credit per invoice detail record in a temporary table or in a CTE and use that data in your main query.
Make sure that you regard the date_applied value of the credit. Before a credit is applied to a customer, that customer should probably have less credit or no credit at all. That should be reflected correctly on historical invoices, I guess.
Make sure you determine the correct amount of total credit. It is unclear from the information provided in your question how that should be determined/calculated. Is only the latest total_applied value from the credits table active? Or should all the historical total_applied values be summarized to get the total available credit?)
Include a correct join between your invoices table and your credits table. Currently, this join is hard coded in your query.
Also regard actual payments by customers. Payments have effect on the available credit, I assume. Also note that, unless you are OK with a history that changes, you need to regard the payment dates as well (just like the credit change dates).
I'm not sure how you would solve your scenario using PowerShell... I do know for sure, that this can be tackled with SQL.
I cannot say anything about the resulting performance, however. These kinds of calculations surely come with a price tag attached in that regard. If you need high performance, I guess it might be more practical to include columns in your invoices table to physically store the available credit with each invoice detail record.
Edit
I have experimented a little with your scenario and your additional comments.
My solution implementation uses two CTEs:
The first CTE (cte_invoice_credit_dates) retrieves the date of the active credit record for specific invoice IDs.
The second CTE (cte_contact_invoice_summarized_totals) calculates the invoice totals of all the invoices of a specific contact. Since you want to summarize on solution detail per invoice as well, I also included the solution ID per invoice in the querying logic.
The main query selects all columns from the invoices table and uses the data from the two CTEs to calculate three additional columns in the result set:
Column credit_assigned represents the total assigned credit at the invoice's date.
Column summarized_total shows the contact's cumulative invoice total.
Column credit_available shows the remaining credit.
WITH
[cte_invoice_credit_dates] AS (
SELECT DISTINCT
I.[invoice_id],
C.[date_applied]
FROM
[invoices] AS I
OUTER APPLY (SELECT TOP (1) [date_applied]
FROM [credits]
WHERE
[owner_id] = I.[contact_id] AND
[date_applied] <= I.[date]
ORDER BY [date_applied] DESC) AS C
),
[cte_contact_invoice_summarized_totals] AS (
SELECT
I.[contact_id],
I.[invoice_id],
I.[solution_id],
SUM(H.[total]) AS [total]
FROM
[invoices] AS I
INNER JOIN [invoices] AS H ON
H.[contact_id] = I.[contact_id] AND
H.[invoice_id] = I.[invoice_id] AND
H.[solution_id] <= I.[solution_id] AND
H.[date] <= I.[date]
GROUP BY
I.[contact_id],
I.[invoice_id],
I.[solution_id]
)
SELECT
I.[invoice_id],
I.[contact_id],
I.[solution_id],
I.[total],
I.[date],
COALESCE(C.[total_applied], 0) AS [credit_assigned],
H.[total] AS [summarized_total],
COALESCE(C.[total_applied] - H.[total], 0) AS [credit_available]
FROM
[invoices] AS I
INNER JOIN [cte_contact_invoice_summarized_totals] AS H ON
H.[contact_id] = I.[contact_id] AND
H.[invoice_id] = I.[invoice_id] AND
H.[solution_id] = I.[solution_id]
LEFT JOIN [cte_invoice_credit_dates] AS CD ON
CD.[invoice_id] = I.[invoice_id]
LEFT JOIN [credits] AS C ON
C.[owner_id] = I.[contact_id] AND
C.[date_applied] = CD.[date_applied]
ORDER BY
I.[invoice_id],
I.[solution_id];

Conditionally Change String Name

I have a large data source that's automatically uploaded in a SQL Server Table so I am unable to manually change the data. Every now and then there are records that are mislabeled. 98% of the dataset contains unique Patient_fins; however, for patients that have been to both locations (ED and EDU), Patient_fin are duplicated, which is fine. For example,
Patient_fin CHECKIN_DATE_TIME TRACKING_GROUP
1 2018-01-01 01:37:00 EDU
1 2018-01-01 04:37:00 ED
I'm running into issues when the patients tracking group is not correctly labeled (both labels are the same when the CHECKIN_DATE_TIMEs are different) . For example, I can tell from the CHECKIN_DATE_TIME that the patient has been to two different locations ED and EDU, yet the tracking group is the same. The second row for Patient_fin 1, tracking group should read 'ED'
Patient_fin CHECKIN_DATE_TIME TRACKING_GROUP
1 2018-01-01 01:37:00 EDU
1 2018-01-01 04:37:00 EDU
For instances where the TRACKING GROUP is incorrect, is there a way in SQL where I can recode the record with the later CHECKIN_DATE_TIME so the TRACKING_GROUP reads ED. A priori knowledge tells me the later CHECKIN_DATE_TIME will always be associated with ED and not EDU.

IF only there will ever be two records with the same Patient_fin and you don't need to account for the first record being ED, what happens then? You would then be left with two records having a TRACKING_GROUP = ED:
--This will do pretty much what Sean Lange described except instead of a cte, it uses
--A subquery to get the records with a row number, partitioned by the Patient_fin
--It then joins this on the table by Patient_fin and CHECKIN_DATE_TIME and updates the second record for a Patient_fin
UPDATE dbo.SomTable
SET TRACKING_GROUP = 'ED'
FROM dbo.SomeTable AS st
INNER JOIN
(
SELECT Patient_fin, CHECKIN_DATE_TIME, ROW_NUMBER() OVER(PARTITION BY Patient_fin ORDER BY Patient_fin) AS [RowNumer]
FROM dbo.SomeTable
) AS x
ON x.CHECKIN_DATE_TIME = st.CHECKIN_DATE_TIME AND x.Patient_fin = st.Patient_fin
WHERE x.RowNum = 2

Measure all distances return shortest value

I have one table with a list of stores, approximately 100 or so with lat/long. The second table I have a list of customers, with lat/long and has more than 500k.
I need to find the closest store to each customer. Currently I am using the geography data type with the STDistance function to calculate the distance between two points. This is functioning fine, but I am getting hung up on the most efficient ways to process this.
Option #1 - Cartesian join Customer_table to Store_table, process the distance calculation, rank the results and filter to #1. Concern with this is that if you have a 1 million row customer list, and 100 stores, you are created a 100 million row table and the rank function then thereafter may be taxing.
Option #2 - With some dynamic sql, create a pivoted table that has each customer in the first column, and each subsequent column has the calculated distance to each branch. From there, I can unpivot and then do the same rank/over function described in the first.
EXAMPLE
CUST_ID LAT LONG STORE1DIST STORE2DIST STORE3DIST
1 20.00 30.00 4.5 5.6 7.8
2 20.00 30.00 7.4 8.1 8.5
I'm not clear which would be the most efficient, and will keep the DBA's from wanting to come find me.
Thanks for the input in advance!

You can unpivot the data into multiple rows for each store distance then use simple pivot (Group by) to get the minimum value of StoreDistance.
select CUST_ID, MIN(STOREDIST) StoreDistance, MIN(STORES) StoreName
from
(select CUST_ID, LAT, LONG, STORE1DIST, STORE2DIST, STORE3DIST from Cus/*Your table*/) p
UNPIVOT
(
STOREDIST FOR STORES IN (STORE1DIST, STORE2DIST, STORE3DIST)
) as unpvt
Group by CUST_ID
This will give you the result as:
CUST_ID StoreDistance StoreName
-----------------------------------
1 4.5 STORE1DIST
2 7.4 STORE1DIST

I have a similar situation on my job. I use a distance function like this (returns kms, use 3960* to return miles):
CREATE Function MySTDistance(#lat1 float, #lon1 float, #lat2 float, #lon2 float)
returns smallmoney
as
return IsNull(6373*acos((sin(radians(#lat1))*sin(radians(#lat2)))
+(cos(radians(#lat1))*cos(radians(#lat2))*cos(radians(#lon1-#lon2)))),0)
then you look for the closest store by doing something like...
select C.Cust_Id
,Store_id=
(select top (1) Store_id
from Store_Table S
order by dbo.MySTDistance(S.lat, S.long, C.lat, C.long)
)
from Customer_Table C
Now you have each customer id with his closest store id. It's quite fast with a huge volume of customers (at least in my case).

MSAccess/SQL lookup table for match field based on sum of current table.field

I've been battling this for the last week with many attempted solutions. I want to return the unique names in table with the sum of their points and their current dance level based on that sum. Ultimately I want compare the returned dance level with what is stored in the customer table against the customer and show only the records where the two dance levels are different (the stored dance level and the calculated dance level based on the current sum of the points.
The final solution will be a web page using ADODB connection to MSAccess DB (2013). But for starters just want it to work in MSAccess.
I have a MSAccess DB (2013) with the following tables.
PointsAllocation
CustomerID Points
100 2
101 1
102 1
100 1
101 4
DanceLevel
DLevel Threshold
Beginner 2
Intermediate 4
Advanced 6
Customer
CID Firstname Dancelevel1
100 Bob Beginner
101 Mary Beginner
102 Jacqui Beginner
I want to find the current DLevel for each customer by using the SUM of their Points in the first table. I have this first...
SELECT SUM(Points), CustomerID FROM PointsAllocation GROUP BY CustomerID
Works well and gives me total points per customer. I can then INNER JOIN this to the customer table to get the persons name. Perfect.
Now I want to add the DLevel from the DanceLevel table to the results where the SUM total is used to lookup the Threshold and not exceed the value so I get the following:
(1) (2) (3) (4)
Bob 3 Beginner Intermediate
Mary 5 Beginner Advanced
Where...
(1) Customer.Firstname
(2) SUM(PointsAllocation.Points)
(3) Customer.Dancelevel1
(4) Dancelevel.DLevel
Jacqui is not shown as her SUM of Points is less than or equal to 2 giving her a calculated dance level of Beginner and this already matches the her Dancelevel1 in the Customer table.
Any ideas anyone?

You can start from the customer table because you want to list every customer. Then left join it with a subquery that calculates the dance levels and point totals. The innermost subquery totals the points and then joins on valid dance levels and selects the max threshold value from the dance levels. Then left join on the DanceLevel table again on the threshold value to get the level's description.
Select Customer.Firstname,
CustomerDanceLevels.Points,
Customer.Dancelevel1,
Dancelevel.DLevel
from Customer
left join
(select CustomerID, Points, Min(Threshold) Threshold
from
(select CustomerID, sum(Points) Points
from PointsAllocation
group by CustomerID
) PointsTotal
left join DanceLevel
on PointsTotal.Points <= DanceLevel.Threshold
group by CustomerID, Points
) CustomerDanceLevels
on Customer.CID = CustomerDanceLevels.CustomerID
left join DanceLevel
on CustomerDanceLevels.Threshold = DanceLevel.Threshold

Removing Duplicate Records from table based on a column [duplicate]

I created a table with multiple inner joins from 4 tables but the results brings back duplicate records. Here code that I am using
SELECT tblLoadStop.LoadID,
tblCustomer.CustomerID,
tblLoadMaster.BillingID,
tblLoadMaster.LoadID,
tblLoadMaster.PayBetween1,
LoadStopID,
tblLoadMaster.Paybetween2,
tblStopLocation.StopLocationID,
tblStopLocation.city,
tblStopLocation.state,
tblStopLocation.zipcode,
tblLoadSpecifications.LoadID,
tblLoadSpecifications.LoadSpecificationID,
Picks,
Stops,
Typeofshipment,
Weight,
LoadSpecClass,
Miles,
CommodityList,
OriginationCity,
OriginationState,
DestinationCity,
DestinationState,
LoadRate,
Status,
CompanyName,
Customerflag,
tblCustomer.CustomerID,
tblCustomer.AddressLine1,
tblCustomer.City,
tblCustomer.State,
tblCustomer.Zipcode,
CompanyPhoneNumber,
CompanyFaxNumber,
SCAC,
tblLoadMaster.Salesperson,
Change,
StopType
FROM tblLoadMaster
INNER JOIN tblLoadSpecifications
ON tblLoadSpecifications.LoadID = tblLoadMaster.LoadID
INNER JOIN tblLoadStop
ON tblLoadStop.LoadID = tblLoadMaster.LoadID
INNER JOIN tblStopLocation
ON tblStopLocation.StopLocationID = tblLoadStop.StopLocationID
INNER JOIN tblCustomer
ON tblCustomer.CustomerID = tblLoadMaster.CustomerID
WHERE tblLoadMaster.Phase LIKE '%2%'
ORDER BY tblLoadMaster.LoadID DESC;
This is the result that I get
Load ID Customer Salesperson Origin Destination Rate
-------------------------------------------------------------------------
13356 FedEx Alex Duluth New York 300
13356 FedEx Steve Florida Kansas 400
I only want the first row to show,
13356 FedEx Alex Duluth New York 300
and remove the bottom row,
13356 FedEx Steve Florida Kansas 400
The tblLoadStop Table has the duplicate record with a duplicate LoadID from tblloadMaster Table

One approach would be to use a CTE (Common Table Expression) if you're on SQL Server 2005 and newer (you aren't specific enough in that regard).
With this CTE, you can partition your data by some criteria - i.e. your LoadID - and have SQL Server number all your rows starting at 1 for each of those "partitions", ordered by some criteria (you're not very clear on how you decide which row to keep and which to ignore in your question).
So try something like this:
;WITH CTE AS
(
SELECT
LoadID, Customer, Salesperson, Origin, Destination, Rate,
RowNum = ROW_NUMBER() OVER(PARTITION BY LoadID ORDER BY tblLoadstopID ASC)
FROM
dbo.tblLoadMaster lm
......
WHERE
lm.Phase LIKE '%2%'
)
SELECT
LoadID, Customer, Salesperson, Origin, Destination, Rate
FROM
CTE
WHERE
RowNum = 1
Here, I am selecting only the "first" entry for each "partition" (i.e. for each LoadId) - ordered by some criteria (updated: order by tblLoadstopID - as you mentioned) you need to define in your CTE.
Does that approach what you're looking for??