SQL - aggregate related accounts - maually set up ID - sql-server

I have two tables:
Account & Amount column
list of related accounts
Data samples:
Account | Amount
--------+---------
001 | $100
002 | $150
003 | $200
004 | $300
Account | Related Account
--------+------------------
001 | 002
002 | 003
003 | 002
My goal is to be able to aggregate all related accounts. From table two - 001,002 & 003 are actually all related to each other. What I would like to be able to do is to get a sum of all related accounts. Possibly ID 001 to 003 as Account #1, so I can aggregate them.
Result below
ID | Account | Amount
-----+-----------+--------
#1 | 001 | $100
#1 | 002 | $150
#1 | 003 | $200
#2 | 004 | $300
I can then manipulate the above table as below (final result)
ID | Amount
-----+--------
#1 | $450
#2 | $300
I tried doing a join, but it doesn't quite achieve what I want. I still have a problem relating account 001 with 003 (they are indirectly related because 002 is related with both 001 and 003.
If anyone can point me to the right direction, will be much appreciated.

Well, you really made this harder then it should be.
If you could change the data in the second table, so it will not contain reversed duplicates (in your sample data - 2,3 and 3,2) it would simplify the solution.
If you could refactor both tables into a single table, where the related column is a self referencing nullable foreign key, it would simplify the solution even more.
Let's assume for a minute you can't do either, and you have to work with the data as provided. So the first thing you want to do is to ignore the reversed duplicates in the second table. This can be done using a common table expression and a couple of case expressions.
First, create and populate sample tables (Please save us this step in your future questions):
DECLARE #TAccount AS TABLE
(
Account int,
Amount int
)
INSERT INTO #TAccount (Account, Amount) VALUES
(1, 100),
(2, 150),
(3, 200),
(4, 300)
DECLARE #TRelatedAccounts AS TABLE
(
Account int,
Related int
)
INSERT INTO #TRelatedAccounts (Account, Related) VALUES
(1,2),
(2,3),
(3,2)
You want to get only the first two records from the #TRelatedAccounts table.
This is the AccountAndRelated CTE.
Now, you want to left join the #TAccount table with the results of this query, so for each Account we will have the Account, the Amount, and the Related Account or NULL, if the account is not related to any other account or it's the first on the relationship chain.
This is the CTERecursiveBase CTE.
Then, based on that you can create a recursive CTE (called CTERecursive), and finally select the sum of amount from the recursive CTE based on the root of the recursion.
Here is the entire script:
;WITH AccountAndRelated AS
(
SELECT DISTINCT CASE WHEN Account > Related THEN Account Else Related END As Account,
CASE WHEN Account > Related THEN Related Else Account END As Related
FROM #TRelatedAccounts
)
, CTERecursiveBase AS
(
SELECT A.Account, Related, Amount
FROM #TAccount As A
LEFT JOIN AccountAndRelated As R ON A.Account = R.Account
)
, CTERecursive AS
(
SELECT Account As Id, Account, Related, Amount
FROM CTERecursiveBase
WHERE Related IS NULL
UNION ALL
SELECT Id, B.Account, B.Related, B.Amount
FROM CTERecursiveBase AS B
JOIN CTERecursive AS R ON B.Related = R.Account
)
SELECT Id, SUM(Amount) As TotalAmount
FROM CTERecursive
GROUP BY Id
Results:
Id TotalAmount
1 450
4 300
You can see a live demo on rextester.
Now, Let's assume you can modify the data of the second table. You can use the AccountAndRelated cte to get only the records you need to keep in the #TRelatedAccounts table - This means you can skip the AccountAndRelated cte and use the #TRelatedAccounts directly in the CTERecursiveBase cte.
You can see a live demo of that as well.
Finally, let's assume you can refactor your database. In that case, I would recommend joining the two tables together - so your #TAccount table would look like this:
Account Amount Related
1 100 NULL
2 150 1
3 200 2
4 300 NULL
Then you only need the recursive cte.
Here is a live demo of that option as well.

Related

How to add data to a single column

I have a question in regards to adding data to a particular column of a table, i had a post yesterday where a user guided me (thanks for that) to what i needed and said an update was the way to go for what i need, but i still can't achieve my goal.
i have two tables, the tables where the information will be added from and the table where the information will be added to, here is an example:
source_table (has only a column called "name_expedient_reviser" that is nvarchar(50))
name_expedient_reviser
kim
randy
phil
cathy
josh
etc.
on the other hand i have the destination table, this one has two columns, one with the ids and the other where the names will be inserted, this column values are null, there are some ids that are going to be used for this.
this is how the other table looks like
dbo_expedient_reviser (has 2 columns, unique_reviser_code numeric PK NOT AI, and name_expedient_reviser who are the users who check expedients this one is set as nvarchar(50)) also this is the way this table is now:
dbo_expedient_reviser
unique_reviser_code | name_expedient_reviser
1 | NULL
2 | NULL
3 | NULL
4 | NULL
5 | NULL
6 | NULL
what i need is the information of the source_table to be inserted into the row name_expedient_reviser, so the result should look like this
dbo_expedient_reviser
unique_reviser_code | name_expedient_reviser
1 | kim
2 | randy
3 | phil
4 | cathy
5 | josh
6 | etc.
how can i pass the information into this table? what do i have to do?.
EDIT
the query i saw that should have worked doesn't update which is this one:
UPDATE dbo_expedient_reviser
SET dbo_expedient_reviser.name_expedient_reviser = source_table.name_expedient_reviser
FROM source_table
JOIN dbo_expedient_reviser ON source_table.name_expedient_reviser = dbo_expedient_reviser.name_expedient_reviser
WHERE dbo_expedient_reviser.name_expedient_reviser IS NULL
the query was supposed to update the information into the table, extracting it from the source_table as long as the row name_expedient_reviser is null which it is but is doesn't work.
Since the Names do not have an Id associated with them I would just use ROW_NUMBER and join on ROW_NUMBER = unique_reviser_code. The only problem is, knowing what rows are null. From what I see, they all appear null. In your data, is this the case or are there names sporadically in the table like 5,17,29...etc? If the name_expedient_reviser is empty in dbo_expedient_reviser you could also truncate the table and insert values directly. Hopefully that unique_reviser_code isn't already linked to other things.
WITH CTE (name_expedient_reviser, unique_reviser_code)
AS
(
SELECT name_expedient_reviser
,ROW_NUMBER() OVER (ORDER BY name_expedient_reviser)
FROM source_table
)
UPDATE er
SET er.name_expedient_reviser = cte.name_expedient_reviser
FROM dbo_expedient_reviser er
JOIN CTE on cte.unique_reviser_code = er.unique_reviser_code
Or Truncate:
Truncate Table dbo_expedient_reviser
INSERT INTO dbo_expedient_reviser (name_expedient_reviser, unique_reviser_code)
SELECT DISTINCT
unique_reviser_code = ROW_NUMBER() OVER (ORDER BY name_expedient_reviser)
,name_expedient_reviser
FROM source_table
it is not posible to INSERT the data into a single column, but to UPDATE and move the data you want is the only way to go in that cases

SQL Server Insert Using View

This is a simplified example of what I want to do. Assume there is table named contractor that looks like this:
name | paid_adjustment_amount | adj_date
Bob | 1000 | 4/7/2016
Mary | 2000 | 4/8/2016
Bill | 5000 | 4/8/2016
Mary | 4000 | 4/10/2016
Bill | (1000) | 4/12/2016
Ann | 3000 | 4/30/2016
There is a view of the contractor table, let's call it v_sum, that is just a SUM of the paid_adustment_amount grouped by name. So it looks like this:
name | total_paid_amount
Bob | 1000
Mary | 6000
Bill | 4000
Ann | 3000
Finally, there is another table called to_date_payment that looks like this:
name | paid_to_date_amount
Bob | 1000
Mary | 8000
Bill | 3000
Ann | 3000
Joe | 4000
I want to compare the information in the to_date_payment table to the v_sum view and insert a new row in the contractor table to show an adjustment. Something like this:
INSERT INTO contractor
SELECT to_date_payment.name,
to_date_payment.paid_to_date_amount - v_sum.total_paid_amount,
GETDATE()
FROM to_date_payment
LEFT JOIN v_sum ON to_date_payment.name = v_sum.name
WHERE to_date_payment.paid_to_date_amount - v_sum.total_paid_amount <> 0
OR v_sum.name IS NULL
Are there any issues with using a view for this? My understanding, please correct me if I'm wrong, is that a view is just a result set of a query. And, since the view is of the table I'm inserting new records into, I'm afraid there could be data integrity problems.
Thanks for the help!
In order to fully understand what you are doing, you should also provide the definition for v_sum. Generally speaking, views might provide some advantages, especially when indexed. More details can be found here and here.
Simple usage of views do not provide performance benefits, but they are very good of providing abstraction over tables.
In your particular case, I do not see any problem in JOINing with the view, but I would worry about potential problems related to:
1) JOIN using VARCHARs instead of integer - ON to_date_payment.name = v_sum.name - if possible, try to JOIN on integer (ids or foreign keys ids) values, as it is faster (indexes applied on integer columns will have a smaller key, comparisons are slightly faster).
2) OR in queries - usually leads to performance problems. One thing to try is to change the SELECT like this:
SELECT to_date_payment.name,
to_date_payment.paid_to_date_amount - v_sum.total_paid_amount,
GETDATE()
FROM to_date_payment
JOIN v_sum ON to_date_payment.name = v_sum.name
WHERE to_date_payment.paid_to_date_amount - v_sum.total_paid_amount <> 0
UNION ALL
SELECT to_date_payment.name,
to_date_payment.paid_to_date_amount, -- or NULL if this is really intended
GETDATE()
FROM to_date_payment
-- NOT EXISTS is usually faster than LEFT JOIN ... IS NULL
WHERE NOT EXISTS (SELECT 1 FROM v_sum V WHERE V.name = to_date_payment.name)
3) Possible undesired result - by default, arithmetic involving NULL returns NULL. When there is no match in v_sum, then v_sum.total_paid_amount is NULL and to_date_payment.paid_to_date_amount - v_sum.total_paid_amount will evaluate to NULL. Is this correct? Maybe to_date_payment.paid_to_date_amount - ISNULL(v_sum.total_paid_amount, 0) is intended.

Efficiently counting strength of relationship between rows in Postgres

I have a table that looks similar to this:
session_id | sku
------------|-----
a | 1
a | 2
a | 3
a | 4
b | 2
b | 3
c | 3
I want to pivot this into a table similar to this:
sku1 | sku2 | score
------|------|------
1 | 2 | 1
1 | 3 | 1
1 | 4 | 1
2 | 3 | 2
2 | 4 | 1
3 | 4 | 1
The idea is to store a denormalised table that allows one to look up for a given sku, what other skus are related to sessions it has been related to, and how many times both skus are related to the same session.
What algorithms, patterns or strategies could you suggest for implementing this in PostgreSQL or other technologies?
I realise that this kind of lookup can be done on the original table using counts, or using a facetting search engine. However, I want to make the reads more performant, and just want to keep the overall statistics. The idea is that I will be performing this pivot regularly on the newest few thousand rows in the first table, then storing the result in the second. I'm only concerned with approximate statistics for the second table.
I've got some SQL that works, but VERY slowly. Also looking into the potential for using a graph database of some sort, but wanted to avoid adding another technology for a small part of the app.
Update: The SQL below seems performant enough. I can convert 1.2 million rows in the first table (tags) into 250k rows in the second table (product_relations) with around 2-3k variations of sku in about 5 minutes on my iMac. I will realistically be denormalising only up to 10k rows per day. Question is whether this is actually the best approach. Seems a little dirty to me.
BEGIN;
CREATE
TEMPORARY TABLE working_tags(tag_id int, session_id varchar, sku varchar) ON COMMIT DROP;
INSERT INTO working_tags
SELECT id,
session_id,
sku
FROM tags
WHERE time < now() - interval '12 hours'
AND processed_product_relation IS NULL
AND sku IS NOT NULL LIMIT 200000;
CREATE
TEMPORARY TABLE working_relations (sku1 varchar, sku2 varchar, score int) ON COMMIT DROP;
INSERT INTO working_relations
SELECT a.sku AS sku1,
b.sku AS sku2,
count(DISTINCT a.session_id) AS score
FROM working_tags AS a
INNER JOIN working_tags AS b ON a.session_id = b.session_id
AND a.sku < b.sku
WHERE a.sku IS NOT NULL
AND b.sku IS NOT NULL
GROUP BY a.sku,
b.sku;
UPDATE product_relations
SET score = working_relations.score+product_relations.score
FROM working_relations
WHERE working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2;
INSERT INTO product_relations (sku1, sku2, score)
SELECT working_relations.sku1,
working_relations.sku2,
working_relations.score
FROM working_relations
LEFT OUTER JOIN product_relations ON (working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2)
WHERE product_relations.sku1 IS NULL;
UPDATE tags
SET processed_product_relation = TRUE
WHERE id IN
(SELECT tag_id
FROM working_tags);
COMMIT;
If I've interpreted your intention correctly (per comments) this should do it:
SELECT
s1.sku AS sku1,
s2.sku AS sku2,
count(session_id)
FROM session s1
INNER JOIN session s2 USING (session_id)
WHERE s1.sku < s2.sku
GROUP BY s1.sku, s2.sku
ORDER BY 1,2;
See: http://sqlfiddle.com/#!15/2e0b2/1
In other words: Self-join session, then find all pairings of SKUs for each session ID, excluding ones where the left is greater than or equal to the right in order to avoid repeating pairings - if we have (1,2,count) we don't want (2,1,count) as well. Then group by the SKU pairings and count how many rows are found for each pairing.
You may want to count(distinct session_id) instead, if your SKU pairings can repeat and you want to exclude duplicates. There will probably be more efficient ways to do that, but that's the simplest.
An index on at least session_id will be very useful. You may also want to mess with planner cost parameters to make sure it chooses a good plan - in particular, make sure effective_cache_size is accurate and random_page_cost vs seq_page_cost reflects your caching and I/O costs. Finally, throw as much work_mem at it as you can afford.
If you're creating a materialized view, just CREATE UNLOGGED TABLE whatever AS SELECT .... . That way you minimise the numer of writes/rewrites/overwrites.

Can I do it only with SQL Queries or do I need a stored procedure?

I came across with a problem in which I need some advice/help.
Background:
I'm programming a map in an accounting software which requires special formatting and a tree-like hierarchy between the account codes, i.e.:
Account 1 is the father of accounts 11, 12 and 13;
On the other hand, account 11 is the father of accounts 111, 112 and 113;
There can also be an account relationship like this one: account 12 is the father of account 12111, because there is no 121 nor 1211, which means that the hierarchy is not a "one digit plus hierarchy" if I make my self clear;
And this logic continues till the end of the map.
Database table background:
There are 3 tables involved in all process.
A general ledger table (named ContasPoc) which I can relate with table number 2 (general ledger entries accounts) by each account code, in other words, the account code in table 1 is equal to the account code in table 2 (this field is named CodigoConta in both tables).
The general ledger entries accounts (named LancamentosContas) which I can relate with table number 1 (general ledger) as I said. This table is also related with table number 3 (general ledger entries) by an Id.
Table number 3 is as I said general ledger entries (named Lancamentos) and the field name that relates this table with number 2 is IdLancamento. I need this table for filtering purposes.
A real example:
I want all the accounts from 24 till 26 that have values/movements but I also want their parents. Moreover, their parents will have to contain the sums of their children accounts just like the example output bellow:
Account Code | Debit Sum | Credit Sum | Balance /*Balance = Debit Sum - Credit Sum*/
24 | 12,000 | 184,000 | -172,000 /*Sum of all children (242+243) */
242 | 12,000 | 48,000 | -36,000 /*Sum of all children (2423+2424) */
2423 | 12,000 | 0,000 | 12,000 /*Sum of all children (24231) */
24231 | 0,000 | 0,000 | 12,000 /*Account with values/movements */
2424 | 0,000 | 48,000 | -48,000 /*Sum of all children (24241) */
24241 | 0,000 | 48,000 | -48,000 /*Account with values/movements */
243 | 0,000 | 136,000 | -136,000 /*Sum of all children (2433) */
2433 | 0,000 | 136,000 | -136,000 /*Sum of all children (24331) */
24331 | 0,000 | 136,000 | -136,000 /*Sum of all children (243313) */
243313 | 0,000 | 136,000 | -136,000 /*Account with values/movements */
As you can see there are lots of accounts but only the ones that have the following comment /*Account with values/movements*/ have values, the others are calculated by me, which brings me to the real problem:
Can I do it? Of course, but at the moment I'm using recursive methods and several queries to do it and this approach takes a awful lot of time so I definitely need a new one because most of the customers use old computers.
I've tried to get this kind of structure/data using a query with INNER JOINS. Didn't work because I don't get all the data I need, only the lines with movements.
I also and tried with LEFT JOINS. Didn't work so well as well because I get all the data I need plus a lot of which I don't need. But using a LEFT JOIN has one more problem, I only get this extra data (also the one I need) if I do not include any fields from table number 3 in the WHERE clause. This obviously has an answer but because I'm not such an expert in SQL Server I'm not seeing what is the reason for that.
Here are the query I built (the other one has only 1 difference, a LEFT JOIN instead of the first INNER JOIN):
SELECT IdConta, Min(IdContaPai) AS IdContaPai, Min(ContasPoc.CodigoConta) AS CodigoConta
FROM ContasPoc INNER JOIN (LancamentosContas INNER JOIN Lancamentos ON
Lancamentos.IdLancamento = LancamentosContas.IdLancamento) ON ContasPoc.CodigoConta =
LancamentosContas.CodigoConta
WHERE Lancamentos.IdEmpresa=17 AND ContasPoc.IdEmpresa=17 AND Lancamentos.IdLancamento =
LancamentosContas.IdLancamento AND ContasPoc.CodigoConta>='24' AND ContasPoc.CodigoConta<='26'
GROUP BY IdConta
ORDER BY Min(ContasPoc.CodigoConta)
After this long explanation (sorry about that) my 3 questions are:
Is it possible to get such results (as I demonstrated above) only with database queries, in other words, without using recursive methods?
If not (most probable) are stored procedures the way to go and if so how can I do it (never used them before)?
Is there another method I'm not seeing?
Can someone help me on this one?
I hope I got all the relevant data in, but if not please tell me what I forgot and I will gladly answer back. Thanks in advance!
Miguel
WITH q (AccountCode, Balance) AS (
SELECT 24231, 12000
UNION ALL
SELECT 243313, -136000
UNION ALL
SELECT 24241, -48000
UNION ALL
SELECT 24, NULL
UNION ALL
SELECT 242, NULL
UNION ALL
SELECT 2423, NULL
)
SELECT qp.AccountCode, SUM(qc.Balance)
FROM q qp
JOIN q qc
ON SUBSTRING(CAST(qc.AccountCode AS VARCHAR), 1, LEN(qp.AccountCode)) = qp.AccountCode
GROUP BY
qp.AccountCode
ORDER BY
AccountCode
Unless I'm mistaken, this becomes easy if you think of Account Code as a string. You could query like:
select *,
(select sum(vw2.DebitSum)
from SomeView vw2
where vw.code < vw2.code and
vw2.code < cast(cast(vw.code as int) + 1 as varchar)
) as DebitSum
from SomeView vw
where '24' <= vw.code and vw.code < '27'
The subquery sums the debits for all children.
You never "need" stored procedures. Think of it this way, a stored procedure is just code with DB queries inside it which runs entirely on the server. You could always do that code in your client language and have the DB queries return result sets which you operate on on the client side.
i don't understand why you don't just have an account table with an id for the primary key, and for every child account have a parent account field. then you just reference back to the account table and get your hierarchy in a looping cte.

Detecting Correlated Columns in Data

Suppose I have the following data:
OrderNumber | CustomerName | CustomerAddress | CustomerCode
1 | Chris | 1234 Test Drive | 123
2 | Chris | 1234 Test Drive | 123
How can I detect that the columns "CustomerName", "CustomerAddress", and "CustomerCode" all correlate perfectly? I'm thinking that Sql Server data mining is probably the right tool for the job, but I don't have too much experience with that.
Thanks in advance.
UPDATE:
By "correlate", I mean in the statistics sense, that whenever column a is x, column b will be y. In the above data, The last three columns correlate with each other, and the first column does not.
The input of the operation would be the name of the table, and the output would be something like :
Column 1 | Column 2 | Certainty
CustomerName | CustomerAddress | 100%
CustomerAddress | CustomerCode | 100%
There is a 'functional dependency' test built in to the SQL Server Data Profiling component (which is an SSIS component that ships with SQL Server 2008). It is described pretty well on this blog post:
http://blogs.conchango.com/jamiethomson/archive/2008/03/03/ssis-data-profiling-task-part-7-functional-dependency.aspx
I have played a little bit with accessing the data profiler output via some (under-documented) .NET APIs and it seems doable. However, since my requirement dealt with distribution of column values, I ended up going with something much simpler based on the output of DBCC STATISTICS. I was quite impressed by what I saw of the profiler component and the output viewer.
What do you mean by correlate? Do you just want to see if they're equal? You can do that in T-SQL by joining the table to itself:
select distinct
case when a.OrderNumber < b.OrderNumber then a.OrderNumber
else b.OrderNumber
end as FirstOrderNumber,
case when a.OrderNumber < b.OrderNumber then b.OrderNumber
else a.OrderNumber
end as SecondOrderNumber
from
MyTable a
inner join MyTable b on
a.CustomerName = b.CustomerName
and a.CustomerAddress = b.CustomerAddress
and a.CustomerCode = b.CustomerCode
This would return you:
FirstOrderNumber | SecondOrderNumber
1 | 2
Correlation is defined on metric spaces, and your values are not metric.
This will give you percent of customers that don't have customerAddress uniquely defined by customerName:
SELECT AVG(perfect)
FROM (
SELECT
customerName,
CASE
WHEN COUNT(customerAddress) = COUNT(DISTINCT customerAddress)
THEN 0
ELSE 1
END AS perfect
FROM orders
GROUP BY
customerName
) q
Substitute other columns instead of customerAddress and customerName into this query to find discrepancies between them.

Resources