I'm attempting to determine the best way, either in SSIS or straight TSQL, to merge two rows based on a given key, but taking specific data from each row based on various aggregate rules (MAX and SUM specifically). As an example, given the following dataset:
Customer Name Total Date Outstanding
12345 A 100 7/15/2015 500
12345 200 1/1/2015 300
456 B 500 1/2/2010 100
456 B 250 2/1/2015 900
78 C 100 9/15/2015 500
I wish to consolidate those to a single row per customer key, with the following rules as an example:
If any name is null, use a corresponding value for that customer that is not null
MAX(Total)
MAX(Date)
SUM(Outstanding)
The result set would be:
Customer Name Total Date Outstanding
12345 A 200 7/15/2015 800
456 B 500 2/1/2015 1000
78 C 100 9/15/2015 500
What's the best approach here? My first instinct is to query the table to join to itself on customer to get all values on a single row, and then use formulas in a Derived Column task in SSIS to determine the values to use. My concern there is that is not scalable - it works fine if I have a customer occur only twice in the main dataset, but the goal would be for the logic to work for N number of rows without needing to do a ton of rework. I'm sure there's also a TSQL approach that I'm missing here. Any help would be appreciated.
If name column in your query is not empty then you can do that simply by using aggregate function in one query
DECLARE #Customer TABLE
(
Customer INT, Name varchar(10), Total INT , PurchaseDate DATE , Outstanding INT
)
INSERT INTO #Customer
SELECT 12345,'A',100,'7/15/2015',500 UNION
SELECT 12345,'A',200,'1/1/2015',300 UNION
SELECT 456,'B',500,'1/2/2010',100 UNION
SELECT 456,'B',250,'2/1/2015',900 UNION
SELECT 78,'C',100,'9/15/2015',500
SELECT Customer,NAME ,MAX(Total), MAX(PurchaseDate), SUM(outstanding)
FROM #Customer
GROUP BY Customer, NAME
Demo
Now, if your name column is empty in few cases like you have mentioned in your example then you can update name table with correct name value
Related
I'm using Postgresql 9.2 and have a simple students table as follow
id | proj_id | mark | name | test_date
I have 2 queries which is described below
select * from (select distinct on (proj_id) proj_id , mark, name,
test_date from students )
t
where t.mark <= 1000
VS
select distinct on (proj_id) proj_id , mark, name, test_date from
students where mark <= 1000
when I run each query for more than 10000 records each query returns different result especially result count although for less than 3000 records the result would be the same.
is this postgresql 9.2 bug or I'm missing something ?
Your queries are producing two different sets of results because they are applying the logic differently.
The first query is getting a distinct set of results, and then applying the 'mark' filter.
The second query is applying the 'mark' filter, and then getting a distinct set of results.
As you don't have any ordering applied the first query could potential return a different number of rows each time it is run - as the mark field could contain any of the values that relate to the proj_id.
I have a database with approximately 10 million rows (and 20 columns - about 4 GB) where about 10% of the rows have a duplicate column. Database is in SQL Server 2014 Express and using SSMS.
I created a new column CNT (int, null) to count the occurrences of each row where I have a duplicate ID. Desired result would look like:
ID CNT
100 1
100 2
101 1
102 1
102 2
103 1
104 1
Not being really familiar with advanced SQL capabilities I did some research and came up with using a CTE to set the CNT column. Worked fine on a small test table - but it was obvious this is not the way to go for a large table (I killed it after 5+ hours on a pretty decent system.)
Here's the code that I attempted to implement:
with CTE as
(select dbo.database.id, dbo.database.cnt,
RN = row_number() over (partition by id order by id)
from dbo.databasee)
update CTE set CNT = RN
Column ID is of type Int. All columns allow nulls - there are no keys or indexed columns.
Edit: Martin is right, I can only offer an alternate solution than the CTE at the moment. Make a new table exactly like your old one, and insert the old table's data into it with this.
INSERT INTO newTable
SELECT ID, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID)
FROM oldTable;
Then you can delete your old table. Definitely not a perfect solution, but it should work.
I need to select random rows from a table based on weight in another row. Like if the user enters random value 50 I need to select 50 random rows from the table being that the rows with higher weight gets returned more number of times. I saw using NEWID() to select n number of random rows and this link
Random Weighted Choice in T-SQL
where we can select one row based on the weight from another row but I need to select several rows based on user random input number ,so will the best way be using the suggested answer in the above link and looping over it n number of times(but I think it would return the same row) is there any other easy solution.
MY table is like this
ID Name Freq
1 aaa 50
2 bbb 30
3 ccc 10
so when the user enters 50 I need to return 50 random names so it should be like more aaa ,bbb than ccc.Might be like 25 aaa 15 bbb and 10 ccc. Anything close to this will work to.I saw this answer but when I execute against my DB it seems to be running for 5mins and no results yet.
SQL : select one row randomly, but taking into account a weight
I think the difficult part here is getting any individual row to potentially appear more than once. I'd look into doing something like the following:
1) Build a temp table, duplicating records according to their frequency (I'm sure there's a better way of doing this, but the first answer that came to my mind was a simple while loop... This particular one really only works if the frequency values are integers)
create table #dup
(
id int,
nm varchar(10)
)
declare #curr int, #maxFreq int
select #curr=0, #maxFreq=max(freq)
from tbl
while #curr < #maxFreq
begin
insert into #dup
select id, nm
from tbl
where freq > #curr
set #curr = #curr+1
end
2) Select your top records, ordered by a random value
select top 10 *
from #dup
order by newID()
3) Cleanup
drop table #dup
Maybe could you try something like the following:
ORDER BY Freq * rand()
in your sql? So columns with a higher Freq value should in theory get returned more often than those with a lower Freq value. It seems a bit hackish but it might work!
How can I show the number of rows in a table in a way that when a new record is added the number representing the row goes higher and when a record is deleted the number gets updated accordingly?
To be more clear,suppose I have a simple table like this :
ID int (primary key) Name varchar(5)
The ID is set to get incremented by itself (using identity specification) so it can't represent the number of row(record) since if I have for example 3 records as:
ID NAME
1 Alex
2 Scott
3 Sara
and I delete Alex and Scott and add a new record it will be:
3 Sara
4 Mina
So basically I'm looking for a sql-side solution for doing this so that I don't change anything else in the source code in multiple places.
I tried to write something to get the job done but it failes. Here it is :
SELECT COUNT(*) AS [row number],Name
FROM dbo.Test
GROUP BY ID, Name
HAVING (ID = ID)
This shows as:
row number Name
1 Alex
1 Scott
1 Sara
while I want it to get shown as:
row number Name
1 Alex
2 Scott
3 Sara
If you just want the number against the rows while selecting the data and not in the database then you can use this
select row_number() over(order by id) from dbo.Test
This will give the row number n for nth row.
Try
SELECT id, name, ROW_NUMBER() OVER (ORDER BY id) AS RowNumber
FROM MyTable
What you want is called an auto increment.
For SQL-Server this is achieved by adding the IDENTITY(1,1) attribute to the table definition.
Other RDBMS use a different syntax. Firebird for example has generators, which do the counting. In a BEFORE-INSERT trigger you would assign the ID-field to the current value of the generator (which will be increased automatically).
I had this exact problem a while ago, but I was using SQL Server 2000, so although row number() is the best solution, in SQL Server 2000, this isn't available. A workaround for this is to create a temporary table, insert all the values with auto increment, and replace the current table with the new table in T-SQL.
Is there a way in MS access to return a dataset between a specific index?
So lets say my dataset is:
rank | first_name | age
1 Max 23
2 Bob 40
3 Sid 25
4 Billy 18
5 Sally 19
But I only want to return those records between 'rank' 2 and 4, so my results set is Bob, Sid and Billy? However, Rank is not part of the table, and this should be generated when the query is run. Why don't I use an autogenerated number, because if a record is deleted, this will be inconsistent, and what if I wanted the results in reverse!
This obviously very simple, and the reason I ask is because I am working on a product catalogue and I am looking for a more efficient way of paging through the returned dataset, so if I only return 1 page worth of data from the database this is obviously going to be quicker then return a complete set of 3000 records and then having to subselect from that set!
Thanks R.
Original suggestion:
SELECT * from table where rank BETWEEN 2 and 4;
Modified after comment, that rank is not existing in structure:
Select top 100 * from table;
And if you want to choose subsequent results, you can choose the ID of the last record from the first query, say it was ID 101, and use a WHERE clause to get the next 100;
Select top 100 * from table where ID > 100;
But these won't give you what you're looking for either, I bet.
How are you calculating rank? I assume you are basing it on some data in another dataset somewhere. If so, create a function, do a table join, or do something that can calculate rank based on values in other table(s), then you can do queries based on the rank() function.
For example:
select *
from table
where rank() between 2 and 4
If you are not calculating rank based on some data somewhere, there really isn't a way to write this query, and you might as well be returning three random rows from the table.
I think you need to use a correlated subquery to calculate the rank on the fly e.g. I'm guessing the rank is based on name:
SELECT T1.first_name, T1.age,
(
SELECT COUNT(*) + 1
FROM MyTable AS T2
WHERE T1.first_name > T2.first_name
) AS rank
FROM MyTable AS T1;
The bad news is the Access data engine is poorly optimized for this kind of query; in my experience, performace will start to noticeably degrade beyond a few hundred rows.
If it is not possible to maintain the rank on the db side of the house (e.g. high insertion environment) consider doing the paging on the client side. For example, an ADO classic recordset object has properties to support paging (PageCount, PageSize, AbsolutePage, etc), something for which DAO recordsets (being of an older vintage) have no support.
As always, you'll have to perform your own timings but I suspect that when there are, say, 10K rows you will find it faster to take on the overhead of fetching all the rows to an ADO recordset then finding the page (then perhaps fabricate smaller ADO recordset consisting of just that page's worth of rows) than it is to perform a correlated subquery to only fetch the number of rows for the page.
Unfortunately the LIMIT keyword isn't available in MS Access -- that's what is used in MySQL for a multi-page presentation. If you can write an order key into the results table, then you can use it something like this:
SELECT TOP 25 MyOrder, Etc FROM Table1 WHERE MyOrder in
(SELECT TOP 55 MyOrder FROM Table1 ORDER BY MyOrder DESC)
ORDER BY MyOrder ASCENDING
If I understand you correctly, there is ionly first_name and age columns in your table. If this is the case, then there is no way to return Bob, Sid, and Billy with a single query. Unless you do something like
SELECT * FROM Table
WHERE FirstName = 'Bob'
OR FirstName = 'Sid'
OR FirstName = 'Billy'
But I think that this is not what you are looking for.
This is because SQL databases make no guarantee as to the order that the data will come out of the database unless you specify an ORDER BY clause. It will usually come out in the same order it was added, but there are no guarantees, and once you get a lot of rows in your table, there's a reasonably high probability that they won't come out in the order you put them in.
As a side note, you should probably add a "rank" column (this column is usually called id) to your table, and make it an auto incrementing integer (see Access documentation), so that you can do the query mentioned by Sev. It's also important to have a primary key so that you can be certain which rows are being updated when you are running an update query, or which rows are being deleted when you run a delete query. For example, if you had 2 people named Max, and they were both 23, how you delete 1 row without deleting the other. If you had another auto incrementing unique column in there, you could specify the unique ID in your query to delete only one.
[ADDITION]
Upon reading your comment, If you add an autoincrement field, and want to read 3 rows, and you know the ID of the first row you want to read, then you can use "TOP" to read 3 rows.
Assuming your data looks like this
ID | first_name | age
1 Max 23
2 Bob 40
6 Sid 25
8 Billy 18
15 Sally 19
You can wuery Bob, Sid and Billy with the following QUERY.
SELECT TOP 3 FirstName, Age
From Table
WHERE ID >= 2
ORDER BY ID