Choose 2 Random Values from 2 Separate Columns SQL - sql-server

I am having to prepare test data to send to a 3rd party however I don't wish to send the customers real name nor do I wish to send their real date of birth.
I could solve the D.O.B issue by just randomly increasing the DOB by several years. However the name is different, is there anyway I can have a list of say 10 customer names and just choose a different Firstname and Surname each time.
I wish to mix and match the names however so it essentially randomly picks 1 firstname and then randomly picks a lastname and puts them together on the same line.
SELECT TOP 1 opde.first_name
FROM Table AS opde
ORDER BY NEWID()
This will return a random first name each time, but if I put the surname column in it will also return the matching surname, I don't want that I want a random surname from the list.
I tried doing this via a UNION but you can't do an ORDER BY NEWID() in the UNION.
Cheers.

I think this one might help...
WITH fn AS
(
SELECT TOP 1 opde.first_name
FROM Table AS opde
ORDER BY NEWID()
),
sn AS
(
SELECT TOP 1 opde.surname
FROM Table AS opde
ORDER BY NEWID()
)
SELECT first_name, surname
FROM fn
CROSS APPLY sn;
In the fn subquery you select a random first name. In the sn you do the same but with an surname.
With the cross apply you combine those two results

You may use a union of subqueries, each of which uses order by with NEWID:
SELECT first_name
FROM
(SELECT TOP 1 opde.first_name FROM Table AS opde ORDER BY NEWID()) t1
UNION ALL
SELECT first_name
FROM
(SELECT TOP 1 opde.first_name FROM Table AS opde ORDER BY NEWID()) t2;
Demo

Related

detecting duplicates and removing them

I've been trying to solve a problem in my database which is quite common but I couldn't find a solution so far and I hope you could help me with this.
I have a database with people and their associated addresses. My primary goal is to find out how many unique households are in there. For example, I want to count a family as one. So far a ran a query to display last_names and addresses which are more than one:
select Last_Name ,add_line1, count(*) from ##all_people
group by Last_Name,ADD_LINE1
having count(*) > 1
This shows me people with the same last_name and address but I need their IDs in order to remove them from my temptable.
Furthermore, I'd like to ask how it is possible to display only one record for each household.
This is the structure of my temptable:
ID First_name Last_Name add_line1
Thank you so much for your help!!!
to find duplicates, you can use Count() Over() and partition by the grouping you want.
select * from (
select Id, Last_Name ,add_line1, count(*) over (partition by Last_Name, add_line1) dupe_count from ##all_people
) t
where t.dupe_count > 1
to find the ones you want to delete, you can use Row_Number()
select * from (
select Id, Last_Name ,add_line1, row_number() over (partition by Last_Name, add_line1 order by ID) extras from ##all_people
) t
where t.extras > 1
use t.extras = 1 to see one row per grouping
You seem to have a lot of questions here...
My primary goal is to find out how many unique households are in there.
You can do this with a distinct count:
SELECT COUNT(DISTINCT Last_Name + add_line1)
FROM ##all_people
...but I need their IDs in order to remove them from my temptable
I think this is solved by the new count query.
Furthermore, I'd like to ask how it is possible to display only one record for each household.
Just use distinct last name and address:
SELECT DISTINCT last_name, add_line1
FROM ##all_people

SQL Server: select random rows with distinct id from table where id is not distinct

I have a simple table named Tickets with the following columns:
ticketId, userId
where ticketId is the primary key, UserId is not unique.
A user can therefore have several tickets, each with unique ticketId's.
I'm struggling to find a solution on my problem which is that I need to select 5 random tickets by 5 unique userId's.
I know how to select the random tickets by using the following query:
SELECT TOP 5 *
FROM Tickets
ORDER BY RAND(CHECKSUM(*) * RAND())
Which returns something like:
Ticket id: UserId:
--------------------------
10 1
25 1
31 2
42 2
56 3
My question is: what do I need to add to the query for it to select the random rows between distinct userId's so that it does not return more than one unique ticket for a user
Mind I need the most performance correct solution, since the table could potentially be filled with millions of rows in the long run.
Thanks in advance,
Christian
Edit:
The more tickets a user has, the higher the chances of selection. However it should still be randomly selected and not just select the user with the highest amount of tickets. Just like in a lottery.
In other words it should select 5 random rows between all rows, but ensure that the 5 rows have a unique userId.
Please try like this .... NEWID()
Select UserId
from
(
SELECT TOP 5 UserId
FROM Tickets
ORDER BY NEWID()
)k
CROSS APPLY
(
select top 1 TicketId
from Tickets T WHERE T.UserId = k.UserId
ORDER BY NEWID()
)u
Edit: As pointed out in the comments, this solution doesn't properly weight the users by number of tickets (so a user with 1000 tickets incorrectly has same change of winning as user with 1 ticket). This was particularly dumb of me since I pointed out this problem on other answers.
Given that Steve now has his solution working, I think that is the better answer.
Original answer:
I think something like the following works:
SELECT top 5 ticketid, userid
FROM
(
SELECT ticketid, userid, ROW_NUMBER() OVER (PARTITION BY userid ORDER BY NEWID()) as nid
FROM tickets
) a
WHERE nid = 1
ORDER BY NEWID()
Here's an sql fiddle to play around with it.
Credit where credit is due: I based this on Steve's solution which I don't think works correctly as written.
Something like the following I think.
Please note this code is untested, so please excuse any small syntax errors.
WITH randomised_tickets AS
(
SELECT
*
,ROW_NUMBER() OVER (ORDER BY NEWID() ASC) AS random_order
FROM Tickets
)
,ordered_winning_tickets AS
(
SELECT
*
,ROW_NUMBER() OVER (PARTITION BY userId ORDER BY random_order ASC) AS user_win_order
FROM randomised_tickets
)
SELECT TOP 5
*
FROM
ordered_winning_tickets
WHERE
user_win_order = 1 --eliminate 2nd wins from the list
ORDER BY
random_order
You could try something like this, using ignore_dup_key on a temp table to eliminate duplciates for a user:
drop table if exists #WinningTickets
create table #WinningTickets(PickId int identity primary key, TicketId int, UserId int)
create unique index ix_unique_user on #WinningTickets(UserId) with (ignore_dup_key=on)
while ( select count(*) from #WinningTickets ) < 5
begin
insert into #WinningTickets
select top 10 TicketId, UserId
from Tickets
order by newid()
end
select top 5 *
from #WinningTickets
order by PickId

How to find and delete all duplicates from SQL Server database

I'm new to SQL in general and I need to delete all duplicates in a given database.
For the moment, I use this DB to experiment some things.
The table currently looks like this :
I know I can find all duplicates using this query :
SELECT COUNT(*) AS NBR_DOUBLES, Name, Owner
FROM dbo.animals
GROUP BY Name, Owner
HAVING COUNT(*) > 1
but I have a lot of trouble finding an adapted and updated solution to not only find all the duplicates, but also delete them all, only leaving one of each.
Thanks a lot for taking some of your time to help me.
;WITH numbered AS (
SELECT ROW_NUMBER() OVER(PARTITION BY Name, Owner ORDER BY Name, Owner) AS _dupe_num
FROM dbo.Animals
)
DELETE FROM numbered WHERE _dupe_num > 1;
This will delete all but one of each occurance with the same Name & Owner, if you need it to be more specific you should extend the PARTITION BY clause. If you want it to take in account the entire record you should add all your fields.
The record left behind is currently random, since it seems you do not have any field to have any sort of ordering on.
What you want to do is use a projection that numbers each record within a given duplicate set. You can do that with a Windowing Function, like this:
SELECT Name, Owner
,Row_Number() OVER ( PARTITION BY Name, Owner ORDER BY Name, Owner, Birth) AS RowNum
FROM dbo.animals
ORDER BY Name, Owner
This should give you results like this:
Name Owner RowNum
Ecstasy Sacha 1
Ecstasy Sacha 2
Ecstasy Sacha 3
Gremlin Max 1
Gremlin Max 2
Gremlin Max 3
Outch Max 1
Outch Max 2
Outch Max 3
Now you want to convert this to a DELETE statement that has a WHERE clause targeting rows with RowNum > 1. The way to use a windowing function with a DELETE is to first include the windowing function as part of a common table expression (CTE), like this:
WITH dupes AS
(
SELECT Name, Owner,
Row_Number() OVER ( PARTITION BY Name, Owner ORDER BY Name, Owner, Birth) AS RowNum
FROM dbo.animals
)
DELETE FROM dupes WHERE RowNum > 1;
This will delete later duplicates, but leave row #1 for each group intact. The only trick now is to make sure row #1 is the correct row, since not all of your duplicates have the same values for the Birth or Death columns. This is the reason I included the Birth column in the windowing function, while other answers (so far) have not. You need to decide if you want to keep the oldest animal or the youngest, and optionally change the Birth order in the OVER clause to match your needs.
Use CTE. I will show you a sample :
Create table #Table1(Field1 varchar(100));
Insert into #Table1 values
('a'),('b'),('f'),('g'),('a'),('b');
Select * from #Table1;
WITH CTE AS(
SELECT Field1,
RN = ROW_NUMBER()OVER(PARTITION BY Field1 ORDER BY Field1)
FROM #Table1
)
--SELECT * FROM CTE WHERE RN > 1
DELETE FROM CTE WHERE RN > 1
What I am doing is, numbering the rows. If there are duplicates based on PARTITION BY columns, it will be numbered sequentially, else 1.
Then delete those records whose count is greater than 1.
I won't spoon feed you solution hence you will have to play with PARTITION BY to reach your output
output :
Select * from #Table1;
Field1
---------
a
b
f
g
a
b
/*with cte as (...) SELECT * FROM CTE;*/
Field1 RN
------- -----
a 1
a 2
b 1
b 2
f 1
g 1
if NBR_DOUBLES had an ID field, I believe you could use this;
DELETE FROM NBR_DOUBLES WHERE ID IN
(
SELECT MAX(ID)
FROM dbo.animals
GROUP BY Name, Owner
HAVING COUNT(*) > 1
)

Find max value and show corresponding value from different field in SQL server

I have a table with data about cities that includes their name, population and other fields irrelevant to my question.
ID Name Population
1 A 45667
2 B 123456
3 C 3005
4 D 13769
To find the max population is basic, but I need a resulting table that has the max population in one column, and the corresponding city's name in another column
Population Name
123456 B
I've looked through similar questions, but for some reason the answers look over-complicated. Is there a way to write the query in 1 or 2 lines?
There are several ways that this can be done:
A filter in the WHERE clause:
select id, name, population
from yourtable
where population in (select max(population)
from yourtable)
Or a subquery:
select id, name, population
from yourtable t1
inner join
(
select max(population) MaxPop
from yourtable
) t2
on t1.population = t2.maxpop;
Or you can use TOP WITH TIES. If there can be no ties, then you can remove the with ties. This will include any rows that have the same population value:
select top 1 with ties id, name, population
from yourtable
order by population desc
Since you are using SQL Server you can also use ranking functions to get the result:
select id, name, population
from
(
select id, name, population,
row_number() over(order by population desc) rn
from yourtable
) src
where rn = 1
See SQL Fiddle with Demo of all.
As a side note on the ranking function, you might want to use dense_rank() instead of row_number(). Then in the event you have more than one city with the same population you will get both city names. (See Demo)

Is it possible to get two sets of data with one query

I need to get some items from database with top three comments for each item.
Now I have two stored procedures GetAllItems and GetTopThreeeCommentsByItemId.
In application I get 100 items and then in foreach loop I call GetTopThreeeCommentsByItemId procedure to get top three comments.
I know that this is bad from performance standpoint.
Is there some technique that allows to get this with one query?
I can use OUTER APPLY to get one top comment (if any) but I don't know how to get three.
Items {ItemId, Title, Description, Price etc.}
Comments {CommentId, ItemId etc.}
Sample data that I want to get
Item_1
-- comment_1
-- comment_2
-- comment_3
Item_2
-- comment_4
-- comment_5
One approach would be to use a CTE (Common Table Expression) if you're on SQL Server 2005 and newer (you aren't specific enough in that regard).
With this CTE, you can partition your data by some criteria - i.e. your ItemId - and have SQL Server number all your rows starting at 1 for each of those "partitions", ordered by some criteria.
So try something like this:
;WITH ItemsAndComments AS
(
SELECT
i.ItemId, i.Title, i.Description, i.Price,
c.CommentId, c.CommentText,
ROW_NUMBER() OVER(PARTITION BY i.ItemId ORDER BY c.CommentId) AS 'RowNum'
FROM
dbo.Items i
LEFT OUTER JOIN
dbo.Comments c ON c.ItemId = i.ItemId
WHERE
......
)
SELECT
ItemId, Title, Description, Price,
CommentId, CommentText
FROM
ItemsAndComments
WHERE
RowNum <= 3
Here, I am selecting up to three entries (i.e. comments) for each "partition" (i.e. for each item) - ordered by the CommentId.
Does that approach what you're looking for??
You can write a single stored procedure which calls GetAllItems and GetTopThreeeCommentsByItemId, takes results in temp tables and join those tables to produce the single resultset you need.
If you do not have a chance to use a stored procedure, you can still do the same by running a single SQL script from data access tier, which calls GetAllItems and GetTopThreeeCommentsByItemId and takes results into temp tables and join them later to return a single resultset.
This gets two elder brother using OUTER APPLY:
select m.*, elder.*
from Member m
outer apply
(
select top 2 ElderBirthDate = x.BirthDate, ElderFirstname = x.Firstname
from Member x
where x.BirthDate < m.BirthDate
order by x.BirthDate desc
) as elder
order by m.BirthDate, elder.ElderBirthDate desc
Source data:
create table Member
(
Firstname varchar(20) not null,
Lastname varchar(20) not null,
BirthDate date not null unique
);
insert into Member(Firstname,Lastname,Birthdate) values
('John','Lennon','Oct 9, 1940'),
('Paul','McCartney','June 8, 1942'),
('George','Harrison','February 25, 1943'),
('Ringo','Starr','July 7, 1940');
Output:
Firstname Lastname BirthDate ElderBirthDate ElderFirstname
-------------------- -------------------- ---------- -------------- --------------------
Ringo Starr 1940-07-07 NULL NULL
John Lennon 1940-10-09 1940-07-07 Ringo
Paul McCartney 1942-06-08 1940-10-09 John
Paul McCartney 1942-06-08 1940-07-07 Ringo
George Harrison 1943-02-25 1942-06-08 Paul
George Harrison 1943-02-25 1940-10-09 John
(6 row(s) affected)
Live test: http://www.sqlfiddle.com/#!3/19a63/2
marc's answer is better, just use OUTER APPLY if you need to query "near" entities (e.g. geospatial, elder brothers, nearest date to due date, etc) to the main entity.
Outer apply walkthrough: http://www.ienablemuch.com/2012/04/outer-apply-walkthrough.html
You might need DENSE_RANK instead of ROW_NUMBER/RANK though, as the criteria of a comment being a top could yield ties. TOP 1 could yield more than one, TOP 3 could yield more than three too. Example of that scenario(DENSE_RANK walkthrough): http://www.anicehumble.com/2012/03/postgresql-denserank.html
Its better that you select the statement by using the row_number statement and select the top 3 alone
select a.* from
(
Select *,row_number() over(partition by column)[dup]
) as a
where dup<=3

Resources