I am trying to find potential duplicates in a many-to-many join in a SQL Server database.
I have a database of students attending classes and have the following tables: Lessons, Attendees, Classrooms and Students.
I am trying to find duplicates where the same group of students may have been entered twice for the same date and classroom.
Students to Lessons is many-to-many broken down by the Attendees table. The LessonID, StudentID, ClassroomID fields are SQL Server Identity primary keys. Attendees is simply the join table with a compound key of student and lesson.
Lessons:
LessonID
LessonDate
ClassroomID
Students:
StudentID
Attendees:
LessonID
StudentID
Classrooms:
ClassroomID
It is legitimate that the same group of students may have attended different classes on the same day in the same classroom, but I want to flag them up as potential duplicates, in case the record has erroneously been entered twice.
I can’t figure out how to find matching sets of students for the same classroom on the same date.
So, an example of duplicate data I would expect to find would be:
Lessons:
+----------+-------------+------------+
| LessonID | ClassroomID | LessonDate |
+----------+-------------+------------+
| 335867 | 347 | 06/01/2020 |
| 335872 | 347 | 06/01/2020 |
+----------+-------------+------------+
Attendees:
+----------+----------+
| LessonID | PersonID |
+----------+----------+
| 335867 | 432 |
| 335867 | 1398 |
| 335867 | 5107 |
| 335872 | 432 |
| 335872 | 1398 |
| 335872 | 5107 |
+----------+----------+
Another way to look at this would be: for any given Lesson, which other lessons (if any) have the same students in the same classroom on the same day.
I found a solution myself using the STRING_AGG function to flatten out the hierarchy. I added the following query to the database:
SELECT Lessons.LessonID, Lessons.ClassroomID, Lessons.LessonDate, string_agg(Attendees.StudentID, '-') AS team
FROM Lessons INNER JOIN
Attendees ON Lessons.LessonID = Attendees.LessonID
GROUP BY Lessons.LessonID, Lessons.ClassroomID, Lessons.LessonDate
This gives lesson data that looks like this:
+---+----+------------+--------------+
| 1 | 17 | 2006-01-04 | 3-5-10-23 |
| 2 | 18 | 2006-01-04 | 2-17-252 |
| 3 | 18 | 2006-01-04 | 2-16-18 |
| 4 | 18 | 2006-01-04 | 2-6-11-16-18 |
+---+----+------------+--------------+
which I can then simply query against.
I will turn this into a stored procedure passing in for my chosen lesson: LessonDate, ClassroomID and its own "STRING_AGG" team of students, as filters.
The STRING_AGG function is only available from SQL Server 2017. So for older versions you can use the FOR XML PATH('') syntax, concatenating with a hyphen, with a STUFF to remove the leading hyphen:
SELECT dbo.Lessons.LessonID, dbo.Lessons.ClassroomID, dbo.Lessons.LessonDate,
(
stuff(
(select '-' + cast(StudentId as varchar(10))
FROM Attendees
WHERE Attendees.LessonId = Lessons.Lessonid
FOR XML path('')
),1,1,'')
)
as Team
FROM dbo.Lessons
You could concatenate with a comma instead for standard CSV format if preferred.
Related
I have 2 different tables. I need to get a name from the TMK table in table 1 as below, and I need to bring the total number from my 2nd table. I can't write join. can u help me
TMK Table;
| tmkName |
| George |
| Jacob |
flowNewStatus Table;
|statusId|
| 1 |
| 2 |
if george has number 1 status i want this join
| tmkName | |statusId|
| George | | 1 |
Before getting to possible SQL queries... from the tables you show you'd need an additional table that associates the person to status, a join table. Essentially a TMK_status table:
TMK_status table
| personID | statusID |
|----------|----------|
| 1 | 1 |
| 2 | 3 |
| 3 | 1 |
Alternatively, the statusID could be stored as a column of TMK thus,
TMK table
| personID | tmkName | statusID |
|----------|----------|----------|
| 1 | George | 1 |
| 2 | Jacob | 3 |
If by "I can't write join", you mean you don't know how, check this answer: What is the difference between "INNER JOIN" and "OUTER JOIN"? - you will need an inner join.
If, on on the other hand, you mean you can't use join statements, then you could write a subselect statement. There could be other solutions but they depend on how you decide to join/relate the 2 tables.
Is it possible in SQL Server to take two select statements and combine them into a single row without knowing how many entries one of the select statements got?
I've been looking around at various Join solutions but they all seem to work on the basis that the amount of columns is predetermined. I have a case here where one table has a determined amount of columns (t1) and the other table have an undetermined amount of entries (t2) which all use a key that matches one entry in t1.
+----+------+-----+
| id | name | ... |
+----+------+-----+
| 1 | John | ... |
+----+------+-----+
And
+-------------+----------------+
| activity_id | account_number |
+-------------+----------------+
| 1 | 12345467879 |
| 1 | 98765432515 |
| ... | ... |
| ... | ... |
+-------------+----------------+
The number of account numbers belonging to the first query is unknown.
After the query it would become:
+----+------+-----+----------------+------------------+-----+------------------+
| id | name | ... | account_number | account_number_2 | ... | account_number_n |
+----+------+-----+----------------+------------------+-----+------------------+
| 1 | John | ... | 12345467879 | 98765432515 | ... | ... |
+----+------+-----+----------------+------------------+-----+------------------+
So I don't know how many account numbers could be associated with the id beforehand.
I'm using SQL Server 2012. I have a table CustomerMaster. Here is some sample content:
+--------+---------------+-----------------+-------------+
| CustNo | NewMainCustNo | Longname | NoOfMembers |
+--------+---------------+-----------------+-------------+
| 3653 | 3653 | GroupId:003 | |
| 3654 | 3654 | GroupId:004 | |
| 11 | 3653 | Peter Evans | |
| 155 | 3653 | Harold Charley | |
| 156 | 3654 | David Arnold | |
| 160 | 3653 | Mickey Willson | |
| 2861 | 3653 | Jonathan Wickey | |
| 2871 | 3653 | William Jason | |
+--------+---------------+-----------------+-------------+
The NewMainCustNo for Customer records is equivalent to CustNo from Group records. Basically each customer belongs to a particular group.
My question is how to update the NoOfMembers column for group records with total number of customer belongs to a certain group.
Please share your ideas on how to do this.
Thank you...
This is the solution I came up with
update CustomerMaster
set NoOfMembers = (select count(*) from CustomerMaster m2 where m2.NewMainCustNo = CustomerMaster.CustNo and m2.CustNo <> CustomerMaster.CustNo)
where LongName like 'GroupId:%'
Check this SQL Fiddle to see the query in action.
However I disagree with your data structure. You should have a separate table for your groups. In your customer table you only need to reference the ID of the group in the group table. This makes everything (including the query above) much cleaner.
If I understand correctly, you can use a window function for the update. Here is an example with an updatable CTE:
with toupdate as (
select cm.*, count(*) over (partition by NewMainCustNo) as grpcount
from customermaster
)
update toupdate
set NoOfMembers = grpcount;
You may not have the option to do so, but I would separate groups out into their own table.
create table Groups (
GroupID int primary key,
Name varchar(200)
)
Then, change NewMainCustNo to GroupID, create, purge your customer table of groups, and go from there. Then, getting a group count would be:
select GroupID,
Name [Group Name],
COUNT(*)
from Groups g
join Customers c on
c.GroupID = g.GroupID
This question already has an answer here:
Pivot without aggregate function in MSSQL 2008 R2
(1 answer)
Closed 8 years ago.
I am selecting data specific to certain clients out of multiple tables where data from one client spans multiple rows, however I would like duplicate entries to be combined onto one row. One basic example would be as follows
+------------+-------+-------------------------------+
| ClientCode | Name | Email |
+------------+-------+-------------------------------+
| CAL01 | Doug | itsjustdoug#internet.org |
| CAL01 | Doug | doug#email.com |
| MER03 | Jane | janehasemail#email.com |
| MER03 | Jane | janerocks#web.com |
| MER03 | Jane | janehatesspam#justforspam.net |
+------------+-------+-------------------------------+
The results I am looking for would be more like
+------------+-------+-------------------+-------------------+-----------------------+
| ClientCode | Name | Email1 | Email2 | Email3 |
+------------+-------+-------------------+-------------------+-----------------------+
| CAL01 | Doug | itsjustdoug#inte | doug#email.com | NULL |
| MER03 | Jane | janehasemail#ema | janerocks#web.com | janehatesspam#justfor |
+------------+-------+-------------------+-------------------+-----------------------+
Here is what I have tried.
Select * From
(Select
ClientCode
,Name
,Email
From dbo.Clients) T
PIVOT(Max (Email) for Email in (Email1, Email2, Email3)) T2
This does not seem to be the correct way to achieve what I want. Any suggestions would be appreciated. It is worth noting that the actual query is much more complicated and contains many joins and perhaps several different instances where I would use this sort of "pivoting?"
Thanks
Generate Row_number per clientcode in pivot source query
And concatenate Email text with the generated row_number which will create the pivot column list
SELECT *
FROM (SELECT ClientCode,
NAME,
Email,
'Email'+ CONVERT(VARCHAR(50), Row_number() OVER(partition BY ClientCode ORDER BY email)) Emails
FROM dbo.Clients) T
PIVOT(Max (Email)
FOR Emails IN( [Email1],
[Email2],
[Email3])) T2
SQLFIDDLE DEMO
So my query:
SELECT Tags, COUNT(Tags) AS Listings
FROM Job
WHERE datepart(year, dateposted)=2013
GROUP BY Tags
ORDER BY Listings DESC
Outputs:
+----------------------+----------+
| Tags | Listings |
+----------------------+----------+
| java c++ | 41 |
| software development | 41 |
| java c++ c# | 31 |
| | 25 |
| sysadmin | 25 |
| see jd | 24 |
| java c++ ood | 23 |
| java | 23 |
+----------------------+----------+
I want it to come out like so:
+----------------------+----------+
| Tags | Listings |
+----------------------+----------+
| java | 118|
| c++ | 95 |
| ood | 23 |
| see | 24 |
| jd | 24 |
| software development | 41 |
| sysadmin | 25 |
| c# | 31 |
+----------------------+----------+
How can I count each individual word in the field instead of the entire field? The tags column is nvarchar.
First, your table structure is awful. Storing data in a list like that is going to cause you headaches similar to what you are trying to do right now.
The problem with a split function is you have no idea what software development or other multi-word tags are - Is that one word or two?
I think the only way you will solve this is by creating a table with your tags or using a derived table similar to the following:
;with cte (tag) as
(
select 'java' union all
select 'c++' union all
select 'software development' union all
select 'sysadmin' union all
select 'ood' union all
select 'jd' union all
select 'see' union all
select 'c#'
)
select c.tag, count(j.tags) listings
from cte c
inner join job j
on j.tags like '%'+c.tag+'%'
group by c.tag
See SQL Fiddle with Demo. Using this you can get a result:
| TAG | LISTINGS |
| java | 9 |
| c++ | 10 |
| software development | 4 |
| sysadmin | 2 |
| ood | 6 |
| jd | 3 |
| see | 2 |
| c# | 1 |
The issue with the above as was pointed out in the comments is how to decide if you have a tag software and development, those will match with the above query.
The best solution that you would have to this problem would be to store the tags in a separate table similar to:
create table tags
(
tag_id int,
tag_name varchar(50)
);
Then you could use a JOIN table to connect your jobs to the tag:
create table tag_job
(
job_id int,
tag_id int
);
Once you have a set up similar to this then it becomes much easier to query your data:
select t.tag_name,
count(tj.tag_id) listings
from tags t
inner join tag_job tj
on t.tag_id = tj.tag_id
group by t.tag_name
See demo
You will probably need to split out the individual words.
Here's a good series on splitters in SQL Server:
SqlServerCentral.com
I don't see how you will be able to differentiate "software development" as a single tag though. If you have a list of acceptable tags elsewhere, you could probably use that perform a count.
If you have a list of Available Tags, here is one approach that doesn't require a split.
Sql Fiddle Example
There could be an issue with this approach if you have a tag that is contained in another. I.e. 'software' and 'software development'
This is how I solved my issue.
SELECT TOP 50 Tags.s Tag, COUNT(Tags.s) AS Listings
FROM Job
CROSS APPLY [dbo].[SplitString](Tags,' ') Tags
WHERE NOT Job.Tags IS NULL and datepart(year,job.datecreated) = 2013
GROUP BY Tags.s
ORDER BY Listings DESC