db optimization: computing rank - database

This question asks how to select a user's rank by his id.
id name points
1 john 4635
3 tom 7364
4 bob 234
6 harry 9857
The accepted answer is
SELECT uo.*,
(
SELECT COUNT(*)
FROM users ui
WHERE (ui.points, ui.id) >= (uo.points, uo.id)
) AS rank
FROM users uo
WHERE id = #id
which makes sense. I'd like to understand what the performance tradeoffs would be, between this approach, or by modifying the db structure to store a calculated rank (I guess that would require massive changes every time there's a rank change), or any other approaches that I'm too newb to think of. I'm a db noob.

The performance tradeoff would basically be what you described:
If you modified the structure to store a rank, queries would be very, very simple and fast. However, this would require some overhead any time "points" changed, as you'd have to verify that the rank hasn't changed. If the ranking had changed, you'd have to do multiple updates.
This causes more work (with the potential for bugs) at every update/insert. The tradeoff is very fast reads. If you're typical usage is very few modifications compared to millions of reads, AND you found this query to be a bottleneck, it might be worth considering reworking this. However, I would avoid the added complexity and maintainability headaches unless you truly found this to be a problem, since the current solution requires less storage, and is very flexible.

The link you reference is a MySQL question. If the original database had been Oracle the accepted answer would be to use an analytic function, which does scale, very nicely:
SQL> select id, name, points from users order by id
2 /
ID NAME POINTS
---------- ---------- ----------
1 john 4635
3 tom 7364
4 bob 234
6 harry 9857
8 algernon 1
9 sebastian 234
10 charles 888
7 rows selected.
SQL> select name, id, points, rank() over (order by points)
2 from users
3 /
NAME ID POINTS RANK()OVER(ORDERBYPOINTS)
---------- ---------- ---------- -------------------------
algernon 8 1 1
bob 4 234 2
sebastian 9 234 2
charles 10 888 4
john 1 4635 5
tom 3 7364 6
harry 6 9857 7
7 rows selected.
SQL> select name, id, points, dense_rank() over (order by points desc)
2 from users
3 /
NAME ID POINTS DENSE_RANK()OVER(ORDERBYPOINTSDESC)
---------- ---------- ---------- -----------------------------------
harry 6 9857 1
tom 3 7364 2
john 1 4635 3
charles 10 888 4
bob 4 234 5
sebastian 9 234 5
algernon 8 1 6
7 rows selected.
SQL>

Does not the 'where' portion of that query internally require reading the entire table? I understand about premature optimization. Academically, it seems that this wouldn't scale further than a few thousand rows.

Related

How to update in master-detail tables

Example:
Master
ID
Student
1
Cindy
2
Barbie
Detail
ID
ID_FK
Subject
1
1
Math
2
1
Science
3
1
English
4
2
English
5
2
History
Scenario: if i update Barbie to have three subjects, Math, science and english, should i delete all her records first then add the new ones or is there any other way to do this. Thanks.

Table Design for Data with multiple Granularity

I have been trying to understand data modelling and warehousing, Given the best practices of having only one type of granularity in the table can I have one table store both the low granular data with the aggregate data.
2 Table Structure
Table 1
TransactionID Transaction_Dt ProductID Items Cost_Total
11111 1/1/2020 1 10 100
11111 1/1/2020 2 5 200
11111 1/1/2020 3 4 400
11111 1/1/2020 4 5 100
11111 1/1/2020 5 6 600
11111 1/1/2020 6 10 100
Table 2 (Aggregated)
TransactionID Transaction_Dt Total_Items Cost_Total
11111 1/1/2020 40 1500
One table Structure
Aggregate Data in the table
TransactionID Transaction_Dt ProductID Items Cost_Total Type
11111 1/1/2020 1 10 100 ind_Item
11111 1/1/2020 2 5 200 ind_Item
11111 1/1/2020 3 4 400 ind_Item
11111 1/1/2020 4 5 100 ind_Item
11111 1/1/2020 5 6 600 ind_Item
11111 1/1/2020 6 10 100 ind_Item
**11111 1/1/2020 ALL 40 1500 all_Item**
Here we have one record for the entire transaction with the sum of all items and sum of all cost.
Can anyone help me on the Cons of the 2nd approach where we have aggregated data in the same table
Some thoughts about this:
I am not a fan of storing data at multiple levels of aggregation in one table for the reason that #Marmite Bomber suggested - if you do a select sum and don't filter out the aggregates, you'll get a multiple of the answer you're looking for.
If you do still want to put everything into one table, I'd add another column, perhaps called agg_level, and indicate the aggregation level of the row in that table. (you are already kind of doing this with your 'Type' column although Type is a very ambiguous term).
I'd recommend against changing the TransactionID value (you propose adding some asterisks to indicate that it's an aggregate). Modifying it will make it harder to search for the one you want and users will have to understand your notation to get the right records. If you do add an agg_level column and leave the TransactionIDs in their original form, you could put an easily recognizable term in the agg_level column. For example, a record could say "raw", or "Transaction Total", or "Monthly Aggregate"...
If you have to put your aggregates into your base data table, like you've shown, you should consider creating views on top of the table, each view providing only detail at one level of aggregation. You likely would give users access only to these views, and not to the base data. In this way, you store everyone in one table but, to users, it looks like you have multiple tables and you needn't worry about a user accidentally misforming a query that brings back duplicate totals.
It's a good question, Snehasish, and it shows that you've been giving it some thought. Best of luck as you navigate the need going forward!

How to design a Db table for attendance

I am currently working on a school management system but can't seem to figure out the best way to design my student attendance table.
INFO
School is for 14 weeks and class holds 5 times a week. Students in the school can be up to 2000 per term. Meaning attendance can be up to 14 x 5 x 2000 = 140, 000 per term.
I am developing the application for a desktop using VB.Net and MS Access.
PROGRESS SO FAR
I have so far designed something that I am skeptic about.
table name: attendance
_____________________________________________
| id |std_id | att_week | att_date | status |
''''''''''''''''''''''''''''''''''''''''''''''
| 1 | 0001 | 1 |29/9/2015 | yes |
''''''''''''''''''''''''''''''''''''''''''''''
| 2 | 0002 | 1 |29/9/2015 | yes |
''''''''''''''''''''''''''''''''''''''''''''''
I easily found out that designing it like this can easily yield 140, 000 rows in a term.
I also thought of making the week days as column names, that will easily result in 14 x 5 = 70 columns.
What is the best way to design this said table.
Friend I think you should construct your table like this:
Table would accept only the absentees
id student_id class date
________________________________________
1 11 7a 11/11/2020
2 21 6b 10/12/2020
and so on.....
You could easily retrieve details like
1] total absentees per class
2] total absent of a student in date range
3] Per day report of attendance of student can be easily prepared based on this data
ALSO this would be extremly fast due to less number of record and if you index on class_id and and partition tables in specified date range.
Thank You!

How do I normalize data in a database?

I'm in an intro to database management course and we're learning about normalizing data (1NF, 2NF, 3NF, etc.) and I'm super confused on how to actually go about and do it. I've read up on this, consulted various sites and youtube videos and I still can't seem to get it to click. I am using Microsoft Access 2013 if that's of any help.
This is the data I'm working with.
Thanks.
Edit1: Alright, I think I have the tables set up correctly. But now I'm having trouble actually inputting data to go from one table to the next. Here's my relationship table.
On a very basic level, any repeating values in a table are candidates for normalization. Duplicated data is usually a bad idea. Say you needed to update a patient's surname - you now have to update all the occurrences in this table, and possibly many others throughout the rest of the database. Much better to store each patient's details in one place only.
This is where normalization comes in. Looking down the columns, you can see that there are repeating values for data about dentists, patients and surgeries, so we should normalize towards having tables for each of these entities, as well as the original table that contains appointments, giving you four tables in total.
Extract the entities out into their own tables, and give each row a primary (unique) key - just use an incrementing integer for now. (Edit: as suggested in the comment we could use the natural keys of PatientNo, StaffNo and SurgeryNo instead of creating surrogates.)
Then, instead of each patient's name and number appearing multiple times in the appointments table, we just reference the key of the master record in the Patient table. This is called a foreign key.
Then, do the same for Dentist and Surgery.
You will end up with tables looking something like this:
APPOINTMENT
AppointmentID DentistID PatientID AppointmentTime SurgeryID
----------------------------------------------------------------
1 1 1 12 Aug 03 10:00 1
2 1 2 ... 2
3 2 3 ... 1
4 2 3 ... 1
5 3 2 ... 2
6 3 4 ... 3
DENTIST
DentistID Name StaffNo
--------------------------------------
1 Tony Smith S1011
2 Helen Pearson S1024
3 Robin Plevin S1032
PATIENT
PatientID Name PatientNo
---------------------------------------
1 Gillian White P100
2 Jill Bell P105
3 Ian MackKay P108
4 John Walker P110
SURGERY
SurgeryID SurgeryNo
-------------------------
1 S10
2 S15
3 S13
The first step is to data modelling and denormalization is to understand your data. Study it an understand the domain "objects" or tables that exist within your model. That will give you an idea of how to start. Sometimes a single table or query sample is not enough to fully understand the database, but in your case, we can use the sample data and make some assumptions.
Secondly, look for repeated / redundant data. If you see copies of names, there is a good chance that is a candidate for a foreign key. Our assumption tells us that STAFF_NO is a primary key candidate for DENTIST because each unique STAFF_NO correlates to a unique DENTIST_NAME, so I see a good candidate DENTIST table (STAFF_NO, DENTIST_NAME)
Example in some table of SURGERY:
ID STAFF_NO DENTIST_NAME
1 1 Fred Sanford
2 1 Fred Sanford
3 3 Lamont Sanford
4 3 Lamont Sanford
Why store these over and over? What happens when Fred says "But my correct name is Fred G Sanford", so you have to update your database. In the current table, you have to update the name is many rows. If you had normalized it, you'd have a single location for the name, in the DENTIST table.
So I can take the unique dentists and store them in DENTIST
create table DENTIST(staff_no integer primary key, dentist_name varchar(100));
-- One possible way to populate our dentist table is to use a distinct query from surgery
insert into DENTIST
select distinct staff_no, dentist_name from surgery;
STAFF_NO DENTIST_NAME
1 Fred Sanford
3 Lamont Sanford
SURGERY table now points to DENTIST table
ID STAFF_NO
1 1
2 1
3 3
4 3
And you can now create a view, VIEW_SURGERY to join the DENTIST_NAME back in to satisfy the needs of typical queries.
select s.id, d.staff_no, d.dentist_name
from surgery s join dentist d
on s.staff_no = d.staff_no -- join here
So now a unique update to DENTIST, by the dentist primary key will update a single row.
update dentist set name = 'Fred G Sanford' where staff_no = 1;
Add query view will show the updated name for N rows:
select * from view_surgery
ID STAFF_NO DENTIST_NAME
1 1 Fred G Sanford
2 1 Fred G Sanford
3 3 Lamont Sanford
4 3 Lamont Sanford
In short, you are removing redundancy.
This is just a sample, and one way to do it. Manual normalization like this is not as common when you have modelling tools, but the point is, we can look at data, spot redundancies and factor those redundancies into new tables, and relate those new tables by foreign keys and joins, then build views to represent the original data.

how to see a difference between entity and a column

Sometimes I am having a hard time seeing a difference between an entity and a column when I am starting to make a diagram. I don't know when it is supposed to be a entity or a column. For example, in some game if you have a user and that user can play by itself or it can play in the group. Would you make that two different entities User and GroupUser ?
Also, for example if the User has levels, status and badges they earn which is part of the game. Would these be entities also or they would just be in one entity which would be part of the User ?
Entity could be a Person (e.g. Student), Place (e.g. Room Name), Object (e.g. Books), Abstract Concept (e.g. Course, Order) that could be represented in your database and normally could become a Table in your Database.
Column(s) on the other hand is/are the attribute(s) of your Entity.
So, in your case you have a User entity and the possible columns or attributes (or fields) are
UserID, UserLevel, UserStatus, Badges, PlayStatus (values could be individual or group).
Your Badges although is a column could turn into Entity if it violates the Normalization rules.
For example if you have this Table for User:
Table: Users
UserID UserName UserStatus PlayStatus Badges
------ -------- ---------- ---------- ------
1 Surefire Active Single Private, Warrior, Platoon Leader
2 FastMachine Active Group Private, Warrior
3 BeatTheGeek Inactive Group Private
The Badges here violates the 1NF (1st Normal Form) in Normalization rules which says that there should be no repeating groups or in this case no Multi-valued columns. So, this could be normalized like:
Table: Users
UserID UserName UserStatus PlayStatus
------ -------- ---------- ----------
1 Surefire Active Single
2 FastMachine Active Group
3 BeatTheGeek Inactive Group
Table: Badges
BadgeID BadgeName
------ --------
1 Private
2 Indie
3 Warrior
4 Platoon Leader
5 Colonel
6 1 Star General
7 2 Star General
8 3 Star General
9 4 Star General
10 5 Star General
11 Hero
Table: UserBadgesHistory
UserID BadgeID ReceiveDate
------ -------- -----------
1 1 12/01/2013
1 3 12/05/2013
1 4 1/5/2014
2 1 2/5/2014
2 3 2/10/2014
3 2 11/10/2013
In general, an entity has multiple columns (i.e. attributes) of its own, and a column (or attribute) does not.
In your example, if the only data you're interested in storing is a User's current level, then level is unlikely to be an entity. This is because it would have only a single attribute of name/number. If you wanted to find all Users currently at level 4, you would simply do a query with level = 4.
On the other hand, if you had a reason to add additional data about the level, such as what abilities are associated with that level or the date a given User achieved the level, then you would want to make Level a separate entity.
A Level entity would have an ID, a number or name, and whatever other attributes you need as data.
ID | Prerequisite | Ability
----+--------------+--------------
1 | NULL | May gain foos
2 | Gain 10 foos | May gain bars
3 | Gain 20 bars | 30 free foos
In a fully normalized state, you would have another entity called UserLevel in which you would store data about, for example, when a certain User gained a level.
The UserLevel entity would contain the LevelID and the UserID as foreign keys (links back to the other entities), and a DateAchieved column for when the User achieved the level.
LevelID | UserID | DateAchieved
---------+--------+-------------
1 | 1 | 2014-02-01
1 | 2 | 2014-02-01
2 | 1 | 2014-02-05
3 | 1 | 2014-02-09
2 | 2 | 2014-02-11
4 | 1 | 2014-02-13
This shows User 1 and User 2 starting at Level 1 on the same day and leveling up at different rates.

Resources