To Delete or Update (Impact on Storage) - sql-server

Looking for some advice.
Have a table where the average record size is 2.3 KB. Average 60,000 records per user per month for a total of 180,00 to 200,000 per month. Table has 9 different indexes. All data is stored in the same table, separated by a FilingID.
Each month 3 users import their data prepare it under unique FilingIDs. Once each individual has completed their process, the data needs to be combined under a single FilingID to be submitted. For example.
User A = FilingID 1
User B = FilingID 2
User C = FilingID 3
Combined = FilingID 4
Each month will have new FilingIDs and previous month’s data is retained.
As I see it I have 2 options.
1.) When all users have finished their prep, copy the data from FilingIDs 1-3 to FilingID 4. When 4 has been filed successfully delete data from FilingIDs 1-3.
2.) When all users have finished their prep, update the FilingIDs for 1-3 to FilingID 4.
I prefer option 1 for a number of reasons, however I am concerned what this will do to the size of my database with bloat, fragmented indexes etc. I don’t understand the inner workings of the storage engine that well and would appreciate any insight anyone can provide.
NOTE I do not control the table schema and don’t have an option to use a different table as this is part of a larger application.

Related

What is the correct data structure for a workout journal?

I have trouble getting my head around the correct data structure for the following application. Although I am sure this is quite standard for people in the field.
I want to create, for learning purposes, a workout journal app. So the idea is to be able to log, each day, a particular workout.
A workout is comprised of exercises. And each exercise has particular attributes.
For example. Workout 1 is a strength session. So I will have e.g. dumbell press, squats, ... which are all sets and reps based. So for workout 1 I need to be able to enter for each exercise the sets, reps and weight used for that particular workout.
Workout 2 is say a running session. This is time based and distance based. So for that workout 2 I need to be able to enter time and distance.
What would be the structure I need to have in my database for such application ?
I guess I should have an "exercise" table. Then this should somehow be a foreign key in the "workout" table. But how can the workout table accommodate varying attributes ? As well as varying number of entries ? (since a workout can be one, three or ten exercises) Also all this should constitute only one record of the "workout" table.
EDIT :
I have tried to come up with a structure. Could someone confirm/infirm this is the correct way to do this ?
So the final result is the one below, for human representation :
Final Result (sport journal)
Date
Timestamp
Exercise 1
Sets
Reps
Weight
Exercise 2
Time
Distance
Exercise 3
Sets
Reps
Weight
120821
10.30
Bench press
5
10
40
Run
60
400
NULL
NULL
NULL
NULL
120821
17.00
Bench press
5
10
40
NULL
NULL
NULL
Squats
3
5
120
But I guess this can't really be achieved as such as this is not (I think) possible in a relational database. So I need to have separate tables and the human view shown above will be a join of those various databases. For example, one "record" of the human view can be obtained by a join of various tables based on the date and timestamp i.e. the actual timing of the workout.
If that is correct, then I think a structure like this could work (at least, the ideas are there I think) :
Exercise database (so simply the list of exercises with their type which determines the attributes needed)
Name
Type
Bench press
Setsreps
Run
Run
Squats
Setsreps
Attributes (the attributes depending on exercise type. Maybe should split furhter to avoid varying number of columns depending on this exercise type i.e. run vs setsreps?)
Attributes
Attribute1
Attribute2
Attribute3
setsreps
Sets
Reps
Weight
Run
Time
Distance
NULL
Carry
Weight
Distance
NULL
Setsreps instances database (so the actual realization of the exercise on a certain day. This table will be huge !)
Date
Timestamp
Exercise
Sets
Reps
Weight
120821
10.30
Bench press
5
10
40
120821
17.00
Squats
3
5
120
Run instances database (same as above but for run instances. Since a run instance has different attributes than a setsreps instance. Is this the correct way to do this ?)
Date
Timestamp
Time
Distance
120821
10.30
60
400
170821
17.00
120
800
So then I could have the "human" view by performing a join of the setsreps & run tables on a particular data and timestamp (which together form a primary key)
Is this a correct way of thinking ?
Thanks for the support
A workout is comprised of exercises. And each exercise has particular
attributes.
One way to do this is to have a Workout table, an Exercise table, and an Attribute table. The Wikipedia article, Database normalization, will help you understand how to create a normalized database.
I'm assuming that you're not going to share attributes with exercises. If you do, this changes a one to many relationship between exercises and attributes to a many-to-many relationship. You would need a junction table to model a many to many relationship.
The Exercise table will have a foreign key for the Workout table. The Attribute table will have a foreign key for the Exercise table.

Product revenue cube (using SSAS) too big and slow browsing

Got a question regarding big size SSAS cube:
Scenario: .. building product revenue cube on SQL Server Analysis Server (SSAS) 2012:
the product hierarchy is a dimension with 7 levels: cat1, cat2, … cat7(product_id);
The fact table is at product id level, key = (customer id + bill month + product id), this gives around 200M rows per month, may still go up.
Concern: Now the prototype is having only 3 month data, the cube browsing are slow already. When drag and drop 2 -3 dimensions, it might take couple of minutes or more.
Late on, there are 30+ other dimensions need to add in, and extend to 1-2 years data, 2-3B rows. So we are concerning the performance, can that much data be handled with an acceptable browsing speed?
Question: Is there any better way (design, performance tuning) for above scenario?
E.g. another thinking is to make the fact table flat, i.e. make it customer level, key = (customer id + bill month), one customer only have one row per bill month. While adding 40+ columns, one column for each product, that way the fact row count per month with go down to say 5M. But we can’t build/browse the product dimension anymore (that is an important requirement), can we?
Thanks in advance for any light shedding.

How to prevent overwriting records in a shared appointment book?

PostgreSQL 9.3
I am writing a standard appointment book. The book will have three columns per 15:00 minute intervals and will have multiple clients writing to it simultaneously from different locations.
If two user's, A and B, are making appointments at the same time, how do I prevent user B from overwriting user A's appointment? (That is, once user A has
written an appointment, user B's information will be out of date--he won't know what user A did until he attempts to save the appointment record). If the appointment time which user B is wanting to use has already been assigned, I wish user B to be notified that it is not available and not overwrite user A.
I'm sure this is a basic and common problem, but I can't wrap my head around it.
Thanks for any help, suggestions, or references.
If you store the appointments as for example
id, customer_id, time
1 ,32, 2016-03-10 10:15:00
...
just add a unique index on the time column. Then You will have an error if you try to insert a row with the same time. If you want to allow more than 1 bookings at the same time you could add a time_slot column and include that in the index.
id, customer_id, time, time_slot
1 ,32, 2016-03-10 10:15:00, 1
2 ,15, 2016-03-10 10:15:00, 2
create unique index appointments_time_timeslot_idx on appointments (time, time_slot);

One large table or many small ones in database?

Say I want to create a typical todo-webApp using a db like postgresql. A user should be able to create todo-lists. On this lists he should be able to make the actual todo-entries.
I regard the todo-list as an object which has different properties like owner, name, etc, and of course the actual todo-entries which have their own properties like content, priority, date ... .
My idea was to create a table for all the todo-lists of all the users. In this table I would store all the attributes of each list. But the questions which arises is how to store the todo-entries themselves? Of course in an additional table, but should I rather:
1. Create one big table for all the entries and have a field storing the id of the todo-list they belong to, like so:
todo-list: id, owner, ...
todo-entries: list.id, content, ...
which would give 2 tables in total. The todo-entries table could get very large. Although we know that entries expire, hence the table only grows with more usage but not over time. Then we would write something like SELECT * FROM todo-entries WHERE todo-list-id=id where id is the of the list we are trying to retrieve.
OR
2. Create a todo-entries table on a per user basis.
todo-list: id, owner, ...
todo-entries-owner: list.id, content,. ..
Number of entries table depends on number of users in the system. Something like SELECT * FROM todo-entries-owner. Mid-sized tables depending on the number of entries users do in total.
OR
3. Create one todo-entries-table for each todo-list and then store a generated table name in a field for the table. For instance could we use the todos-list unique id in the table name like:
todo-list: id, owner, entries-list-name, ...
todo-entries-id: content, ... //the id part is the id from the todo-list id field.
In the third case we could potentially have quite a large number of tables. A user might create many 'short' todo-lists. To retrieve the list we would then simply go along the lines SELECT * FROM todo-entries-id where todo-entries-id should be either a field in the todo-list or it could be done implicitly by concatenating 'todo-entries' with the todos-list unique id. Btw.: How do I do that, should this be done in js or can it be done in PostgreSQL directly? And very related to this: in the SELECT * FROM <tablename> statement, is it possible to have the value of some field of some other table as <tablename>? Like SELECT * FROM todo-list(id).entries-list-name or so.
The three possibilities go from few large to many small tables. My personal feeling is that the second or third solutions are better. I think they might scale better. But I'm not sure quite sure of that and I would like to know what the 'typical' approach is.
I could go more in depth of what I think of each of the approaches, but to get to the point of my question:
Which of the three possibilities should I go for? (or anything else, has this to do with normalization?)
Follow up:
What would the (PostgreSQL) statements then look like?
The only viable option is the first. It is far easier to manage and will very likely be faster than the other options.
Image you have 1 million users, with an average of 3 to-do lists each, with an average of 5 entries per list.
Scenario 1
In the first scenario you have three tables:
todo_users: 1 million records
todo_lists: 3 million records
todo_entries: 15 million records
Such table sizes are no problem for PostgreSQL and with the right indexes you will be able to retrieve any data in less than a second (meaning just simple queries; if your queries become more complex (like: get me the todo_entries for the longest todo_list of the top 15% of todo_users that have made less than 3 todo_lists in the 3-month period with the highest todo_entries entered) it will obviously be slower (as in the other scenarios). The queries are very straightforward:
-- Find user data based on username entered in the web site
-- An index on 'username' is essential here
SELECT * FROM todo_users WHERE username = ?;
-- Find to-do lists from a user whose userid has been retrieved with previous query
SELECT * FROM todo_lists WHERE userid = ?;
-- Find entries for a to-do list based on its todoid
SELECT * FROM todo_entries WHERE listid = ?;
You can also combine the three queries into one:
SELECT u.*, l.*, e.* -- or select appropriate columns from the three tables
FROM todo_users u
LEFT JOIN todo_lists l ON l.userid = u.id
LEFT JOIN todo_entries e ON e.listid = l.id
WHERE u.username = ?;
Use of the LEFT JOINs means that you will also get data for users without lists or lists without entries (but column values will be NULL).
Inserting, updating and deleting records can be done with very similar statements and similarly fast.
PostgreSQL stores data on "pages" (typically 4kB in size) and most pages will be filled, which is a good thing because reading a writing a page are very slow compared to other operations.
Scenario 2
In this scenario you need only two tables per user (todo_lists and todo_entries) but you need some mechanism to identify which tables to query.
1 million todo_lists tables with a few records each
1 million todo_entries tables with a few dozen records each
The only practical solution to that is to construct the full table names from a "basename" related to the username or some other persistent authentication data from your web site. So something like this:
username = 'Jerry';
todo_list = username + '_lists';
todo_entries = username + '_entries';
And then you query with those table names. More likely you will need a todo_users table anyway to store personal data, usernames and passwords of your 1 million users.
In most cases the tables will be very small and PostgreSQL will not use any indexes (nor does it have to). It will have more trouble finding the appropriate tables, though, and you will most likely build your queries in code and then feed them to PostgreSQL, meaning that it cannot optimize a query plan. A bigger problem is creating the tables for new users (todo_list and todo_entries) or deleting obsolete lists or users. This typically requires behind-the scenes housekeeping that you avoid with the previous scenario. And the biggest performance penalty will be that most pages have only little content so you waste disk space and lots of time reading and writing those partially filled pages.
Scenario 3
This scenario is even worse that scenario 2. Don't do it, it's madness.
3 million tables todo_entries with a few records each
So...
Stick with option 1. It is your only real option.

correct approach to store in database

I'm developing an online website (using Django and Mysql). I have a Tests table and User table.
I have 50 tests within the table and each user completes them at their own pace.
How do I store the status of the tests in my DB?
One idea that came to my mind is to create an additional column in User table. That column containing testid's separated by comma or any other delimiter.
userid | username | testscompleted
1 john 1, 5, 34
2 tom 1, 10, 23, 25
Another idea was to create a seperate table to store userid and testid. So, I'll have only 2 columns but thousands of rows (no of tests * no of users) and they will always continue to increase.
userid | testid
1 1
1 5
2 1
1 34
2 10
Your second option is vastly preferred... your first solution breaks normalization rules by trying to store multiple values in a single field, which will cause you headaches down the road.
The second table will not only be easier to maintain when trying to add or remove values, but will also likely perform better since you'll be able to effectively index those columns.
There are two phrases that should automatically disqualify any database design idea.
create one table per [anything]
a column containing [anything] separated by a comma
Separate table, two columns--you're on the right track there.

Resources