One large table or break them down? (Optimal database) - database

I've looked around trying to figure out the best way to handle my database, and I have some questions.
Is it better to have a large table or separate tables? Is there any real difference as far as server load?
I paid a guy to put together a database and some php so I could break into it, and he made 2 separate tables for what, I think, should be one. Seems to be redundant, and I no likey repeating myself.
Basically, x_content is: ID | section | page | heading | content
and x_menu is ID | PARENT | LINK | DISPLAY | HASCHILD
(personally, it bugs me about the caps. I think I'll go Standard Case since everything else on the site is all-lowercase or in script camelCase)
Anyway, ID, (heading/DISPLAY), and (page/LINK) are more/less (can be) the same. Seems to me I'd be doing myself a favor by combining these, and adding the rest of what I want.
What I'd Like: ID | Category | Name | URL | Description | Keywords | Content | Theme
So- should I delete the x_menu and combine them?
*If I link all of the pages in my site right now, it would be something like 40+

If menu is build dynamically in your application, I would suggest to have a separate table for menu for easy maintenance.
However, in your combined table, I don't see any "menu" column. I don't know whether you'd like to replace it with URL. But difference I find is every page has url but it may or may not be included in menu.
If "menu" is not build dynamically, I think more information must be provided to consider whether menu needs to be separate entity.

Related

Pivot Table for Wagtail Form Builder for Polling and Votes purpose

I am trying to use Wagtail Form Builder for Voting and Polling purpose, and use HighCharts to display the results interactively on webpage.
the problems is that Wagtail FormSubmission class only stores information of each vote.
| vote user | question 1 | question 2 |
| jason | A | C |
| lily | D | B |
But I want to get information like:
How many users voted for A, B, C, D for Question 1, 2 respectively, and What are those users. Similar to do a Pivot Table for the FormSubmission results.
I understand I can do a QuerySet API Aggregation to get what I want, but I do not want to do this expensive manipulation every time when user visit the webpage.
I am thinking about using class-level attributes to achieve this.
Q: I am wondering what is the best practice to store those aggregation results in DB and update accordingly every time a Vote is submitted?
Wagtail form builder is not really suitable for this task. It's designed to allow non-programmers to construct forms for simple data collection - where it just needs to be stored and retrieved, with no further processing - without having to know how to define a Django model. To achieve this, all the data is stored in the FormSubmission model in a single field as JSON text, so that the same model can be re-used for any form. Since this isn't a format that the database engine understands natively, there's no way to perform complex queries on it efficiently - the only way is to unpack each submission individually and run calculations on it in Python code, which is going to be less efficient than any queryset functionality.
Instead, I would recommend writing a custom Django app for this. The tutorial in the Django documentation is a polls app, which should give you some idea of the way to go about it, but in short, you'll most likely need three models: a Question model containing the text of each question, an AnswerChoice model where each item is one of the possible answers for one question, and a Response model indicating which AnswerChoice a given user has chosen. With these models, you'll be able to perform queries such as "how many users answered A for question 1" with a queryset expression such as:
Response.objects.filter(question=1, answer_choice='A').count()

Is a good practice to store UI logic in the database?

Recently I started working in this company and they showed to me this "framework" they use within the company for the project. And the main goal is to do everything from the database because that way it can be change later on without touching the code, only the database which is "easier". So it would look like the following:
|id |component_id |default |read_only|visible|form_id|
|---|-------------|---------|---------|-------|-------|
|1 |2 |now() + 4|false |true |3 |
|2 |1 |null |false |true |4 |
|3 |5 |null |true |true |1 |
The component_id goes to a component table where they define each fields like date pickers, inputs, selects... And so on, and the form_id goes to another table where different forms in the app are, so like register, login, add_order... So on.
This is an over-simplification since they have more columns and more tables just to display the data in the UI with this, and have actions that trigger different thinks.
So my question is, is this a good practice? I mean, the code looks very complicated just to allow this and that database is a mess with a lot of different tables that only stores logic. Or is this use use something that people use very often, since I haven't encountered before.
We're using Dart/Flutter in the front end and I love the strong typed languages, but this removes the strong type for a guessing what value in the db is what we get and have a huge file with switch statements checking what component is to render it and apply all the other values.
I think is easier just write the code when needed since it simpler and better to look at instead of trying to figuring all this madness... Am I right?
This is a perfect example of over-engineering. There are numerous issues with this design. One of the main ones is the amount of risk that this introduces. Not only does it make development a nightmare, but it also allows developers to bypass any sort of risk-controls such as code scans. It also introduces a possible attack vector as it relies on an external mutable source for runtime behavior.
Data from your database should be just that, data. The business layer should be a stable set of logical instructions that manipulates that data. The less cross-over the better.
This kind of design also introduces problems with what amounts to versioning your dataset as you would your codebase. Then you have to make sure they sync up together.
Unfortunately, you probably have the original architect of this nightmare still around where you work, or your development team has gotten so used to such lax risk controls that a transition to a proper design will be like pulling teeth. If you are aiming to eventually push them in the right direction, your best bet would be to present the issues as a matter of risk versus reward and have a solution ready to propose.

When is it OK to blur the abstraction between data and logic?

I mean referring to specific database rows by their ID, from code, or specifying a class name in the database. Example:
You have a database table called SocialNetwork. It's a lookup table. The application doesn't write or or delete from it. It's mostly there for database integrity; let's say the whole shebang looks like this:
SocialNetwork table:
Id | Description
-----------------------------
1 | Facebook
2 | Twitter
SocialNetworkUserName table:
Id | SocialNetworkId | Name
---------------------------------------------------
1 | 2 | #seanssean
2 | 1 | SeanM
In your code, there's some special logic that needs to be carried out for Facebook users. What I usually do is make either an enum or some class constants in the code to easily refer to it, like:
if (socailNetwork.Id == SocialNetwork.FACEBOOK ) // SocialNetwork.FACEBOOK = 1
// special facebook-specific functionality here
That's a hard-coded database ID. It's not a huge crime since it's just referencing a lookup table, but there's no longer a clean division between data and logic, and it bothers me.
The other option I can think of would be to specify the name of a class or delegate in the database, but that's even worse IMO because now you've not only broken the division between data and logic, but you've tied yourself to one language now.
Am I making much ado about nothing?
I don't see the problem.
At some point your code needs to do things. Facebook is a real social network, with its own real API, and you want it to do Facebook-specific things in your code. Unless your tasks are trivial, to put all of the Facebook-specific stuff in the database would mean a headache in your code. (What's the equivalent of "Like" in Twitter, for example?)
If the Facebook entry isn't in your database, then the Facebook-specific code won't be executed. You can do that much.
Yep, but with the caveat that "it depends." It's unlikely to change, but.
Storing the name of a class or delegate is probably bad, but storing a token used by a class or delegate factory isn't, because it's language-neutral--but you'll always have the problem of having to maintain the connection somewhere. Unless you have a table of language-specific things tied to that table, at which point I believe you'd be shot.
Rather than keep the constant comparison in mainline code, IMO this kind of situation is nice for a factory/etc. pattern, enum lookup, etc. to implement network-specific class lookup/behavior. The mainline code shouldn't have to care how it's implemented, which it does right now--that part is a genuine concern.
With the caveat that ultimately it may never matter. If it were me, I'd at least de-couple the mainline code, because stuff like that makes me twitchy.

Deletion / invalidation approaches for reference data

Based on the discussion I found here: Database: To delete or not to delete records, I want to focus on reference data in particular, add a few thoughts on that, and ask for your preferred approach in general, or based on which criteria you make the decision which of the approaches available you go for.
Let's assume the following data structure for a 'request database' for customers, whereas requests may be delivered via various channels (phone, mail, fax, ..; our 'reference data table I want to mainly focus on'):
Request (ID, Text, Channel_ID)
Channel(ID, Description)
Let's, for the beginning, assume the following data within those two tables:
Request:
ID | Text | Channel_ID
===============================================================
1 | How much is product A currently? | 1
2 | What about my inquiry from 2011/02/13? | 1
3 | Did you receive my payment from 2011/03/04? | 2
Channel:
ID | Description
===============================================================
1 | Phone
2 | Mail
3 | Fax
So, how do you attack this assuming the following requirements:
Channels may change over time. That means: Their descriptions may change. New ones may be added, only valid starting from some particular data. Channels may be invalidated (by some particular date)
For reporting and monitoring purposes, it needs to be possibly to identify using which channel a request was originally filed.
For new requests, only the currently 'valid' channels should be allowed, whereas for pre-existing ones, also the channels that were valid at that particular date should be allowed.
In my understanding, that clearly asks for a richer invalidation approach that goes beyond a deletion flag, probably something incorporating a 'ValidFrom / ValidTo' approach for the reference data table.
On the other hand, this involves several difficulties during data capture of requests, because for new requests, you only display they currently available channels, whereas for maintenance of pre-existing ones, all channels available as of the creation of this record need to be displayed. This might not only be complicated from a development point of view, but may also be non-intuitive to the users.
How do you commonly set up your data model for reference data that might chance over time? How do you create your user interface then? Which further parameters do you take into account for proper database design?
In such cases I usually create another table, for example, channel_versions that duplicates all fields from channel and has extra create_date column(and it's own PK of course). For channel I define after insert/update triggers that copy new values into channel_versions. Now all requests from Request table refer to records from channel_versions. For new requests you need to get the most recent version of channel from channel_versions . For old requests you always know how channel looked when the request was fulfilled.

Join-Free Table structure for Tags

I'm working on a little blog software, and I'd like to have tags attached to a post. Each Post can have between 0 and infinite Tags, and I wonder if it's possible to do that without having to join tables?
As the number of tags is not limited, I can not just create n fields (Tag1 to TagN), so another approach (which is apparently the one StackOverflow takes) is to use one large text field and a delimiter, i.e. "<Tag1><Tag2><Tag3>".
The problem there: If I want to display all posts with a tag, I would have to use a "Like '%<Tag2>%'" statement, and those can AFAIK not use any indexes, requiring a full table scan.
Is there any suitable way to solve this?
Note: I know that a separate Tag-Link-Table offers benefits and that I should possibly not worry about performance without measuring etc. I'm more interested in the different ways to design a system.
Wanting to do this without joins strikes me as a premature optimisation. If this table is being accessed frequently, its pages are very likely to be in memory and you won't incur an I/O penalty reading from it, and the plans for the queries accessing it are likely to be cached.
A separate tag table is really the only way to go here. It is THE only way to allow an infinite number of tags.
This sounds like an exercise in denormalization. All that's really needed is a table that can naturally support any query you happen to have, by repeating any information you would otherwise have to join to another table to satisfy. A normalized database for something like what you've got might look like:
Posts:
PostID | PostTitle | PostBody | PostAuthor
--------+--------------+-------------------+-------------
1146044 | Join-Free... | I'm working on... | Michael Stum
Tags:
TagID | TagName
------+-------------
1 | Archetecture
PostTags:
PostID | TagID
--------+------
1146044 | 1
Then You can add a columns to optimise your queries. If it were me, I'd probably just leave the Posts and Tags tables alone, and add extra info to the PostTags join table. Of course what I add might depend a bit on the queries I intend to run, but probably I'd at least add Posts.PostTitle, Posts.PostAuthor, and Tags.TagName, so that I need only run two queries for showing a blog post,
SELECT * FROM `Posts` WHERE `Posts`.`PostID` = $1
SELECT * FROM `PostTags` WHERE `PostTags`.`PostID` = $1
And summarizing all the posts for a given tag requires even less,
SELECT * FROM `PostTags` WHERE `PostTags`.`TagName` = $1
Obviously the downside to denormalization is that it means you have to do a bit more work to keep the denormalized tables up to date. A typical way of dealing with this is to put some sanity checks in your code that detects when a denormalized query is out of sync by comparing it to other information it happens to have available. Such a check might go in the above example by comparing the post titles in the PostTags result set against the title in the Posts result. This doesn't cause an extra query. If there's a mismatch, the program could notify an admin, ie by logging the inconsistency or sending an email.
Fixing it is easy (but costly in terms of server workload), throw out the extra columns and regenerate them from the normalized tables. Obviously you shouldn't do this until you have found the cause of the database going out of sync.
If you're using SQL Server, you could use a single text field (varchar(max) seems appropriate) and full-text indexing. Then just do a full-text search for the tag you're looking for.

Resources