Database design linked list vs. order by - sql-server

I have a table which need to persist some user actions in sequence. I can either save it by using a self reference table which will be like a linked list or not using the self reference at all and just use the times tamp to keep the sequence.
This table has reference to other tables such as user and files associated with an action.
The operations will need support CRUD. The frequency of operations are in this order: Retrieve > Insert > Update > Delete
What is your design preference and why?
Thanks!

i would avoid "linked lists" like the plague. about the only thing they are good for is retrieving the "next" item. the problem is that every extra "hop" requires a join, so if you want to parameterise over that (eg to provide a function that retrieves N items following a given item) then you need one of (1) machine-generated joins (2) multiple selects (3) SQL that's unlikely to be portable and/or supported by your ORM.
this is the same problem that makes trees notoriously nasty in sql. it's "fixed" by recursive joins, but that's (3) above (maybe i am old-fashioned and someone will say that these are well supported now - if so i guess i will learn...).

Related

What is the best solution to store a volunteers availability data in access 2016 [duplicate]

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

How does the SETCURRENTKEY() C/AL function in Navision work?

I have the following questions:
What does SETCURRENTKEY actually do?
What is the benefit of SETCURRENTKEY?
Why would I use SETCURRENTKEY?
When would I use SETCURRENTKEY?
What is the advantage of using an index and how do I tie this analogously to the example of an old sorting system of a library?
What type of database querying efficiency problems does this function solve?
I have been searching all over the internet and the 'IT Pro Developer Help' internal Navision documentation for this poorly documented function and I cannot find a right answer to my questions.
The only thing I know is that SETCURRENTKEY sets the current key for a record variable and it sorts the recordset based on it. When SETCURRENTKEY is used with only a few keys, it can improve query performance. I have no idea what is actually happening when a database uses an index versus not using an index.
Someone told me this is how SETCURRENTKEY works:
It is like the old sorting card system in a library: without SETCURRENTKEY you would have to go through each shelf and manually filter out for the book you want. You would find a mix of random books and you would have to say: "No, not this one. Yes, this one". With SETCURRENTKEY you can have an index analogous to the old system where you would just go to a book or music CD based on its 'Author' or 'Artist' etc.
That's all fine, but I still can't properly answer my questions.
With SETCURRENTKEY you declare the key (table index, which can consist of many fields) to be used when querying database with FINDSET/FINDFIRST/FINDLAST statements, and the order of records you will receive while iterating the recordset with NEXT statement.
Performance. The database server uses the selected key (table index) to retrieve the record set. You are always better off explicitly stating SETCURRENTKEY in your code, as it makes you think along about you database structure and indices required.
Performance, and so that you know ahead the order of records you will receive when iterating through a recordset.
When to use:
The typical use is this:
RecordVar.SETCURRENTKEY(...)
RecordVar.SETRANGE(Field, ...)
RecordVar.SETFILTER(Field, ...)
RecordVar.SETRANGE(Field, ...)
...
IF RecordVar.FINDSET THEN REPEAT
// do something with records
UNTIL RecordVar.NEXT = 0;
SETCURRENTKEY is declarative, and comes into effect only when FINDSET is executed. At the moment FINDSET is executed, the database will be queried on the table represented by RecordVar, using the filters declared by SETRANGE/SETFILTER, and the key/index declared by SETCURRENTKEY.
For 5. and 6. and generally, I would truly reccomend you to familiarize yourself with basic database index theory. This is what it is, pretty well explained by yourself using the library/book analogy.
If modifying key fields (or filtered fields, even if not in the key) in a loop, the standard way to do this in NAV is to declare a second record variable, do a GET on it using the primary key fields from the record variable you are looping through, then change and MODIFY the second record variable.

Hierarchical SQL data (Recursive CTE vs HierarchyID vs closure table)

I have a set of hierarchical data being used in a SQL Server database. The data is stored with a guid as the primary key, and a parentGuid as a foreign key pointing to the objects immediate parent. I access the data most often through Entity Framework in a WebApi project. To make the situation a little more complex I also need to manage permission based on this hierarchy such that a permission applied to a parent applies to all of its descendants. My question is this:
I have searched all over and cannot decide which would be best to handle this situation. I know I have the following options.
I can create Recursive CTEs, Common Table Expression, (aka RCTE) to handle the hierarchical data. This seems to be the most simple approach for normal access, but I'm worried it may be slow when used to determine permission levels for child objects.
I can create a hierarchyId data type field in the table and use SQL Server provided functions such as GetAncestor(), IsDescendantOf(), and etc. This seems like it would make querying fairly easy, but seems to require a fairly complex insert/update trigger to keep the hierarchyId field correct through inserts and moves
I can create a closure table, which would store all of the relationships in the table. I imagine it as such: parent column and child column, each parent -> child relationship would be represented. (ie 1->2 2->3 would be represented in the database as 1-2, 1-3, 2-3). The downside is that this requires insert, update, and delete triggers even though they are fairly simple, and this method generates a lot of records.
I have tried searching all over and can't find anything giving any advice between these three methods.
PS I am also open to any alternative solutions to this problem
I have used all three methods. It's mostly a question of taste.
I agree that hierarchy with parent-child relationships in the table is the simplest. Moving a subtree is simple and it's easy to code the recursive access with CTEs. Performance is only going to be an issue if you have very large tree structures and you are frequently accessing the hierarchical data. For the most part, recursive CTEs are very fast when you have the correct indexes on the table.
The closure table is more like a supplement to the above. Finding all the descendants of a given node is lightning fast, you don't need the CTEs, just one extra join, so it's sweet. Yes, the number of records blows up, but I think it is no more than N-1 times the number of nodes for a tree of depth N (e.g. a tertiary tree of depth 5 would require 1 + 3 + 9 + 27 + 81 = 121 connections when storing only the parent-child relationship vs. 1 + 3 + (9 * 2) + (27 * 3) + (81 * 4) = 427 for the closure table). In addition, the closure table records are so narrow (just 2 ints at a minimum) that they take up almost no space. Generating the list of records to insert into the closure table when a new record is inserted into the hierarchy takes a tiny bit of overhead.
I personally like HierarchyId since it really combines the benefit of the above two, which is compact storage, and lightning fast access. Once you get it set up, it is easy to query and takes very little space. As you mentioned, it's a little tricky to move subtrees around, but it's manageable. Anyway, how often do you really move a subtree in a hierarchy? There are some links you can find that will suggest some methods, e.g.:
http://sqlblogcasts.com/blogs/simons/archive/2008/03/31/SQL-Server-2008---HierarchyId---How-do-you-move-nodes-subtrees-around.aspx
The main drawback I have found to hierarchyId is the learning curve. It's not as obvious how to work with it as the other two methods. I have worked with some very bright SQL developers who would frequently get snagged on it, so you end up with one or two resident experts who have to field questions from everyone else.

Database design question - which is the best solution?

I'm using Firebird 2.1 and I'm looking for the best way to solve this issue.
I'm writing a calendaring application. Different users' calendar entries are stored in a big Calendar table. Each calendar entry can have a reminder set - only one reminder/entry.
Statistically, the Calendar table could grow to hundreds of thousands of records over time, while there are going to be much less reminders.
I need to query the reminders on a constant basis.
Which is the best option?
A) Store the reminders' info in the Calendar table (in which case I'm going to query hundreds of thousands of records for IsReminder = 1)
B) Create a separate Reminders table which contains only the ID of calendar entries which have reminders set, then query the two tables with a JOIN operation (or maybe create a view on them)
C) I can store all information about reminders in the Reminders table, then query only this table. The downside is that some information needs to be duplicated in both tables, like in order to show the reminder, I'll need to know and store the event's starttime in the Reminders table - thus I'm maintaining two tables with the same values.
What do you think?
And one more question: The Calendar table will contain the calender of multiple users, separated only by a UserID field. Since there can be only 4-5 users, even if I put an index on this field, its selectivity is going to be very bad - which is not good for a table with hundreds of thousands of records. Is there a workaround here?
Thanks!
There are advantages and drawbacks to all three choices. Whis one is best depends on details you have not provided. In general, don't worry too much about selecting three or four entries out of a hundred thousand, provided the indexes you have set up allow the right retrieval strategy. If don't understand indexing, you're likely to be in trouble no matter which of the three choices you make.
If it were me, I would go with choice B. I'd also store any attributes of a reminder in the table for reminders.
Be very careful about whether you identify an event by EventId alone or by (UserId, EventId). If you choose the latter, it behooves you to use a compound primary key for the Event table. Don't worry too much about compound primary keys, especially with Firebird.
If you declare a compound primary key, be aware that declaring (UserId, EventId) will not have the same consequences as declaring (EventId, UserId). They are logically equivalent, but the structure of the automatically generated index will be different in the two cases.
This in turn will affect the speed of queries like "find all the reminders for a given user".
Again, if it were me, I'd avoid choice C. the introduction of harmful redundancy into a schema carries with it the responsibility for some very careful programming when you go to update the data. Otherwise, you can end up with a database that stores contradictory versions of the same fact in different places of the database.
And, if you really want to know the effect on perfromance, try all three ways, load with test data, and do your own benchmarks.
I think you need to create realistic, fake user data and measure the difference with some typical queries you expect to run.
Indexing, query optimization and the types of query results you need can make a big difference,
so it's not easy to say what's best without knowing more.
When choosing Option (A) you should
provide an index on "IsReminder" (or a combined index on IsReminder, UserId, whatever fits best to your intended queries)
make sure your queries use this index
Option B is preferable over A if you have more than a boolean flag for each reminder to store (for example, the number of minutes the user shall be notified before the event). You should, however, make some guessing how often in your program you will have to JOIN both tables.
If you can, avoid option C. If you don't want to benchmark all three cases, I suggest start with A or B, according to the described circumstances, and probably the solution you choose will be fast enough, so you don't have to bother with the other cases.

What will be the best way to keep track of modified tuples in a database?

I am currently working on a project in which I have to keep track of the tuples that are modified in a relational database. This should include updated tuples, but also inserted and deleted tuples. My question is what will be the best way to accomplish this? I have several ideas of my own, but maybe there are easier/better ways that I did not think of, or there already exists a project that exactly does this.
The final goal of the project is that it will work for relational databases of different vendors, but the first implementation will use a MySQL database. Other database systems can be supported later. But it would be nice if the solution that works for MySQL can be easily adapted to another database.
My first idea was to parse log files. However, I am not certain whether these logfiles contain the actual modified tuples, and furthermore I can imagine that these logfiles will not always be available (e.g. on shared hosting).
My second idea was to intercept the queries at the application level. When a INSERT, DELETE or UPDATE query is performed, these queries can be parsed, and the tuples that they will affect can be determined beforehand. For an INSERT operation this simply is the inserted tuple, and for a DELETE or UPDATE operation the tuples can be identified by applying the WHERE clause in a new SELECT statement.
As a last remark I want to add that performance is not an important factor at this stage of development.
If more details are needed I am happy to provide them.
Use triggers to capture the INSERT, UPDATE, and DELETE and log your entries to a new table. You can use a timestamp on that table to note when the transactions occurred. In the future you can query that table for your modification information.
This will require some database dependent features but you can encapsulate them depending on your architecture but you could use database triggers, which I normally advise against except for this very thing, auditing. In each kind of trigger, you could simply write to a log table whatever info you need. Just one suggestion.

Resources