Implementing tag system with multiple tables - database

I am trying to implement tag system similar to one which StackOverflow has. Obviously I've read multiple articles including this answer.
However my scenario is little bit different
there will be limited amount of tags which can be only created by user with higher privilege (anybody can assign a tag there). This excludes option #1 (from SO question I linked above, each tag is inserted directly into the tables tags column and then it's queried with LIKE) I guess
there are also multiple tables in DB which can be tagged (currently five)
Especially second criteria makes it harder so these are my thoughts
I could follow option #3, have table tags and have M:N relationship with each table. However that would make searching harder (imagine that join if the table number grows) and also I need to tell which table (application module) matches the tag in a search result
I could use some kind of polymorphism but I am pretty new to this concept regarding to the databases so is this something which fits to this problem well?
I use newest version of PostgreSQL.

Since you are using PostgreSQL, you have the option of some field types which aren't available for other databases. Particularly, arrays and JSON fields. I did some performance comparisons of the various methods in a blog post. Arrays and JSONB were definitely better options than a tags table for any search which needed to combine multiple tags.
Given that, I would recommend creating a tags column for each table on which you want to have tags, either an array or a JSONB column, depending. If you need to search over multiple tables, I'd suggest a UNION query instead of having a single monolithic tags table which joins to everything.

Related

Database, Using json field instead of ManyToManyField?

Suppose reviews can have zero or more tags.
One could implement this using three tables, Review/Tag/ReviewTagRelation.
ReviewTagRelation would have foreign key to Review and Tag table.
Or using two tables Review/Tag. Review has a json field to hold the list of tag ids.
Traditional approach seems to be the one using the three tables.
I wonder if it is ok to use the two tables approach when there's no need to reference reviews from tags.
i.e. I only need to know what tags are associated with a given review.
In my experience it is always best to keep the data in your database normalized, unless there is a clean and clear cut reason for not doing so that makes sense as per your business requirements.
With normalized data, you know that no matter what, you will always be able to write a query to receive exactly what you are looking for, and if for some reason you want to return data as json, you can do so in your select query.

Is there a pattern to avoid ever-multiplying link tables in database design?

Currently scoping out a new system. Like many systems, it will be required to store documents and link them to other kinds of item. In this instance a Document object can belong to a Job or it can belong to an Item (which in turn belongs to a Job).
We could do this by having a JobId and an ItemId against a Document and leaving one or the other blank if necessary, but that's going to mean annoying conditional logic in the handling code. So, two link tables seems a better idea.
However, it is likely that we will need to link Documents to other items in the system at some point in the future. There are Company and User objects, for example, and we might want to record Documents against those. There may be more.
That would entail a proliferation of link tables which, while effective, is messy and hard to follow.
This solution is in SQL Server and will be handled in code via Entity Framework.
Are there any design principles that can allow us to hook up Document objects with a variety of other system objects as required in a neater and more flexible way?
You could store two values: the id, and the type of object to which the document is attached. It doesn't allow the use of foreign keys, but is compatible with many application development frameworks.
If you have the partitioning option then you could dedicate different partitions to different object types.
You could also have multiple tables, one for job documents, one for item documents, and get an overview of all of them with a view that UNION ALL's them together. If you need uniqueness in that result set then you could use UUIDs for the primary key, or add an extra column to the view to express from which table the row was read.

Solr - How to index on multiple entities?

I have two tables contacts and inventory. These two tables are not related. I want to index these two tables and search using Solr.
Is this possible?
If some part of your application needs to search for contacts, and another one needs to search in the inventory, create two separate indices. Storing wildly different data in the same index is almost never a good idea, it complicates things unnecessarily. As the Solr wiki wisely says:
The more heterogeneous (different
kinds of data) you have in one field
or in one index, the less useful it
is.
You don't need to have multiple Solr instances to accomodate multiple indices, you can easily manage this with multi-core.
I found a very helpful answer to this question here, including some guidance on using "multiple indexes" vs. "multiple document types in one index". The post also links to example code on github that I found very useful.
Yes, you can do that. Simply create a Solr schema, that contains all fields necessary for both tables and add another field, that contains the table name. During indexing, add the table name property to the fields you want to index. During searching also always include a query parameter for the table name field.
As an alternative, you can setup multiple instances of Solr. But you should do this only, if we are talking about massive amounts of data here (like millions of table rows).

Should the descriptive tags associated with an entity be stored in a separate database table?

I have a Questions model, and just like StackOverflow, each question can be tagged with multiple descriptive tags by a user.
What I'm trying to decide is whether it's necessary for the Tags associated with a question to be stored in a separate table in the database.
Or could I store the Tags as a single field of the Questions table as a list of space-separated strings?
I'm not sure which makes more sense - is there any good reason to separate the data?
Using a comma-separated string for a multi-valued attribute is another SQL Antipattern. :-)
How long does the string need to be? Stated another way: how many tags can a given entry have? (It depends on how long the individual tags are.)
How do you account for strings that contain the separator character? What if a character you currently use as a separator becomes a legitimate character in a tag?
How do you insert or delete elements from the list in SQL? (You have to fetch the whole list into the application, explode the list, filter through it, and re-post it to the database.)
How can you do aggregates like COUNT(*) in SQL?
How do you search efficiently for all entries that share a given tag? (You have to use costly pattern-matching queries.)
The solution is to use a separate table, as most other folks on this thread are advising.
Separating tags into their own table, plus a further table with a many:many relationship between Tags and Questions, is what's known in relational land ad "normal form". It makes it easier and faster to perform tasks such as getting all questions tagged with a certain tag, finding the most popular tags, &c.
(Just in case you don't know -- a "many:many relationship" is a table with just two columns [a foreign key into Tags and one into Questions] and no uniqueness constraints).
I would put the questions in 1 table, the tags in 1 table, and have a seperate table to connect the tags to questions. This would be the best way to build that database. It keeps all tags consistant and highly reduces redundency.
By seperating the data like this, your can assure that searching for a specific tag will bring back the same items. You don't have to worry about whether the tag is spelled the same throughout all the questions. Also, you can limit the tag options easier this way.
You should definitely store the tags in a separate table, it makes everything easier, and that's the whole idea of a 'relational' database.

ID for tags in tag systems

I'm implementing a tag system similar to StackOverflow tag system. I was thinking about when storing the tags and relating to a question, that relationship will be directly with the tag name or it's better create a field tagID to "link" the question with the tag? Looks that linking directly to tag name is easier, but it doesn't look good, mainly why when working with statistics and/or tag categorization (IMHO) can be hard to manage this. Another problem is when one admin decides "fix" a tag name. If there isn't a tagID separated from tag name, then I will be changing the key of the table...
What's your thoughts?
Thanks for all replies. I will delete this post since there is another posts with the same subject. I wonder why the search and the suggestion doesn't show it results for me...
Have a look at these related earlier SO questions:
What is the most efficient way to
store tags in a
database
Database design for
tagging
How to design a database schema to
support tagging with
categories
Is there an ideal schema for
tagging?
Your last sentence in your question seems to answer it. Assuming the tags are stored in a tag table, I would always have an ID column (int or GUID) and the varchar/string column for the tag name. The many-to-many (junction table) that would relate some other entity to one or more tags would have two columns containing the ID's the "other entity" and the tag's ID.
It's then easy to edit a tag (to correct a mis-spelling for example) without touching the key. You should get much better performance when using queries that include joins with your junction table and it also means you're normalizing your data better.
Remember, "the key, the whole key and nothing but the key, so help me codd"! :)
If you foresee many tags, and are using a relational database, using an ID that the database supports natively (e.g. RID) internally may just give you better performance.
If that's not a concern: go by simple short tag names. You can give the tags long names which will be displayed in the user interface too where it makes sense (e.g. ask the user for one when creating a new tag). You are more likely to have to edit the long names, which nothing refers to directly, so this is not a problem.
Aside, if you are using a relational database, it is probably not very difficult to change a tag name together with all its references with a simple query, it may just be a slightly more expensive operation, but it is probably not going to be done frequently enough that you need to optimize for it. And consider that you may have duplicate tags that you will want to merge too, so you might want to be able to do that anyway.

Resources