High-performance wiki-schema - sql-server

I'm using MS SQL Server 2005.
What's the best schema for a Wiki-like system? where users edit/revise a submission and the system keeps track of these submissions.
Lets say we're doing a simple wiki-based system. Will keep track of each revision plus views and latest activity of each revision. In other screens, the system will list "Latest Submissions" and "Most Viewed", plus search by title.
My current schema (and I know its bad) is using a single table. When I need to see the "Latest Submissions" I sort by "LatestActivity", group by "DocumentTitle", then take first N records. I assume a lot of grouping (especially grouping on nvarchar) is bad news. For listing the most viewed I also do the same: sort by views, group by name, take first N records. Most of the time, I will also be doing a "WHERE DocumentName LIKE '%QUERY-HERE%'".
My current schema is "Version 1", see below:
alt text http://www.anaimi.com/junk/schemaquestion.png
I assume this is not acceptable. So i'm trying to come up with another/more-performant design. How does Version 2 sound to you? In version two I get the advantage of grouping on WikiHeadId which is a number - i'm assuming grouping over a number is better than nvarchar.
Or the extreme case which is version 3, where I will do no grouping, but has several disadvantages such as duplicating values, maintaining these values in code, etc.
Or is there a better/known schema for such systems?
Thanks.
(moved from ServerFault - i think its a development question more than an IT question)

Firstly (and out of curiosity) how does the current schema indicate what the current version is? Do you just have multiple 'WikiDocument' entries with the same DocumentTitle?
I'm also not clear on why you need a 'LastActivity' at a Version level. I don't see how 'LastActivity' fits with the concept of a 'Version' -- in most wikis, the 'versions' are write-once: if you modify a version, then you're creating a new version, so the concept of a last-updated type value on the version is meaningless -- it's really just 'datecreated'.
Really, the 'natural' schema for your design is #2. Personally, I'm a bit of a fan of the old DB axiom 'normalize until it hurts, then denormalize until it works'. #2 is a cleaner, nicer design (simple, with no duplication), and if you have no urgent reason to denormalize to version 3, I wouldn't bother.
Ultimately, it comes down to this: are you worrying about 'more performant' design because you've observed performance problems, or because you hypothetically might have some? There's no real reason #2 shouldn't perform well. Grouping isn't necessarily bad news in SQL Server -- in fact, if there's an appropriate covering index for the query, it can perform extremely well because it can just navigate to a particular level in the index to find the grouped values, then use the remaining columns of the index to use to MIN/MAX/whatever. Grouping by NVARCHAR isn't particularly bad -- if it's not observed to be a problem, don't fret about it, though (non-binary) collations can make it a little tricky -- but in version 2, where you need to GROUP BY you can do it by WikiHeadId, right?
One thing that may make life easier, if you do a lot of operations on the current version (as I assume you would), to add an FK back from the head table to the body table, indicating the current version. If you want to view the current versions with the highest number of hits, with #2 as it stands now it might be:
SELECT TOP ...
FROM WikiHead
INNER JOIN
(SELECT WikiHeadId, MAX(WikiBodyVersion) /* or LastUpdated? */ AS Latest
FROM WikiBody GROUP BY WikiHeadId) AS LatestVersions
INNER JOIN WikiBody ON
(Latest.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiBody.WikiBodyVersion = LatestVersions.Latest)
ORDER BY
Views DESC
or alternatively
...
INNER JOIN WikiBody ON
(WikiHead.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiBody.WikiBodyVersion =
(SELECT MAX(WikiBodyVersion) FROM WikiBody WHERE WikiBody.WikiHeadId = WikiHead.WikiHeadId)
...
both of which are icky. If the WikiHead keeps a pointer to the current version, it's just
...
INNER JOIN WikiBody ON
(WikiHead.WikiHeadId = WikiBody.WikiHeadId)
AND (WikiHead.Latest = WikiBody.WikiBodyVersion)
...
or whatever, which may be a useful denormalization just because it makes your life easier, not for performance.

Check this out.
It's the database schema for mediawiki, what wikipedia is based on.
It looks pretty well documented and would be an interesting read for you.
From this page.

Related

Does writing the full path in SELECT statements enhance performance SQL?

Is the performance of queries impacted when writing the full query path. And what is the best practice when writing such queries ? Assuming the script is way more complex and longer than the following.
Example #1:
SELECT Databasename.Tablename.NameofColumn
FROM databasename.tablename
Example #2:
SELECT NameofColumn
FROM tablename
OR using aliases - example #3:
SELECT t.NameofColumn
FROM tablename t
There are a number of considerations when you're writing queries that are going to be released into a production environment, and how and when to use fully qualified names is one of those considerations.
A fully qualified table name has four parts: [Server].[Database].[Schema].[Table]. You missed Schema in your examples above, but it's actually the one that makes the most difference. SQL Server will allow you to have objects with the same name in different schemas; so you could have dbo.myTable and staging.myTable in the same database. SQL Server doesn't care, but your query probably does.
Even if there aren't identically named objects, adding the schema still helps the engine find the object you're querying a little bit faster, so there's your performance boost, albeit a small one, and only in the compile/execution plan step.
Besides performance, though, you need to worry about readability for your own sake when you need to revisit your code, and conventionality for when somebody else needs to look at your code. Conventions vary slightly from shop to shop, but here are a couple of generalities that will at least make your code easier to look at, say, on Stack Overflow.
1. Use table aliases.
This gets almost unreadable after about three column names:
SELECT
SchemaName.Tablename.NameofColumn1,
SchemaName.Tablename.NameofColumn2,
SchemaName.Tablename.NameofColumn3
FROM SchemaName.TableName
This is just easier on the brain:
SELECT
tn.NameofColumn1,
tn.NameofColumn2,
tn.NameofColumn3
FROM SchemaName.TableName as tn
2. Put the alias in front of every column reference, everywhere in your query.
There should never be any ambiguity about which table a particular column is coming from, either for you, when you're trying to troubleshoot it at 3:00 AM, or for anyone else, when you're sipping margaritas on the beach and your buddy's on call for you.
3. Make your aliases meaningful.
Again, it's about keeping things straight in your head later on. Aaron Bertrand wrote the definitive post on it almost ten years ago now.
4. Include the database name in the FROM clause if you want, but...*
If you have to restore a database using a different name, your procedures won't run. In my shop, we prefer a USE statement at the top of each proc. Fewer places to change a name if need be.
tl;dr
Your example #3 is pretty close. Just add the table schema to the FROM clause.

Can Joins between views and table hurt performance?

I am new in sql server,
my manager has given me job where i have to find out performance of query in sql server 2008.
That query is very complex having joins between views and table. I read in internet that joins between views and table cause performance hurt?
If any expert can help me on this? Any good link where i found knowledge of this? How to calculate query performance in sql server?
Look at the query plan - it could be anything. Missing indexes on one of the underlying view tables, missing indexes on the join table or something else.
Take a look at this article by Gail Shaw about finding performance issues in SQL Server - part 1, part 2.
A view (that isn't indexed/materialised) is just a macro: no more, no less
That is, it expands into the outer query. So a join with 3 views (with 4, 5, and 6 joins respectively) becomes a single query with 15 JOINs.
This is a common "I can reuse it" mistake: DRY doesn't usually apply to SQL code.
Otherwise, see Oded's answer
Oded is correct, you should definitely start with the query plan. That said, there are a couple of things that you can already see at a high level, for example:
CONVERT(VARCHAR(8000), CN.Note) LIKE '%T B9997%'
LIKE searches with a wildcard at the front are bad news in terms of performance, because of the way indexes work. If you think about it, it's easy to find all people in the phone book whose name starts with "Smi". If you try to find all people who have "mit" anywhere in their name, you will find that you need to read the entire phone book. SQL Server does the same thing - this is called a full table scan, and is typically quite slow. The other issue is that the left side of the condition uses a function to modify the column (specifically, converting it to a varchar). Essentially, this again means that SQL Server cannot use an index, even if there was one for the column CN.Note.
My guess is that the column is a text column, and that you will not be allowed to change the filter logic to remove the wildcard at the beginning of the search. In this case, I would recommend looking into Full-Text Search / Indexing functionality. By enabling full text indexing, and using specific keywords such as CONTAINS, you should get better performance.
Again (as with all performance optimisation scenarios), you should still start with the query plan to see if this really is the biggest problem with the query.

Please help explain if I'm destroying my DB Schema for the sake of performance :(

I've got a database in production for nearly 3 years, on Sql 2008 (was '05, before that). Has been fine, but it isn't very performant. So i'm tweaking the schema and queries to help speed some things up. Also, a score of main tables contain around 1-3 mill rows, per table (to give u a estimate on sizes).
Here's a sample database diagram (Soz, under NDA so i can't display the original) :-
alt text http://img11.imageshack.us/img11/4608/dbschemaexample.png
Things to note (which are directly related to my problem) :-
A vehicle can have 0 (NULL) or 1 Radio. (Left Outer Join)
A vehicle can have 0 (NULL) or 1 Cupholder (Left Outer Join)
A vehicle has 1 Tyre Type (Inner Join).
Firstly, this looks like a normalised database schema. I suck and DB theory, so I'm guessing this is 3NF (at least) ... famous last words :)
Now, this is killing my database performance because these two outer joins and inner join are getting called a lot AND there's also a few more joins in many statements.
To try and fix this, I thought I might try and indexed view. Creating the view is a piece of cake. But indexing it, doesn't work -> can't create indexed views with joins OR self referencing tables (also another prob :( ).
So, i've cried for hours (and /wrists, dyed hair and wrote an emo song about it and put it on myfailspace) and did the following...
Added a new row into each 'optional' outer join tables (in this example, Radios and CupHolders). ID = 0, rest of the data = 'Unknown Blah' or 0's.
Update Parent tables, so that any NULL data's now have a 0.
Update relationship from outer joins to inner joins.
Now, this works. I can even make my indexed view, which is very fast now.
So ... i'm in pain. This just goes against everything I've been taught. I feel dirty. Alone. Infected.
Is this a bad thing to do? Is this a common scenario of denormalizing a database for the sake of performance?
I would love some thoughts on this, please :)
PS. Those images a random google finds -- so not me.
null values generally are not used in indexs. What you've done is to provide a sentinel value so that the column always has a value which allows your indexes to be used more effectively.
You didn't change the structure of your database either, so I wouldn't call this denormalizing. I've done that with date values where you have an "end date" null denoted not ended yet. Instead I made it a known date way in the future which allowed for indexing.
I think this is fine.
Database should always be designed and initially implemented in 3NF. But the world is a place of reality, not ideals, and it's okay to revert to 2NF (or even 1NF) for performance reasons. Don't beat yourself up about it, pragmatism beats dogmatism in the real world all the time.
Your solution, if it improves performance, is a good one. The idea of having an actual radio (for example), manufactured by nobody and having no features, is not a bad one - it's been done a lot before, believe me :-) The only reason you would use that field as NULL was to see which vehicles have no radio and there's little difference between these queries:
select Registration from vehicles where RadioId is null
select Registration from vehicles where RadioId = 0
My first thought was to simply combine the four tables into one and hang the duplicate data issue. Most problems with DBMS' stem from poor performance rather than low storage space.
Maybe keep that as your fallback position if your current de-normalized schema becomes slow as well.
"...So i'm tweaking the schema and queries to help speed some things up..." - I would beg to differ about this. It seems that you're slowing things down. (Just kidding.)
I like the Database Programmer blog. He has two columns for and against normalization that you might find helpful:
http://database-programmer.blogspot.com/2008/10/argument-for-normalization.html
http://database-programmer.blogspot.com/2008/10/argument-for-denormalization.html
I'm not a DBA, but I think the evidence is in front of your eyes: Performance is worse. I don't see what splitting these 1:1 relationships into separate tables is buying you, but I'll be happy to take instruction.
Before I changed anything, I'd ask SQL Server to EXPLAIN PLAN on every query that was slow and use that information to see exactly what should be changed. Don't guess because a normalization guru told you so. Get the data to back up what you're doing. What you're doing sounds like optimizing middle tier code without profiling. Gut feelings aren't very accurate.
im running into the same issue of performance vs academic excellence. we have a large view on a customer database with 300 columns and 91000 records. we use outer joins to create the view and the performance is pretty bad. we have considered changing to inner joins by putting in the dummy records with a value of zero on the columns we join on (instead of null) to enable a unique index on the view.
i have to agree that if performance is important, sometimes strange things have to be done to make it happen. ultimately those who pay our bills don't care if the architecture is perfect.

Help with designing a schema for a lyrics database

I'd like to work on a project, but it's a little odd. I want to create a site that shows lyrics and their translations, but they are shown simultaneously side-by-side (so this isn't just a normal i18n of the site).
I have normalized the tables like this (formatted to show hierarchy).
artists
artistNames
albums
albumNames
tracks
trackNames
trackLyrics
user
So questions,
First, that'll be a whopping seven joins. I must have written pretty small queries in the past because I've never come across something like this. Is joining so many tables a bad thing? I'm pretty sure I'll be using SQLite for this project, but does anyone think PostgreSQL or MySQL could perform better with a pretty big join like this?
Second, my current self-built framework uses a data mapper to create domain objects. This is the first time I will be working with so many one-to-many relationships, so my mapper really only takes one row as one object. For example,
id name
------ ----------
1 Jackie Chan
2 Stephen Chow
So it's super easy to map objects. But with those one to many relationships...
id language name
------ ---------- -------
1 en Jackie Chan
1 zh 陳港生
2 en Stephen Chow
2 zh 周星馳
...I'm not sure what to do. Is looping through the result set to create a massive array and feeding it to my domain object factory the only option when dealing with a data set like this?
<?php
array(
array(
'id' => 1,
'names' => array(
'en' => 'Jackie Chan'
'zh' => '陳港生'
)
),
array(
'id' => 2,
'names' => array(
'en' => 'Stephan Chow',
'zh' => '周星馳'
)
)
);
?>
I have an itch to just denormalize these tables so I can get my one row per object application working, but I've always read this is not the way to go.
Third, does this schema sound right for the job?
Twelve way joins are not unheard of in serious industrial work. You need sufficient hardware, a strong DBMS, and good database design. Seven way joins should be a breeze for any good environment.
You separate out data, as needed, to avoid difficulties like database update anomalies. These anomalies are what you get when you don't follow the normalization rules. You join data as needed to get the data that you need in a single result.
Sometimes it's better to ignore some of the normalization rules when you build a database. In that case, you need an alternative set of design principles in order to avoid design by trial and error. The amount of joining you are doing has little to do with the disadvantages of looping through results or unfortunate mapping between tuples and objects.
Most of the mappings between tuples (table rows) and objects are done in an incorrect fashion. A tuple is an object, but it isn't an application oriented object. This can cause either performance problems or difficult programmming or both.
As far as you can avoid it, don't loop through results, one row at a time. Deal with results as a set of data. If you can't do that in PHP, then you need to learn how, or get a better programming environment.
Just a note. I'm not really sure that 7 tables is that big a join. I seem to remember that Postgres has a special query optimiser (based on a genetic algorithm, no less) that only kicks in once you join 12 tables or more.
General rule is to make schema as normalized as possible. Then perform stress tests with expected amount of data. If you find performance bottlenecks you should try to optimize in following order:
Profile and optimize queries
add indices to schema
add hints to query optimizer (don't know if SQLite has any, but most of databases do)
If 1. does not gain any performance benefits, consider denormalizing database.
Denormalizing database is usually needed only if you work with "large" amounts of data. I checked several lyrics databases on internet and the largest I found have lyrics for about 400.000 songs. Let's assume you can find 1.000.000 of lyrics performed by 500.000 artists. That is amount of data that all databases can easily handle on average modern computer.
Doing this many joins shouldn't be an issue on any serious DB. I haven't worked with SQLite to know if it's in the "serious" category. The only way to find out would be to create your schema, load up a lot of data and start looking at query plans (visual explains are very useful here). When I am doing these kinds of tests, I usually shoot for 10x the data I expect to have in production. If things work ok with this much data, I know I should be ok with real data.
Also, depending on how you need to retrieve the data, you may want to try subqueries instead of joins:
select a.*, (select r.name from artist r where r.id=a.artist a and r.locale='en') from album where a.id=1;
I've helped a friend optimize a web storefront. In your case, it's a lot the same.
First. What is your priority, webpage speed or update speed?
Normal forms were designed to make data maintenance simple. If Prince changes his name again, voila, just one row is updated. But if you want your web pages to render as fast as possible, then 3rd normal isn't your best plan. Yes, every one is correct that it will do a 7 way join no problem, but that will be dozens of i/o's... index lookup on every table then table access by rowid, then again and again. If you denormalize for webpage loading speed you may do 2 or 3 i/o's. Which will also allow for greater scaling since every page hit will need fewer i/o's to complete, you'll be able to do more simultaneous hits before maxing your i/o.
But there's no reason not to do both. you can keep the base data, the official copy in a normal form, then write a script that can generate a denormal table for web performance. If it's not that big, you can regen the whole thing in a few minute of maintenance downtime. If it is very big, you may need to be smart about the update and only change what needs to be keeping change vectors in an intermediate driving table.
But at the heart of your design I have a question.
Artist names change over time. John Cougar became John Cougar Melonhead (or something) and then later he became John Mellancamp. Do you care which John did a song? will you stamp the entries with from and to valid dates?
It looks like you have a 1-n relationship from artists to albums but that really should many-many.
Sometimes the same album is released more than once, with different included tracks and sometimes with different names for a track. Think international releases. Or bonus tracks. How will you know that's all the same album?
If you don't care about those details then why bother with normalization? If Jon and Vangelis is 1 artist, then there is simply no need to normalize. You're not interested in the answers normalization will provide.

Database design question. BIT column for deletions

Part of my table design is to include a IsDeleted BIT column that is set to 1 whenever a user deletes a record. Therefore all SELECTS are inevitable accompanied by a WHERE IsDeleted = 0 condition.
I read in a previous question (I cannot for the love of God re-find that post and reference it) that this might not be the best design and an 'Audit Trail' table might be better.
How are you guys dealing with this problem?
Update
I'm on SQL Server. Solutions for other DB's are welcome albeit not as useful for me but maybe for other people.
Update2
Just to encapsulate what everyone said so far. There seems to be basically 3 ways to deal with this.
Leave it as it is
Create an audit table to keep track of all the changes
Use of views with WHERE IsDeleted = 0
Therefore all SELECTS are inevitable accompanied by a WHERE IsDeleted = 0 condition.
This is not a really good way to do it, as you probably noticed, it is quite error-prone.
You could create a VIEW which is simply
CREATE VIEW myview AS SELECT * FROM yourtable WHERE NOT deleted;
Then you just use myview instead of mytable and you don't have to think about this damn column in SELECTs.
Or, you could move deleted records to a separate "archive" table, which, depending on the proportion of deleted versus active records, might make your "active" table a lot smaller, better cached in RAM, ie faster.
If you have to have this kind of Deleted Bit column, then you really should consider setting up some VIEWs with the WHERE clause in it, and use those rather than the underlying tables. Much less error prone.
For example, if you have this view:
CREATE VIEW [Current Product List] AS
SELECT ProductID,ProductName
FROM Products
WHERE Discontinued=No
Then someone who wants to see current products can simply write:
SELECT * FROM [Current Product List]
This is much less error prone than writing:
SELECT ProductID,ProductName
FROM Products
WHERE Discontinued=No
As you say, people will forget that WHERE clause, and get confusing and incorrect results.
P.S. the example SQL comes from Microsoft's Northwind database. Normally I would recommend NOT using spaces in column and table names.
We're actively using the "Deleted" column in our enterprise software. It is however a source of constant errors when forgetting to add "WHERE Deleted = 0" to an SQL query.
Not sure what is meant by "Audit Trail". You may wish to have a table to track all deleted records. Or there may be an option of moving the deleted content to paired tables (like Customer_Deleted) to remove the passive content from tables to minimize their size and optimize performance.
A while ago there was some blog uproar on this issue, Ayende and Udi Dahan both posted on this.
Nai this is totally up to you.
Do you need to be able to see who has deleted / modified / inserted what and when? If so, you should design the tables for this and adjust your procs to write these values when they are called.
If you dont need an audit trail, dont waste time with one. Just do as you are with IsDeleted.
Personally, I flag things right now, as an audit trail wasn't specified in my spec, but that said, I don't like to actually delete things. Hence, I chose to flag it. I'm not going to waste a clients time writing something they diddn't request. I wont mess about with other tables because that's another thing for me to think about. I'd just make sure my index's were up to the job.
Ask your manager or client. Plan out how long the audit trail would take so they can cost it and let them make the decision for you ;)
Udi Dahan said this:
Model the task, not the data
Looking back at the story our friend from marketing told us, his intent is to discontinue the product – not to delete it in any technical sense of the word. As such, we probably should provide a more explicit representation of this task in the user interface than just selecting a row in some grid and clicking the ‘delete’ button (and “Are you sure?” isn’t it).
As we broaden our perspective to more parts of the system, we see this same pattern repeating:
Orders aren’t deleted – they’re cancelled. There may also be fees incurred if the order is canceled too late.
Employees aren’t deleted – they’re fired (or possibly retired). A compensation package often needs to be handled.
Jobs aren’t deleted – they’re filled (or their requisition is revoked).
In all cases, the thing we should focus on is the task the user wishes to perform, rather than on the technical action to be performed on one entity or another. In almost all cases, more than one entity needs to be considered.
If you have Oracle DB, then you can use audit trail for auditing. Check the AUDIT VAULT tool form OTN, here. It even supports SQL Server.
Views (or stored procs) to get at the underlying table data are the best way. However, if you have the problem with "too many cooks in the kitchen" like we do (too many people have rights to the data and may just use the table without knowing enough to use the view/proc) you should try using another table.
We have a complete mimic of the base table with a few extra columns for tracking. So Employee table has an EmployeeDeleted table with the same schema but extra columns for when it was deleted and who deleted it and sometimes even the reason for deletion. You can even get fancy and have triggers do the insertion directly instead of going through applications/procs.
Biggest Advantage: no flag to worry about during selects
Biggest Disadvantage: any schema changes to the base table also have to be made on the "deleted" table
Best for: situations where for whatever reason (usually political with us) many not-as-experienced people have rights to the data but still expect it to be accurate without having to understand flags or schemas, etc
I've used soft deletes before on a number of applications I've worked on, and overall it's worked out quite well. Yes, there is the issue of always having to remember to add AND IsActive = 1 to all of your SELECT queries, but really that's not so bad. You can create views if you don't want to have to remember to always do that.
The reason we've done this is because we had very specific business needs to be able to report on records that have been deleted. The reporting needs varied widely - sometimes they'd need to see just the active records, or just the inactive records, or sometimes a mix of both - so pushing all the deleted records into an audit table wasn't a very good option.
So, depending on your particular business needs, I think this approach is certainly a viable option.

Resources