indexing pros cons in sql server 2008 - sql-server

I am working on social networking site. now our team decide to store user profile in denormalized manner. so our table structure is like this
here attribute it means one fields for user profile e.g. Firstname,LastName,BirthDate etc...
and groups means name of a group of fields e.g. Personal Details, Academic Info, Achievements etc..
**
Attribute/Groups master - it creates
hierarchy of groups and attributes.
**
Attribute_GroupId bigint
ParentId bigint
Attribute_GroupName nvarchar(1000)
ISAttribute bit
DisplayName nvarchar(1000)
DisplaySequence int
**
Attribute Control Info - stores which
control have to be populated at run
time for the attribute as well as its
validation criteria...
**
Attribute_ControlInfoId bigint
AttributeId bigint
ControlType nvarchar(1000)
DataType nvarchar(1000)
DefaultValue nvarchar(1000)
IsRequired bit
RegulareExpression nvarchar(1000)
**
And finally Attribute Values where for
every attribute , user wise values
will be stored
**
AttributeId bigint Checked
IsValueOrRefId bit Checked
Value nvarchar(MAX) Checked
ReferenceDataId bigint Checked
UserId bigint Checked
Unchecked
Now they are saying that we'll create index on Attribute Values table. there is no primary key also there.
AS there's huge data going to be stored in this table. e.g. if there are 50 million users and 30 attributes are there it'll store 1500 million records. in this case if we create index on table, isn't Insert and Update statement will be very slow as well as at time of data fetching for one user. quires will also be very slow.
i thought one option for that like instead of attribute wise values i can store one XML record for one user.
so, please can anybody help me out to find out best option for this case. how should i store data?
here i can not make hard code table because at any time new fields can be added by administrator so i need some data structure where i can easily add any fields in user profile with 1-2 steps only.
please reply me if anybody has better solution for this.

You guys need a dba!
This is one of those EAV tables that is going to bite you down the road!

Bill Karwin (his blog) put together a SQL Anti-patterns PPT
Link 1
Link 2
He offers 3 alternate solution to EAV.
Indexing is the least of your worries...

Check out those articles which highlight just how bad that design choice is, and what potential problems you're getting yourself into if you stick to that design:
Five Simple Database Design Errors You Should Avoid
Joe Celko: Avoiding the EAV of Destruction
Bad CaRMa
It seems to be a fairly common design problem - and it seems like a good idea to programmers to solve it that way, with a attribute/value table - but it's really not a good idea from a database performance point of view.
Also:
Now they are saying that we'll create
index on Attribute Values table. there
is no primary key also there.
As some SQL gurus like to say: "If it doesn't have a primary key, it's not a table".
You definitely need to find a way to get a primary key onto your tables - if you don't have anything that you can use per se, add a column "ID" of type "INT IDENTITY(1,1)" to it and put the primary key on that column. You need a primary key! Database design, first lesson, first five minutes....
You need to rethink your design and come up with something more clever to store the data you need.

Related

Database Normalization and Nested Lists -- Cannot Think of a Solution

I am trying to implement a system on my website similar to that of Facebook's "Like" feature. Where users can click a button which counter++'s. However, I have run into a problem in terms of efficiently storing data into my DB.
Each story has it's own row in the stories table in my DB with the columns like and users_like.
I want each person to only be able to like the story once. Therefore I need to somehow store data that shows that the user has, in fact, like++'d the post.
All I could thing of was to have a column named users_like and then add each user, followed by a comma, to the column using CONCAT and then using the php function to explode the data.
However, this method, as far as I know, is in the opposite direction of database normalization.
What is the best way to do this and I understand "best" is subjective.
I cannot add a liked flag to the user table because there will be a vast number of stories the person could 'like.'
Thanks
You need a many to many table in your database that will store a foreign key to the stories table and a foreign key to the user table. You put a constraint on this table saying that the story fk - user fk combo must be unique.
You now don't even have to have a like column, you just count the number of rows in the many to many table corresponding to your story.

Single column/primary key only table for referential integrity?

Maybe i'm going about this wrong but my working on a database design for one of my projects.
I have an entity with a classification column which groups up entities into convenient categories for the user. These classifications are predefined and unchangeable by the user (at least thats the current design).
I'm trying to decide if I should have a 'EntityClassification' table which contains simply an 'Id' column as the primary key with no other information in order to have an enforced relationship between the Entity:Classification -> EntityClassification:Id.
I don't plan to have a name/description column in EntityClassification since my current thought is that I'll need to support localization of these pre-defined names which will be done with static string table like resource files downloaded to the client based on their country/language. There really isn't any other data which is associated with this EntityClassfication that I would want and a table seems like it might be an overkill?
Is this common/recommend practice for this type of problem? We're using SQL Server 2008 and don't have an enum datatype for the database which would seem to be really what i'm trying to achieve.
You should have the table with name and description not only for end user display, but internal documentation so when the users say 'my query based on this classification doesn't work!' someone hired in the future will know which ID they're talking about.
Do you just want to ensure that the values in Entity:Classification are restricted to your pre-determined list? If so a check constraint might be what you need.
Such constraints aren't as flexible as foreign keys: to alter the checked values we have to drop and recreate the constraint, but then you say there are no plans to change the values so that shouldn't matter.

What is the best way to keep changes history to database fields?

For example I have a table which stores details about properties. Which could have owners, value etc.
Is there a good design to keep the history of every change to owner and value. I want to do this for many tables. Kind of like an audit of the table.
What I thought was keeping a single table with fields
table_name, field_name, prev_value, current_val, time, user.
But it looks kind of hacky and ugly. Is there a better design?
Thanks.
There are a few approaches
Field based
audit_field (table_name, id, field_name, field_value, datetime)
This one can capture the history of all tables and is easy to extend to new tables. No changes to structure is necessary for new tables.
Field_value is sometimes split into multiple fields to natively support the actual field type from the original table (but only one of those fields will be filled, so the data is denormalized; a variant is to split the above table into one table for each type).
Other meta data such as field_type, user_id, user_ip, action (update, delete, insert) etc.. can be useful.
The structure of such records will most likely need to be transformed to be used.
Record based
audit_table_name (timestamp, id, field_1, field_2, ..., field_n)
For each record type in the database create a generalized table that has all the fields as the original record, plus a versioning field (additional meta data again possible). One table for each working table is necessary. The process of creating such tables can be automated.
This approach provides you with semantically rich structure very similar to the main data structure so the tools used to analyze and process the original data can be easily used on this structure, too.
Log file
The first two approaches usually use tables which are very lightly indexed (or no indexes at all and no referential integrity) so that the write penalty is minimized. Still, sometimes flat log file might be preferred, but of course functionally is greatly reduced. (Basically depends if you want an actual audit/log that will be analyzed by some other system or the historical records are the part of the main system).
A different way to look at this is to time-dimension the data.
Assuming your table looks like this:
create table my_table (
my_table_id number not null primary key,
attr1 varchar2(10) not null,
attr2 number null,
constraint my_table_ak unique (attr1, att2) );
Then if you changed it like so:
create table my_table (
my_table_id number not null,
attr1 varchar2(10) not null,
attr2 number null,
effective_date date not null,
is_deleted number(1,0) not null default 0,
constraint my_table_ak unique (attr1, att2, effective_date)
constraint my_table_pk primary key (my_table_id, effective_date) );
You'd be able to have a complete running history of my_table, online and available. You'd have to change the paradigm of the programs (or use database triggers) to intercept UPDATE activity into INSERT activity, and to change DELETE activity into UPDATing the IS_DELETED boolean.
Unreason:
You are correct that this solution similar to record-based auditing; I read it initially as a concatenation of fields into a string, which I've also seen. My apologies.
The primary differences I see between the time-dimensioning the table and using record based auditing center around maintainability without sacrificing performance or scalability.
Maintainability: One needs to remember to change the shadow table if making a structural change to the primary table. Similarly, one needs to remember to make changes to the triggers which perform change-tracking, as such logic cannot live in the app. If one uses a view to simplify access to the tables, you've also got to update it, and change the instead-of trigger which would be against it to intercept DML.
In a time-dimensioned table, you make the strucutural change you need to, and you're done. As someone who's been the FNG on a legacy project, such clarity is appreciated, especially if you have to do a lot of refactoring.
Performance and Scalability: If one partitions the time-dimensioned table on the effective/expiry date column, the active records are in one "table", and the inactive records are in another. Exactly how is that less scalable than your solution? "Deleting" and active record involves row movement in Oracle, which is a delete-and-insert under the covers - exactly what the record-based solution would require.
The flip side of performance is that if the application is querying for a record as of some date, partition elimination allows the database to search only the table/index where the record could be; a view-based solution to search active and inactive records would require a UNION-ALL, and not using such a view requires putting the UNION-ALL in everywhere, or using some sort of "look-here, then look-there" logic in the app, to which I say: blech.
In short, it's a design choice; I'm not sure either's right or either's wrong.
In our projects we usually do it this way:
You have a table
properties(ID, value1, value2)
then you add table
properties_audit(ID, RecordID, timestamp or datetime, value1, value2)
ID -is an id of history record(not really required)
RecordID -points to the record in original properties table.
when you update properties table you add new record to properties_audit with previous values of record updated in properties. This can be done using triggers or in your DAL.
After that you have latest value in properties and all the history(previous values) in properties_audit.
I think a simpler schema would be
table_name, field_name, value, time, userId
No need to save current and previous values in the audit tables. When you make a change to any of the fields you just have to add a row in the audit table with the changed value. This way you can always sort the audit table on time and know what was the previous value in the field prior to your change.

Ship management database structure discussion (should denormalize?)

My software went in production some days ago and now I want to argue a bit about the database structure.
The software collects data about ships, currently 174 details for each ship, each detail can be a text value, a long text value, a number (of a specified length, with or without a specified number of decimals), a date, a date with time, a boolean field, a menu with many values, a list of data and more.
I solved the problem with the following tables
Ship:
- ID - smallint, Autoincrement identity
- IMO - int, A number that does not change for the life of the ship
ShipDetailType:
- ID - smallint, Autoincrement identity
- Description - nvarchar(200), The description of the value the field contains
- Position - smallint, The position of the field in the data input form
- ShipDetailGroup_ID - smallint, A key to the group the field belongs to in the data input form
- Type - varchar(4), The type of the field as mentioned above
ShipDetailGroup
- ID - smallint, Autoincrement identity
(snip...)
ShipMenuPresetValue
- ID - smallint, Autoincrement identity
- ShipDetailType_ID - smallint, A key to the detail the values belongs to
- Value - nvarchar(100), The values preset in the menu type detail
ShipTextDetail
- ID - smallint, Autoincrement identity
- Ship_ID - smallint, A Key to the ship the detail belongs to
- ShipDetailType_ID - smallint, a Key to the detail type of the value
- Text - nvarchar(500), the field containing the detail's value
- ModifiedDate - smalldatetime
- User_ID - smallint, A key to the user table
ShipTextDetailHistory
(snip...)
This table is the same as the ShipTextDetail and contains every change to the details.
Other tables for the list detail type, each with the specified fields required for the list, ...
I just read this article: http://thedailywtf.com/Articles/The_Inner-Platform_Effect.aspx and http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056
The articles says that this is not the right way to handle the problem.
My customer has a management gui for the details and groups as he changes the details descriptions and adds more details.
The data input form is dynamically built by reading the structure from the DetailGroups and DetailTypes, each detail type generates a specified input control.
The comments suggests that another way of solving this matter is dynamically creating and removing columns from the table.
What do you think?
Diagram Screenshot: http://img24.imageshack.us/my.php?image=66604496uk3.png
I would refactor your code if:
Your customer complained
You found something that didn't work
You found a way that the code couldn't handle a
change you knew was going to happen
in the future.
You remembered to write unit tests that will allow you to refactor, right?
*As far as the structure you have there, I've seen structures like it before. It's a little cumbersome but it is standard in many places. One thing to remember is that while its possible to dynamically add and remove columns from databases, the internal storage mechanism of the database doesn't necessarily expect you to be adding and removing these columns continuously. But I don't think this is very relevant compared to the points above, which boil down to: *Does it work?
I've seen this approach before and it's presented loads of performance issues once the data volume has grown. The kind of problems you'll encounter come when you need to return multiple items and use multiple criteria in your where clause. You join back and forth between Ship and ShipTextDetail to get all your select columns - maybe you have to do that 10/20 times ? You then do the same for your criteria maybe 2-3 times. Now you have a query with so many joins it runs really slowly. Next you 'pre-cook' some of the data to improve performance, ie you drag out common data into a fixed table structure - ah you've returned to a semi-normalised model.
My recommendation would be this - you know the information for 174 fields those are your core attributes. Your customer may add to that list, and may change the description of the fields, but it's still a really good starting point. Create a proper DataModel based around those, and then build in an extensability mechanism, as you have already done, but only for the new fields. The metadata - the descriptions of the fields, can reside in another table, or potentially in a resource file (useful for internationalisation?) and that gives some flexibility for existing fields.
I agree with Joe, you may not have problems if your DB is small, ie <1000 ships and your selects are simple. Although with 174 attributes to chose from this doesn't appear likely. I think you should change some of the 'obvious' fields first, ie I'd assume you have a Ship.Name, Ship.Owner, Ship.Weight, Ship.Registration ...
Good Luck.
I've done similar things, but there are a couple problems with this specific implementation:
You are storing numbers, booleans, dates, etc. as strings. This might be less than ideal. An alternative is to implement separate classes (inheriting from a base) for the different data types then store them in tables made for their data type.
Do the properties that you track change very frequently? Are they a different set per tanker? If not, it might be better to make objects rather than property bags to store all the data. Those objects can then be persisted to the database.
From a performance standpoint, either approach will be fine. How many ships could there possibly be? All the data is going to fit into RAM on any server.

What is the best practices in db design when I want to store a value that is either selected from a dropdown list or user-entered?

I am trying to find the best way to design the database in order to allow the following scenario:
The user is presented with a dropdown list of Universities (for example)
The user selects his/her university from the list if it exists
If the university does not exist, he should enter his own university in a text box (sort of like Other: [___________])
how should I design the database to handle such situation given that I might want to sort using the university ID for example (probably only for the built in universities and not the ones entered by users)
thanks!
I just want to make it similar to how Facebook handles this situation. If the user selects his Education (by actually typing in the combobox which is not my concern) and choosing one of the returned values, what would Facebook do?
In my guess, it would insert the UserID and the EducationID in a many-to-many table. Now what if the user is entering is not in the database at all? It is still stored in his profile, but where?
CREATE TABLE university
(
id smallint NOT NULL,
name text,
public smallint,
CONSTRAINT university_pk PRIMARY KEY (id)
);
CREATE TABLE person
(
id smallint NOT NULL,
university smallint,
-- more columns here...
CONSTRAINT person_pk PRIMARY KEY (id),
CONSTRAINT person_university_fk FOREIGN KEY (university)
REFERENCES university (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);
public is set to 1 for the Unis in the system, and 0 for user-entered-unis.
You could cheat: if you're not worried about the referential integrity of this field (i.e. it's just there to show up in a user's profile and isn't required for strictly enforced business rules), store it as a simple VARCHAR column.
For your dropdown, use a query like:
SELECT DISTINCT(University) FROM Profiles
If you want to filter out typos or one-offs, try:
SELECT University FROM PROFILES
GROUP BY University
HAVING COUNT(University) > 10 -- where 10 is an arbitrary threshold you can tweak
We use this code in one of our databases for storing the trade descriptions of contractor companies; since this is informational only (there's a separate "Category" field for enforcing business rules) it's an acceptable solution.
Keep a flag for the rows entered through user input in the same table as you have your other data points. Then you can sort using the flag.
One way this was solved in a previous company I worked at:
Create two columns in your table:
1) a nullable id of the system-supplied string (stored in a separate table)
2) the user supplied string
Only one of these is populated. A constraint can enforce this (and additionally that at least one of these columns is populated if appropriate).
It should be noted that the problem we were solving with this was a true "Other:" situation. It was a textual description of an item with some preset defaults. Your situation sounds like an actual entity that isn't in the list, s.t. more than one user might want to input the same university.
This isn't a database design issue. It's a UI issue.
The Drop down list of universities is based on rows in a table. That table must have a new row inserted when the user types in a new University to the text box.
If you want to separate the list you provided from the ones added by users, you can have a column in the University table with origin (or provenance) of the data.
I'm not sure if the question is very clear here.
I've done this quite a few times at work and just select between either the drop down list of a text box. If the data is entered in the text box then I first insert into the database and then use IDENTITY to get the unique identifier of that inserted row for further queries.
INSERT INTO MyTable Name VALUES ('myval'); SELECT ##SCOPE_IDENTITY()
This is against MS SQL 2008 though, I'm not sure if the ##SCOPE_IDENTITY() global exists in other versions of SQL, but I'm sure there's equivalents.

Resources