Sql: Joining view on computed columns vs performance - sql-server

I have some Sql tables with a primary key that's includes more column. I created a view on this
tables and I added a computed column that is a concatenation of table's primary key, separated by a separator. (for example: ColumnA$ColumnB$ColumnC is concatenation of Column A, B e C that's table key).
When I use this view I filter on computed column to work with primary key.
In other case I have a query that put in join more view. Foreign key on the view is computed like primary key and the joins are on computed column.
The scope of this work is to simplified key to simplified integration with other software.
Could this execution scenario affect significatly performance?
Thanks in advance
Luca

Better idea would be to keep these columns separately just as you have them natively in your tables, then you can create your index/PK based on all 3 columns not just a concentrated single one. For the performance I would probably suggest here to use indexed view here. Other way if we talk about 3 string columns you can use some hashing techniques as long as you can handle that extreme minimum hashing duplication exception on your application end.

Related

Entity Framework show multiple the same row

This is weird, I am using an ASP.NET MVC application and Entity Framework to map a view from my database.
I don't know why but the query returns the same rows (5 rows 2 times each) multiple times, while in the database the view show me 10 distinct rows.
Don't understand what is going on.
Please help!
This is a well-known issue with views. Since a view (contrary to an actual table) in SQL Server doesn't have a defined primary key, EF will use all non-nullable columns as the primary key. These might be strings or other datatypes - and they might just really not make up a "good" primary key.
Now when EF reads the data, it comes across the first row in question, reads it into the dataset, and determines what the "substitute primary key" for that row is. When it then reads the next row from the database view, if the non-nullable columns are all the same, EF will interpret this as "this is the same row again" and it will NOT actually store the values from the database view, but it'll just use the row that it had just read before - since the primary key is the same, that's a valid approach.
How to solve this?
you can either explicitly define an EF-based primary key for your view entity that is in fact distinct for each row read
you can include the primary key columns of all the tables involved in the view - that way, the unique values from each table will be present in the view and thus will cause EF to properly recognize those distinct rows as being distinct rows.

Minimum number of candidate keys for a relation?

My question is, is it necessary for a relation/table in database to have a candidate key and hence a primary key? Is it possible to have a relation where a row cannot be uniquely identified by any combination of attributes?
If no, why? And if yes, then how does a DBMS make operations like search, delete etc, efficient?
Relations always have distinct tuples which means that in a Relational DBMS a table always has at least one candidate key.
SQL is a different case. SQL tables are "tuple bags", not relations. SQL tables can have duplicate rows, which is one of SQL's biggest flaws. Despite the fact that SQL supports duplicate rows the language is ill-suited to cope with them. In the presence of duplicate rows the SQL standard UPDATE and DELETE for instance have no guaranteed way to reference individual rows without resorting to some complex cursor-based operations.
Consequent problems of duplicate rows are certain inefficiencies and complexities of SQL DBMSs and a lack of orthogonality in their features. SQL DBMS engines have to use internal structures and support special features as a prerequisite in order to deal with duplicate rows. Some DBMS vendors try to get around the difficulties by disabling certain features for tables that don't have keys.
A database does not require a primary key. A table is just an unordered set of rows. Without any indexes, the only mechanism for accessing rows in a table is a full table scan (or a full partition scan, if the table is partitioned). Such operations are only efficient for very small numbers of rows.
Tables are more useful when you can refer to particular rows. Often, the best primary keys are auto incremented/identity primary keys. These are maintained by the database. In practice, all tables in a well-designed database are going to have primary keys. Here are three reasons:
Rows can be referred to by other tables.
Individual rows can be updated and deleted.
Individual rows can be selected efficiently and unambiguously.
Note: you can have indexes on a table without primary keys. And combinations of one or more columns can be made unique, even if the combination is not a primary key. The primary key itself is an index, so the inverse is not true. And all rows in a table have "row addresses" which are unique. Whether or not these are available for queries depends on the database engine.
Yes, this is possible.
Just note, that some identifier does exists behind the scenes (Example from SQL Server):
When a table is stored as a heap, individual rows are identified by
reference to a row identifier (RID) consisting of the file number, data page number, and slot on the page
How operations will be performed?
A table scan will be needed for almost any operation:
If a table is a heap and does not have any nonclustered indexes, then
the entire table must be examined (a table scan) to find any row

Redundant DB column for indexing

I'm defining a few database tables, roughly looking like this:
In order to quickly run a query where a Person's MailMessages are retrieved in time order, regardless of what MailAccount they were sent to, I want an index for the MailMessage table, sorted by (PersonId, ReceivedTime). That means adding a redundant PersonId column to the MailMessage table, like this:
...or does it? Is there any neater way of doing this? If not, is the best practice to make PersonId a foreign key in the MailMessage table, or should this not be done, as it's conceptually not a foreign key but rather just a column used for the (PersonId, ReceivedTime) index?
Yes you could do that, but it would require having a key in table MailAccount on {MailAccountId, PersonId}, so it can be referenced by the FK in table MailMessage. From the perspective of enforcing uniqueness, this is redundant, since {MailAccountId} alone is already unique.
There is an alternative: use identifying relationships and natural keys. For example:
This achieves essentially the same goal, but with just one key (and the underlying index) per table.
Note the order of PK fields in the bottom table: it allows a query...
SELECT *
FROM MailMessage
WHERE PersonId = ?
ORDER BY ReceivedTime
...to be satisfied by an index range scan on the primary index. And if the table happens to be clustered, the DBMS won't even have to access the table heap after that (there is no table heap at all - rows are stored directly in the B-Tree).
Avoidance of JOINs without resorting to redundant keys (which is also good for clustering) is one of the pros of natural keys versus surrogate keys. As you can imagine, the list of pros and cons does not end there.
What you are doing is called denormalization. A full discussion of the pros and cons of this concept are a bit much for SO.
This type of optimization is also possible using a Materialized View (called an Indexed View in SQL Server).

SQL Server 2008 - Database Design Query

I have to load the data shown in the below image into my database.
For a particular row, either field PartID would be NULL OR field GroupID will be NULL, and the other available columns refers to the NON-NULL entity. I have following three options:
To use one database table, which will have one unified column say ID, which will have PartID and GroupID data. But, in this case I won't be able to apply foreign key constraint, as this column will be containing both entities' data.
To use one database table, which will have columns for both PartID and GroupID, which will contain the respective data. For each row, one of them will be NULL, But in this case I will be able to apply foreign key constraint.
To use two database tables, which will have similar structure, the only difference will be the column PartID and GroupID. In this case I will be able to apply foreign key constraint.
One thing to note here is that, the table(s) will be used in import processes to import about 30000 rows in one go and will also be heavily used in data retrieve operations. Also, the other columns will be used as pivot columns.
Can someone please suggest what should be best approach to achieve this?
I would use option 2 and add a constraint that only one can be non-null and the other must be null (just to be safe). I would not use option 1 because of the lack of a FK and the possibility of linking to the wrong table when not obeying the type identifier in the join.
There is a 4th option, which is to normalize them as "items" with another (surrogate) key and two link tables which link items to either parts or groups. This eliminates NULLs. There are further problems with that approach (items might be in both again or neither without any simple constraint), so unless that is necessary for other reasons, I wouldn't generally go down that path.
Option 3 could be fine - it really depends if these rows are a relation - i.e. data associated with a primary key. That's one huge problem I see with the data presented, the lack of a candidate key - I think you need to address that first.
IMO option 2 is the best - it's not perfectly normalized but will be the easiest to work with. 30K rows is not a lot of rows to import.
I would modify the table so it has one ID column and then add an IDType that is either "G" for Group or "P" for Part.

When having an identity column is not a good idea?

In tables where you need only 1 column as the key, and values in that column can be integers, when you shouldn't use an identity field?
To the contrary, in the same table and column, when would you generate manually its values and you wouldn't use an autogenerated value for each record?
I guess that it would be the case when there are lots of inserts and deletes to the table. Am I right? What other situations could be?
If you already settled on the surrogate side of the Great Primary Key Debacle then I can't find a single reason not use use identity keys. The usual alternatives are guids (they have many disadvatages, primarily from size and randomness) and application layer generated keys. But creating a surrogate key in the application layer is a little bit harder than it seems and also does not cover non-application related data access (ie. batch loads, imports, other apps etc). The one special case is distributed applications when guids and even sequential guids may offer a better alternative to site id + identity keys..
I suppose if you are creating a many-to-many linking table, where both fields are foreign keys, you don't need an identity field.
Nowadays I imagine that most ORMs expect there to be an identity field in every table. In general, it is a good practice to provide one.
I'm not sure I understand enough about your context, but I interpret your question to be:
"If I need the database to create a unique column (for whatever reason), when shouldn't it be a monotonically increasing integer (identity) column?"
In those cases, there's no reason to use anything other than the facility provided by the DBMS for the purpose; in your case (SQL Server?) that's an identity.
Except:
If you'll ever need to merge the table with data from another source, use a GUID, which will prevent duplicate keys from colliding.
If you need to merge databases it's a lot easier if you don't have to regenerate keys.
One case of not wanting an identity field would be in a one to one relationship. The secondary table would have as its primary key the same value as the primary table. The only reason to have an identity field in that situation would seem to be to satisfy an ORM.
You cannot (normally) specify values when inserting into identity columns, so for example if the column "id" was specified as an identify the following SQL would fail:
INSERT INTO MyTable (id, name) VALUES (1, 'Smith')
In order to perform this sort of insert you need to have IDENTITY_INSERT on for that table - this is not intended to be on normally and can only be on for a maximum of 1 tables in the database at any point in time.
If I need a surrogate, I would either use an IDENTITY column or a GUID column depending on the need for global uniqueness.
If there is a natural primary key, or the primary key is defined as a unique combination of other foreign keys, then I typically do not have an IDENTITY, nor do I use it as the primary key.
There is an exception, which is snapshot configuration tables which I am tracking with an audit trigger. In this case, there is usually a logical "primary key" (usually date of the snapshot and natural key of the row - like a cost center or gl account number for which the row is a configuration record), but instead of using the natural "primary key" as the primary key, I add an IDENTITY and make that the primary key and make a unique index or constraint on the date and natural key. Although theoretically the date and natural key shouldn't change, in these tables, if a user does that instead of adding a new row and deleting the old row, I want the audit (which reflects a change to a row identified by its primary key) to really reflect a change in the row - not the disappearance of a key and the appearance of a new one.
I recently implemented a Suffix Trie in C# that could index novels, and then allow searches to be done extremely fast, linear to the size of the search string. Part of the requirements (this was a homework assignment) was to use offline storage, so I used MS SQL, and needed a structure to represent a Node in a table.
I ended up with the following structure : NodeID Character ParentID, etc, where the NodeID was a primary key.
I didn't want this to be done as an autoincrementing identity for two main reasons.
How do I get the value of a NodeID after I add it to the database/data table?
I wanted more control when it came to generating my own IDs.

Resources