I need to store entries of the schema like (ID:int, description:varchar, updatetime:DateTime). ID is unique primary key. The usage scenario is, I will frequently insert new entries, frequently query entries by ID and less frequently remove expired entries (by updatetime field, using another SQL Job run daily to avoid database ever increasing). Each entry is with 0.5k size.
My question is how to optimize the database schema design (e.g. tricks to add index, transaction/lock levels or other options) in my scenario to improve performance? Currently I plan to store all information in a single table, not sure whether it is the best option.
BTW: I am using SQL Server 2005/2008.
thanks in advance,
George
Additionally to your primary key, just add index on updatetime.
Your decision to store everything in a single table needs to be reviewed. There are very few subject matters that can really be well modeled by just one table.
The problems that arise from using just one table are usually less obvious than the problems that arise from not creating the right indexes and things like that.
I'm interested in the "description" column (field). Do all descriptions describe the same kind of thing? Do you ever retrieve sets of descriptions, rather than just one description at a time? How do you group descriptions into sets?
How do you know the ID for the description you are trying to retrieve? Do you store copies of the ID in some toher place, in order to reference which ones you want?
Do you know what a "foreign key" is? Was your choice not to include any foreign keys in this table deliberate?
These are some of the questions that need to be answered before you can know whether a single table design really suits your case.
Your ID is your primary key and it has automatically an index.
You can put onther index for the expiration date. Indexes
are going to help you for searching but decreases the performance
when inserting, deleting and updating. Anyway one index is not
an issue.
It sounds for me somehow strange -I am not saying that it is an error-
that you have ALL the information in one table. Re-think that point.
See if you can refactorize something.
It sounds as simple as it gets, except for possibly adding an index on updatetime as OMax suggested (I recommend).
If you would also like to fetch items by description, you should also consider a text index or full-text index on that column.
Other than that - you're ready to go :)
Related
I have a table user with multiple columns, every user has a unique userid.
Because it is unique, I dont have to specify a clustering key unless I want to use the column in queries. Is this bad, because every partition consists of a single row? If it is bad for whatever reason, what is the best practise to do in this case?
Thank you for your help!
Edit: If I have a query that needs to return all usernames, how can I do that with a good performance? Doing it from this table seems not very efficient for me, should I make another table where I simply duplicate all usernames in a Collection? Then they are all in one place and the read doesn't have to jump over multiple nodes.
I just answered the similar question. Short story - it really depends on the access patterns, and table settings. You may need to tune the table parameters to get best performance, but the settings may depend on the amount of data, and other requirements.
There are always two (main) considerations when defining your primary keys in Cassandra:
Data distribution
Query pattern match
From a data distribution standpoint, you can't get much better than using a unique key as the partition key. The more of them, the more evenly they should hash-out and thus be evenly distributed.
However, a key which distributes well but doesn't fit the desired query pattern, is pretty useless.
tl;dr;
If that unique key is all you'll ever query the table by, then it makes a great choice for a partition key.
Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.
I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL
You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.
Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).
The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.
if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table
That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?
Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)
You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.
you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.
Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy
Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.
I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...
Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345
#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete
To start off, I'm not that great with database strategies, so I don't know really how to even approach this.
What I want to do is store some info in a database. Essentially the data is going to look like this
SensorNumber (int)
Reading (int)
Timestamp (Datetime?)(I just want to track down to the minute, nothing further is needed)
The only thing about this is that over a few months of tracking I'm going to have millions of rows (~5 million rows).
I really only care about searching by Timestamp and/or SensorNumber. The data in here is pretty much going to be never edited (insert once, read many times).
How should I go about building this? Is there anything special I should do other than create the table? and create the one index for SensorNumber and Temp?
Based on your comment, I would put a clustered index on (Sensor, Timestamp).
This will always cover when you want to search for SENSOR alone, but will also cover both fields checked in combination.
If you want to ever search for Timestamp alone, you can add a nonclustered index there as well.
One issue you will have with this design is the need to rebuild the table since you are going to be inserting rows non-sequentially - the new rows won't always belong at the end of the index.
Also, please do not name a field timestamp - this is a keyword in SQL Server and can cause you all kinds of issues if you don't delimit it everywhere.
You definitely want to use a SQL-Server "clustered index" for the most selective data you're likely to search on.
Here's more info:
http://www.sql-server-performance.com/2007/clustered-indexes/
http://odetocode.com/articles/70.aspx
http://www.sql-server-performance.com/2002/index-not-equal/
ELABORATION:
"Sensor" would be a poor choice - you're likely to have few sensors, many rows. This would not be a discriminating index.
"Time" would be discriminating... but it would also be a poor choice. Because the time itself, independent of sensor, temperature, etc, is probably meaningless to your query.
A clustered index on "sensor,time" might be ideal. Or maybe not - it depends on what you're after.
Please review the above links.
PS:
Please, too, consider using "datetime" instead of "timestamp". They're two completely different types under MSSQL ... and "datetime" is arguably the better, more flexible choice:
http://www.sqlteam.com/article/timestamps-vs-datetime-data-types
I agree with using a clustered index, you are almost certainly going to end up with one anyway - so it's better to define it.
A clustered index determines the order that the data is stored, adding to the end is cheaper than inserting into the middle.
Think of a deck of cards you are trying to keep in rank order as you add cards. If the highest rank is a 8, adding a 9 is trivial - put it at the top.
If you add a 5, it gets more complex, you have to work out where to put it and then insert it.
So adding items with a clustered index in order is optimal.
Given that I would suggest having a clustered index in (Timestamp,Sensor).
Clustering on (Sensor, Timestamp) will create a LOT of changes to the physical ordering of data which is very expensive (even using SSD).
If Timestamp,Sensor combo is unique then define it as being UNIQUE, otherwise Sql Server will add in a uniqueidentifier on the index to resolve duplicates.
Primary keys are automatically unique, almost all tables should have a primary key.
If (Timestamp,Sensor) is not unique, or you want to reference this data from another table, consider using an identity column as the clustered Primary Key.
Good Luck!
Is there any benefit to storing content alphabetic in columns? Maybe make lookups faster? If yes then when i add new lookup values to my tables do i need to rebuild the PK for the lookup values to fit in the new text? Say a table like this:
City_tbl
city_id: example: 1120
City_name: example: New York.
If I need to add Chicago to it, do i add it at the bottom of the list with the next ID which may be 2000 or do i inset it after the city in alphabetic order which would mean I need to update the PK Id of all following IDs by 1.
Only benefit I know about is when I have to manually add lookup values without querying the database I can quickly check the lookup value list for exiting items with ease. But not sure if it may make lookups faster or something if the system knows the text is in alphabetical order.
No, I see no value in it. Better to use a proper primary key and add an index to the column. The people who have spent years writing relational databases know how to optimize access far better than you do.
I'd make the PK column auto increment, leaving the updating to the database. I'd add an index to the city name column so you can search by name as quickly as possible.
You're presuming that you understand something about the physical storage of the database. At best, your efforts will have no effect; at worst, you'll screw up the fast access that a properly indexed b-tree will already give you.
Because I have a poor memory, I want to write a simple application to store table column information, especially the meaning of table columns. Now I have a problem with my table design. I plan to make the table have the following columns:
id, table_name, column_name, data_type, is_pk, meaning
But this design can’t express the foreign key relationship. For example: table1.column1 and table2.column3 and table8.column5 all have the same data type and the same meaning, how can I modify my table design to express this information(relationship)?
Great thanks!
PS:
In fact, recently I'm working on a legacy application. The database is poorly designed. The foreign key relationship is not expressed on the database layer but on the application layer. Now my boss not allow us to modify the database. We just need to make the application working. So I can't do some work on the database directly.
Depending on your DBMS, you could probably use comments on the table / column to record the meaning of each one of those columns. Most DBMS allow you to perform some kind of annotation.
If you must have it in your table you have a few choices.
Free text If this is just to serve as a memory aid, it doesn't really need to be machine readable. This makes it easier for you to read / use directly.
fk_id Store the ID of the field this foreign key maps into. You could then define a view that pulls in the meaning column from this foreign key.
Meaning Table Store meaning as an ID into a seperate table and use a view to make it easier to work with.
Create a document Keep it in a document instead. That way you can print it out and have it handy.
You could try designing a full de-normalized schema for this, but I'd argue thats seriously over-thinking something that's just meant as a memory aid.
I would just add a column to your design "FK_Column_ID" that will hold a reference to column ID in case of a FK constraint.
The other way will be to create a duplicate of your DB as DBDefinitions or something like that.
Almost all DBMS allow you to attach descriptions or comments to table, index, and column definitions.
Oracle:
COMMENT ON COLUMN employees.job_id IS 'abbreviated job title';
If you specify foreign key relationships as part of the schema, the database will keep track of them, and will display them for you.
It is not possible to define a compound foreign key relationship with a single additional column. I would suggest that you create a second table to define the foreign keys, perhaps with the following columns:
id, fk_name, primary_table_id, foreign_table_id
and add a fk_id column to relate the fields used in the foreign key relationship. This works for both the single column foreign key and the compound foreign key.
Alternatively, and with some attempt at diplomacy, tell your boss that if you can't fix the root cause of an issue, then the time required to complete the project will be much longer than expected. First you will take some time to implement a work around which will not perform adequately, then you will take more time to implement the fix you should have implemented in the first place (which in this case is fixing the database.)
If you're not allowed to edit the database then presumably you're creating this in another standalone DBMS. I don't think is something you can acheive simply and you may well be better of just writing it up in a text document.
I think that you need more than one table. If you create a table of tables:
id, table_name, meaning
And then a table of columns:
id, column_name, datatype, meaning
You can then create a link table:
table_id, column_id, is_pk, meaning
This will enable you to have the same column linked to more than one table - thus expressing your foreign keys. As I said above though - it may be more effort than its worth.
FWIW, I do this quite often and the best "simple application" I've found is a spreadsheet.
I use a page for table/column defs, and extra pages as I need them for things like FK relationships, lookup values etc.
One of the great things about a spreadsheet for this app, is adding columns to the sheet as you need them, & removing them when you don't.
The indexing ability of a spreadsheet is also v. useful when you have a large number of tables to work with.
I know this does not answer your question directly, but how about using a database diagram?
I also have a poor memory (age I guess) and I always have an up to date diagram on my wall.
You can show all the tables, fields and foreign keys and also add comments.
I use the PowerAMC (aka PowerDesigner from Sybase) database designer, it also generates the SQL script to create the database, perhaps not very useful for legacy databases, although it will reverse engineer the database and create the diagram automatically (it can take some time to make the diagram readable).
I don't see a reason why you should implement some app to store some info there. You can as well use smth like OneNote or any other available organizer, development wiki, etc.: there are tons of ways to store info in such a way that it comes handy when you look up for it in future.
If you can make some inner changes, you can change keys' constraints names to readable pattern, like table1_colName_table2_colName.
And at least you can make some diagram, whether hand-made or using some design application.
If all this doesn't solve your problem, some more details are needed on what exactly you need to solve :)