Is a unique column good as partition key in Cassandra?

Is a unique column good as partition key in Cassandra? - database

I have a table user with multiple columns, every user has a unique userid.
Because it is unique, I dont have to specify a clustering key unless I want to use the column in queries. Is this bad, because every partition consists of a single row? If it is bad for whatever reason, what is the best practise to do in this case?
Thank you for your help!
Edit: If I have a query that needs to return all usernames, how can I do that with a good performance? Doing it from this table seems not very efficient for me, should I make another table where I simply duplicate all usernames in a Collection? Then they are all in one place and the read doesn't have to jump over multiple nodes.

I just answered the similar question. Short story - it really depends on the access patterns, and table settings. You may need to tune the table parameters to get best performance, but the settings may depend on the amount of data, and other requirements.

There are always two (main) considerations when defining your primary keys in Cassandra:
Data distribution
Query pattern match
From a data distribution standpoint, you can't get much better than using a unique key as the partition key. The more of them, the more evenly they should hash-out and thus be evenly distributed.
However, a key which distributes well but doesn't fit the desired query pattern, is pretty useless.
tl;dr;
If that unique key is all you'll ever query the table by, then it makes a great choice for a partition key.

Related

Best practice for storing millions of rows with TSQL (Sql Server 2008)

To start off, I'm not that great with database strategies, so I don't know really how to even approach this.
What I want to do is store some info in a database. Essentially the data is going to look like this
SensorNumber (int)
Reading (int)
Timestamp (Datetime?)(I just want to track down to the minute, nothing further is needed)
The only thing about this is that over a few months of tracking I'm going to have millions of rows (~5 million rows).
I really only care about searching by Timestamp and/or SensorNumber. The data in here is pretty much going to be never edited (insert once, read many times).
How should I go about building this? Is there anything special I should do other than create the table? and create the one index for SensorNumber and Temp?

Based on your comment, I would put a clustered index on (Sensor, Timestamp).
This will always cover when you want to search for SENSOR alone, but will also cover both fields checked in combination.
If you want to ever search for Timestamp alone, you can add a nonclustered index there as well.
One issue you will have with this design is the need to rebuild the table since you are going to be inserting rows non-sequentially - the new rows won't always belong at the end of the index.
Also, please do not name a field timestamp - this is a keyword in SQL Server and can cause you all kinds of issues if you don't delimit it everywhere.

You definitely want to use a SQL-Server "clustered index" for the most selective data you're likely to search on.
Here's more info:
http://www.sql-server-performance.com/2007/clustered-indexes/
http://odetocode.com/articles/70.aspx
http://www.sql-server-performance.com/2002/index-not-equal/
ELABORATION:
"Sensor" would be a poor choice - you're likely to have few sensors, many rows. This would not be a discriminating index.
"Time" would be discriminating... but it would also be a poor choice. Because the time itself, independent of sensor, temperature, etc, is probably meaningless to your query.
A clustered index on "sensor,time" might be ideal. Or maybe not - it depends on what you're after.
Please review the above links.
PS:
Please, too, consider using "datetime" instead of "timestamp". They're two completely different types under MSSQL ... and "datetime" is arguably the better, more flexible choice:
http://www.sqlteam.com/article/timestamps-vs-datetime-data-types

I agree with using a clustered index, you are almost certainly going to end up with one anyway - so it's better to define it.
A clustered index determines the order that the data is stored, adding to the end is cheaper than inserting into the middle.
Think of a deck of cards you are trying to keep in rank order as you add cards. If the highest rank is a 8, adding a 9 is trivial - put it at the top.
If you add a 5, it gets more complex, you have to work out where to put it and then insert it.
So adding items with a clustered index in order is optimal.
Given that I would suggest having a clustered index in (Timestamp,Sensor).
Clustering on (Sensor, Timestamp) will create a LOT of changes to the physical ordering of data which is very expensive (even using SSD).
If Timestamp,Sensor combo is unique then define it as being UNIQUE, otherwise Sql Server will add in a uniqueidentifier on the index to resolve duplicates.
Primary keys are automatically unique, almost all tables should have a primary key.
If (Timestamp,Sensor) is not unique, or you want to reference this data from another table, consider using an identity column as the clustered Primary Key.
Good Luck!

Primary key question

Is there a benefit to having a single column primary key vs a composite primary key?
I have a table that consists of two id columns which together make up the primary key.
Are there any disadvantages to this? Is there a compelling reason for me to throw in a third column that would be unique on it's own?

Database Normalization nuts will tell you one thing.
I'm just going to offer my own opinion of what i've learned over the years..I stick an AutoIncrementing ID field to Every ($&(##$)# one of my tables. It makes life a million times easier in the long run to be able to single out with impunity a single row.
This is from a "down in the trenches" developer.

Single column keys are simple to write, simple to maintain, and simple to understand.
If you're going to have a huge number of rows - billions? - maybe saving a byte here and there will help.
But if you're not looking at extreme cases, optimizing for "simple" is often the best way to go.

If you are a coder and the database is nothing to you but a glorified object-store, then sure, by all means inject surrogate keys willy nilly. In fact go one better and just delegate all DB schema design and DB interaction to your favourite ORM and be done with it. Indeed, when I want a small or medium scale object-store, that's exactly what I do.
If you are approaching an information systems or information management problem, then it is a completely different story. When you start dealing with 10's (or more likely 100's) of millions of dirty records integrated from multiple sources, several or all of which are not under your control; at that point the seductive lure of an easy answer to the problems of 'identity' is a trap.
Yes you sometimes still introduce a surrogate key internally to allow for concise FK relationships and improved cache efficiency on covering indices; but, you gain those benefits at the cost of substantial pain at managing the natural-key/surrogate-key relationship.
In this case it will be important to make sure you don't allow the surrogate key to leak. Your public API's at the business-logic layer should use the natural-key, nothing above an document/record-cache should be aware of the existence of a surrogate key. Be aware that the cost of matching updates against the existing surrogate keys can be prohibitive, and a far larger scalability hit than the incremental cost of moving a few extra bytes per request over the internal network.
So in conclusion:
If the DB is just being used as an object-store: let the ORM worry about object identity, and there should almost certainly be a surrogate key.
If the DB is being used as a database: the introduction of a surrogate key is an engineering design decision with serious tradeoffs in both directions. The decision will need to be made on a case by case basis, with full recognition of the resulting costs to be accepted in exchange for the benefits gained either way.
Update
The 'convenience' of a surrogate key is really just the ability to punt on the question of identity. This is often necessary in a database, and reasonable in the caching layer as I allow, but beyond that it leads to brittle data designs. The problem is that identity is no something that has one correct answer. For non-trivial data-intensive systems you will routinely find yourself needing to work in terms of equivalence classes, rather than the reference identity, object-oriented programming lulls us into thinking is normal.
What it really comes down to is a realization that the whole concept of a 'primary key' is a fiction invented to help the relational model work efficiently; but, adopting a surrogate key, cements that fiction and makes the whole system brittle and inflexible. Business logic needs to be able to provide their own definitions of equality — sometimes four copies of the same file need to be considered four files, sometimes they should be considered indistinguishable from the original file; when you edit one of them, is that then a new file? the same file? The answer to both questions is of course yes, when... Working with natural keys provides this critical ability to work in terms of conceptual equivalence classes. If you let surrogate keys infect your business logic, you quickly lose this.

I have had to use multi-column primary keys in the past, and it became quite a nightmare very quickly.
If you have one table that references your first table, how does it contain that primary key? Now add another table that references only the second table but needs to find data in the first. Now another... on down the rabbit hole.
If you know that you will only have the one table, there's probably not an issue either way- use whichever represents your data better. But if you'll be using it in joins, you can lose performance pretty quickly.

Is there a benefit to having a single column primary key vs a composit[sic] primary key?
Yes. If the primary key also happens to be the clustered index, it is common that the clustered index is duplicated fully for each secondary index in the table. Therefore, having a fatter clustered index, which is what one would get with a composite, implies an increase in storage cost. Also, foreign references to this table would need to specify both fields to refer to a unique entry, which implies a further storage cost. There is also an arguably greater cost in development time because there is a slight increase in the complexity of the join.
On the other hand, depending on the distribution of the values of your two key fields, it may be the case that concurrent access to your table is greatly improved because chronologically-successive inserts could occur on different physical pages; this could be the case, for example, if your fields are time-independent (and non-monotonic like an auto-incrementer) like clientID, or something like that. This could be significant for performance in a high concurrency environment.
I have a table that consists of two id columns which together make up the primary key.
Are there any disadvantages to this? Is there a compelling reason for me
to throw in a third column that would be unique on it's own?
If the most common way in which your table is queried is to specify those three fields as restrictions, then having all three in a composite key would likely be the fastest lookup.
And there is another important point that I almost forgot. Since having a composite key means that foreign references to this table from other tables must specify all fields in the key, it also means that some queries performed on the other table that required a restriction on one or more of the parts of the composite index of this table, can be performed without requiring a join. This could be considered similar to the concept of denormalization for the sake of performance (and arguably sacrificing a little ease of maintainability).

In general I prefer to have a surrogate key becasue there are very few truly good natural keys (key problem is not uniqueness but that they change over time) and the longer the natural key, the more it affects performance when used as a PK. If you have a natural key, you should create a unique index on it and then use the surrogate key as the PK used for joining to other tables. That enforces the uniqueness of the natural key data but fixes the problems of join performance and the extra time to update all child records when the natural key changes.
There is one case where I ignore this and that is a joining table. If it is a table that is used to enforce a many to many relationship and consists only of two surrogate keys from other tables, then you really gain nothing from adding a surrogate key. Typically the individual keys are used for joins not the PK and surrogate keys almost never change. In a joining table, I just add the two colmns I need and nothing else.

In most databases I know (MySQL, PostgreSQL) the composite key will generate an index. So if you specify your key as composite the DB should provide you an efficient way to lookup tuples from the DB using that key. I think it is the case for all DBs. I think you do not have to bother about performance there.

Don't use multi-column keys. They get very difficult to maintain, especially if the components of the key are not human-understandable.
Use an internally generated key instead.

Imagine you have a composite primary key (field1 and field2 for example) instead of just one autoincremental identifier. Clients' requirements are very changeable and after some development the client says that field2 is not compulsory and it can be nullable, it won't be possible to continue as the primary key of the table. Imagine this table is one of the most importants in your model. Then all the foreign keys should be changed if field 2 cannot be in the composite primary key. It's a nightmare changing the primary key all over the model.
As well if there is a lot of foreign keys I think is not a very good Idea to add several keys to each table just to make the link.

I'm not sure there's enough information for us to make your call for you. Here are a few observations that might be helpful though.
is the primary key a clustered index? Is the table referenced by other tables through a foreign key? If yes, then you may benefit from a single-column key, because that key will appear in those other tables. This is how you would save space.
If the table is not referenced by other tables, then you would be using extra space in your table without much additional benefit. And, if this table only contains the two columns now, then you would increase the table size by 50%.
If you use an extra column for the primary key, do not forget your natural key (the two-column key). Create a unique constraint on the composite key. You still want to maintain the integrity of the real data.

The decision should always be based on requirements and the intended meaning of the data. A table with only a single attribute key clearly enforces a different kind of constraint and implies that your table has a very different meaning to the same table with a multi attribute key. On the other hand adding an additional unique column would also be a waste of resources and add meaningless complexity if you don't actually need to use it anywhere.

One caveat to the auto-incrementing column is that it can give a false impression of uniqueness. Sure, your identity column is always unique, but that's just a meaningless value you've attached to the table. Unless you also have a unique constraint attached to the set of columns that represent the actual semantic primary key of the table, you have no guarantee of meaningful uniqueness.

database design asking for advice

I need to store entries of the schema like (ID:int, description:varchar, updatetime:DateTime). ID is unique primary key. The usage scenario is, I will frequently insert new entries, frequently query entries by ID and less frequently remove expired entries (by updatetime field, using another SQL Job run daily to avoid database ever increasing). Each entry is with 0.5k size.
My question is how to optimize the database schema design (e.g. tricks to add index, transaction/lock levels or other options) in my scenario to improve performance? Currently I plan to store all information in a single table, not sure whether it is the best option.
BTW: I am using SQL Server 2005/2008.
thanks in advance,
George

Additionally to your primary key, just add index on updatetime.

Your decision to store everything in a single table needs to be reviewed. There are very few subject matters that can really be well modeled by just one table.
The problems that arise from using just one table are usually less obvious than the problems that arise from not creating the right indexes and things like that.
I'm interested in the "description" column (field). Do all descriptions describe the same kind of thing? Do you ever retrieve sets of descriptions, rather than just one description at a time? How do you group descriptions into sets?
How do you know the ID for the description you are trying to retrieve? Do you store copies of the ID in some toher place, in order to reference which ones you want?
Do you know what a "foreign key" is? Was your choice not to include any foreign keys in this table deliberate?
These are some of the questions that need to be answered before you can know whether a single table design really suits your case.

Your ID is your primary key and it has automatically an index.
You can put onther index for the expiration date. Indexes
are going to help you for searching but decreases the performance
when inserting, deleting and updating. Anyway one index is not
an issue.
It sounds for me somehow strange -I am not saying that it is an error-
that you have ALL the information in one table. Re-think that point.
See if you can refactorize something.

It sounds as simple as it gets, except for possibly adding an index on updatetime as OMax suggested (I recommend).
If you would also like to fetch items by description, you should also consider a text index or full-text index on that column.
Other than that - you're ready to go :)

Is it ok to use character values for primary keys?

Is there a performance gain or best practice when it comes to using unique, numeric ID fields in a database table compared to using character-based ones?
For instance, if I had two tables:
athlete
id ... 17, name ... Rickey Henderson, teamid ... 28
team
teamid ... 28, teamname ... Oakland
The athlete table, with thousands of players, would be easier to read if the teamid was, say, "OAK" or "SD" instead of "28" or "31". Let's take for granted the teamid values would remain unique and consistent in character form.
I know you CAN use characters, but is it a bad idea for indexing, filtering, etc for any reason?
Please ignore the normalization argument as these tables are more complicated than the example.

I find primary keys that are meaningless numbers cause less headaches in the long run.

Text is fine, for all the reasons you mentioned.
If the string is only a few characters, then it will be nearly as small an an integer anyway. The biggest potential drawback to using strings is the size: database performance is related to how many disk accesses are needed. Making the index twice as big, for example, could create disk-cache pressure, and increase the number of disk seeks.

I'd stay away from using text as your key - what happens in the future when you want to change the team ID for some team? You'd have to cascade that key change all through your data, when it's the exact thing a primary key can avoid. Also, though I don't have any emperical evidence, I'd think the INT key would be significantly faster than the text one.
Perhaps you can create views for your data that make it easier to consume, while still using a numeric primary key.

I'm just going to roll with your example. Doug is correct when he says that text is fine. Even for a medium sized (~50gig) database having a 3 letter code be a primary key won't kill the database. If it makes development easier, reduces joins on the other table and it's a field that users would be typing in...I say go for it. Don't do it if it's just an abbreviation that you show on a page or because it makes the athletes table look pretty. I think the key is the question "Is this a code that the user will type in and not just pick from a list?"
Let me give you an example of when I used a text column for a key. I was making software for processing medical claims. After the claim got all digitized a human had to look at the claim and then pick a code for it that designated what kind of claim it was. There were hundreds of codes...and these guys had them all memorized or crib sheets to help them. They'd been using these same codes for years. Using a 3 letter key let them just fly through the claims processing.

I recommend using ints or bigints for primary keys. Benefits include:
This allows for faster joins.
Having no semantic meaning in your primary key allows you to change the fields with semantic meaning without affecting relationships to other tables.
You can always have another column to hold team_code or something for "OAK" and "SD". Also

The standard answer is to use numbers because they are faster to index; no need to compute a hash or whatever.
If you use a meaningful value as a primary key you'll have to update it all through you're database if the team name changes.
To satisfy the above, but still make the database directly readable,
use a number field as the primary key
immediately create a view Athlete_And_Team that joins the Athlete and Team tables
Then you can use the view when you're going through the data by hand.

Are you talking about your primary key or your clustered index? Your clustered index should be the column which you will use to uniquely identify that row by most often. It also defines the logical ordering of the rows in your table. The clustered index will almost always be your primary key, but there are circumstances where they can be differant.

Tables with no Primary Key

I have several tables whose only unique data is a uniqueidentifier (a Guid) column. Because guids are non-sequential (and they're client-side generated so I can't use newsequentialid()), I have made a non-primary, non-clustered index on this ID field rather than giving the tables a clustered primary key.
I'm wondering what the performance implications are for this approach. I've seen some people suggest that tables should have an auto-incrementing ("identity") int as a clustered primary key even if it doesn't have any meaning, as it means that the database engine itself can use that value to quickly look up a row instead of having to use a bookmark.
My database is merge-replicated across a bunch of servers, so I've shied away from identity int columns as they're a bit hairy to get right in replication.
What are your thoughts? Should tables have primary keys? Or is it ok to not have any clustered indexes if there are no sensible columns to index that way?

When dealing with indexes, you have to determine what your table is going to be used for. If you are primarily inserting 1000 rows a second and not doing any querying, then a clustered index is a hit to performance. If you are doing 1000 queries a second, then not having an index will lead to very bad performance. The best thing to do when trying to tune queries/indexes is to use the Query Plan Analyzer and SQL Profiler in SQL Server. This will show you where you are running into costly table scans or other performance blockers.
As for the GUID vs ID argument, you can find people online that swear by both. I have always been taught to use GUIDs unless I have a really good reason not to. Jeff has a good post that talks about the reasons for using GUIDs: https://blog.codinghorror.com/primary-keys-ids-versus-guids/.
As with most anything development related, if you are looking to improve performance there is not one, single right answer. It really depends on what you are trying to accomplish and how you are implementing the solution. The only true answer is to test, test, and test again against performance metrics to ensure that you are meeting your goals.
[Edit]
#Matt, after doing some more research on the GUID/ID debate I came across this post. Like I mentioned before, there is not a true right or wrong answer. It depends on your specific implementation needs. But these are some pretty valid reasons to use GUIDs as the primary key:
For example, there is an issue known as a "hotspot", where certain pages of data in a table are under relatively high currency contention. Basically, what happens is most of the traffic on a table (and hence page-level locks) occurs on a small area of the table, towards the end. New records will always go to this hotspot, because IDENTITY is a sequential number generator. These inserts are troublesome because they require Exlusive page lock on the page they are added to (the hotspot). This effectively serializes all inserts to a table thanks to the page locking mechanism. NewID() on the other hand does not suffer from hotspots. Values generated using the NewID() function are only sequential for short bursts of inserts (where the function is being called very quickly, such as during a multi-row insert), which causes the inserted rows to spread randomly throughout the table's data pages instead of all at the end - thus eliminating a hotspot from inserts.
Also, because the inserts are randomly distributed, the chance of page splits is greatly reduced. While a page split here and there isnt too bad, the effects do add up quickly. With IDENTITY, page Fill Factor is pretty useless as a tuning mechanism and might as well be set to 100% - rows will never be inserted in any page but the last one. With NewID(), you can actually make use of Fill Factor as a performance-enabling tool. You can set Fill Factor to a level that approximates estimated volume growth between index rebuilds, and then schedule the rebuilds during off-peak hours using dbcc reindex. This effectively delays the performance hits of page splits until off-peak times.
If you even think you might need to enable replication for the table in question - then you might as well make the PK a uniqueidentifier and flag the guid field as ROWGUIDCOL. Replication will require a uniquely valued guid field with this attribute, and it will add one if none exists. If a suitable field exists, then it will just use the one thats there.
Yet another huge benefit for using GUIDs for PKs is the fact that the value is indeed guaranteed unique - not just among all values generated by this server, but all values generated by all computers - whether it be your db server, web server, app server, or client machine. Pretty much every modern language has the capability of generating a valid guid now - in .NET you can use System.Guid.NewGuid. This is VERY handy when dealing with cached master-detail datasets in particular. You dont have to employ crazy temporary keying schemes just to relate your records together before they are committed. You just fetch a perfectly valid new Guid from the operating system for each new record's permanent key value at the time the record is created.
http://forums.asp.net/t/264350.aspx

The primary key serves three purposes:
indicates that the column(s) should be unique
indicates that the column(s) should be non-null
document the intent that this is the unique identifier of the row
The first two can be specified in lots of ways, as you have already done.
The third reason is good:
for humans, so they can easily see your intent
for the computer, so a program that might compare or otherwise process your table can query the database for the table's primary key.
A primary key doesn't have to be an auto-incrementing number field, so I would say that it's a good idea to specify your guid column as the primary key.

Just jumping in, because Matt's baited me a bit.
You need to understand that although a clustered index is put on the primary key of a table by default, that the two concepts are separate and should be considered separately. A CIX indicates the way that the data is stored and referred to by NCIXs, whereas the PK provides a uniqueness for each row to satisfy the LOGICAL requirements of a table.
A table without a CIX is just a Heap. A table without a PK is often considered "not a table". It's best to get an understanding of both the PK and CIX concepts separately so that you can make sensible decisions in database design.
Rob

Nobody answered actual question: what are pluses/minuses of a table with NO PK NOR a CLUSTERED index.
In my opinion, if you optimize for faster inserts (especially incremental bulk-insert, e.g. when you bulk load data into a non-empty table), such a table: with NO clustered index, NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database with Simple Recovery Model, is the best. Now, if you ever want to query this table (as opposed to scanning it in its entirety) you may want to add a non-clustered non-unique indexes as needed but keep them to the minimum.

I too have always heard having an auto-incrementing int is good for performance even if you don't actually use it.

A Primary Key needn't be an autoincrementing field, in many cases this just means you are complicating your table structure.
Instead, a Primary Key should be the minimum collection of attributes (note that most DBMS will allow a composite primary key) that uniquely identifies a tuple.
In technical terms, it should be the field that every other field in the tuple is fully functionally dependent upon. (If it isn't you might need to normalise).
In practice, performance issues may mean that you merge tables, and use an incrementing field, but I seem to recall something about premature optimisation being evil...

Since you are doing replication, your are correct identities are something to stear clear of. I would make your GUID a primary key but nonclustered since you can't use newsequentialid. That stikes me as your best course. If you don't make it a PK but put a unique index on it, sooner or later that may cause people who maintain the system to not understand the FK relationships properly introducing bugs.