Sorting on ID (int PK) vs creation time - sql-server

Say I have blog post comments. On insert they get the current utc date time as their creation time (via sysutcdatetime default value) and they get an ID (via integer identity column as PK).
Now I want to sort the comments descending by their age. Is it safe to just do a ORDER BY ID or is it required to use the creation time? I'm thinking about "concurrent" commits and rollbacks of inserts and an isolation level of read committed. Is it possible that the IDs sometimes do not represent the insert order?
I'm asking this because if sorting by IDs is safe then I could have the following benefits:
I don't need an index for the creation time.
Sorting by ID's is probably faster
I don't need a high precision on the datetime2 column because that would only be required for sorting anyway (in order to not have two rows with the same creation time).
This answer says it is possible when you don't have the creation time but is it always safe?
This answer says it is not safe with an identity column. But when it's also the PK the answer gives an example with sorting by ID without mentioning if this is safe.
Edit:
This answer suggests sorting by date and then by ID.

Yes, the IDs can be jumbled because ID generation is not part of the insert transaction. This is in order to not serialize all insert transactions on the table.
The most correct way to sort would be ORDER BY DateTime DESC, ID DESC with the ID being added as a tie breaker in case the same date was generated multiple times. Tie breakers in sorts are important to achieve deterministic results. You don't want different data to be shown for multiple refreshes of the page for example.
You can define a covering index on DateTime DESC, ID DESC and achieve the same performance as if you had ordered by the CI key (here: ID). There's no relevant physical difference between CI and NCIs.
Since you mention the PK somewhere I want to point out that the choice of the PK does not affect any of this. Only indexes do. The query processor does not ever care about PKs and unique keys.

I would order by ID.
Technically you may get different results when sorting by ID vs sorting by time.
The sysutcdatetime will return the time when transaction starts. ID could be generated somewhere later during the transaction. Also, the clock on any computer always drifts. When computer clock is synchronized with the time source, the clock may jump forward or backwards. If you do the sync often, the jump will be small, but it will happen.
From the practical point of view, if two comments were posted within, say, one second of each other, does it really matter which of these comments is shown first?
What I think does matter is the consistency of the display results. If the system somehow decides that comment A should go before comment B, then this order should be preserved everywhere across the system.
So, even with the highest precision datetime2(7) column it is possible to have two comments with exactly the same timestamp and if you order just by this timestamp it is possible that sometimes they will appear as A, B and sometimes as B, A.
If you order by ID (primary key), you are guaranteed that it is unique, so the order will be always well defined.
I would order by ID.
On a second thought, I would order by time and ID.
If you show the time of the comment to the user it is important to show comments according to this time. To guarantee consistency sort by both time and ID in case two comments have the same timestamp.

if you sort on id based on descending order and you are filtering on basis of user then your blog will automatically show latest post on above and that will do the job for you. so dont use date as sorting

Related

DBMS for complex conditional search with repetition

Goal is to make querying as fast as possible.
Postgres table contains 10.000.000 records, each record has 30 various properties.
CREATE TABLE films (
code char(5) CONSTRAINT firstkey PRIMARY KEY,
title varchar(40) NOT NULL,
did integer NOT NULL,
date_prod date,
kind varchar(10),
len interval hour to minute
-- and ~25 more columns
);
Users are filtering data in very specific ways, but always based on bunch of conditions.
For example, user A needs to paginate through these 10mm records filtered by column code, title, did and date_prod and ordered by column date_prod and title. And he is performing just few more similar combinations, but he repeats search many times a day. So, main point is: conditions are complex, but variety of combinations is small. Usually just 3-5 per user.
May be this is also important: user wants to see only some columns, not all of them. And this is related to conditions he uses in query.
Records in this table is being updated many times a day and therefore each time user will see updated data, so caching will not work here.
This app is in use by small quantity of users (less than 10.000) and will never exceed this number.
What i need here is to make queries as fast as it's possible. It's okay, if each first time when user creates new search (bunch of query conditions + very specific set of columns) it will be taking seconds to give the results. But if this user saves this set of conditions and columns, i need to make all further repetitions of this search as fast as possible despite the fact that the data is being updated all the time.
I doubt that indexing each column is a good idea. So, how do i do that? PostgreSQL with material views? May be MongoDB or other nosql solution somehow will work better here?

Why is datekey in fact tables always INT?

I'm looking at the datekey column from the fact tables in AdventureWorksDW and they're all of type int.
Is there a reason for this and not of type date?
I understand that creating a clustered index composed of an INT would optimize query speed. But let's say I want to get data from this past week. I can subtract 6 from date 20170704 and I'll get 20170698 which is not a valid date. So I have to cast everything to date, subtract, and then cast as int.
Right now I have a foreign key constraint to make sure that something besides 'YYYYMMDD' isn't inserted. It wouldn't be necessary with a Date type. Just now, I wanted to get some data between 6/28 and 7/4. I can't just subtract six from `20170703'; I have to cast from int to date.
It seems like a lot of hassle and not many benefits.
Thanks.
Yes, you could be using a Date data type and have that as your primary key in the Fact and the dimension and you're going to save yourself a byte in the process.
And then you're going to have to deal with a sale that is recorded and we didn't know the date. What then? In a "normal" dimensional model, you define Unknown surrogate values so that people know there is data and it might be useful but it's incomplete. A common convention is to make it zero or in the negative realm. Easy to do with integers.
Dates are a little weird in that we typically use smart keys - yyyymmdd. From a debugging perspective, it's easy to quickly identify what the date is without having to look up against your dimension.
You can't make an invalid date. Soooo what then? Everyone "knows" that 1899-12-31 is the "fake" date (or whatever tickles your fancy) and that's all well and good until someone fat fingers a date and magically hit your sentinel date and now you've got valid unknowns mixed with merely bad data entry.
If you're doing date calculations against an smart key, you're doing it wrong. You need to go to your data dimension to properly resolve the value and use methods that are aware of date logic because it's ugly and nasty beyond just simple things like month lengths and leap year calculations.
Actually that fact table has a relationship to a table DimDate, and if you join that table you would get many more options for point in time search, then if you would`ve got by adding and removing days/months.
Say you need list of all orders on second Saturday of May? Or all orders on last week of december?
Also some business regulate their fiscal year different. Some start in June, some start in January..
In summary, DimDate is there to provide you with flexibility when you need to do complicated date searches without doing any calculations, and using a simple index seek on DimDate
It's a good question, but the answer depends on what kind of datawarehouse you're aiming for. SSAS, for instance, covers tabular and multi-dimensional.
In multi-dimensional, you would never be querying the fact table itself through SQL, so the problem you note with e.g. subtracting 6 days from 20170704 would actually never arise. Because in MD SSAS you'd use MDX on the dimension itself to implement date logic (as suggested in #S4V1N's answer above). Calendar.Date.PrevMember(6). And for more complicated stuff, you can build all kinds of date hierarchies and get into MDX ParallelPeriod and FirstChild and that kind of thing.
For a datawarehouse that you're intending to use with SQL, your question has more urgency. I think that in that case #S4V1N's answer still applies: restrict your date logic to the dimension side
because that's where it's already implemented (possibly with pre-built calendar and fiscal hierarchies).
Because your logic will operate on an order of magnitude less rows.
I'm perfectly happy to have fact tables keyed on an INT-style date: but that's because I use MD SSAS. It could be that AdventureWorksDW was originally built with MD SSAS in mind (where whether the key used in fact tables is amenable to SQL is irrelevant), even though MS's emphasis seems to have switched to Tabular SSAS recently. Or the use of INTs for date keys could have been a "developer-nudging" design decision, meant to discourage date operations on the fact tables themselves, as opposed to on the Date dimension.
The thread is pretty old, but my two cents.
At one of the clients I worked at, the design chosen was an int column. The reason given (by someone before I joined) was that there were imports from different sources - some that included time information and some that only provided the date information (both strings, to begin with).
By having an int key, we could then retain the date/datetime information in a datetime column in the Fact table, while at the same time, have a second column with just the date portion (Data type: date/datetime) and use this to join to Dim table. This way the (a) aggregations/measures would be less involved (b) we wouldn't prematurely discard time information, which may be of value at some point and (c) at that point, if required the Date dimension could be refactored to include time OR a new DateTime dimension could be created.
That said, this was the accepted trade-off there, but might not be a universal recommendation.
Now a very old thread,
For non-date columns a sequential integer key is considered best practice, because it is fast, and reasonably small. A natural key which encapsulates business logic could change overtime and also may need some method of identifying which version of that dimension it is for a slowly changing dimension.
[https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/dimension-surrogate-key/][1]
Ideally for consistency a date dimension should also have a sequential integer key, so why is it different? After all the theory of debugging could be also applied to other (non-date) dimensions. From The Data Warehouse Toolkit, 3rd Edition, Kimball & Ross, page 49 (Calendar Date Dimension) is this comment
To facilitate partitioning, the primary key of a date dimension can be
more meaningful, such as an integer representing YYYYMMDD, instead of
a sequentially-assigned surrogate key.
Although I think this means partitioning of a fact table. I argue that the datekey is an integer to allow for consistency with other dimensions but not a sequential key to allow for easier table partitioning.

Should I just use a single table?

I have an entity Order.
The order has information on date, client, associate who handled order etc.
Now the order also needs to store a state i.e. differentiate between won orders and lost orders.
The idea is that a customer may submit an order to the company, but could eventually back out.
(As domain info, the order is not of items. It is a services company that tries to handle clients and makes offers on when they can deliver an order and at what price etc. So the customer may find a better burgain and back up and stop the ordering process from the company).
The company wants data on both won orders and lost orders and the difference between a won order and a lost order is just a couple of more attributes e.g. ReasonLost which could be Price or Time.
My question is, what would be the best representation of the Order?
I was thinking of using a single table and just have for the orders won, the ReasonLost as null.
Does it make sense to create separate tables for WonOrder and LostOrder if the difference of these new entities is not significant?
What would be the best model for this case?
Use one table. Add an OrderState Field.
Caveat: If you are doing millions of transactions per day, then decisions like this need much more attention and analysis.
There is another alternative design that you might consider. In this alternative you keep a second table for the order lost reason and relate it to your order table as an optional 1:1. Note that this is effectively an implementation of a supertype/subtype pattern where the lost order subtype has one additional attribute.
It looks like this:
This alternative might be attractive under any of the following circumstances:
You lose very few orders.
Your order table isn't wide enough to hold a long enough lost order reason.
Your lost order reason is very, very big (even BLOB).
You have an aesthetic objection to maintaining a lost order reason in your order table.

Calculated Measure aggregating on certain cells only

I'm trying to figure out how I can create a calculated measure that produces a count of only unique facts in my fact table. My fact table basically stores events from a historical perspective. But I need the measure to filter out redundant events.
Using sales as an example(Since all material around OLAP always uses sales in examples):
The fact table stores sales EVENTS. When a sale is first made it has a unique sales reference which is a column in the fact table. A unique sale however can be amended(Items added or returned) or completely canceled. The fact table stores these changes to a sale as different rows.
If I create a count measure using SSAS I get a count of all sales events which means an unique sale will be counted multiple times for every change made to it (Which in some reports is desirable). However I also want a measure that produces a count of unique sales rather than events but not just based on counting unique sales references. If the user filters by date then they should see unique sales that still exist on that date (If a sale was canceled by that date if should not be represented in the count at all).
How would I do this in MDX/SSAS? It seems like I need have a count query work from a subset from a query that finds the latest change to a sale based on the time dimension.
In SQL it would be something like:
SELECT COUNT(*) FROM SalesFacts FACT1 WHERE Event <> 'Cancelled' AND
Timestamp = (SELECT MAX(Timestamp) FROM SalesFact FACT2 WHERE FACT1.SalesRef=FACT2.SalesRef)
Is it possible or event performant to have subqueries in MDX?
In SSAS, create a measure that is based on the unique transaction ID (The sales number, or order number) then make that measure a 'DistinctCount' aggregate function in the properties window.
Now it should count distinct order numbers, under whichever dimension slice it finds itself under.
The posted query might probably be rewritten like this:
SELECT COUNT(DISTINCT SalesRef)
FROM SalesFacts
WHERE Event <> 'Cancelled'
An simple answer would be just to have a 'sales count' column in your fact view / dsv query that supplies a 1 for an 'initial' event, a zero for all subsiquent revisions to the event and a -1 if the event is cancelled. This 'journalling' approach plays nicely with incremental fact table loads.
Another approach, probably more useful in the long run, would be to have an Events dimension: you could then expose a calculated measure that was the count of the members in that dimension non-empty over a given measure in your fact table. However for sales this is essentially a degenerate dimension (a dimension based on a fact table) and might get very large. This may be inappropriate.
Sometimes the requirements may be more complicated. If you slice by time, do you need to know all the distinct events that existed then, even if they were later cancelled? That starts to get tricky: there's a recent post on Chris Webb's blog where he talks about one (slightly hairy) solution:
http://cwebbbi.wordpress.com/2011/01/22/solving-the-events-in-progress-problem-in-mdx-part-2role-playing-measure-groups/

How to design data storage for partitioned tagging system?

How to design data storage for huge tagging system (like digg or delicious)?
There is already discussion about it, but it is about centralized database. Since the data is supposed to grow, we'll need to partition the data into multiple shards soon or later. So, the question turns to be: How to design data storage for partitioned tagging system?
The tagging system basically has 3 tables:
Item (item_id, item_content)
Tag (tag_id, tag_title)
TagMapping(map_id, tag_id, item_id)
That works fine for finding all items for given tag and finding all tags for given item, if the table is stored in one database instance. If we need to partition the data into multiple database instances, it is not that easy.
For table Item, we can partition its content with its key item_id. For table Tag, we can partition its content with its key tag_id. For example, we want to partition table Tag into K databases. We can simply choose number (tag_id % K) database to store given tag.
But, how to partition table TagMapping?
The TagMapping table represents the many-to-many relationship. I can only image to have duplication. That is, same content of TagMappping has two copies. One is partitioned with tag_id and the other is partitioned with item_id. In scenario to find tags for given item, we use partition with tag_id. If scenario to find items for given tag, we use partition with item_id.
As a result, there is data redundancy. And, the application level should keep the consistency of all tables. It looks hard.
Is there any better solution to solve this many-to-many partition problem?
I doubt there is a single approach that optimizes all possible usage scenarios. As you said, there are two main scenarios that the TagMapping table supports: finding tags for a given item, and finding items with a given tag. I think there are some differences in how you will use the TagMapping table for each scenario that may be of interest. I can only make reasonable assumptions based on typical tagging applications, so forgive me if this is way off base!
Finding Tags for a Given Item
A1. You're going to display all of the tags for a given item at once
A2. You're going to ensure that all of an item's tags are unique
Finding Items for a Given Tag
B1. You're going to need some of the items for a given tag at a time (to fill a page of search results)
B2. You might allow users to specify multiple tags, so you'd need to find some of the items matching multiple tags
B3. You're going to sort the items for a given tag (or tags) by some measure of popularity
Given the above, I think a good approach would be to partition TagMapping by item. This way, all of the tags for a given item are on one partition. Partitioning can be more granular, since there are likely far more items than tags and each item has only a handful of tags. This makes retrieval easy (A1) and uniqueness can be enforced within a single partition (A2). Additionally, that single partition can tell you if an item matches multiple tags (B2).
Since you only need some of the items for a given tag (or tags) at a time (B1), you can query partitions one at a time in some order until you have as many records needed to fill a page of results. How many partitions you will have to query will depend on how many partitions you have, how many results you want to display and how frequently the tag is used. Each partition would have its own index on tag_id to answer this query efficiently.
The order you pick partitions in will be important as it will affect how search results are grouped. If ordering isn't important (i.e. B3 doesn't matter), pick partitions randomly so that none of your partitions get too hot. If ordering is important, you could construct the item id so that it encodes information relevant to the order in which results are to be sorted. An appropriate partitioning scheme would then be mindful of this encoding. For example, if results are URLs that are sorted by popularity, then you could combine a sequential item id with the Google Page Rank score for that URL (or anything similar). The partitioning scheme must ensure that all of the items within a given partition have the same score. Queries would pick partitions in score order to ensure more popular items are returned first (B3). Obviously, this only allows for one kind of sorting and the properties involved should be constant since they are now part of a key and determine the record's partition. This isn't really a new limitation though, as it isn't easy to support a variety of sorts, or sorts on volatile properties, with partitioned data anyways.
The rule is that you partition by field that you are going to query by. Otherwise you'll have to look through all partitions. Are you sure you'll need to query Tag table by tag_id only? I believe not, you'll also need to query by tag title. It's no so obvious for Item table, but probably you also would like to query by something like URL to find item_id for it when other user will assign tags for it.
But note, that Tag and Item tables has immutable title and URL. That means you can use the following technique:
Choose partition from title (for Tag) or URL (for Item).
Choose sequence for this partition to generate id.
You either use partition-localID pair as global identifier or use non-overlapping number sets. Anyway, now you can compute partition from both id and title/URL fields. Don't know number of partitions in advance or worrying it might change in future? Create more of them and join in groups, so that you can regroup them in future.
Sure, you can't do the same for TagMapping table, so you have to duplicate. You need to query it by map_id, by tag_id, by item_id, right? So even without partitioning you have to duplicate data by creating 3 indexes. So the difference is that you use different partitioning (by different field) for each index. I see no reason to worry about.
Most likely your queries are going to be related to a user or a topic. Meaning that you should have all info related to those in one place.
You're talking about distribution of DB, usually this is mostly an issue of synchronization. Reading, which is about 90% of the work usually, can be done on a replicated database. The issue is how to update one DB and remain consistent will all others and without killing the performances. This depends on your scenario details.
The other possibility is to partition, like you asked, all the data without overlapping. You probably would partition by user ID or topic ID. If you partition by topic ID, one database could reference all topics and just telling which dedicated DB is holding the data. You can then query the correct one. Since you partition by ID, all info related to that topic could be on that specialized database. You could partition also by language or country for an international website.
Last but not least, you'll probably end up mixing the two: Some non-overlapping data, and some overlapping (replicated) data. First find usual operations, then find how to make those on one DB in least possible queries.
PS: Don't forget about caching, it'll save you more than distributed-DB.

Resources