Loading unfinished data to data warehouse - sql-server

The title might be confusing so I'd like to present my current problem.
Please image the following situation: System stores devices' issues, which should be fixed by qualified workers. I have table "issue" with:
id as PK
workerid FK
status which desribes whether the problem is solved or unsolved
estimated completion time
real completion time
and other columns. I have also a data warehouse which will store the "issues" and describe performance of those "workers" (working time mostly).
During the ETL process the biggest problem comes with "unsolved issues". I might have two possibilities:
a) process only solved "issues", leave unsolved until they are finished then wait until they are finished and process them. This task however will not include in my reports issues, that might take too long to finish, which might be crucial in business aspect.
b) process both solved and unsolved issues, the PK in Fact table could be issueId and status. But then i'll store almost identical issues which might be weird, and difficult to analize.
Is this common situation? Which of these two possibilities seems more reasonable? Or probably there is other, better way to do this?

It seems like there should be an issues dimension, and that dimension would hold the status column. There are a couple of issues with changing facts:
You are going to have to setup a scheduled process that updates the status column of the fact table every x minutes. I always try to avoid updating a fact table, as it makes cube processing more difficult, it can introduce blocking, and change tracking is difficult (when did the status change, who changed it, and why?). Additionally, if/when you upgrade to SQL 2012 and want to use column-store indexes (which have revolutionized star schema query performance), you won't be able to directly update the column.
Dimensions are sometimes expected to change. Facts are not. If the status is in the dimension, it's also easy to set up change tracking. Look into slowly changing dimensions.

Related

Paging of frequently changing data

I'm developing a web application which display a list of let's say "threads". The list can be sorted by the amount of likes a thread has. There can be thousands of threads in one list.
The application needs to work in a scenario where the likes of a thread can change more than 10x in a second. The application furthermore is distributed over multiple servers.
I can't figure out an efficient way to enable paging for this sort of list. And I can't transmit the whole sorted list by likes to a user at once.
As soon as an user would go to page 2 of this list, it likely changed and may contain threads already listed from page one
Solutions which don't work:
Storing the seen threads on the client side (could be too many on mobile)
Storing the seen threads on the Server side (too many users and threads)
Snapshot the list in temp database table (it's too frequent changing data and it need to be actual)
(If it matters I'm using MongoDB+c#)
How would you solve this kind of problem?
Interesting question. Unless I'm misunderstanding you, and by all means let me know if I am, it sounds like the best solution would be to implement a system that, instead of page numbers, uses timestamps. It would be similar to what many of the main APIs already do. I know Tumblr even does this on the dashboard, where this is, of course, not an unreasonable case: there can be tons of posts added in a small amount of time at peak hours, depending on how many people the user follows.
So basically, your "next page" button could just link to /threads/threadindex/1407051000, which could translate to "all the threads that were created before 2014-08-02 17:30. That makes your query super easy to implement. Then, when you pull down all the next elements, you just look for anything that occurred before the last element on the page.
The downfall of this, of course, is that it's hard to know how many new elements have been added since the user started browsing, but you could always log the start time and know anything since then would be new. And it's also difficult for users to type in their own pages, but that's not a problem in most applications. You also need to store the timestamps for every record in your thread, but that's probably already being done, and if it's not then it's certainly not hard to implement. You'll be paying the cost of something like eight bytes extra per record, but that's better than having to store anything about "seen" posts.
It's also nice because, and again this might not apply to you, but a user could bookmark a page in the list, and it would last unchanged forever since it's not relative to anything else.
This is typically handled using an OLAP cube. The idea here is that you add a natural time dimension. They may be too heavy for this application, but here's a summary in case someone else needs it.
OLAP cubes start with the fundamental concept of time. You have to know what time you care about to be able to make sense of the data.
You start off with a "Time" table:
Time {
timestamp long (PK)
created datetime
last_queried datetime
}
This basically tracks snapshots of your data. I've included a last_queried field. This should be updated with the current time any time a user asks for data based on this specific timestamp.
Now we can start talking about "Threads":
Threads {
id long (PK)
identifier long
last_modified datetime
title string
body string
score int
}
The id field is an auto-incrementing key; this is never exposed. identifier is the "unique" id for your thread. I say "unique" because there's no unique-ness constraint, and as far as the database is concerned it is not unique. Everything else in there is pretty standard... except... when you do writes you do not update this entry. In OLAP cubes you almost never modify data. Updates and inserts are explained at the end.
Now, how do we query this? You can't just directly query Threads. You need to include a star table:
ThreadStar {
timestamp long (FK -> Time.timestamp)
thread_id long (FK -> Threads.id)
thread_identifier long (matches Threads[thread_id].identifier)
(timestamp, thread_identifier should be unique)
}
This table gives you a mapping from what time it is to what the state of all of the threads are. Given a specific timestamp you can get the state of a Thread by doing:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
AND Thread.identifier = {thread_identifier}
That's not too bad. How do we get a stream of threads? First we need to know what time it is. Basically you want to get the largest timestamp from Time and update Time.last_queried to the current time. You can throw a cache up in front of that that only updates every few seconds, or whatever you want. Once you have that you can get all threads:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
ORDER BY Thread.score DESC
Nice. We've got a list of threads and the ordering is stable as the actual scores change. You can page through this at your leisure... kind of. Eventually data will be cleaned up and you'll lose your snapshot.
So this is great and all, but now you need to create or update a Thread. Creation and modification are almost identical. Both are handled with an INSERT, the only difference is whether you use an existing identifier or create a new one.
So now you've inserted a new Thread. You need to update ThreadStar. This is the crazy expensive part. Basically you make a copy of all of the ThreadStar entries with the most recent timestamp, except you update the thread_id for the Thread you just modified. That's a crazy amount of duplication. Fortunately it's pretty much only foreign keys, but still.
You also don't do DELETEs either; mark a row as deleted or just exclude it when you update ThreadStar.
Now you're humming along, but you've got crazy amounts of data growing. You'll probably want to clean it out, unless you've got a lot of storage budge, but even then things will start slowing down (aside: this will actually perform shockingly well, even with crazy amounts of data).
Cleanup is pretty straightforward. It's just a matter of some cascading deletes and scrubbing for orphaned data. Delete entries from Time whenever you want (e.g. it's not the latest entry and last_queried is null or older than whatever cutoff). Cascade those deletes to ThreadStar. Then find any Threads with an id that isn't in ThreadStar and scrub those.
This general mechanism also works if you have more nested data, but your queries get harder.
Final note: you'll find that your inserts get really slow because of the sheer amounts of data. Most places build this with appropriate constraints in development and testing environments, but then disable constraints in production!
Yeah. Make sure your tests are solid.
But at least you aren't sensitive to re-ordered data mid-paging.
For constantly changing data such as likes I would use a two stage appraoch. For the frequently changing data I would use an in memory DB to keep up with the change rates and flush this peridically to the "real" db.
Once you have that the query for constantly chaning data is easy.
Query the db.
Query the in memory db.
Merge the frequently changed data from the in memory db with the "slow" db data .
Remember which results you already have displayed so pressing the next button will
not display an already dispalyed value twice because on different pages because its rank has changed.
If many people look at the same data it might help to cache the results of 3 in itself to reduce the load on the real db even further.
Your current architecture has no caching layers (the bigger the site the more things are cached). You will not get away with a simple DB and efficient queries against the db if things become too massive.
I would cache all 'thread' results on the server when the user first time hits the database. Then return the first page of data to the user and for each subsequent next page calls I'd return cached results.
To minimize memory usage you can cache only records ids and fetch whole data when user requests it.
Cache can be evicted each time user exits current page. If it isn't a ton of data I would stick to this solution because user won't get annoyed of data constantly changing.

Does storing aggregated data go against database normalization?

On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.

Database design: columns with duplicate meaning

maybe the title of the question don't describe my problem very well but here it is :
let's say i have a table article that look like this:
+ title
+ author
.
.
.
+ status [choices : ('draft', ..., 'translate')]
And let's say in my business process i publish in my web page articles that have [status = 'translate']
Is it a good design decision to add another field :
+ read [bool]
to my table that's mean that the article is ready to be publish or is it a bad design because i can test on the status == 'translate' for that and the new field will be just a duplicate ??
i hope that my question is clear , and thanks in advance.
Here's a fundamental DB design concept (it's actually a part of making your table comply with 3NF): Non of your columns should depend on anything but the primary key of the table. That should answer your question.
Here's a good quote to remember that:
every non-key attribute
"must provide a fact about the key,
the whole key, and nothing but the key
so help me Codd".
(that's also from wiki)
The reason for that, is that breaking this law might compromise Data Integrity.
Bad design.
First, what you do have here is a field that is basically the current state of a state engine.
Second, status should be a separate table - do NOT put status texts in the same table. You can then add additional info for every possible status to the status table.
Duplicate. If you can manage without a column, don't use it.
Think about the overhead it adds to your database (and besides, a boolean column cannot be indexed, so it also won't increase yourperformance).
(And of course, replace the status strings with numeric values).
Good luck.
In order for there not to be a potential data integrity conflict, you could make the "ready" column a computed column, or you could make a view which provides this translation service.
However, for this particular design, I would put the states into a table and have an IsReady column in the state table. Then you could add different states which are all IsReady. I have used designs like this many times, where certain states are equivalent for some operations, but not for others. Each has a flag. In my particular case, many batches in different states were allowed to be counted as "successful" for average timing/performance purposes, but batches which had completely successfully but were later invalidated were not considered "successful", etc.
This case has a name in normalization theory. It's called "harmful redundancy". Here are some potential drawbacks to harmful redundancy: the database can contradict itself; too much space is wasted; too much time is wasted.
Contradictions in the database gets the most air time in tutorials on database design. You have to take some measures to prevent this situation from arising, or live with the consequences. You can rely on careful programming to keep contradictions out of the database, or you can declare constraints that will prevent any transaction from leaving the database in a contradictory state.
The waste of space is usually, but not always, a trivial cost. Wasted space can result in wasted time as a consequence.
The waste of time is the one that preoccupies programmers the most. But here, the issue gets to be more subtle. Sometimes "harmful redundancy" results in saving time, not wasting it. Most often, it results in extra time during updates, but time saving during retrieval. Often, the time savings or wastage is trivial, and therefore so is the design decision, from the point of view of speed.
In your case, the speed consequences should be minimal. Here's a hint: how often do you update rows? How often do your read them? how important is speed in updating or reading? If the time you gain during reading has more weight for yuo than the time you spend updating, then go for it.

Am i right to sacrifice database design fundamentals in this case for the sake of speed?

I work in a company that uses single table Access database for its outbound cms, which I moved to a SQL server based system. There's a data list table (not normalized) and a calls table. This has about one update per second currently. All call outcomes along with date, time, and agent id are stored in the calls table. Agents have a predefined set of records that they will call each day (this comprises records from various data lists sorted to give an even spread throughout their set). Note a data list record is called once per day.
In order to ensure speed, live updates to this system are stored in a duplicate of the calls table fields in the data list table. These are then copied to the calls table in a batch process at the end of the day.
The reason for this is not obviously the speed at which a new record could be added to the calls table live, but when the user app is closed/opened and loads the user's data set again I need to check which records have not been called today - I would need to run a stored proc on the server that picked the last most call from the calls table and check if its calldate didn't match today's date. I believe a more expensive query than checking if a field in the data list table is NULL.
With this setup I only run the expensive query at the end of each day.
There are many pitfalls in this design, the main limitation is my inexperience. This is my first SQL server system. It's pretty critical, and I had to ensure it would work and I could easily dump data back to access db during a live failure. It has worked for 11 months now (and no live failure, less downtime than the old system).
I have created pretty well normalized databases for other things (with far fewer users), but I'm hesitant to implement this for the calling database.
Specifically, I would like to know your thoughts on whether the duplication of the calls fields in the data list table is necessary in my current setup or whether I should be able to use the calls table. Please try and answer this from my perspective. I know you DBAs may be cringing!
Redesigning an already working Database may become the major flaw here. Rather try to optimize what you have got running currently instead if starting from scratch. Think of indices, referential integrity, key assigning methods, proper usage of joins and the like.
In fact, have a look here:
Database development mistakes made by application developers
This outlines some very useful pointers.
The thing the "Normalisation Nazis" out there forget is that database design typically has two stages, the "Logical Design" and the "Physical Design". The logical design is for normalisation, and the physical design is for "now lets get the thing working", considering among other things the benefits of normalisation vs. the benefits of breaking nomalisation.
The classic example is an Order table and an Order-Detail table and the Order header table has "total price" where that value was derived from the Order-Detail and related tables. Having total price on Order in this case still make sense, but it breaks normalisation.
A normalised database is meant to give your database high maintainability and flexibility. But optimising for performance is one of the considerations that physical design considers. Look at reporting databases for example. And don't get me started about storing time-series data.
Ask yourself, has my maintainability or flexibility been significantly hindered by this decision? Does it cause me lots of code changes or data redesign when I change something? If not, and you're happy that your design is working as required, then I wouldn't worry.
I think whether to normalize it depends on how much you can do, and what may be needed.
For example, as Ian mentioned, it has been working for so long, is there some features they want to add that will impact the database schema?
If not, then just leave it as it is, but, if you need to add new features that change the database, you may want to see about normalizing it at that point.
You wouldn't need to call a stored procedure, you should be able to use a select statement to get the max(id) by the user id, or the max(id) in the table, depending on what you need to do.
Before deciding to normalize, or to make any major architectural changes, first look at why you are doing it. If you are doing it just because you think it needs to be done, then stop, and see if there is anything else you can do, perhaps add unit tests, so you can get some times for how long operations take. Numbers are good before making major changes, to see if there is any real benefit.
I would ask you to be a little more clear about the specific dilemma you face. If your system has worked so well for 11 months, what makes you think it needs any change?
I'm not sure you are aware of the fact that "Database design fundamentals" might relate to "logical database design fundamentals" as well as "physical database design fundamentals", nor whether you are aware of the difference.
Logical database design fundamentals should not (and actually cannot) be "sacrificed" for speed precisely because speed is only determined by physical design choices, the prime desision factor in which is precisely speed and performance.

Limits on number of Rows in a SQL Server Table

Are there any hard limits on the number of rows in a table in a sql server table? I am under the impression the only limit is based around physical storage.
At what point does performance significantly degrade, if at all, on tables with and without an index. Are there any common practicies for very large tables?
To give a little domain knowledge, we are considering usage of an Audit table which will log changes to fields for all tables in a database and are wondering what types of walls we might run up against.
You are correct that the number of rows is limited by your available storage.
It is hard to give any numbers as it very much depends on your server hardware, configuration, and how efficient your queries are.
For example, a simple select statement will run faster and show less degradation than a Full Text or Proximity search as the number of rows grows.
BrianV is correct. It's hard to give a rule because it varies drastically based on how you will use the table, how it's indexed, the actual columns in the table, etc.
As to common practices... for very large tables you might consider partitioning. This could be especially useful if you find that for your log you typically only care about changes in the last 1 month (or 1 day, 1 week, 1 year, whatever). You could then archive off older portions of the data so that it's available if absolutely needed, but won't be in the way since you will almost never actually need it.
Another thing to consider is to have a separate change log table for each of your actual tables if you aren't already planning to do that. Using a single log table makes it VERY difficult to work with. You usually have to log the information in a free-form text field which is difficult to query and process against. Also, it's difficult to look at data if you have a row for each column that has been changed because you have to do a lot of joins to look at changes that occur at the same time side by side.
In addition to all the above, which are great reccomendations I thought I would give a bit more context on the index/performance point.
As mentioned above, it is not possible to give a performance number as depending on the quality and number of your indexes the performance will differ. It is also dependent on what operations you want to optimize. Do you need to optimize inserts? or are you more concerned about query response?
If you are truly concerned about insert speed, partitioning, as well a VERY careful index consideration is going to be key.
The separate table reccomendation of Tom H is also a good idea.
With audit tables another approach is to archive the data once a month (or week depending on how much data you put in it) or so. That way if you need to recreate some recent changes with the fresh data, it can be done against smaller tables and thus more quickly (recovering from audit tables is almost always an urgent task I've found!). But you still have the data avialable in case you ever need to go back farther in time.

Resources