I have to have this query produce a monthly average availability for a selection of equipment. The query looks pretty ugly and rather bloated and I'm wondering if there's something I can do to make it better.
Right now I have an Expression column representing each month. In each month, take January for example, I need to check if the maintenance period occurred within the month, started before the month but ended during, started during the month but ended outside, completely encapsulates the month (if Jan 2012, then started maintenance Dec 2011 ended Feb 2012), or not count it because it was fully operational throughout the month.
So as you can see checking these 4 cases and doing calculations afterwards really bloats up one month, let alone 12! So I'm just looking for a suggestion if there's something more I can be doing. I'm not afraid of some RnD if someone has articles they can point me towards. If this approach is not overly bad I don't mind, I would just like to do better if its possible.
Thanks in advanced!
can't you just do a simple crosstab query in MS Access?
Related
I am currently working on a prototype for a future product. I am currently working on a concept for how highly flexible licences can best be stored and billed.
The following illustration explains the situation a little:
The billing should take place for each month and include the exact prices on a pro rata basis. In the example, January is to be billed, which is made up of 3 different licence periods. Each part should contain the exact price proportionate to its time in the month. So the concept is clear to me, but I am asking about the best technical implementation.
Should the periods be in an SQL database or a timeseries DB or something else entirely? How should I mark when periods have been paid, but extend into the next month?
SQL would be my first approach: Periods table: store periods (startDate, endDate, priceFactor, isBilled)
Problems:
How to bill ongoing periods, which extend over many months? (they have no endDate)
sql-queries could get complex
Thank you for your help!
Question: This might be a beginner developer questions but need some clarifications. What is best / mostly used / best practice to process data into database table from beginning of time (for example data from January 1900) ? Do people pull and process data from beginning of time and process it at once? Do people pull data based on window of time (example processing by one Month of Year at a time or one Year a time). Also when there is modification of data in June 2000, do people reprocess it from beginning of time or just reprocess for June 2000. I really appreciate your response.
tl;dr general question about handling database data and design:
Is it ever acceptable/are there any downsides to derive data from other data at some point in time, and then store that derived data into a separate table in order to keep a history of values at that certain time, OR, should you never store data that is derived from other data, and instead derive the required data from the existing data only when you need it?
My specific scenario:
We have a database where we record peoples' vacation days and vacation day statuses. We track how many days they have left, how many days they've taken, and things like that.
One design requirement has changed and now asks that I be able to show how many days a person had left on December 31st of any given year. So I need to be able to say, "Bob had 14 days left on December 31st, 2010".
We could do this two ways I see:
A SQL Server Agent job that, on December 31st, captures the days remaining for everyone at that time, and inserts them into a table like "YearEndHistories", which would have your EmployeeID, Year, and DaysRemaining at that time.
We don't keep a YearEndHistories table, but instead if we want to find out the amount of days possessed at a certain time, we loop through all vacations added and subtracted that exist UP TO that specific time.
I like the feeling of certainty that comes with #1 --- the recorded values would be reviewed by administration, and there would be no arguing or possibility about that number changing. With #2, I like the efficiency --- one less table to maintain, and there's no derived data present in the actual tables. But I have a weird fear about some unseen bug slipping by and peoples' historical value calculation start getting screwed up or something. In 2020 I don't want to deal with, "I ended 2012 with 9.5 days, not 9.0! Where did my half day go?!"
One thing we have decided on is that it will not be possible to modify values in previous years. That means it will never be possible to go back to the previous calendar year and add a vacation day or anything like that. The value at the end of the year is THE value, regardless of whether or not there was a mistake in the past. If a mistake is discovered, it will be balanced out by rewarding or subtracting vacation time in the current year.
Yes, it is acceptable, especially if the calculation is complex or frequently called, or doesn't change very often (eg: A high score table in a game - it's viewed very often, but the content only changes on the increasingly rare occasions when a player does very well).
As a general rule, I would normalise the data as far as possible, then add in derived fields or tables where necessary for performance reasons.
In your situation, the calculation seems relatively simple - a sum of employee vacation days granted - days taken, but that's up to you.
As an aside, I would encourage you to get out of thinking about "loops" when data is concerned - try to think about the data as a whole, as a set. Something like
SELECT StaffID, sum(Vacation)
from
(
SELECT StaffID, Sum(VacationAllocated) as Vacation
from Allocations
where AllocationDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
union
SELECT StaffID, -Count(distinct HolidayDate)
from HolidayTaken
where HolidayDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
) totals
group by StaffID
Derived data seems to me like a transitive dependency, which is avoided in normalisation.
That's the general rule.
In your case I would go for #1, which gives you a better "auditability", without performance penalty.
We are designing a MySQL table to track the number of followers on a daily basis for 10,000s of Twitter accounts. We've been struggling to figure out the most efficient way to store this data. The two options we are consider are:
1) OPTION 1 - Table with rows: Twitter ID, Month, Day1, Day2, Day3, etc. where each day would contain the number of followers for that account for each day of the specified month
2) OPTION 2 - Table with rows: Twitter ID, Day, Followers
Option 1 would result in about 30x less rows than Option 2. What I'm not sure from a performance perspective is if it's preferable to have less columns or less rows.
In terms of the queries we will be using, we just want to be able to query the data to get the number of followers for a specific Twitter account for arbitrary time ranges.
I would appreciate suggestions on which approach is better and why. Also, if there is a much better option than the ones I present please feel free to suggest it.
Thanks in advance for your help!
Option 2, no question.
Imagine trying to write a query using each option. Let's give the best case for option 1: We know we want the total for all 31 days of the month. THen with option 1 the query is:
select twitterid, day1+day2+day3+day4+day5+day6+day7+day8+day9+day10
+day11+day12+day13+day14+day15+day16+day17+day18+day19+day20
+day21+day22+day23+day24+day15+day26+day27+day28+day29+day30
+day31 as total
from table1
where month='2010-12';
select twitterid, sum(day) as total
from table2
where date between '2010-12-01' and '2010-12-31'
group by twitterid;
The second looks way easier to me. If you don't think so, tell me if you immediately noticed the error in the option 1 version, and if you're confident that no programmer would ever make such an error.
Now imagine that the requirements change just slightly, and someone wants the total for one week. With the second version, that's easy: give a date range that describes that week. This could easily be done when building a query on the fly: JUst ask for start date and add 6 days to it for the end date. But with the first version, what are you going to do? You'd have to figure out which days of the month fall in that week and change the list of fields retrieved. The week might span two calendar months. This would be a giant pain.
As to performance: Sure, more rows take more time to retrieve. But longer rows also take more time to retrieve. Lesson 1 on database design: Don't throw out normalization to do a micro-optimization when you don't even have a good reason to believe there's a problem. Build a normalized database first. Then if it turns out that there are performance problems, tune it afterwards. Odds are that you can buy a faster hard drive for a whole lot less than the cost of one day of programmer's time taken finding a mistake in an unnecessarily complex query.
Offcourse it depends on what queries you are going to do - but unless every query requires the 31 days of that month, for your operational data, Use Option 2.
It's better from a logical perspective (say later on you don't want queries per "30 calender days", but "last X days")
It's better for writes, too (only
update 1 row with 2 fields instead of
overwriting all fields).
You can always optimize later (partitioning comes to mind)
Your data-warehouse can still be optimized for long-term aggregate statistics.
Use Option 2. Option 1 would be a nightmare for queries.
MySQL has good support for doing date ranges in queries, so it is easiest to just have row per day.
I would say option 2, but you would probably want to add a field for a primary key to speed up queries. And if that primary key is an integer value, even better.
Option 2 definitely (with a two-column unique key/constraint on Twitter ID and Day).
Option 1 will just be regrettable.
I'm still learning the ropes of OLAP, cubes, and SSAS, but I'm hitting a performance barrier and I'm not sure I understand what is happening.
So I have a simple cube, which defines two simple dimensions (type and area), a third Time dimension hierarchy (goes Year->Quarter->Month->Day->Hour->10-Minute), and one measure (sum on a field called Count). The database tracks events: when they occur, what type are, where they occurred. The fact table is a precalculated summary of events for each 10 minute interval.
So I set up my cube and I use the browser to view all my attributes at once: total counts per area per type over time, with drill down from Year down to the 10 Minute Interval. Reports are similar in performance to the browse.
For the most part, it's snappy enough. But as I get deeper into the drill-tree, it takes longer to view each level. Finally at the minute level it seems to take 20 minutes or so before it displays the mere 6 records. But then I realized that I could view the other minute-level drilldowns with no waiting, so it seems like the cube is calculating the entire table at that point, which is why it takes so long.
I don't understand. I would expect that going to Quarters or Years would take longest, since it has to aggregate all the data up. Going to the lowest metric, filtered down heavily to around 180 cells (6 intervals, 10 types, 3 areas), seems like it should be fastest. Why is the cube processing the entire dataset instead of just the visible sub-set? Why is the highest level of aggregation so fast and the lowest level so slow?
Most importantly, is there anything I can do by configuration or design to improve it?
Some additional details that I just thought of which may matter: This is SSAS 2005, running on SQL Server 2005, using Visual Studio 2005 for BI design. The Cube is set (as by default) to full MOLAP, but is not partitioned. The fact table has 1,838,304 rows, so this isn't a crazy enterprise database, but it's no simple test db either. There's no partitioning and all the SQL stuff runs on one server, which I access remotely from my work station.
When you are looking at the minute level - are you talking about all events from 12:00 to 12:10 regardless of day?
I would think if you need that to go faster (because obviously it would be scanning everything), you will need to make the two parts of your "time" dimension orthogonal - make a date dimension and a time dimension.
If you are getting 1/1/1900 12:00 to 1/1/1900 12:10, I'm not sure what it could be then...
Did you verify the aggregations of your cube to ensure they were correct? Any easy way to tell is that if you get the same amount of records no matter what drill-tree you go down.
Assuming this is not the case, what Cade suggests about making a Date dimension AND a Time dimension would be the most obvious approach but it is one bigger no-no's in SSAS. See this article for more information: http://www.sqlservercentral.com/articles/T-SQL/70167/
Hope this helps.
I would also check to ensure that you are running the latest sp for sql server 2005
The RTM version had some SSAS perf issues.
also check to ensure that you have correctly define attribute relationships on you time dimension and other dims as well.
Not having these relationships defined will the SSAS storage engine to scan more data then necessary
more info: http://ms-olap.blogspot.com/2008/10/attribute-relationship-example.html
as stated above, splitting out the date and time will significantly decrease the cardinality of your date dimension which should increase performance and allow a better analytic experience.