Question: This might be a beginner developer questions but need some clarifications. What is best / mostly used / best practice to process data into database table from beginning of time (for example data from January 1900) ? Do people pull and process data from beginning of time and process it at once? Do people pull data based on window of time (example processing by one Month of Year at a time or one Year a time). Also when there is modification of data in June 2000, do people reprocess it from beginning of time or just reprocess for June 2000. I really appreciate your response.
Related
This isn't exactly a programming question, as I don't have an issue writing the code, but a database design question. I need to create an app that tracks sales goals vs. actual sales over time. The thing is, that a persons goal can change (let's say daily at most).
Also, a location can have multiple agents with different goals that need to be added together for the location.
I've considered basically running a timed task to save daily goals per agent into a field. It seems that over the years that will be a lot of data, but it would let me simply query a date range and add all the daily goals up to get an goal for that date range.
Otherwise, I guess I could simply write changes (i.e. March 2nd - 15 sales / week, April 12th, 16 sales per week) which would be less data, but much more programming work to figure out goals based on a time query.
I'm assuming there is probably a best practice for this - anyone?
Put a date range on your goals. The start of the range is when you set that goal. The end of the range starts off as max-collating date (often 9999-12-31, depending on your database).
Treat this as "until forever" or "until further notice".
When you want to know what goals were in effect as of a particular date, you would have something like this in your WHERE clause:
...
WHERE effective_date <= #AsOfDate
AND expiry_date > #AsOfDate
...
When you change a goal, you need two operations, first you update the existing record (if it exists) and set the expiry_date to the new as-of date. Then you insert a new record with an effective_date of the new as-of date and an expiry_date of forever (e.g. '9999-12-31')
This give you the following benefits:
Minimum number of rows
No scheduled processes to take daily snapshots
Easy retrieval of effective records as of a point in time
Ready-made audit log of changes
TL;DR
I have a table with about 2 million WRITEs over the month and 0 READs. Every 1st day of a month, I need to read all the rows written on the previous month and generate CSVs + statistics.
How to work with DynamoDB in this scenario? How to choose the READ throughput capacity?
Long description
I have an application that logs client requests. It has about 200 clients. The clients need to receive on every 1st day of a month a CSV with all the requests they've made. They also need to be billed, and for that we need to calculate some stats with the requests they've made, grouping by type of request.
So in the end of the month, a client receives a report like:
I've already come to two solutions, but I'm not still convinced on any of them.
1st solution: ok, every last day of the month I increase the READ throughput capacity and then I run a map reduce job. When the job is done, I decrease the capacity back to the original value.
Cons: not fully automated, risk of the DynamoDB capacity not being available when the job starts.
2nd solution: I can break the generation of CSVs + statistics to small jobs in a daily or hourly routine. I could store partial CSVs on S3 and on every 1st day of a month I could join those files and generate a new one. The statistics would be much easier to generate, just some calculations derived from the daily/hourly statistics.
Cons: I feel like I'm turning something simple into something complex.
Do you have a better solution? If not, what solution would you choose? Why?
Having been in a similar place myself before, I used, and now recommend to you, to process the raw data:
as often as you reasonably can (start with daily)
to a format as close as possible to the desired report output
with as much calculation/CPU intensive work done as possible
leaving as little to do at report time as possible.
This approach is entirely scaleable - the incremental frequency can be:
reduced to as small a window as needed
parallelised if required
It also, makes possible re-running past months reports on demand, as the report generation time should be quite small.
In my example, I shipped denormalized, pre-processed (financial calculations) data every hour to a data warehouse, then reporting just involved a very basic (and fast) SQL query.
This had the additional benefit of spreading the load on the production database server to lots of small bites, instead of bringing it to its knees once a week at invoice time (30000 invoiced produced every week).
I would use the service kinesis to produce a daily and almost real time billing.
for this purpose I would create a special DynamoDB table just for the calculated data.
(other option is to run it on flat files)
then I would add a process which will send events to kinesis service just after you update the regular DynamoDB table.
thus when you reach the end of the month you can just execute whatever post billing calculations you have and create your CSV files from the already calculated table.
I hope that helps.
Take a look at Dynamic DynamoDB. It will increase/decrease the throughput when you need it without any manual intervention. The good news is you will not need to change the way the export job is done.
I'm facing a problem that perhaps someone around here can help me with.
I work in a business intelligence company and I'd like to simulate the whole usage cycle of our product the way our clients use it.
The short version is that our customers are inserting some 20 million records to their database on a daily basis, and our product crunches the new data at the end of the day.
I would like to automatically create around 20 million records and insert them into some database, everyday (MSSQL probably).
I should point out that the number of records should change from day to day between 15 to 25 million. Other than that, the data is supposed to be inserted to 6 tables linked with foreign keys.
I ususally use Redgate's SQL Generator to create data, but as far as I can tell it's good for one time data generation as opposed to the on going data generation I'm looking for.
If anyone knows of methods/tools adequate to this situation, please let me know.
Thanks!
You could also write a small Java (or similar) program to get the starting ID from the database, pick a random number of rows to insert, and then execute the data-generation tool as a child process.
For example, see Runtime.exec():
http://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html
You can then run your program as a scheduled task or cron job.
There is a similar topic before : Daylight saving time and time zone best practices
What I am trying to ask is related, but different.
What is the suggested practice for Date handling?
I am looking for a more 'logical' date. For example, business date of our application, or the date of birth for certain people.
Normally I store it as Date (in Oracle), with 0:0:0 time. This is going to work fine IF all component of my application is in the same timezone. Coz that date in DB means 0:0:0 of the DB's timezone, if I am presenting my data to user of another timezone, it will easily have problem because, for example, Date of 2012-12-25 0:0:0 London time is in fact 2012-12-24 16:0:0 Hong Kong Time.
I have thought of two way to solve, but both of them have its deficiencies.
First, we are storing it as a String. The drawback is obvious: I need to do a lot of conversations in our app or query, and I lost a lot of date arithmetic
Second way is to store it as Date, but with a pre-defined timezone (e.g. UTC). When application is displaying the date, it has to display as UTC timezone. However I will need a lot of timezone manipulation in my application code.
What is the suggested way of handling Date? Or do most people simply use one of the above 3 (including the assume-to-be-same-timezone one) approaches?
A date is a way of identifying a day, and a day is relative to the local time zone, that is, the sun. A day is a 24 hour period (although because of leap seconds and other sidereal corrections, that 24 hours is only a very close approximation). So the date of December 5 in London names a different 24 hour period from the date December 5 in New York. One of the consequences of this is that if you want to do arithmetic on dates between different time zones, you can only do so to an accuracy of +/- 1. As a data structure, this is a conventional date (say, year and day offset) and a time zone identified by an hour offset from UTC (beware, there are some 1/2 hour offsets out there).
It should be clear, then that converting dates in one time zone to dates in another is not possible, because they represent different intervals. (Mostly. There can be exception for adjacent time zones, one on daylight savings time and one not.) Date arithmetic between different time zones can't be done ordinarily either, for the same reason. Sometimes there's not enough data captured to get the perfect answer.
But the full answer to your concern behind the question you asked depends on what the dates mean. If they are, for example, legal dates for things like deadlines, then those dates are conventionally taken with respect to the location of an office building, say, where the deadline is clocked. In that case the day boundary follows a single time zone, and there would be no sense in storing it redundantly. Times would be converted to dates in that time zone when they are stored.
Using UTC everywhere makes things easy and consistent. Keeping dates (as points in time) saved as UTC in DB, making math on them in UTC, doing explicit conversions to local time only in view layer, converting user input dates to UTC will give quite stable base for any action or computation you need with them. You don't really need to show the dates to the user in UTC - actually showing them in local time and hinting that you may show them UTC gives more useful information.
If you need to keep only the dates (like the birthday, what you've mentioned in comment), explicitly cut such information away (like conversion from DateTime to Date on DB level, or any code-level possibility).
This is sample of good normalization - the same thing you're doing while using UTF over codepages or keeping to the same units doing physical computations.
Using that approach, your code for differing dates will be much simpler. In case of showing the date & conversions between UTC and locals, many frameworks (or even languages itself) give you tools to deal with locals, while working with UTC, etc.
I once designed and lead the build of a real-time transaction system that handled customers in multiple timezones, but were all invoiced on time periods of one timezone.
The solution that worked well for me was to store both the UTC time and the local time on each record. This came after using the system for a few years and realising there were really two separate uses for date columns, so the data was stored that way.
Although it used up a few more bytes on disk (big deal - disk is cheap), it made things so simple when querying; "casual" queries eg the help desk searching for a customer transaction used the local time column and "formal" queries eg accounting department invoice batch runs used the UTC column.
It also dealt with issues of the local time "reliving" an hour of transactions every time daylight saving went back one hour, which can make using just the local time a real pain if you're a 24 hour business.
I was recently asked an interview question on a hypothetical web based booking system and how I would design the database schema to minimize duplication and maximize flexibility.
The use case is that a admin would enter the availability of a property into the system. There could be multiple time period set. For example, 1st of April 2009 to 14th of April 2009 and 3rd of July 2009 to 21st of July 2009.
A user is then only able to place a booking in the periods made available of equal or shorter periods.
How would you store this information in a database?
Would you use something as simple (really simplified) as;
AVAILABILITY(property_id, start_date, end_date);
BOOKING(property_id, start_date, end_date);
Could you then easily construct a web page that showed a calendar of availability with periods that have been booked blanked out. Would it be easy to build reports from this database schema? Is it as easy as it seems?
It might be easier to work with a single table for both availability and booking, with a granularity of 1 day:
property_date (property_id, date, status);
Column status would have (at least) the following 2 values:
Available
Booked
Entering a period of availability e.g. 1st to 14th of April would entail (the application) inserting 14 rows into property_date each with a status of 'Available'. (To the user it should seem like a single action).
Booking the property for the period 3rd to 11th April would entail checking that an 'Available' row existed for each day, and changing the status to 'Booked'.
This model may seem a bit "verbose", but it has some advantages:
Checking availability for any date is easy
Adding a booking automatically updates the availability, there isn't a separate Availability table to keep in sync.
Showing availability in a web page would be very simple
It is easy to add new statuses to record different types of unavailability - e.g. closed for maintenance.
NB If "available" is the most common state of a property, it may be better to reverse the logic so that there is an 'Unavailable' status, and the absence of a row for a date means it is available.