Say I have a domain model where 3 objects interact, Reservation, Vehicle and Fleet. The Fleet has many Vehicles, and each Vehicle can have many Reservations. e.g.
Fleet -1--*- Vehicle -1--*- Reservation
If I want Fleet to have a method getMostPopularVehicle(), I could have it iterate each Vehicle and count the number of Reservations.
If I then want to introduce an ORM for persistence, should I (1) have getMostPopularVehicle() call a data layer method to populate the Fleet, Vehicles and Reservations before iterating it as before? Or should I (2) now just query the database directly to get the most popular vehicle in the data layer method?
My thinking is that (1) is correct, but a database query can be so efficient. Perhaps I am approaching this all wrong?
Both approaches are valid. It depends on what you want to achieve; if you can get performance by issueing a (HQL or JPQL or whatever your ORM supports) query, which uses your domain model as well, it is quite legal to do this.
If these statistics were expected to be readily available I'd probably go for one of 2 other options that would avoid executing a potentially heavy duty query as often (if at all):
Retrieve the data by querying the database directly (avoiding the domain model) using CQRS. The returned view models can be cached to avoid subsequent users causing the query to execute again.
Create a separate reporting database. Key the values on fleet id and vehicle id and maybe increment the value by 5 for every reservation and then decrement EVERY value by 1 once a week, never going any lower than 0. This way, you can keep fairly representative rolling statistics, whilst being able to quickly query the table and find the highest value row for a given fleet.
Related
I think the question in the title speaks it all and is general.
I can give a concrete example as well:
I have tagged articles and want to find similar articles with the tags associated with them.
The score function will look at two articles and count the number of tags in common.
Since the score is not stored anywhere, I'll have to calculate the score everytime I need to find similar articles given an article.
But this is too expensive.
What is the common work-around to this kind of problem in general?
Is there a better approach for my specific tag problem? (e.g. solr's moreLikeThis)
edit
I'm using postgres, if that matters.
I'm looking for a general solution that people used successfully, such as you should batch calculate the score and save it somewhere and etc...
The answer will vary wildly by database product and version. For example, in some database products, it may be the case that a view or an indexed view might be faster than the more common solution...
Typically the way to handle a situation like this is by precalculating the result. You can do that in a handful of ways:
a. You can use something like triggers (added in the SQL 99 standard) that update the counts as rows are added, updated or removed from the source table. In this solution, you are making a (presumably) small sacrifice on inserts, updates and deletes of the source table in order to make significant gains in retrieving the information.
b. You can use a data warehouse where you accept some level of latency of live data to reported data. That means you accept that the data queried from the data warehouse will be stale by some accepted number of minutes, hours, days, or weeks. The data warehouse works by periodically querying the live OLTP (Online Transaction Processing) data and updates the OLAP (Online Analytical Processing) database which contains the precalculated results. You then run your reports off the OLAP data or a combination of OLTP and OLAP data. A formal database warehouse isn't required to achieve the equivalent results. You could write a procedure which is executed on a timer that updates a table periodically with updated results.
On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.
Plant data is real time data from plant process, such as, press, temperature, gas flow and so on. The data model of these data is typically like this:
(Point Name, Time stamps, value(float or integer), state(int))
We have thousands of points and longtime to store. And important, we want search them easy and quickly when we need.
A typically search request is like:
get data order by time stamp
from database
where Point name is P001_Press
between 2010-01-01 and 2010-01-02
A database similar to MySql is not suitable for us, because the records is too many and the query is too slowly.
So, how to store data (like above) and where to store them? Any NOSQL databases?? Thanks!
This data and query pattern actually fits pretty well into a flat table in a SQL database, which means implementing it with NoSQL will be significantly more work than fixing your query performance in SQL.
If your data is inserted in real time, you can remove the order by clause as the date will already be sorted by timestamp and there is no need to waste time resorting it. An index on point name and timestamp should get you good performance on the rest of the query.
If you are really getting to the limits of what a SQL table can hold (many millions of records) you have the option of sharding - a table for each data point may work fairly well.
I am working with google app engine and using the low leval java api to access Big Table. I'm building a SAAS application with 4 layers:
Client web browser
RESTful resources layer
Business layer
Data access layer
I'm building an application to help manage my mobile auto detailing company (and others like it). I have to represent these four separate concepts, but am unsure if my current plan is a good one:
Appointments
Line Items
Invoices
Payments
Appointment: An "Appointment" is a place and time where employees are expected to be in order to deliver a service.
Line Item: A "Line Item" is a service, fee or discount and its associated information. An example of line items that might go into an appointment:
Name: Price: Commission: Time estimate
Full Detail, Regular Size: 160 75 3.5 hours
$10 Off Full Detail Coupon: -10 0 0 hours
Premium Detail: 220 110 4.5 hours
Derived totals(not a line item): $370 $185 8.0 hours
Invoice: An "Invoice" is a record of one or more line items that a customer has committed to pay for.
Payment: A "Payment" is a record of what payments have come in.
In a previous implementation of this application, life was simpler and I treated all four of these concepts as one table in a SQL database: "Appointment." One "Appointment" could have multiple line items, multiple payments, and one invoice. The invoice was just an e-mail or print out that was produced from the line items and customer record.
9 out of 10 times, this worked fine. When one customer made one appointment for one or a few vehicles and paid for it themselves, all was grand. But this system didn't work under a lot of conditions. For example:
When one customer made one appointment, but the appointment got rained out halfway through resulting in the detailer had to come back the next day, I needed two appointments, but only one line item, one invoice and one payment.
When a group of customers at an office all decided to have their cars done the same day in order to get a discount, I needed one appointment, but multiple invoices and multiple payments.
When one customer paid for two appointments with one check, I needed two appointments, but only one invoice and one payment.
I was able to handle all of these outliers by fudging things a little. For example, if a detailer had to come back the next day, i'd just make another appointment on the second day with a line item that said "Finish Up" and the cost would be $0. Or if I had one customer pay for two appointments with one check, I'd put split payment records in each appointment. The problem with this is that it creates a huge opportunity for data in-congruency. Data in-congruency can be a serious problem especially for cases involving financial information such as the third exmaple where the customer paid for two appointments with one check. Payments must be matched up directly with goods and services rendered in order to properly keep track of accounts receivable.
Proposed structure:
Below, is a normalized structure for organizing and storing this data. Perhaps because of my inexperience, I place a lot of emphasis on data normalization because it seems like a great way to avoid data incongruity errors. With this structure, changes to the data can be done with one operation without having to worry about updating other tables. Reads, however, can require multiple reads coupled with in-memory organization of data. I figure later on, if there are performance issues, I can add some denormalized fields to "Appointment" for faster querying while keeping the "safe" normalized structure intact. Denormalization could potentially slow down writes, but I was thinking that I might be able to make asynchronous calls to other resources or add to the task que so that the client does not have to wait for the extra writes that update the denormalized portions of the data.
Tables:
Appointment
start_time
etc...
Invoice
due_date
etc...
Payment
invoice_Key_List
amount_paid
etc...
Line_Item
appointment_Key_List
invoice_Key
name
price
etc...
The following is the series of queries and operations required to tie all four entities (tables) together for a given list of appointments. This would include information on what services were scheduled for each appointment, the total cost of each appointment and weather or not payment as been received for each appointment. This would be a common query when loading the calendar for appointment scheduling or for a manager to get an overall view of operations.
QUERY for the list of "Appointments" who's "start_time" field lies between the given range.
Add each key from the returned appointments into a List.
QUERY for all "Line_Items" who's appointment_key_List field includes any of the returns appointments
Add each invoice_key from all of the line items into a Set collection.
QUERY for all "Invoices" in the invoice ket set (this can be done in one asynchronous operation using app engine)
Add each key from the returned invoices into a List
QUERY for all "Payments" who's invoice_key_list field contains a key matching any of the returned invoices
Reorganize in memory so that each appointment reflects the line_items that are scheduled for it, the total price, total estimated time, and weather or not it has been paid for.
...As you can see, this operation requires 4 datastore queries as well as some in-memory organization (hopefully the in-memory will be pretty fast)
Can anyone comment on this design? This is the best I could come up with, but I suspect there might be better options or completely different designs that I'm not thinking of that might work better in general or specifically under GAE's (google app engine) strengths, weaknesses, and capabilities.
Thanks!
Usage clarification
Most applications are more read-intensive, some are more write intensive. Below, I describe a typical use-case and break down operations that the user would want to perform:
Manager gets a call from a customer:
Read - Manager loads the calendar and looks for a time that is available
Write - Manager queries customer for their information, I pictured this to be a succession of asynchronous reads as the manager enters each piece of information such as phone number, name, e-mail, address, etc... Or if necessary, perhaps one write at the end after the client application has gathered all of the information and it is then submitted.
Write - Manager takes down customer's credit card info and adds it to their record as a separate operation
Write - Manager charges credit card and verifies that the payment went through
Manager makes an outgoing phone call:
Read Manager loads the calendar
Read Manager loads the appointment for the customer he wants to call
Write Manager clicks "Call" button, a call is initiated and a new CallReacord entity is written
Read Call server responds to call request and reads CallRecord to find out how to handle the call
Write Call server writes updated information to the CallRecord
Write when call is closed, call server makes another request to the server to update the CallRecord resource (note: this request is not time-critical)
Accepted answer::
Both of the top two answers were very thoughtful and appreciated. I accepted the one with few votes in order to imperfectly equalize their exposure as much as possible.
You specified two specific "views" your website needs to provide:
Scheduling an appointment. Your current scheme should work just fine for this - you'll just need to do the first query you mentioned.
Overall view of operations. I'm not really sure what this entails, but if you need to do the string of four queries you mentioned above to get this, then your design could use some improvement. Details below.
Four datastore queries in and of itself isn't necessarily overboard. The problem in your case is that two of the queries are expensive and probably even impossible. I'll go through each query:
Getting a list of appointments - no problem. This query will be able to scan an index to efficiently retrieve the appointments in the date range you specify.
Get all line items for each of appointment from #1 - this is a problem. This query requires that you do an IN query. IN queries are transformed into N sub-queries behind the scenes - so you'll end up with one query per appointment key from #1! These will be executed in parallel so that isn't so bad. The main problem is that IN queries are limited to only a small list of values (up to just 30 values). If you have more than 30 appointment keys returned by #1 then this query will fail to execute!
Get all invoices referenced by line items - no problem. You are correct that this query is cheap because you can simply fetch all of the relevant invoices directly by key. (Note: this query is still synchronous - I don't think asynchronous was the word you were looking for).
Get all payments for all invoices returned by #3 - this is a problem. Like #2, this query will be an IN query and will fail if #3 returns even a moderate number of invoices which you need to fetch payments for.
If the number of items returned by #1 and #3 are small enough, then GAE will almost certainly be able to do this within the allowed limits. And that should be good enough for your personal needs - it sounds like you mostly need it to work, and don't need to it to scale to huge numbers of users (it won't).
Suggestions for improvement:
Denormalization! Try storing the keys for Line_Item, Invoice, and Payment entities relevant to a given appointment in lists on the appointment itself. Then you can eliminate your IN queries. Make sure these new ListProperty are not indexed to avoid problems with exploding indices
Other less specific ideas for improvement:
Depending on what your "overall view of operations" is going to show, you might be able to split up the retrieval of all this information. For example, perhaps you start by showing a list of appointments, and then when the manager wants more information about a particular appointment you go ahead and fetch the information relevant to that appointment. You could even do this via AJAX if you this interaction to take place on a single page.
Memcache is your friend - use it to cache the results of datastore queries (or even higher level results) so that you don't have to recompute it from scratch on every access.
As you've noticed, this design doesn't scale. It requires 4 (!!!) DB queries to render the page. That's 3 too many :)
The prevailing notion of working with the App Engine Datastore is that you want to do as much work as you possibly can when something is written, so that almost nothing needs to be done when something is retrieved and rendered. You presumably write the data very few times, compared to how many times it's rendered.
Normalization is similarly something that you seem to be striving for. The Datastore doesn't place any value in normalization -- it may mean less data incongruity, but it also means reading data is muuuuuch slower (4 reads?!!). Since your data is read much more often than it's written, optimize for reads, even if that means your data will occasionally be duplicated or out of sync for a short amount of time.
Instead of thinking about how the data looks when it's stored, think about how you want the data to look when it's displayed to the user. Store as close to that format as you can, even if that means literally storing pre-rendered HTML in the datastore. Reads will be lightning-fast, and that's a good thing.
So since you should optimize for reads, oftentimes your writes will grow to gigantic proportions. So gigantic that you can't fit it in the 30 second time limit for requests. Well, that's what the task queue is for. Store what you consider the "bare necessities" of your model in the datastore, then fire off a task queue to pull it back out, generate the HTML to be rendered, and put it in there in the background. This might mean your model is immediately ready to display until the task has finished with it, so you'll need a graceful degradation in this case, even if that means rendering it "the slow way" until the data is fully populated. Any further reads will be lightning-quick.
In summary, I don't have any specific advice directly related to your database -- that's dependent on what you want the data to look like when the user sees it.
What I can give you are some links to some super helpful videos about the datastore:
Brett Slatkin's 2008 and 2009 talks on building scalable, complex apps on App Engine, and a great one from this year about data pipelines (which isn't directly applicable I think, but really useful in general)
App Engine Under the Covers: How App Engine does what it does, behind the scenes
AppStats: a great way to see how many datastore reads you're performing, and some tips on reducing that number
Here are a few app-engine specific factors that I think you'll have to contend with:
When querying using an inequality, you can only use an inequality on one property. for example, if you are filtering on an appt date being between July 1st and July 4th, you couldn't also filter by price > 200
Transactions on app engine are a bit tricky compared to the SQL database you are probably used to. You can only do transactions on entities that are in the same "entity group".
Let's say you're creating a database to store messages for a chat room application. There's an infinite number of chat rooms (they're created at run-time on-demand), and all messages need to be stored in the database.
Would it be a mistaken to create one giant table to store messages for all chat rooms, knowing that there could eventually be billions of records in that one table?
Would it be more prudent to dynamically create a table for each room created, and store that room's messages only in that table?
It would be proper to have a single table. When you have n tables which grows by application usage, you're describing using the database itself as a table of tables, which is not how an RDBMS is designed to work. Billions of records in a single table is trivial on a modern database. At that level, your only performance concerns are good indexes and how you do joins.
Billions of records?
Assuming you have constantly 1000 active users with 1 message per minute, this results in 1.5mio messages per day, and approx 500mio messages per year.
If you still need to store chat messages several years old (what for?), you could archive them into year-based tables.
I would definitely argue against dynamic creation of room-based tables.
Whilst a table per chat room could be performed, each database has limits over the number of tables that may be created, so given an infinite number of chat rooms, you are required to create an infinite number of tables, which is not going to work.
You can on the other hand store billions of rows of data, storage is not normally the issue given the space - retrieval of the information within a sensible time frame is however and requires careful planning.
You could partition the messages by a date range, and if planned out, you can use LUN migration to move older data onto slower storage, whilst leaving more recent data on the faster storage.
Strictly speaking, your design is right, a single table. fields with low entropy {e.g 'userid' - you want to link from ID tables, i.e following normal database normalization patterns}
you might want to think about range based partitioning. e.g 'copies' of your table with a year prefix. Or maybe even a just a 'current' and archive table
Both of these approaches mean that your query semantic is more complex {consider if someone did a multi-year search}, you would have to query multiple tables.
however, the upside is that your 'current' table will remain at a roughly constant size, and archiving is more straightforward. - {you can just drop table 2005_Chat when you want to archive 2005 data}
-Ace