Database design - google app engine - google-app-engine

I am working with google app engine and using the low leval java api to access Big Table. I'm building a SAAS application with 4 layers:
Client web browser
RESTful resources layer
Business layer
Data access layer
I'm building an application to help manage my mobile auto detailing company (and others like it). I have to represent these four separate concepts, but am unsure if my current plan is a good one:
Appointments
Line Items
Invoices
Payments
Appointment: An "Appointment" is a place and time where employees are expected to be in order to deliver a service.
Line Item: A "Line Item" is a service, fee or discount and its associated information. An example of line items that might go into an appointment:
Name: Price: Commission: Time estimate
Full Detail, Regular Size: 160 75 3.5 hours
$10 Off Full Detail Coupon: -10 0 0 hours
Premium Detail: 220 110 4.5 hours
Derived totals(not a line item): $370 $185 8.0 hours
Invoice: An "Invoice" is a record of one or more line items that a customer has committed to pay for.
Payment: A "Payment" is a record of what payments have come in.
In a previous implementation of this application, life was simpler and I treated all four of these concepts as one table in a SQL database: "Appointment." One "Appointment" could have multiple line items, multiple payments, and one invoice. The invoice was just an e-mail or print out that was produced from the line items and customer record.
9 out of 10 times, this worked fine. When one customer made one appointment for one or a few vehicles and paid for it themselves, all was grand. But this system didn't work under a lot of conditions. For example:
When one customer made one appointment, but the appointment got rained out halfway through resulting in the detailer had to come back the next day, I needed two appointments, but only one line item, one invoice and one payment.
When a group of customers at an office all decided to have their cars done the same day in order to get a discount, I needed one appointment, but multiple invoices and multiple payments.
When one customer paid for two appointments with one check, I needed two appointments, but only one invoice and one payment.
I was able to handle all of these outliers by fudging things a little. For example, if a detailer had to come back the next day, i'd just make another appointment on the second day with a line item that said "Finish Up" and the cost would be $0. Or if I had one customer pay for two appointments with one check, I'd put split payment records in each appointment. The problem with this is that it creates a huge opportunity for data in-congruency. Data in-congruency can be a serious problem especially for cases involving financial information such as the third exmaple where the customer paid for two appointments with one check. Payments must be matched up directly with goods and services rendered in order to properly keep track of accounts receivable.
Proposed structure:
Below, is a normalized structure for organizing and storing this data. Perhaps because of my inexperience, I place a lot of emphasis on data normalization because it seems like a great way to avoid data incongruity errors. With this structure, changes to the data can be done with one operation without having to worry about updating other tables. Reads, however, can require multiple reads coupled with in-memory organization of data. I figure later on, if there are performance issues, I can add some denormalized fields to "Appointment" for faster querying while keeping the "safe" normalized structure intact. Denormalization could potentially slow down writes, but I was thinking that I might be able to make asynchronous calls to other resources or add to the task que so that the client does not have to wait for the extra writes that update the denormalized portions of the data.
Tables:
Appointment
start_time
etc...
Invoice
due_date
etc...
Payment
invoice_Key_List
amount_paid
etc...
Line_Item
appointment_Key_List
invoice_Key
name
price
etc...
The following is the series of queries and operations required to tie all four entities (tables) together for a given list of appointments. This would include information on what services were scheduled for each appointment, the total cost of each appointment and weather or not payment as been received for each appointment. This would be a common query when loading the calendar for appointment scheduling or for a manager to get an overall view of operations.
QUERY for the list of "Appointments" who's "start_time" field lies between the given range.
Add each key from the returned appointments into a List.
QUERY for all "Line_Items" who's appointment_key_List field includes any of the returns appointments
Add each invoice_key from all of the line items into a Set collection.
QUERY for all "Invoices" in the invoice ket set (this can be done in one asynchronous operation using app engine)
Add each key from the returned invoices into a List
QUERY for all "Payments" who's invoice_key_list field contains a key matching any of the returned invoices
Reorganize in memory so that each appointment reflects the line_items that are scheduled for it, the total price, total estimated time, and weather or not it has been paid for.
...As you can see, this operation requires 4 datastore queries as well as some in-memory organization (hopefully the in-memory will be pretty fast)
Can anyone comment on this design? This is the best I could come up with, but I suspect there might be better options or completely different designs that I'm not thinking of that might work better in general or specifically under GAE's (google app engine) strengths, weaknesses, and capabilities.
Thanks!
Usage clarification
Most applications are more read-intensive, some are more write intensive. Below, I describe a typical use-case and break down operations that the user would want to perform:
Manager gets a call from a customer:
Read - Manager loads the calendar and looks for a time that is available
Write - Manager queries customer for their information, I pictured this to be a succession of asynchronous reads as the manager enters each piece of information such as phone number, name, e-mail, address, etc... Or if necessary, perhaps one write at the end after the client application has gathered all of the information and it is then submitted.
Write - Manager takes down customer's credit card info and adds it to their record as a separate operation
Write - Manager charges credit card and verifies that the payment went through
Manager makes an outgoing phone call:
Read Manager loads the calendar
Read Manager loads the appointment for the customer he wants to call
Write Manager clicks "Call" button, a call is initiated and a new CallReacord entity is written
Read Call server responds to call request and reads CallRecord to find out how to handle the call
Write Call server writes updated information to the CallRecord
Write when call is closed, call server makes another request to the server to update the CallRecord resource (note: this request is not time-critical)
Accepted answer::
Both of the top two answers were very thoughtful and appreciated. I accepted the one with few votes in order to imperfectly equalize their exposure as much as possible.

You specified two specific "views" your website needs to provide:
Scheduling an appointment. Your current scheme should work just fine for this - you'll just need to do the first query you mentioned.
Overall view of operations. I'm not really sure what this entails, but if you need to do the string of four queries you mentioned above to get this, then your design could use some improvement. Details below.
Four datastore queries in and of itself isn't necessarily overboard. The problem in your case is that two of the queries are expensive and probably even impossible. I'll go through each query:
Getting a list of appointments - no problem. This query will be able to scan an index to efficiently retrieve the appointments in the date range you specify.
Get all line items for each of appointment from #1 - this is a problem. This query requires that you do an IN query. IN queries are transformed into N sub-queries behind the scenes - so you'll end up with one query per appointment key from #1! These will be executed in parallel so that isn't so bad. The main problem is that IN queries are limited to only a small list of values (up to just 30 values). If you have more than 30 appointment keys returned by #1 then this query will fail to execute!
Get all invoices referenced by line items - no problem. You are correct that this query is cheap because you can simply fetch all of the relevant invoices directly by key. (Note: this query is still synchronous - I don't think asynchronous was the word you were looking for).
Get all payments for all invoices returned by #3 - this is a problem. Like #2, this query will be an IN query and will fail if #3 returns even a moderate number of invoices which you need to fetch payments for.
If the number of items returned by #1 and #3 are small enough, then GAE will almost certainly be able to do this within the allowed limits. And that should be good enough for your personal needs - it sounds like you mostly need it to work, and don't need to it to scale to huge numbers of users (it won't).
Suggestions for improvement:
Denormalization! Try storing the keys for Line_Item, Invoice, and Payment entities relevant to a given appointment in lists on the appointment itself. Then you can eliminate your IN queries. Make sure these new ListProperty are not indexed to avoid problems with exploding indices
Other less specific ideas for improvement:
Depending on what your "overall view of operations" is going to show, you might be able to split up the retrieval of all this information. For example, perhaps you start by showing a list of appointments, and then when the manager wants more information about a particular appointment you go ahead and fetch the information relevant to that appointment. You could even do this via AJAX if you this interaction to take place on a single page.
Memcache is your friend - use it to cache the results of datastore queries (or even higher level results) so that you don't have to recompute it from scratch on every access.

As you've noticed, this design doesn't scale. It requires 4 (!!!) DB queries to render the page. That's 3 too many :)
The prevailing notion of working with the App Engine Datastore is that you want to do as much work as you possibly can when something is written, so that almost nothing needs to be done when something is retrieved and rendered. You presumably write the data very few times, compared to how many times it's rendered.
Normalization is similarly something that you seem to be striving for. The Datastore doesn't place any value in normalization -- it may mean less data incongruity, but it also means reading data is muuuuuch slower (4 reads?!!). Since your data is read much more often than it's written, optimize for reads, even if that means your data will occasionally be duplicated or out of sync for a short amount of time.
Instead of thinking about how the data looks when it's stored, think about how you want the data to look when it's displayed to the user. Store as close to that format as you can, even if that means literally storing pre-rendered HTML in the datastore. Reads will be lightning-fast, and that's a good thing.
So since you should optimize for reads, oftentimes your writes will grow to gigantic proportions. So gigantic that you can't fit it in the 30 second time limit for requests. Well, that's what the task queue is for. Store what you consider the "bare necessities" of your model in the datastore, then fire off a task queue to pull it back out, generate the HTML to be rendered, and put it in there in the background. This might mean your model is immediately ready to display until the task has finished with it, so you'll need a graceful degradation in this case, even if that means rendering it "the slow way" until the data is fully populated. Any further reads will be lightning-quick.
In summary, I don't have any specific advice directly related to your database -- that's dependent on what you want the data to look like when the user sees it.
What I can give you are some links to some super helpful videos about the datastore:
Brett Slatkin's 2008 and 2009 talks on building scalable, complex apps on App Engine, and a great one from this year about data pipelines (which isn't directly applicable I think, but really useful in general)
App Engine Under the Covers: How App Engine does what it does, behind the scenes
AppStats: a great way to see how many datastore reads you're performing, and some tips on reducing that number

Here are a few app-engine specific factors that I think you'll have to contend with:
When querying using an inequality, you can only use an inequality on one property. for example, if you are filtering on an appt date being between July 1st and July 4th, you couldn't also filter by price > 200
Transactions on app engine are a bit tricky compared to the SQL database you are probably used to. You can only do transactions on entities that are in the same "entity group".

Related

database design for large streaming data with minimal latency

Following is the scenario:
Customer places an order.
Order has type: Physical / Downloadable.
Order is placed from: Web / App.
Order is placed from a Location: UK,AUS,etc.
Can have more dimensions in future.
Consider that all of the dimensions change frequently in every order. And the data is quite huge, approximately 1.3 million records per hour.
Want to design this in a way that reports should be able able to drill down with any requested dimension for each customer.
Example:
- Customer 'A' has placed how many orders of type 'Physical' from 'AUS'
- Customer 'A' has placed how many orders in all.
- Customer 'A' has placed how many orders from of type 'Downloadable' from'APP'.
etc.
Need these reports on realtime, hence low latency writes and reads are a must. What nosql database can be a good fit. And how can this data be well structured to be able to sliced and diced in any required dimension as well as combination of more than one dimension.
If you need high performance then I would recommend ScyllaDB which can handle over 1M ops/s per node (on a good hardware). It shares data model with Cassandra so you can model and query your data using CQL. You can give it a free test drive with just couple of clicks here.
Regarding modeling: A useful technique is to model around your queries. So if you have a particular query you should prepare a table that will serve this query in most efficient way. In this technique you duplicate data by creating as many tables with the same data as many different types of queries you have. Duplicating data comes with a price so you need to trade off the performance and cost depending on your needs. You can read more about it here.

Paging of frequently changing data

I'm developing a web application which display a list of let's say "threads". The list can be sorted by the amount of likes a thread has. There can be thousands of threads in one list.
The application needs to work in a scenario where the likes of a thread can change more than 10x in a second. The application furthermore is distributed over multiple servers.
I can't figure out an efficient way to enable paging for this sort of list. And I can't transmit the whole sorted list by likes to a user at once.
As soon as an user would go to page 2 of this list, it likely changed and may contain threads already listed from page one
Solutions which don't work:
Storing the seen threads on the client side (could be too many on mobile)
Storing the seen threads on the Server side (too many users and threads)
Snapshot the list in temp database table (it's too frequent changing data and it need to be actual)
(If it matters I'm using MongoDB+c#)
How would you solve this kind of problem?
Interesting question. Unless I'm misunderstanding you, and by all means let me know if I am, it sounds like the best solution would be to implement a system that, instead of page numbers, uses timestamps. It would be similar to what many of the main APIs already do. I know Tumblr even does this on the dashboard, where this is, of course, not an unreasonable case: there can be tons of posts added in a small amount of time at peak hours, depending on how many people the user follows.
So basically, your "next page" button could just link to /threads/threadindex/1407051000, which could translate to "all the threads that were created before 2014-08-02 17:30. That makes your query super easy to implement. Then, when you pull down all the next elements, you just look for anything that occurred before the last element on the page.
The downfall of this, of course, is that it's hard to know how many new elements have been added since the user started browsing, but you could always log the start time and know anything since then would be new. And it's also difficult for users to type in their own pages, but that's not a problem in most applications. You also need to store the timestamps for every record in your thread, but that's probably already being done, and if it's not then it's certainly not hard to implement. You'll be paying the cost of something like eight bytes extra per record, but that's better than having to store anything about "seen" posts.
It's also nice because, and again this might not apply to you, but a user could bookmark a page in the list, and it would last unchanged forever since it's not relative to anything else.
This is typically handled using an OLAP cube. The idea here is that you add a natural time dimension. They may be too heavy for this application, but here's a summary in case someone else needs it.
OLAP cubes start with the fundamental concept of time. You have to know what time you care about to be able to make sense of the data.
You start off with a "Time" table:
Time {
timestamp long (PK)
created datetime
last_queried datetime
}
This basically tracks snapshots of your data. I've included a last_queried field. This should be updated with the current time any time a user asks for data based on this specific timestamp.
Now we can start talking about "Threads":
Threads {
id long (PK)
identifier long
last_modified datetime
title string
body string
score int
}
The id field is an auto-incrementing key; this is never exposed. identifier is the "unique" id for your thread. I say "unique" because there's no unique-ness constraint, and as far as the database is concerned it is not unique. Everything else in there is pretty standard... except... when you do writes you do not update this entry. In OLAP cubes you almost never modify data. Updates and inserts are explained at the end.
Now, how do we query this? You can't just directly query Threads. You need to include a star table:
ThreadStar {
timestamp long (FK -> Time.timestamp)
thread_id long (FK -> Threads.id)
thread_identifier long (matches Threads[thread_id].identifier)
(timestamp, thread_identifier should be unique)
}
This table gives you a mapping from what time it is to what the state of all of the threads are. Given a specific timestamp you can get the state of a Thread by doing:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
AND Thread.identifier = {thread_identifier}
That's not too bad. How do we get a stream of threads? First we need to know what time it is. Basically you want to get the largest timestamp from Time and update Time.last_queried to the current time. You can throw a cache up in front of that that only updates every few seconds, or whatever you want. Once you have that you can get all threads:
SELECT Thread.*
FROM Thread
JOIN ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE ThreadStar.timestamp = {timestamp}
ORDER BY Thread.score DESC
Nice. We've got a list of threads and the ordering is stable as the actual scores change. You can page through this at your leisure... kind of. Eventually data will be cleaned up and you'll lose your snapshot.
So this is great and all, but now you need to create or update a Thread. Creation and modification are almost identical. Both are handled with an INSERT, the only difference is whether you use an existing identifier or create a new one.
So now you've inserted a new Thread. You need to update ThreadStar. This is the crazy expensive part. Basically you make a copy of all of the ThreadStar entries with the most recent timestamp, except you update the thread_id for the Thread you just modified. That's a crazy amount of duplication. Fortunately it's pretty much only foreign keys, but still.
You also don't do DELETEs either; mark a row as deleted or just exclude it when you update ThreadStar.
Now you're humming along, but you've got crazy amounts of data growing. You'll probably want to clean it out, unless you've got a lot of storage budge, but even then things will start slowing down (aside: this will actually perform shockingly well, even with crazy amounts of data).
Cleanup is pretty straightforward. It's just a matter of some cascading deletes and scrubbing for orphaned data. Delete entries from Time whenever you want (e.g. it's not the latest entry and last_queried is null or older than whatever cutoff). Cascade those deletes to ThreadStar. Then find any Threads with an id that isn't in ThreadStar and scrub those.
This general mechanism also works if you have more nested data, but your queries get harder.
Final note: you'll find that your inserts get really slow because of the sheer amounts of data. Most places build this with appropriate constraints in development and testing environments, but then disable constraints in production!
Yeah. Make sure your tests are solid.
But at least you aren't sensitive to re-ordered data mid-paging.
For constantly changing data such as likes I would use a two stage appraoch. For the frequently changing data I would use an in memory DB to keep up with the change rates and flush this peridically to the "real" db.
Once you have that the query for constantly chaning data is easy.
Query the db.
Query the in memory db.
Merge the frequently changed data from the in memory db with the "slow" db data .
Remember which results you already have displayed so pressing the next button will
not display an already dispalyed value twice because on different pages because its rank has changed.
If many people look at the same data it might help to cache the results of 3 in itself to reduce the load on the real db even further.
Your current architecture has no caching layers (the bigger the site the more things are cached). You will not get away with a simple DB and efficient queries against the db if things become too massive.
I would cache all 'thread' results on the server when the user first time hits the database. Then return the first page of data to the user and for each subsequent next page calls I'd return cached results.
To minimize memory usage you can cache only records ids and fetch whole data when user requests it.
Cache can be evicted each time user exits current page. If it isn't a ton of data I would stick to this solution because user won't get annoyed of data constantly changing.

Large number of entries: how to calculate quickly the total?

I am writing a rather large application that allows people to send text messages and emails. I will charge 7c per SMS and 2c per email sent. I will allow people to "recharge" their account. So, the end result is likely to be a database table with a few small entries like +100 and many, many entries like -0.02 and -0.07.
I need to check a person's balance immediately when they are trying to send an email or a message.
The obvious answer is to have cached "total" somewhere, and update it whenever something is added or taken out. However, as always in programming, there is more to it: what about monthly statements, where the balance needs to be carried forward from the previous month? My "intuitive" solution is to have two levels of cache: one for the current month, and one entry for each month (or billing period) with three entries:
The total added
The total taken out
The balance to that point
Are there better, established ways to deal with this problem?
Largely depends on the RDBMS.
If it were SQL Server, one solution is to create an Indexed view (or views) to automatically incrementally calculate and hold the aggregated values.
Another solution is to use triggers to aggregate whenever a row is inserted at the finest granularity of detail.

hbase data modeling for activity feeds/news feeds/timeline

I decided to use HBase in a project to store the users activities in a social network. Despite the fact that HBase has a simple way to express data (column oriented) I'm facing some difficulties to decide how I would represent the data.
So, imagine that you have millions of users, and each user is generating an activity when they, for example, comment in a thread, publishes something, like, vote, etc. I thought basically in two approaches with an Activity hbase table:
The key could be the user reference + timestamp of activity creation, the value all the activity metadata (most of time fixed size)
The key is the user reference, and then each activity would be stored as a new column inside a column family.
I saw examples for others types of system (such as blogs) that uses the 2nd approach. The first approach (with fixed columns, varying only when you change the schema) is more commonly seen.
What would be the impact in the way I access the data for these 2 approaches?
In general you are asking if your table should be wide or long. HBase works with both, up to a point. Wide tables should never have a row that exceeds region size (by default 256MB) -- so a really prolific user may crash the system if you store large chunks of data for their actions. However, if you are only storing a few bytes per action, then putting all user activity in one row will allow you to get their full history with one get. However, you will be retrieving the full row, which could cause some slowdown for a lot of history (10s of seconds for > 100MB rows).
Going with a tall table and an inverse time stamp would allow you to get a users recent activity very quickly (start a scan with the key = user id).
Using timestamps as the end of a key is a good idea if you want to query by time, but it is a bad idea if you want to optimize writes to your database (writes will always be in the most recent region in the system, causing hot spotting).
You might also want to consider putting more information (such as the activity) in the key so that you can pick up all activity of a particular type more easily.
Another example to look at is OpenTSDB

Delivering activity feed items in a moderately scalable way

The application I'm working on has an activity feed where each user can see their friends' activity (much like Facebook). I'm looking for a moderately scalable way to show a given users' activity stream on the fly. I say 'moderately' because I'm looking to do this with just a database (Postgresql) and maybe memcached. For instance, I want this solution to scale to 200k users each with 100 friends.
Currently, there is a master activity table that stores the rendered html for the given activity (Jim added a friend, George installed an application, etc.). This master activity table keeps the source user, the html, and a timestamp.
Then, there's a separate ('join') table that simply keeps a pointer to the person who should see this activity in their friend feed, and a pointer to the object in the main activity table.
So, if I have 100 friends, and I do 3 activities, then the join table will then grow to 300 items.
Clearly this table will grow very quickly. It has the nice property, though, that fetching activity to show to a user takes a single (relatively) inexpensive query.
The other option is to just keep the main activity table and query it by saying something like:
select * from activity where source_user in (1, 2, 44, 2423, ... my friend list)
This has the disadvantage that you're querying for users who may never be active, and as your friend list grows, this query can get slower and slower.
I see the pros and the cons of both sides, but I'm wondering if some SO folks might help me weigh the options and suggest one way or they other. I'm also open to other solutions, though I'd like to keep it simple and not install something like CouchDB, etc.
Many thanks!
I'm leaning towards just having the master activity table. If you go with that, this is what I would consider implementing:
You can create several activity tables and do a UNION ALL when fetching the data from the database. For example, roll them over monthly - activity_2010_02, etc. Just going by your example - 200K users x 100 friends x 3 activities = 60 million rows. Not a concern performance-wise for PostgreSQL, but you might consider this purely for convenience now and eventually for effortless future expansion.
This has the disadvantage that you're querying for users who may never be active, and as your friend list grows, this query can get slower and slower.
Are you going to display the entire activity feed, going back to the beginning of times? You haven't provided much detail in the original question but I'd hazard a guess that you'd be showing the last 10/20/100 items sorted by time stamp. A couple of indexes and the LIMIT clause should be enough to provide an instant response (as I've just tested on a table with about 20 million rows). It can be slower on a busy server, but that is something that should be worked out with hardware and caching solutions, Postgres is not going to be the bottleneck there.
Even if you do provide activity feeds going back to the dawn of time, paginate the output! The LIMIT clause will save you there. If the basic query with a LIMIT on it is not enough, or if your users have a long tail of friends that are no longer active, you could consider limiting the lookup to the last day/week/month first and then provide the list of friend ids:
select * from activity
where ts <= 123456789
and source_user in (1, 2, 44, 2423, ... my friend list)
If you've got a table spanning months or years back, the search for the friends ids will only be performed within the rows selected by the first WHERE clause.
That's just if I choose between the two solutions you are considering now. I would also look at things like:
Reconsidering your denormalisation of the table. Is storing pre-generated HTML output really the best way? Will you be better off performance-wise by having a lookup table of activities instead and generating templated output on the fly? Pre-generated HTML can seem better at the outset, but consider things like disk storage, APIs, future layout changes and storing HTML may not be that attractive after all. The lookup table could contain your possible activities - added a friend, changed status, etc., and the activity log would reference that and the friend's id if another user is involved in the activity.
Doing pre-generate HTML, but not storing it in the database. Save the stuff on disk as pre-generated pages. This is not a silver bullet, however, and largely depends on the ratio of write-to-reads on your site. I.e. a typical discussion thread on a public forum could have a dozen messages, but could be viewed hundreds of times - a good candidate for caching. Whereas if your application is more tuned to immediate status updates and you'd have to regenerate the HTML page and save it again on disk after every couple of views, then there's little value in this approach.
Hope this helps.

Resources