Azure CosmosDB Graph traversal performance issue - graph-databases

Imagine simple transactions on a bank account as a graph database. With events like this:
+$1000 cash input.
-$10 for buying a book by VISA.
+$100 cash for selling an old bike.
-$50 cash for buying groceries.
In a graph structure we could define the nodes as the transactions with the properties:
id - Transaction ID
time - Timestamp of when the transaction happened.
delta - Amount of money used (+/-) on the transaction
description - Reason for transaction.
The edges would then be pointing to the previous transaction. We could have other edges pointing to other accounts (for transfers between accounts), owners etc. but for simplicity we have this structure.
g.addV('transactions').property('id','1').property('time',0).property('delta',1000).property('description','cash input')
g.addV('transactions').property('id','2').property('time',1).property('delta,-10).property('description','for buying a book by VISA')
g.V('2').addE('previous').to(g.V('1'))
g.addV('transactions').property('id','3').property('time',2).property('delta',100).property('description','cash for selling an old bike.')
g.V('3').addE('previous').to(g.V('2'))
g.addV('transactions').property('id','4').property('time',3).property('delta',-50).property('description','cash for buying groceries')
g.V('4').addE('previous').to(g.V('3'))
Now to get the current inventory of this account we would just traverse the previous edge from the latest transaction, until specific date, or to the beginning, like this:
g.V('4').emit().repeat(out('previous')).until(has('time',0)).properties('delta').value().sum()
==>1040
Which is all good and fast for 4 transactions. BUT, when doing this for 100 000 transactions it takes around 8 minutes, and with more complex operations or more data it takes even longer.
In my test case i have set up an Azure Cosmos-DB with the Graph API and a throughput of 2000 RU/s.
Since I am fairly new to Graph databases and queries, I realize that there might be faster and better ways of doing this, and ways to optimize this, that I don't know about. Maybe even graph database is not the right tool for this job?
What I want to achieve here is a reasonable fast query into the transactions, that may fork into multiple accounts, and many more events.
How can I make this work better?

Why don't you just add a current property for each transaction vertex? This would still keep the history as you have it now, but also provide much faster access to the current inventory value (at any given point in time). Also, should the value of any transaction change afterwards, it would be easy to update all newer transactions accordingly (but this would only be a long running write query, reads would would still be blazing fast).
Keep in mind, that it's generally a bad idea to have so many hops in an OLTP query.

Related

Choosing proper database in AWS when all items must be read from the table

I have an AWS application where DynamoDB is used for most data storage and it works well for most cases. I would like to ask you about one particular case where I feel DynamoDB might not be the best option.
There is a simple table with customers. Each customer can collect virtual coins so each customer has a balance attribute. The balance is managed by 3rd party service keeping up-to-date value and the balance attribute in my table is just a cached version of it. The 3rd party service requires its own id of the customer as an input so customers table contains also this externalId attribute which is used to query the balance.
I need to run the following process once per day:
Update the balance attribute for all customers in a database.
Find all customers with the balance greater than some specified constant value. They need to be sorted by the balance.
Perform some processing for all of the customers - the processing must be performed in proper order - starting from the customer with the greatest balance in descending order (by balance).
Question: which database is the most suitable one for this use case?
My analysis:
In terms of costs it looks to be quite similar, i.e. paying for Compute Units in case of DynamoDB vs paying for hours of micro instances in case of RDS. Not sure though if micro RDS instance is enough for this purpose - I'm going to check it but I guess it should be enough.
In terms of performance - I'm not sure here. It's something I will need to check but wanted to ask you here beforehand. Some analysis from my side:
It involves two scan operations in the case of DynamoDB which
looks like something I really don't want to have. The first scan can be limited to externalId attribute, then balances are queried from 3rd party service and updated in the table. The second scan requires a range key defined for balance attribute to return customers sorted by the balance.
I'm not convinced that any kind of indexes can help here. Basically, there won't be too many read operations of the balance - sometimes it will need to be queried for a single customer using its primary key. The number of reads won't be much greater than number of writes so indexes may slow the process down.
Additional assumptions in case they matter:
There are ca. 500 000 customers in the database, the average size of a single customer is 200 bytes. So the total size of the customers in the database is 100 MB.
I need to repeat step 1 from the above procedure (update the balance of all customers) several times during the day (ca. 20-30 times per day) but the necessity to retrieve sorted data is only once per day.
There is only one application (and one instance of the application) performing the above procedure. Besides that, I need to handle simple CRUD which can read/update other attributes of the customers.
I think people are overly afraid of DynamoDB scan operations. They're bad if used for regular queries but for once-in-a-while bulk operations they're not so bad.
How much does it cost to scan a 100 MB table? That's 25,000 4KB blocks. If doing eventually consistent that's 12,250 read units. If we assume the cost is $0.25 per million (On Demand mode) that's 12,250/1,000,000*$0.25 = $0.003 per full table scan. Want to do it 30 times per day? Costs you less than a dime a day.
The thing to consider is the cost of updating every item in the database. That's 500,000 write units, which if in On Demand at $1.25 per million will be about $0.63 per full table update.
If you can go Provisioned for that duration it'll be cheaper.
Regarding performance, DynamoDB can scan a full table faster than any server-oriented database, because it's supported by potentially thousands of back-end servers operating in parallel. For example, you can do a parallel scan with up to a million segments, each with a client thread reading data in 1 MB chunks. If you write a single-threaded client doing a scan it won't be as fast. It's definitely possible to scan slowly, but it's also possible to scan at speeds that seem ludicrous.
If your table is 100 MB, was created in On Demand mode, has never hit a high water mark to auto-increase capacity (just the starter capacity), and you use a multi-threaded pull with 4+ segments, I predict you'll be done in low single digit seconds.

What is an appropriate database solution for high throughput updates?

Suppose I am a subscription service and I have a table with each row representing customer data.
I want to build a system that consumes a daily snapshot of customer data. This daily snapshot contains data of all currently existing customers (i.e., there will be rows for new customers and customers that unsubscribed will not appear in this data). I also need to keep track of the duration that each customer has subscribed using start and end times. If a customer re-subscribes, another entry of this start and stop time is updated to that customer. A sample record/schema is shown below.
{
"CustomerId": "12345",
"CustomerName": "Bob",
"MagazineName": "DatabaseBoys",
"Gender": "Male",
"Address": "{streetName: \"Sesame Street\", ...}",
"SubscriptionTimeRanges": [{start:12345678, end: 23456789}, {start:34567890, end: 45678901},...]
}
I will be processing >250,000 rows of data once a day, every day
I need to know whether any record in the snapshot doesn't currently exist in the database
The total size of the table will be >250,000
There are longer-term benefits that would come from having a relational database (e.g., joining to another table that contains Magazine information)
I would like to get records by either CustomerId or MagazineName
Writes should not block reads
To achieve this, I anticipate needing to scan the entire table, iterating over every record, and individually updating the SubscriptionTimeRanges array/JSON blob of each record
Latency for writes is not a hard requirement, but at the same time, I shouldn't be expecting to take over an hour to update all of these records (could this be done in a single transaction if it's an update...?)
Reads should also be quick
Concurrent processing is always nice, but that may introduce locking for ACID compliant dbs?
I know that DynamoDB would be quick at handling this kind of use case, and the record schema is right up the NoSQL alley. I can use global secondary indexes / local secondary indexes to resolve some of my issues. I have some experience in PostgreSQL when using Redshift, but I mostly dealt with bulk inserts with no need for data modification. Now I need the data modification aspect. I'm thinking RDS Postgres would be nice for this, but would love to hear your thoughts or opinions.
P.S. don't take the "subscription" system design too seriously, it's the best parallel example I could think of when setting an example for similar requirements.. :)
This is a subjective question, but objectively speaking, DynamoDB is not designed for scans. It can do them, but it requires making repeated requests in a loop, starting each request where the last one left off. This isn't quick for large data sets, so there's also parallel scan but you have to juggle the threads and you consume a lot of table throughput with this.
On the flip side, it is easy and inexpensive to prototype and test against DynamoDB using the SDKs.
But with the daily need to scan the data, and potential need for joins, I would be strongly inclined to go with a relational database.
250,000 rows of data processed daily probably does not justify use of Amazon Redshift. It has a sweetspot of millions to billions of rows and is typically used when you want to do queries throughout the day.
If an RDS database suits your needs, then go for it! If you wish to save cost, you could accumulate the records in Amazon S3 throughout the day and then just load and process the data once per day, turning off the database when it isn't required. (Or even terminate it and launch a new one the next day, since it seems that you don't need to access historical data.)
Amazon Athena might even suit your needs, reading the daily data from S3 and not even needing a persistent database.

Database / Application Design - When to Denormalise?

I am working on an application which needs to store financial transactions for an account.
This data will then need to be queried in a number of ways. I'll need to list individual transactions, monthly totals by category, for example. I'll also need to show a monthly summary with opening / closing balances.
As I see it, I could approach this in the following ways:
From the point of view of database consistency and normalisation, this could be modelled as a simple list of transactions. Balances may then be calculated in the application by adding up every transaction from the beginning of time to the date of the balance you wish to display.
A slight variation on this would be to model the data in the same way, but calculate the balances in a stored procedure on the database server. Appreciate that this isn't hugely different to #1 - both of these issues will perform slower as more data gets added to the system.
End of month balances could be calculated and stored in a separate table (possibly updated by triggers). I don't really like this approach from a data consistency point of view, but it should scale better.
I can't really decide which way to go with this. Should I start with the 'purest' data model and only worry about performance when it becomes an issue? Should I assume performance will become a problem and plan for it from day one? Is there another option which I haven't thought of which would solve the issue better?
I would look at it like this, the calculations are going to take longer and longer and that majority of the monthly numbers before the previous 2-3 months will not be changing. This is a performance problem that has a 100% chance of happening as financial date will grow every month. Therefore looking at a solution in the design phase is NOT premature optimization, it is smart design.
I personally am in favor of only calculating such totals when they need to be calculated rather than every time they are queried. Yes the totals should be updated based on triggers on the table which will add a slight overhead to inserts and deletes. They will make the queries for selects much faster. In my experience users tend to be more tolerant of a slightly longer action query than a much longer select query. Overall this is a better design for this kind of data than a purely normalized model as long as you do the triggers correctly. In the long run, only calculating numbers that have changed will take up far less server resources.
This model will maintain data integrity as long as all transactions go through the trigger. The biggest culprit in that is usually data imports which often bypass triggers. If you do those kinds of imports make sure they have code that mimics the trigger code. Also make sure the triggers are for insert/update and delete and that they are tested using multiple record transactions not just against single records.
The other model is to create a data warehouse that populates on a schedule such as nightly. This is fine if the data can be just slightly out of date. If the majority of queries of this consolidated data will be for reporting and will not involve the current month/day so much then this will work well and you can do it in an SSIS package.

Database design - google app engine

I am working with google app engine and using the low leval java api to access Big Table. I'm building a SAAS application with 4 layers:
Client web browser
RESTful resources layer
Business layer
Data access layer
I'm building an application to help manage my mobile auto detailing company (and others like it). I have to represent these four separate concepts, but am unsure if my current plan is a good one:
Appointments
Line Items
Invoices
Payments
Appointment: An "Appointment" is a place and time where employees are expected to be in order to deliver a service.
Line Item: A "Line Item" is a service, fee or discount and its associated information. An example of line items that might go into an appointment:
Name: Price: Commission: Time estimate
Full Detail, Regular Size: 160 75 3.5 hours
$10 Off Full Detail Coupon: -10 0 0 hours
Premium Detail: 220 110 4.5 hours
Derived totals(not a line item): $370 $185 8.0 hours
Invoice: An "Invoice" is a record of one or more line items that a customer has committed to pay for.
Payment: A "Payment" is a record of what payments have come in.
In a previous implementation of this application, life was simpler and I treated all four of these concepts as one table in a SQL database: "Appointment." One "Appointment" could have multiple line items, multiple payments, and one invoice. The invoice was just an e-mail or print out that was produced from the line items and customer record.
9 out of 10 times, this worked fine. When one customer made one appointment for one or a few vehicles and paid for it themselves, all was grand. But this system didn't work under a lot of conditions. For example:
When one customer made one appointment, but the appointment got rained out halfway through resulting in the detailer had to come back the next day, I needed two appointments, but only one line item, one invoice and one payment.
When a group of customers at an office all decided to have their cars done the same day in order to get a discount, I needed one appointment, but multiple invoices and multiple payments.
When one customer paid for two appointments with one check, I needed two appointments, but only one invoice and one payment.
I was able to handle all of these outliers by fudging things a little. For example, if a detailer had to come back the next day, i'd just make another appointment on the second day with a line item that said "Finish Up" and the cost would be $0. Or if I had one customer pay for two appointments with one check, I'd put split payment records in each appointment. The problem with this is that it creates a huge opportunity for data in-congruency. Data in-congruency can be a serious problem especially for cases involving financial information such as the third exmaple where the customer paid for two appointments with one check. Payments must be matched up directly with goods and services rendered in order to properly keep track of accounts receivable.
Proposed structure:
Below, is a normalized structure for organizing and storing this data. Perhaps because of my inexperience, I place a lot of emphasis on data normalization because it seems like a great way to avoid data incongruity errors. With this structure, changes to the data can be done with one operation without having to worry about updating other tables. Reads, however, can require multiple reads coupled with in-memory organization of data. I figure later on, if there are performance issues, I can add some denormalized fields to "Appointment" for faster querying while keeping the "safe" normalized structure intact. Denormalization could potentially slow down writes, but I was thinking that I might be able to make asynchronous calls to other resources or add to the task que so that the client does not have to wait for the extra writes that update the denormalized portions of the data.
Tables:
Appointment
start_time
etc...
Invoice
due_date
etc...
Payment
invoice_Key_List
amount_paid
etc...
Line_Item
appointment_Key_List
invoice_Key
name
price
etc...
The following is the series of queries and operations required to tie all four entities (tables) together for a given list of appointments. This would include information on what services were scheduled for each appointment, the total cost of each appointment and weather or not payment as been received for each appointment. This would be a common query when loading the calendar for appointment scheduling or for a manager to get an overall view of operations.
QUERY for the list of "Appointments" who's "start_time" field lies between the given range.
Add each key from the returned appointments into a List.
QUERY for all "Line_Items" who's appointment_key_List field includes any of the returns appointments
Add each invoice_key from all of the line items into a Set collection.
QUERY for all "Invoices" in the invoice ket set (this can be done in one asynchronous operation using app engine)
Add each key from the returned invoices into a List
QUERY for all "Payments" who's invoice_key_list field contains a key matching any of the returned invoices
Reorganize in memory so that each appointment reflects the line_items that are scheduled for it, the total price, total estimated time, and weather or not it has been paid for.
...As you can see, this operation requires 4 datastore queries as well as some in-memory organization (hopefully the in-memory will be pretty fast)
Can anyone comment on this design? This is the best I could come up with, but I suspect there might be better options or completely different designs that I'm not thinking of that might work better in general or specifically under GAE's (google app engine) strengths, weaknesses, and capabilities.
Thanks!
Usage clarification
Most applications are more read-intensive, some are more write intensive. Below, I describe a typical use-case and break down operations that the user would want to perform:
Manager gets a call from a customer:
Read - Manager loads the calendar and looks for a time that is available
Write - Manager queries customer for their information, I pictured this to be a succession of asynchronous reads as the manager enters each piece of information such as phone number, name, e-mail, address, etc... Or if necessary, perhaps one write at the end after the client application has gathered all of the information and it is then submitted.
Write - Manager takes down customer's credit card info and adds it to their record as a separate operation
Write - Manager charges credit card and verifies that the payment went through
Manager makes an outgoing phone call:
Read Manager loads the calendar
Read Manager loads the appointment for the customer he wants to call
Write Manager clicks "Call" button, a call is initiated and a new CallReacord entity is written
Read Call server responds to call request and reads CallRecord to find out how to handle the call
Write Call server writes updated information to the CallRecord
Write when call is closed, call server makes another request to the server to update the CallRecord resource (note: this request is not time-critical)
Accepted answer::
Both of the top two answers were very thoughtful and appreciated. I accepted the one with few votes in order to imperfectly equalize their exposure as much as possible.
You specified two specific "views" your website needs to provide:
Scheduling an appointment. Your current scheme should work just fine for this - you'll just need to do the first query you mentioned.
Overall view of operations. I'm not really sure what this entails, but if you need to do the string of four queries you mentioned above to get this, then your design could use some improvement. Details below.
Four datastore queries in and of itself isn't necessarily overboard. The problem in your case is that two of the queries are expensive and probably even impossible. I'll go through each query:
Getting a list of appointments - no problem. This query will be able to scan an index to efficiently retrieve the appointments in the date range you specify.
Get all line items for each of appointment from #1 - this is a problem. This query requires that you do an IN query. IN queries are transformed into N sub-queries behind the scenes - so you'll end up with one query per appointment key from #1! These will be executed in parallel so that isn't so bad. The main problem is that IN queries are limited to only a small list of values (up to just 30 values). If you have more than 30 appointment keys returned by #1 then this query will fail to execute!
Get all invoices referenced by line items - no problem. You are correct that this query is cheap because you can simply fetch all of the relevant invoices directly by key. (Note: this query is still synchronous - I don't think asynchronous was the word you were looking for).
Get all payments for all invoices returned by #3 - this is a problem. Like #2, this query will be an IN query and will fail if #3 returns even a moderate number of invoices which you need to fetch payments for.
If the number of items returned by #1 and #3 are small enough, then GAE will almost certainly be able to do this within the allowed limits. And that should be good enough for your personal needs - it sounds like you mostly need it to work, and don't need to it to scale to huge numbers of users (it won't).
Suggestions for improvement:
Denormalization! Try storing the keys for Line_Item, Invoice, and Payment entities relevant to a given appointment in lists on the appointment itself. Then you can eliminate your IN queries. Make sure these new ListProperty are not indexed to avoid problems with exploding indices
Other less specific ideas for improvement:
Depending on what your "overall view of operations" is going to show, you might be able to split up the retrieval of all this information. For example, perhaps you start by showing a list of appointments, and then when the manager wants more information about a particular appointment you go ahead and fetch the information relevant to that appointment. You could even do this via AJAX if you this interaction to take place on a single page.
Memcache is your friend - use it to cache the results of datastore queries (or even higher level results) so that you don't have to recompute it from scratch on every access.
As you've noticed, this design doesn't scale. It requires 4 (!!!) DB queries to render the page. That's 3 too many :)
The prevailing notion of working with the App Engine Datastore is that you want to do as much work as you possibly can when something is written, so that almost nothing needs to be done when something is retrieved and rendered. You presumably write the data very few times, compared to how many times it's rendered.
Normalization is similarly something that you seem to be striving for. The Datastore doesn't place any value in normalization -- it may mean less data incongruity, but it also means reading data is muuuuuch slower (4 reads?!!). Since your data is read much more often than it's written, optimize for reads, even if that means your data will occasionally be duplicated or out of sync for a short amount of time.
Instead of thinking about how the data looks when it's stored, think about how you want the data to look when it's displayed to the user. Store as close to that format as you can, even if that means literally storing pre-rendered HTML in the datastore. Reads will be lightning-fast, and that's a good thing.
So since you should optimize for reads, oftentimes your writes will grow to gigantic proportions. So gigantic that you can't fit it in the 30 second time limit for requests. Well, that's what the task queue is for. Store what you consider the "bare necessities" of your model in the datastore, then fire off a task queue to pull it back out, generate the HTML to be rendered, and put it in there in the background. This might mean your model is immediately ready to display until the task has finished with it, so you'll need a graceful degradation in this case, even if that means rendering it "the slow way" until the data is fully populated. Any further reads will be lightning-quick.
In summary, I don't have any specific advice directly related to your database -- that's dependent on what you want the data to look like when the user sees it.
What I can give you are some links to some super helpful videos about the datastore:
Brett Slatkin's 2008 and 2009 talks on building scalable, complex apps on App Engine, and a great one from this year about data pipelines (which isn't directly applicable I think, but really useful in general)
App Engine Under the Covers: How App Engine does what it does, behind the scenes
AppStats: a great way to see how many datastore reads you're performing, and some tips on reducing that number
Here are a few app-engine specific factors that I think you'll have to contend with:
When querying using an inequality, you can only use an inequality on one property. for example, if you are filtering on an appt date being between July 1st and July 4th, you couldn't also filter by price > 200
Transactions on app engine are a bit tricky compared to the SQL database you are probably used to. You can only do transactions on entities that are in the same "entity group".

Inspiration needed: Selecting large amounts of data for a highscore

I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.

Resources