Comment post scalability: Top n per user, 1 update, heavy read - database

Here's the situation. Multi-million user website. Each user's page has a message section. Anyone can visit a user's page, where they can leave a message or view the last 100 messages.
Messages are short pieces of txt with some extra meta-data. Every message has to be stored permanently, the only thing that must be real-time quick is the message updates and reading (people use it as chat). A count of messages will be read very often to check for changes. Periodically, it's ok to archive off the old messages (those > 100), but they must be accessible.
Currently all in one big DB table, and contention between people reading the messages lists and sending more updates is becoming an issue.
If you had to re-architect the system, what storage mechanism / caching would you use? what kind of computer science learning can be used here? (eg collections, list access etc)

Some general thoughts, not particular to any specific technology:
Partition the data by user ID. The idea is that you can uniformly divide the user space to distinct partitions of roughly the same size. You can use an appropriate hashing function to divide users across partitions. Ultimately, each partition belongs on a separate machine. However, even on different tables/databases on the same machine this will eliminate some of the contention. Partitioning limits contention, and opens the door to scaling "linearly" in the future. This helps with load distribution and scale-out too.
When picking a hashing function to partition the records, look for one that minimizes the number of records that will have to be moved should partitions be added/removed.
Like many other applications, we could assume the use of the service follows a power law curve: few of the user pages cause much of the traffic, followed by a long tail. A caching scheme can take advantage of that. The steeper the curve, the more effective caching will be. Given the short messages, if each page shows 100 messages, and each message is 100 bytes on average, you could fit about 100,000 top-pages in 1GB of RAM cache. Those cached pages could be written lazily to the database. Out of 10 Mil users, 100,000 is in the ballpark for making a difference.
Partition the web servers, possibly using the same hashing scheme. This lets you hold separate RAM caches without contention. The potential benefit is increasing the cache size as the number of users grows.
If appropriate for your environment, one approach for ensuring new messages are eventually written to the database is to place them in a persistent message queue, right after placing them in the RAM cache. The queue suffers no contention, and helps ensure messages are not lost upon machine failure.

One simple solution could be to denormalize your data, and store pre-calculated aggregates in a separate table, e.g. a MESSAGE_COUNTS table which has a column for the user ID and a column for their message count. When the main messages table is updated, then re-calculate the aggregate.
It's just shifting the bottleneck from one place to another, but it might move it somewhere that's less of a burden.

Related

Time to retrieve a single record via a SQL Server index in a large table

Short version of the question:
If you have a table with a large number of small rows and you want to retrieve a single record from this table via an index probably consisting of two columns is this likely to be something that wil be low cost and fast or high cost and slow
Longer version of question and background:
I am a consultant working with a software development company and I have an argument with them about the performance implications of a piece of functionality that I want to add to the application they are building (and I am designing).
At the moment, we write out a log record every time somebody retrieves a client record. I want to put the name and time of the last person prevously to access that record onto the client page each time that record is retrieved.
They are saying that the performance implications of this will be high but based on my reasonable but not expert knowledge of how B trees work, this doesn't seem right even if the table is very large.
If you create an index on the GUID of the client record and the date/time of access (descending), then you ought to be able to retrieve the required record via an index scan which would just need to find the first entry for that GUID and then stop? And that with a b-tree index, most of the index would be cached so the number of physical disc accesses needed would be very small and the query time therefore significantly less than 1s.
Or have I got this completely wrong
You will have problems with GUID index fragmentation but because your rows do not increase in size (as you said in the comments) you will not have page-splitting problems. The random insert issue is fixable by doing reorganizing and rebuilding.
Besides that, there is nothing wrong with your approach. If the table is larger than RAM you will likely have a single disk IO per access (the intermediate index levels will be cached). If your data fits in RAM you will pay about 0.2 to 0.5ms per query. If your data is on a magnetic disk a seek will likely require 8-12ms. On an SSD you are back to 0.2ms to 0.5ms (maybe 0.05ms more).
Why don't you just create some test data (by selecting a cross product from sys.object of 1M rows) and measure it. It takes little time and you will find out for sure.
should be low cost and fast since the columns are indexed and that would be O(n) I think
You say last person to access? You mean that for every read you will have a write?
And that write is going to change an indexed date time column?
Then I would be worried too.
Writing on each record read will cause you lots of extra disk writes. This will block reads and it might be bad to your caching too. You also need to update your index a lot, and since you change the indexed data your index will be very fragmented.
It depends.
A single retrieval will be low cost and fast
on a decent indexed table
running on decent hardware
over a decent network
On the other hand, it takes time nonetheless.
If we are talking about one retrieval per hour, don't sweat over it. If we are talking about thousands of retrievals per second (as opposed to currently none) it will start to add up to the point it would be noticable.
Some questions you need to adress
Is my hardware up to spec
Does adding two fields result in a page split (unlikely)
How many extra pages need to be read for your regular result sets
How many retrievals/sec will be made
How many inserts/sec (triggering an index update) will be made
After you've adressed these questions, you should be able to make the determination yourself. As far as my gut feelings go, I would be surprised you would notice the performance difference.

Maximum number of records for a custom object in salesforce.com

What is the maximum number of records within a single custom object in salesforce.com?
There does not seem to be a limit indicated in https://login.salesforce.com/help/doc/en/limits.htm
But of course, there has to be a limit of some kind. EG: Could 250 million records be stored in a single salesforce.com custom object?
As far as I'm aware the only limit is your data storage, you can see what you've used by going to Setup -> Administration Setup -> Data Management -> Storage Usage.
In one of the Orgs I work with I can see one object has almost 2GB of data for just under a million records, and this accounts for a little over a third of the storage available. Your storage space depends on your Salesforce Edition and number of users. See here for details.
I've seen the performance issue as well, though after about 1-2M records the performance hit appears magically to plateau, or at least it didn't appear to significantly slow down between 1M and 10M. I wonder if orgs are tier-tuned based on volume... :/
But regardless of this, there are other challenges which make it less than ideal for big data. Even though they've increased the SOQL governor limit to permit up to 50 million records to be retrieved in one call, you're still strapped with a 200,000 line execution limit in Apex and a 10K DML limit (per execution thread). These can be bypassed through Batch Apex, yet this has limitations as well. You can only execute 250K batches in 24 hours and only have 5 batches running at any given time.
So... the moral of the story seems to be that even if you managed to get a billion records into a custom object, you really can't do much with the data at that scale anyway. Therefore, it's effectively not the right tool for that job in its current state.
2-cents
LaceySnr is correct. However, there is an inverse relationship between the number of records for an object and performance. Any part of the system that filters on that object will be impacted, such as views, reports, SOQL queries, etc.
It's hard to talk specific numbers since salesforce has upwards of a dozen server clusters, each with their own performance characteristics. And there's probably a lot of dynamic performance management that occurs regularly. But, in the past I've seen performance issues start to creep in around 2M records. One possible remedy is you can ask salesforce to index fields that you plan to filter on.

Any way to optimize multiple put calls to Google App Engine beyond batching?

I hold messages in a map for each user in the datastore. It's held as an unindexed serialized value keyed by a unique name. A user can message many users at once. Currently I execute a batch get for the (e.g.) 20 targets, update the serialized value in each, then execute a batch put. The serialized message size is small enough to be unimportant, around 1KB.
This is quick for the user, the real time shown in appstats is 90ms. However the cpu-time cost is 918ms. This causes warnings and may become expensive with high usage, or cause trouble if I wish to message 50 users. Is there any way to reduce this cpu-time cost, either with datastore tweaks, or an obvious change to the architecture I've missed? A task queue solution would remove the warnings but would really only redistribute the cost.
EDIT: The datastore key is the username of the receiver, the value is the messages stored as serialized Map where key is username of sender and Message is simple object holding two ints. There are two types of request. The 'update' type described above where the message map is retrieved, the new message is added to the map, and the map is stored. The 'get' type is the inbox owner reading the messages which is a simple get based on key. My thinking was that even if this was split out into a multi-value relationship or similar, this made improve the fidelity (allowing two updates at once) but the amount of put work would still be the same provided it's a simple key-value approach.
It sounds like you're already doing things fairly efficiently. It's not likely you're going to be able to reduce this substantially. Less than 1000 cpu milliseconds per request is a fairly reasonable amount anyway.
There's two things you might gain by splitting entities up: If your lists are long, you're saving the CPU cost of reading and writing large entities when you only need to read or modify some small part of it, and you're saving on transaction collisions. That is, if several tasks need to add items to the queue simultaneously, you can do it without transaction retries, saving you CPU time.

Scaling a MS SQL Server 2008 database

Im trying to work out the best way scale my site, and i have a question on how mssql will scale.
The way the table currently is:
cache_id - int - identifier
cache_name - nvchar 256 - Used for lookup along with event_id
cache_event_id - int - Basicly a way of grouping
cache_creation_date - datetime
cache_data - varbinary(MAX) - Data size will be from 2k to 5k
The data stored is a byte array, thats basically a cached instance (compressed) of a page on my site.
The different ways i see storing i see are:
1) 1 large table, it would contain tens millions of records and easily become several gigabytes in size.
2) Multiple tables to contain the data above, meaning each table would 200k to a million records.
The data will be used from this table to show web pages, so anything over 200ms to get a record is bad in my eyes ( I know some ppl think 1-2 seconds page load is ok, but i think thats slow and want to do my best to keep it lower).
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple database servers?
If its close to impossible to predict these things, il accept that as a reply to. Im not a DBA, and im basically trying to design my DB so i dont have to redesign it later when its it contains huge amount of data.
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple
database servers?
This is all a 'rule of thumb' view;
Load (and therefore to a considerable extent performance) of a DB is largely a factor of 2 issues data volumes and transaction load, with IMHO the second generally being more relevant.
With regards the data volume one can hold many gigabytes of data and get acceptable access times by way of Normalising, Indexing, Partitioning, Fast IO systems, appropriate buffer cache sizes, etc. Many of these, e.g. Normalisation are the issues that one considers at DB design time, others during system tuning, e.g. additional/less indexes, buffer cache size.
The transactional load is largely a factor of code design and total number of users. Code design includes factors like getting transaction size right (small and fast is the general goal, but like most things it is possible to take it to far and have transactions that are too small to retain integrity or so small as to in itself add load).
When scaling I advise first scale up (bigger, faster server) then out (multiple servers). The admin issues of a multiple server instance are significant and I suggest only worth considering for a site with OS, Network and DBA skills and processes to match.
Normalize and index.
How, we can't tell you, because you haven't told use what your table is trying to model or how you're trying to use it.
1 million rows is not at all uncommon. Again, we can't tell you much in the absence of context only you can, but don't, provide.
The only possible answer is to set it up, and be prepared for a long iterative process of learning things only you will know because only you will live in your domain. Any technical advice you see here will be naive and insufficiently informed until you have some practical experience to share.
Test every single one of your guesses, compare the results, and see what works. And keep looking for more testable ideas. (And don't be afraid to back out changes that end up not helping. It's a basic requirement to have any hope of sustained simplicity.)
And embrace the fact that your database design will evolve. It's not as fearsome as your comment suggests you think it is. It's much easier to change a database than the software that goes around it.

When is the size of the database call more expensive than the frequency of calls?

Can someone give me a relative idea of when it makes more sense to hit the database many times for small query results vs caching a large number of rows and querying that?
For example, if I have a query returning 2,000 results. And then I have additional queries on those results that take maybe 10-20 items, would it be better to cache the 2000 results or hit the database every time for each set of 10 or 20 results?
Other answers here are correct -- the RDBMS and your data are key factors. However, another key factor is how much time it will take to sort and/or index your data in memory versus in the database. We have one application where, for performance, we added code to grab about 10,000 records into an in-memory DataSet and then do subqueries on that. As it turns out, keeping that data up to date and selecting out subsets is actually slower than just leaving all the data in the database.
So my advice is: do it the simplest possible way first, then profile it and see if you need to optimize for performance.
It depends on a variety of things. I will list some points that come to mind:
If you have a .Net web app that is caching data in the client, you do not want to pull 2k rows.
If you have a web service, they are almost always better Chunky than Chatty because of the added overhead of XML on the transport.
In a fairly decently normalized and optimized database, there really should be very few times that you have to pull 2k rows out at a time unless you are doing reports.
If the underlying data is changing at a rapid pace, then you should really be careful caching it on the middle tier or the presentation layer because what you present will you will be out of date.
Reports (any DSS) will pull and chomp through much larger data sets, but since they are not interactive, we denormalize and let them have their fun.
In cases of cascading dropdowns and such, AJAX techniques will prove to be more efficient and effective.
I guess I'm not really giving you one answer to your question. "It depends" is the best I can do.
Unless there is a big performance problem (e.g. a highly latent db connection), I'd stick with leaving the data in the database and letting the db take care of things for you. A lot of things are done efficiently on the database level, for example
isolation levels (what happens if other transactions update the data you're caching)
fast access using indexes (the db may be quicker to access a few rows than you searching through your cached items, especially if that data already is in the db cache like in your scenario)
updates in your transaction to the cached data (do you want to deal with updating your cached data as well or do you "refresh" everything from the db)
There are a lot of potential issues you may run into if you do your own caching. You need to have a very good performance reason befor starting to take care of all that complexity.
So, the short answer: It depends, but unless you have some good reasons, this smells like premature optimizaton to me.
in general, network round trip latency is several orders of magnitude greater than the capacity of a database to generate and feed data onto the network, and the capacity of a client box to consume it from a network connection.
But look at the width of your network bus ( Bits/sec ) and compare that to the average round trip time for a database call...
On 100baseT ethernet, for example you are about 12 MBytes / sec data transfer rate. If your average round trip time is say, 200 ms, then your network bus can deliver 3 MBytes in each 200 ms round trip call..
If you're on gigabit ethernet, that number jumps to 30 Mbytes per round trip...
So if you split up a request for data into two round trips, well that's 400 ms, and each query would have to be over 3Mb (or 30Mb for gigibit ) before that would be faster...
This likely varies from RDBMS to RDBMS, but my experience has been that pulling in bulk is almost always better. After all, you're going to have to pull the 2000 records anyway, so you might as well do it all at once. And 2000 records isn't really a large amount, but that depends largely on what you're doing.
My advice is to profile and see what works best. RDBMSes can be tricky beasts performance-wise and caching can be just as tricky.
"I guess I'm not really giving you one answer to your question. "It depends" is the best I can do."
yes, "it depends". It depends on the volatility of the data that you are intending to cache, and it depends on the level of "accuracy" and reliability that you need for the responses that you generate from the data that you intend to cache.
If volatility on your "base" data is low, then any caching you do on those data has a higher probability of remaining valid and correct for a longer time.
If "caching-fault-tolerance" on the results you return to your users is zero percent, you have no option.
The type of data your bringing back affects the decision as well. You don't want to be caching volatile data or data for potential updates that may get stale.

Resources