How can I combine similar tasks to reduce total workload? - google-app-engine

I use App Engine, but the following problem could very well occur in any server application:
My application uses memcache to cache both large (~50 KB) and small (~0.5 KB) JSON documents which aggregate information which is expensive to refresh from the datastore. These JSON documents can change often, but the changes are sparse in the document (i.e., one item out of hundreds may change at a time). Currently, the application invalidates an entire document if something changes, and then will lazily re-create it later when it needs it. However, I want to move to a more efficient design which updates whatever particular value changed in the JSON document directly from the cache.
One particular concern is contention from multiple tasks / request handlers updating the same document, but I have ways to detect this issue and mitigate it. However, my main concern is that it's possible that there could be rapid changes to a set of documents within a small period of time coming from different request handlers, and I don't want to have to edit the JSON document in the cache separately for each one. For example, it's possible that 10 small changes affecting the same set of 20 documents of 50 KB each could be triggered in less than a minute.
So this is my problem: What would be an effective solution to combine these changes together? In my old solution, although it is expensive to re-create an entire document when a small item changes, the benefit at least is that it does it lazily when it needs it (which could be a while later). However, to update the JSON document with a small change seems to require that it be done immediately (not lazily). That is, unless I come up with a complex solution that lazily applies a set of changes to the document later on. I'm hoping for something efficient but not too complicated.
Thanks.

Pull queue. Everyone using GAE should watch this video:
http://www.youtube.com/watch?v=AM0ZPO7-lcE
When a call comes in, update memcache and do an async_add to your task pull queue. You likely could run a process that will handle thousands of updates each minute without a lot of overhead (i.e. instance issues). Still have an issue should memcache get purged prior to your updates, but that it not too hard to work around. HTH. -stevep

Related

Symfony 4 - Cache vs DB solution

I am currently using the cache for my current project, but i'm not sure if it is the right thing to do.
I need to retrieve a lot of data from a web api (nodes that can be picture, node, folder, gallery.... Those nodes will change very often, so I need fast access (loading up to 300-400 element at once). Currently I store them in cache (key as md5 of node_id, so easy to retrieve, and update).
It is working great so far, but if I clear the cache it takes up to 1 minute to create all the cache again.
Should I use a database to store those nodes ? Will it be quicker / slower / same ?
Your question is very broad and thus hard to answer. Saving 300-400 elements under a cache key sounds problematic to me. You can run into problems where serializing when storing in the cache and deserializing when retrieving the data will cause problems for you. Whenever your cache service is down your app will be practically unusable.
If you already run into problems when clearing/updating the cache you might want to look for an alternative. This might be a database or elasticsearch, advanced cache features like tagged caching could help with preventing you from having to clear the whole cache when part of the information updates. You might also want to use something like the chain provider to store things in multiple caches to prevent the aforementioned problem of an unreachable cache "breaking" your app. You could also look into a pattern that is common with CQRS called a read model.
There are a lot of variables that come into play. If you want to know which one will yield the best results, i.e. which one is quicker, you should do frequent performance tests with realistic data using Symfony's debug toolbar & profilers or a 3rd party service like blackfire.io or tideways. You might also want to do capacity test with a tool like JMeter to ensure those results still hold true, when there are multiple simultaneous users.

Azure Search Updating Individual Documents

We are using Azure Search for various scenarios. We often need to update an individual document with changes that users make. We need to have these changes become visible in our indexes as soon as possible so that the stale time is as short as possible within reason.
What is the best strategy to handle this. We know that batch updates is the way to go but we need more immediate reflection of the changes.
Once a document is updated, how long does it take for the index to reflect this change.
Many thanks
If the updates are not very frequent, you can simply update the Azure Search document immediately (that is, using batches of size 1). On the other hand, if the updates are extremely frequent and you notice a high rate of failures with single-document batches, you will need to build some sort of "collector" mechanism to batch up updates. My recommendation would be to do the simple thing first: try single-document batches, and add batching logic if necessary.
Updated or newly indexed documents are reflected in the search results after a short delay, usually ranging from single milliseconds to 1-2 seconds. The delay depends on the service topology and indexing load.

Efficiently record and store page view counts in the database?

Sites like StackOverflow count, save, and display the view counts for pages. How do you do it efficiently? Let's take view counts for StackOverflow questions as the example. I see the following options.
Option 1. When the app receives a request for a question, increment the count in the question table. This is very inefficient! The vast majority of queries are read-only, but you're tacking on an update to each one.
Option 2. Maintain some kind of cache that maps new view counts to questionIds. When the app receives a question request, increment the cached view count for the question id. You’re caching the marginal increase in views. So far, so good. Now you need to periodically flush the counts in the cache. That's the second part of the problem. You could use a second thread or some kind of scheduling component. This really is a separate question and partly depends on your server platform (I'm using Java). Or, rather than using a separate thread, after a certain number of counts stored in the cache, you could do an update within the request thread that reached the threshold. The update functionality could be incapsulated in the cache giving the cache some IQ points.
I like the idea of a cache that gets flushed when a threshold is reached. I'm curious to know what others have done and if there is a better way.
I agree that writing each pageload to the database is inefficient. My approach is to cache the requests then commit them once an hour. Each time I update the cache I compare the current time to LastWriteTime, and if more than an hour has passed, I write. It's an ASP.NET app, so I have a final commit in the Application's shutdown method. The latter is not guaranteed to run, however is considered an acceptable loss.

storing + evaluating performance data

since we suffer from creeping degradation in our web application we decided to monitor our application performance and measure individual actions.
for example we will measure the duration of each request, the duration of individual actions like editing a customer or creating an appointment, searching for a contract.
in most cases the database is the bottleneck for these actions.
i expect that the culminated data will be quite large, since we will gather 1-5 individual actions per request.
of course it would be nonsense to insert each an every element to the database, since this would slow down every request even more.
what is a good strategy for storing and evaluating those per-request data.
i thought about having a global Queue object which is appended and a seperate thread that empties the queue and handles the persistent storage/file. but where to store such data? are there any prebuilt tools for such a visualisation?
we use java, spring, mixed hibernate+jdbc+pl/sql, oracle.
the question should be language-agnostic, though.
edit: the measurement will be taken in production over a large period of time.
It seems like your archive strategy will be at least partially dependent on the scope of your tests:
How long do you intend to collect performance data?
What are you trying to demonstrate? Performance improvements over time? Improvements associated with specific changes? (Like perf issues for a specific set of releases)
As for visualization tools, I've found excel to be pretty useful for small to moderate amounts of data.

Tips for optimising Database and POST request performance

I am developing an application which involves multiple user interactivity in real time. It basically involves lots of AJAX POST/GET requests from each user to the server - which in turn translates to database reads and writes. The real time result returned from the server is used to update the client side front end.
I know optimisation is quite a tricky, specialised area, but what advice would you give me to get maximum speed of operation here - speed is of paramount importance, but currently some of these POST requests take 20-30 seconds to return.
One way I have thought about optimising it is to club POST requests and send them out to the server as a group 8-10, instead of firing individual requests. I am not currently using caching in the database side, and don't really have too much knowledge on what it is, and whether it will be beneficial in this case.
Also, do the AJAX POST and GET requests incur the same overhead in terms of speed?
Rather than continuously hitting the database, cache frequently used data items (with an expiry time based upon how infrequently the data changes).
Can you reduce your communication with the server by caching some data client side?
The purpose of GET is as its name
implies - to GET information. It is
intended to be used when you are
reading information to display on the
page. Browsers will cache the result
from a GET request and if the same GET
request is made again then they will
display the cached result rather than
rerunning the entire request. This is
not a flaw in the browser processing
but is deliberately designed to work
that way so as to make GET calls more
efficient when the calls are used for
their intended purpose. A GET call is
retrieving data to display in the page
and data is not expected to be changed
on the server by such a call and so
re-requesting the same data should be
expected to obtain the same result.
The POST method is intended to be used
where you are updating information on
the server. Such a call is expected to
make changes to the data stored on the
server and the results returned from
two identical POST calls may very well
be completely different from one
another since the initial values
before the second POST call will be
differentfrom the initial values
before the first call because the
first call will have updated at least
some of those values. A POST call will
therefore always obtain the response
from the server rather than keeping a
cached copy of the prior response.
Ref.
The optimization tricks you'd use are generally the same tricks you'd use for a normal website, just with a faster turn around time. Some things you can look into doing are:
Prefetch GET requests that have high odds of being loaded by the user
Use a caching layer in between as Mitch Wheat suggests. Depending on your technology platform, you can look into memcache, it's quite common and there are libraries for just about everything
Look at denormalizing data that is going to be queried at a very high frequency. Assuming that reads are more common than writes, you should get a decent performance boost if you move the workload to the write portion of the data access (as opposed to adding database load via joins)
Use delayed inserts to give priority to writes and let the database server optimize the batching
Make sure you have intelligent indexes on the table and figure out what benefit they're providing. If you're rebuilding the indexes very frequently due to a high write:read ratio, you may want to scale back the queries
Look at retrieving data in more general queries and filtering the data when it makes to the business layer of the application. MySQL (for instance) uses a very specific query cache that matches against a specific query. It might make sense to pull all results for a given set, even if you're only going to be displaying x%.
For writes, look at running asynchronous queries to the database if it's possible within your system. Data synchronization doesn't have to be instantaneous, it just needs to appear that way (most of the time)
Cache common pages on disk/memory in a fully formatted state so that the server doesn't have to do much processing of them
All in all, there are lots of things you can do (and they generally come down to general development practices on a more bite sized scale).
The common tuning tricks would be:
- use more indexing
- use less indexing
- use more or less caching on filesystem, database, application, or content
- provide more bandwidth or more cpu power or more memory on any of your components
- minimize the overhead in any kind of communication
Of course an alternative would be to:
0 develop a set of tests, preferable automatic that can determine, if your application works correct.
1 measure the 'speed' of your application.
2 determine how fast it has to become
3 identify the source of the performane problems:
typical problems are: network throughput, file i/o, latency, locking issues, insufficient memory, cpu
4 fix the problem
5 make sure it is actually faster
6 make sure it is still working correct (hence the tests above)
7 return to 1
Have you tried profiling your app?
Not sure what framework you're using (if any), but frankly from your questions I doubt you have the technical skill yet to just eyeball this and figure out where things are slowing down.
Bluntly put, you should not be messing around with complicated ways to try to solve your problem, because you don't really understand what the problem is. You're more likely to make it worse than better by doing so.
What I would recommend you do is time every step. Most likely you'll find that either
you've got one or two really long running bits or
you're running a shitton of queries because of an n+1 error or the like
When you find what's going wrong, fix it. If you don't know how, post again. ;-)

Resources