How expensive is access to database? How often do we access to it? - database

I'm about to write an application for Android, and it will use Mysql.
I know that access to DB is really expensive in terms of time, and would like to know how often do applications like instant messaging, online gaming access to databases?
For example in a game, we would like to save the positions of a player in the world, when he's moving all the time.
Is the database access actually not expensive, and there is a way to be connected to it all the time and just do request that are actually not expensive?
Or is IT really expensive in anyway, and there are techniques to access to it for example every X interval of time, and saving it locally in the meantime?
I Know that my question is really general, and it depends always on what we need and want.
My question came out because i made a really simple login application that connects and does 1 request to database, and it takes 1 second (a lot!!) to get the result, so how online applications can be so fast?
Thank you

Before answering this I would recommend simulating the process as much as possible, benchmarking and you can work towards the best solution for your use case.
e.g. If I have an application submitting data to a database simulate the submission so I can easily run multiple submissions at the same time and see what the bottle neck is...and see how it compares when I using caching, replication, indexes, etc.
Also reading company blogs can be helpful as they often share success stories that support the usage of a particular approach
How expensive is access to database?
Accessing a database can be a pretty quick operation
SELECT 1; // 0.005 Secs :D
However there are situations that can lead to poor performance (slow reads, writes and updates) but there are some relatively simple ways to combat this
Indexes
The best way to improve the performance of SELECT operations is to
create indexes on one or more of the columns that are tested in the
query. The index entries act like pointers to the table rows, allowing
the query to quickly determine which rows match a condition in the
WHERE clause, and retrieve the other column values for those rows.
Replication
spreading the load among multiple slaves to improve performance. In
this environment, all writes and updates must take place on the master
server. Reads, however, may take place on one or more slaves. This
model can improve the performance of writes (since the master is
dedicated to updates), while dramatically increasing read speed across
an increasing number of slaves.
How often do we access to it?
If you are solely using a database you will access it every time you n position and every time you need to find out their position.
This is where you would explore options to prevent accessing the database.
Memory caches such as redis or memcache
Replication - Only read from slaves

It depends on your design and requirement.
1) Most of the applications manage Connection Pools to minimize the initialization time.
2) Most of the ORM frameworks have external Cache to improve the reading performance. So if you do heavy data reading in your application then don't worry about storing it in locally. The Cache will be effective in this case.
3) When you store locally either in File (or) some format, then it will also add extra performance delay.
4) If you keep the data in primary memory, then obviously Game performance would be better. That's why Gamers prefer high end graphics card, and huge RAM.

For most databases there is the option of batch insertions. Obviously even a small overhead will accumulate if you have to many connections over time. And performing single insertions will have a greater overhead than on batch. The only issue is how often?.... And you should test how often you wan't to insert and how much information you should store locally before doing a batch insertion.

Related

AWS ElastiCache vs RDS ReadReplica

My app currently connects to a RDS Multi-AZ database. I also have a Single-AZ Read Replica used to serve my analytics portal.
Recently there have been an increasing load on my master database, and I am thinking of how to resolve this situation without having to scale up my database again. The two ways I have in mind are
Move all the read queries from my app to the read-replica, and just scale up the read-replica, if necessary.
Implement ElastiCache Memcached.
To me these two options seem to achieve the same outcome for me - which is to reduce load on my master database, but I am thinking I may have understood some fundamentals wrongly because Google doesnt seem to return any results on a comparison between them.
In terms of load, they have the same goal, but they differ in other areas:
Up-to-dateness of data:
A read replica will continuously sync from the master. So your results will probably lag 0 - 3s (depending on the load) behind the master.
A cache takes the query result at a specific point in time and stores it for a certain amount of time. The longer your queries are being cached, the more lag you'll have; but your master database will experience less load. It's a trade-off you'll need to choose wisely depending on your application.
Performance / query features:
A cache can only return results for queries it has already seen. So if you run the same queries over and over again, it's a good match. Note that queries must not contain changing parts like NOW(), but must be equal in terms of the actual data to be fetched.
If you have many different, frequently changing, or dynamic (NOW(),...) queries, a read replica will be a better match.
ElastiCache should be much faster, since it's returning values directly from RAM. However, this also limits the number of results you can store.
So you'll first need to evaluate how outdated your data can be and how cacheable your queries are. If you're using ElastiCache, you might be able to cache more than queries — like caching whole sections of a website instead of the underlying queries only, which should improve the overall load of your application.
PS: Have you tuned your indexes? If your main problems are writes that won't help. But if you are fighting reads, indexes are the #1 thing to check and they do make a huge difference.

Should I keep this "GlobalConnection" or create connection for every query?

I have inherited a legacy Delphi application that uses ADO to connect to SQL Server.
The application has a notion of a "Global Connection" -- that is a single connection that it opens at the start, and then keeps open all throughout the running of the application (which can be days, weeks, or longer....)
So my question is this: Should I keep this way of doing things or should I switch to a "connect-query-disconnect" mode of doing things? Does it matter?
Switching would be a non-trivial task, but I'll do it if it means better performance, data management, etc.
Well, it depends on what you're expecting to get out of it, and what kind of application it is.
There's nothing in particular wrong with using a single long-running connection, as long as the application can gracefully handle disconnections and recover or log/notify when it can't reconnect.
The problem with a connect-query-disconnect setup is that you're adding the overhead of connecting and disconnecting on every query. That's going to slow things down, and in an interactive GUI application users may notice the additional overhead. You also have to make sure that authorization is transparently handled if it isn't already.
At the same time, there may be interactive performance gains to be had if you can push all the queries off onto background threads and asynchronously update the GUI. If contention appears because the queries are serialized, you can migrate to a connection-pool system fairly readily as well and improve things even more. This has a fairly high complexity cost to it though, so now you're looking to balancing what the gains are compared to the work involved.
Right now, my ultimate response is "if it ain't broke, don't fix it." Changes along the lines you propose are a lot of work -- how much do the users of this application stand to gain? Are there other problems to solve that might benefit them more?
Edit: Okay, so it's broke. Well, slow at least, which is all the same to me. If you've ruled out problems with the SQL Server itself, and the queries are performing as fast as they can (i.e. DB schema is sane, the right indexes are available, queries aren't completely braindead, server has enough RAM and fast enough I/O, network isn't flaky, etc.), then yes, it's time to find ways to improve the performance of the app itself.
Simply moving to a connect-query-disconnect is going to make things worse, and the more queries you're issuing the bigger the drop off is going to be. It sounds like you're going to need to rearchitect the app so that you can run fewer queries, run them in the background, cache more aggressively on the client, or some combination of all 3.
Don't forget the making the clients perform better means that server side performance gets more important since it's probably going to be handling a higher load if clients start making multiple connections and issuing multiple queries in parallel.
As mr Frazier told before - the one global connection is not bad per se.
If you intend to change, first detect WHAT is the problem. Let's see some scenarios:
1
Some screens(IOW: an set of 1..n forms to operate in a business entity) are slow. Possible causes:
insuficient filtering resulting in a pletora of records being pulled from database without necessity.
the number of records are ok, but takes too much to render it. Solution: faster controls or intelligent rendering (ex.: Virtual list views)
too much queries each time you open an screen. Possible solutions: use TClientDatasets (or any in-memory dataset) to hold infrequently modified lookup tables. An more sophisticated cache for more extensive tables or opening those datasets in other threads can improve response times.
Scrolling on datasets with controls bound can be slow (just to remember, because those little details can be easily forgotten).
2
Whole app simply slows down. Checklist:
Network cards are ok? An few net cards mal-functioning can wreak havoc even on good structured networks as they create unnecessary noise on the line.
[MSSQL DBA HAT ON] The next on the line of attack is SQL Server. Ask the DBA to trace blocks and deadlocks. Register slow queries and work on them speed up. This relate directly to #1.1 and #1.3
Detect if some naive developer have done SELECT inside transactions. In read committed isolation, it's just overhead, as it'll create more network traffic. Open the query, retrieve the data and close the dataset.
Review the database schema, if you can.
Are any data-bound operations on a bulk of records (let's say, remarking the price of some/majority/all products) being done on the app? Make an SP or refactor the operation on an query, it'll be much faster and will reduce the load of the entire server.
Extensive operations on a group of records? Learn how to do that operations at once on the server instead of one-by-one record. See an examination of most used alternatives on the MSSQL MVP Erland Sommarskog's article on array and list on MSSQL.
Beware of queries with WHERE like : WHERE SomeFunction(table1.blabla) = #SomeParam . Most of time, that ones will not use an index causing to read the entire table to select the desired data. If is a big table.... Indexing on a persisted computed columns can make miracles...[MSSQL HAT OFF]
That's what I can think of without a little more detail... ;-)
Database connections are time consuming resources to create and the rule of thumb should be create as little as possible and reuse as much as possible. That's why some other technologies have database connection pools, which are typically established at application/service startup and then kept as long as possible and shared among threads.
From your comment, the application has performances issues, but it's difficult without more details to make any recommendation.
Should try to nail down what is slow - are all queries slow or just some specific ones?
If just some specific ones is there some correlation.
My 2 cents.

Not able to find the right technique to increase the performance of database retrieving

I have an 2 tables from 12 tables and these 2 tables having millions of records , and when I retrieve the data from these tables it takes much more time . I have heard about indexing , but I think indexing is not a right approach which can be used here . Because each time , I need to fetch whole record instead of 2-3 columns of a record. I also applied indexing , but it took more execution time than without indexing because I fetched whole record.
So , what should be the right approach can be used here?
I'm basing my arguments on Oracle, but similar principles probably apply to other RDBMSs. Please tag your question with the system used.
For indexing the number of columns is mostly irrelevant. More important is the number of rows. But I guess you need all or mostly all of those as well. Indexing won't help in this case, since it would just add another step in the process without reducing the amount of work getting done.
So what you seem to do are large table scans. Those are normally not cached, because they would basically flush the whole cache from all the other useful data being stored there. So every time you select this kind of data you have to scratch in from disc, probably sending it over a wire also. This is bound to take some time.
From what you describe probably the best approach is to cut down on disc reads and network traffic by caching the data as near as possible to the application as possible. Try to setup a cache on the machine of your application possibly as part of your application. Read the data once, put it in the cache and read it from their afterwards. An in memory database would allow you to keep your SQL based access path if this is of any value for you.
Possibly try to fill the cache in the background before anybody is trying to use it.
Of course this will eat up quite some memory and you have to judge if this is feasible.
Second approach would be to tune the caching settings to make the database cache those tables in memory. But be warned that this will affect the performance of the database as a whole and not in a positive way.
Third option might be to move your processing logic into the database. It won't reduce the amount of disc I/O, but at least you would take the network out of the loop (assuming this is part of the issue)
There are few ways you can try things out :-
enable/increase query cache size of the database.
memcached at application level will increase your performance (for sure).
tweak your queries to get the best performance, and configure the best working indexes.
Hope it helps. I had tested all three for MySQL database - django(python) applicaton and they show good results.

What are the pros and cons of a distributed second level cache versus focusing on tuning database

we have a website that uses nhibernate and 2nd level cache. We are having a debate as one person wants to turn off the second level cache as we are moving to a multi webserver environment (with a load balancer in front).
One argument is to get rid of the second level cache and focus on optimizing and tuning the Db. the other argument is to roll out a distributed cache as the second level cache.
I am curious to hear folks pro and con here of DB tuning versus distributed cache (factoring in effort involved, cost, complexity, etc)
In case of a load balancing scenario you have to use a distributed cache provider to get best performance and consistency, that has nothing to do with optimizing your database. In any scenario you should optimize you database.
Both. You should have a distributed cache to prevent unecessary calls to the database and a tuned database so the initial calls are quickly returned. As an example, facebook required a significant amount of caching to scale, but I'm sure it wouldn't do much good if the initial queries took 10 minutes. :)
Two words: measure it.
Since you already have cache implement it you can probably measure what the impact would be of turning it off for benchmark purposes.
I would think that a multi-web server and a distributed second level cache can -and probably should- coexist.
First of all if we take as example memcached, it supports distributed object storing so if you're not using that, you could switch to that. it works.
Secondly, I'm guessing that you're introducing the web-server farm to respond to increasing web requests which will in turn mean increasing requests for data. If you kill your caching, it won't matter how much you optimize your database you're going to thrash it with queries. So you are going to improve your execution time, but while you wait for the database to return your data.
This is especially true for the case that web-node 1 requests dataset A and web-node 2 requests dataset A --> you are going to do the same query twice while with second level caching you only do it once.
So my recommendation is:
Don't kill your second level cache. You have already spent resources to implement it and by disabling it you are NOT going to improve your application's performance. Even a single node of memcached is going to be faster than having none at all.
Do optimize your database operations. This means both from the database side (indexes, views, sp's, functions, perhaps a cluster with read-only and write-only nodes) and application side (optimize your queries, lazy/eager loading profiling, don't fetch data you don't need, combine multiple queries into single-round-trips via Future, MutliQuery, MultiCriteria)
Do optimize your second-level cache implementation. There are datasets that have an infinite expiration date, and thus you query the db for them only once, and there are datasets that have short expiration dates, and thus probably expensive queries are executed more frequently. By optimizing your queries and your db you are going to improve the performance for the queries but the second-level cache is going to save your skin on peak load where short-expiration date datasets will be fetched by the cache more frequently.
If using textual queries is an everyday operation use the database's full-text capabilities or, even better, use a independent service like Lucene.NET (which can be integrated with NHibernate via NHibernate.Search)
That's a very difficult topic. In either case you need proficiency. Either a very proficient DBA, or a very proficient NHibernate / Cache administrator.
Personally, I prefer having full control over my SQL and tuning the database. Since you only have multiple webservers (and not necessarily multiple database instances), you might be better off that way, too. Modern databases have very efficient caches, so usually you create more harm with badly configured second-level caches in the application, rather than just letting the database cache sql statements, cursors, data, buffers, etc. I have experienced this to work very well for around 15 weblogic servers and only one database with lots of memory.
Since you do have NHibernate already, though, moving away from it, back to SQL (maybe with LINQ?) might be quite a costly task, that's not worth the effort.
We use NHibernate's 2nd level cache in our multi-server environment using Microsoft AppFabric distributed cache framework (NHibernate Velocity Provider) with great success.
Having said that, using 2nd level cache requires deeper understanding of the framework to prevent unexpected results. In addition, before using distributed caches, it is important to measure their overhead.
So my answer is basically - before using 2nd-level cache, you should really test and see whether it is really needed.

performance of web app with high number of inserts

What is the best IO strategy for a high traffic web app that logs user behaviour on a website and where ALL of the traffic will result in an IO write? Would it be to write to a file and overnight do batch inserts to the database? Or to simply do an INSERT (or INSERT DELAYED) per request? I understand that to consider this problem properly much more detail about the architecture would be needed, but a nudge in the right direction would be much appreciated.
By writing to the DB, you allow the RDBMS to decide when disk IO should happen - if you have enough RAM, for instance, it may be effectively caching all those inserts in memory, writing them to disk when there's a lighter load, or on some other scheduling mechanism.
Writing directly to the filesystem is going to be bandwidth-limited more-so than writing to a DB which then writes, expressly because the DB can - theoretically - write in more efficient sizes, contiguously, and at "convenient" times.
I've done this on a recent app. Inserts are generally pretty cheap (esp if you put them into an unindexed hopper table). I think that you have a couple of options.
As above, write data to a hopper table, if what ever application framework supports batched inserts, then use these, it will speed it up. Then every x requests, do a merge (via an SP call) into a master table, where you can normalize off data that has low entropy. For example if you are storing if the HTTP type of the request (get/post/etc), this can only ever be a couple of types, and better to store as an Int, and get improved I/O + query performance. Your master tables can also be indexed as you would normally do.
If this isn't good enough, then you can stream the requests to files on the local file system, and then have an out of band (i.e seperate process from the webserver) suck these files up and BCP them into the database. This will be at the expense of more moving parts, and potentially, a greater delay between receiving requests and them finding their way into the database
Hope this helps, Ace
When working with an RDBMS the most important thing is optimizing write operations to disk. Something somewhere has got to flush() to persistant storage (disk drives) to complete each transaction which is VERY expensive and time consuming. Minimizing the number of transactions and maximizing the number of sequential pages written is key to performance.
If you are doing inserts sending them in bulk within a single transaction will lead to more effecient write behavior on disk reducing the number of flush operations.
My recommendation is to queue the messages and periodically .. say every 15 seconds or so start a transaction ... send all queued inserts ... commit the transaction.
If your database supports sending multiple log entries in a single request/command doing so can have a noticable effect on performance when there is some network latency between the application and RDBMS by reducing the number of round trips.
Some systems support bulk operations (BCP) providing a very effecient method for bulk loading data which can be faster than the use of "insert" queries.
Sparing use of indexes and selection of sequential primary keys help.
Making sure multiple instances either coordinate write operations or write to separate tables can improve throughput in some instances by reducing concurrency management overhead in the database.
Write to a file and then load later. It's safer to be coupled to a filesystem than to a database. And the database is more likely to fail than the your filesystem.
The only problem with using the filesystem to back writes is how you extend the log.
A poorly implemented logger will have to open the entire file to append a line to the end of it. I witnessed one such example case where the person logged to a file in reverse order, being the most recent entries came out first, which required loading the entire file into memory, writing 1 line out to the new file, and then writing the original file contents after it.
This log eventually exceeded phps memory limit, and as such, bottlenecked the entire project.
If you do it properly however, the filesystem reads/writes will go directly into the system cache, and will only be flushed to disk every 10 or more seconds, ( depending on FS/OS settings ) which has a negligible performance hit compared to writing to arbitrary memory addresses.
Oh yes, and whatever system you use, you'll need to think about concurrent log appending. If you use a database, a high insert load can cause you to have deadlock conditions, and on files, you need to make sure that you're not going to have 2 concurrent writes cancel each other out.
The insertions will generally impact the (read/update) performance of the table. Perhaps you can do the writes to another table (or database) and have batch job that processes this data. The advantages of the database approach is that you can query/report on the data and all the data is logically in a relational database and may be easier to work with. Depending on how the data is logged to text file, you could open up more possibilities for corruption.
My instinct would be to only use the database, avoiding direct filesystem IO at all costs. If you need to produce some filesystem artifact, then I'd use a nightly cron job (or something like it) to read DB records and write to the filesystem.
ALSO: Only use "INSERT DELAYED" in cases where you don't mind losing a few records in the event of a server crash or restart, because some records almost certainly WILL be lost.
There's an easier way to answer this. Profile the performance of the two solutions.
Create one page that performs the DB insert, another that writes to a file, and another that does neither. Otherwise, the pages should be identical. Hit each page with a load tester (JMeter for example) and see what the performance impact is.
If you don't like the performance numbers, you can easily tweak each page to try and optimize performance a bit or try new solutions... everything from using MSMQ backed by MSSQL to delayed inserts to shared logs to individual files with a DB background worker.
That will give you a solid basis to make this decision rather than depending on speculation from others. It may turn out that none of the proposed solutions are viable or that all of them are viable...
Hello from left field, but no one asked (and you didn't specify) how important is it that you never, ever lose data?
If speed is the problem, leave it all in memory, and dump to the database in batches.
Do you log more than what would be available in the webserver logs? It can be quite a lot, see Apache 2.0 log information for example.
If not, then you can use the good old technique of buffering then batch writing. You can buffer at different places: in memory on your server, then batch insert them in db or batch write them in a file every X requests, and/or every X seconds.
If you use MySQL there are several different options/techniques to load efficiently a lot of data: LOAD DATA INFILE, INSERT DELAYED and so on.
Lots of details on insertion speeds.
Some other tips include:
splitting data into different tables per period of time (ie: per day or per week)
using multiple db connections
using multiple db servers
have good hardware (SSD/multicore)
Depending on the scale and resources available, it is possible to go different ways. So if you give more details, i can give more specific advices.
If you do not need to wait for a response such as a generated ID, you may want to adopt an asynchronous strategy using either a message queue or a thread manager.

Resources