Not able to find the right technique to increase the performance of database retrieving - database

I have an 2 tables from 12 tables and these 2 tables having millions of records , and when I retrieve the data from these tables it takes much more time . I have heard about indexing , but I think indexing is not a right approach which can be used here . Because each time , I need to fetch whole record instead of 2-3 columns of a record. I also applied indexing , but it took more execution time than without indexing because I fetched whole record.
So , what should be the right approach can be used here?

I'm basing my arguments on Oracle, but similar principles probably apply to other RDBMSs. Please tag your question with the system used.
For indexing the number of columns is mostly irrelevant. More important is the number of rows. But I guess you need all or mostly all of those as well. Indexing won't help in this case, since it would just add another step in the process without reducing the amount of work getting done.
So what you seem to do are large table scans. Those are normally not cached, because they would basically flush the whole cache from all the other useful data being stored there. So every time you select this kind of data you have to scratch in from disc, probably sending it over a wire also. This is bound to take some time.
From what you describe probably the best approach is to cut down on disc reads and network traffic by caching the data as near as possible to the application as possible. Try to setup a cache on the machine of your application possibly as part of your application. Read the data once, put it in the cache and read it from their afterwards. An in memory database would allow you to keep your SQL based access path if this is of any value for you.
Possibly try to fill the cache in the background before anybody is trying to use it.
Of course this will eat up quite some memory and you have to judge if this is feasible.
Second approach would be to tune the caching settings to make the database cache those tables in memory. But be warned that this will affect the performance of the database as a whole and not in a positive way.
Third option might be to move your processing logic into the database. It won't reduce the amount of disc I/O, but at least you would take the network out of the loop (assuming this is part of the issue)

There are few ways you can try things out :-
enable/increase query cache size of the database.
memcached at application level will increase your performance (for sure).
tweak your queries to get the best performance, and configure the best working indexes.
Hope it helps. I had tested all three for MySQL database - django(python) applicaton and they show good results.

Related

How expensive is access to database? How often do we access to it?

I'm about to write an application for Android, and it will use Mysql.
I know that access to DB is really expensive in terms of time, and would like to know how often do applications like instant messaging, online gaming access to databases?
For example in a game, we would like to save the positions of a player in the world, when he's moving all the time.
Is the database access actually not expensive, and there is a way to be connected to it all the time and just do request that are actually not expensive?
Or is IT really expensive in anyway, and there are techniques to access to it for example every X interval of time, and saving it locally in the meantime?
I Know that my question is really general, and it depends always on what we need and want.
My question came out because i made a really simple login application that connects and does 1 request to database, and it takes 1 second (a lot!!) to get the result, so how online applications can be so fast?
Thank you
Before answering this I would recommend simulating the process as much as possible, benchmarking and you can work towards the best solution for your use case.
e.g. If I have an application submitting data to a database simulate the submission so I can easily run multiple submissions at the same time and see what the bottle neck is...and see how it compares when I using caching, replication, indexes, etc.
Also reading company blogs can be helpful as they often share success stories that support the usage of a particular approach
How expensive is access to database?
Accessing a database can be a pretty quick operation
SELECT 1; // 0.005 Secs :D
However there are situations that can lead to poor performance (slow reads, writes and updates) but there are some relatively simple ways to combat this
Indexes
The best way to improve the performance of SELECT operations is to
create indexes on one or more of the columns that are tested in the
query. The index entries act like pointers to the table rows, allowing
the query to quickly determine which rows match a condition in the
WHERE clause, and retrieve the other column values for those rows.
Replication
spreading the load among multiple slaves to improve performance. In
this environment, all writes and updates must take place on the master
server. Reads, however, may take place on one or more slaves. This
model can improve the performance of writes (since the master is
dedicated to updates), while dramatically increasing read speed across
an increasing number of slaves.
How often do we access to it?
If you are solely using a database you will access it every time you n position and every time you need to find out their position.
This is where you would explore options to prevent accessing the database.
Memory caches such as redis or memcache
Replication - Only read from slaves
It depends on your design and requirement.
1) Most of the applications manage Connection Pools to minimize the initialization time.
2) Most of the ORM frameworks have external Cache to improve the reading performance. So if you do heavy data reading in your application then don't worry about storing it in locally. The Cache will be effective in this case.
3) When you store locally either in File (or) some format, then it will also add extra performance delay.
4) If you keep the data in primary memory, then obviously Game performance would be better. That's why Gamers prefer high end graphics card, and huge RAM.
For most databases there is the option of batch insertions. Obviously even a small overhead will accumulate if you have to many connections over time. And performing single insertions will have a greater overhead than on batch. The only issue is how often?.... And you should test how often you wan't to insert and how much information you should store locally before doing a batch insertion.

AWS ElastiCache vs RDS ReadReplica

My app currently connects to a RDS Multi-AZ database. I also have a Single-AZ Read Replica used to serve my analytics portal.
Recently there have been an increasing load on my master database, and I am thinking of how to resolve this situation without having to scale up my database again. The two ways I have in mind are
Move all the read queries from my app to the read-replica, and just scale up the read-replica, if necessary.
Implement ElastiCache Memcached.
To me these two options seem to achieve the same outcome for me - which is to reduce load on my master database, but I am thinking I may have understood some fundamentals wrongly because Google doesnt seem to return any results on a comparison between them.
In terms of load, they have the same goal, but they differ in other areas:
Up-to-dateness of data:
A read replica will continuously sync from the master. So your results will probably lag 0 - 3s (depending on the load) behind the master.
A cache takes the query result at a specific point in time and stores it for a certain amount of time. The longer your queries are being cached, the more lag you'll have; but your master database will experience less load. It's a trade-off you'll need to choose wisely depending on your application.
Performance / query features:
A cache can only return results for queries it has already seen. So if you run the same queries over and over again, it's a good match. Note that queries must not contain changing parts like NOW(), but must be equal in terms of the actual data to be fetched.
If you have many different, frequently changing, or dynamic (NOW(),...) queries, a read replica will be a better match.
ElastiCache should be much faster, since it's returning values directly from RAM. However, this also limits the number of results you can store.
So you'll first need to evaluate how outdated your data can be and how cacheable your queries are. If you're using ElastiCache, you might be able to cache more than queries — like caching whole sections of a website instead of the underlying queries only, which should improve the overall load of your application.
PS: Have you tuned your indexes? If your main problems are writes that won't help. But if you are fighting reads, indexes are the #1 thing to check and they do make a huge difference.

Should I keep this "GlobalConnection" or create connection for every query?

I have inherited a legacy Delphi application that uses ADO to connect to SQL Server.
The application has a notion of a "Global Connection" -- that is a single connection that it opens at the start, and then keeps open all throughout the running of the application (which can be days, weeks, or longer....)
So my question is this: Should I keep this way of doing things or should I switch to a "connect-query-disconnect" mode of doing things? Does it matter?
Switching would be a non-trivial task, but I'll do it if it means better performance, data management, etc.
Well, it depends on what you're expecting to get out of it, and what kind of application it is.
There's nothing in particular wrong with using a single long-running connection, as long as the application can gracefully handle disconnections and recover or log/notify when it can't reconnect.
The problem with a connect-query-disconnect setup is that you're adding the overhead of connecting and disconnecting on every query. That's going to slow things down, and in an interactive GUI application users may notice the additional overhead. You also have to make sure that authorization is transparently handled if it isn't already.
At the same time, there may be interactive performance gains to be had if you can push all the queries off onto background threads and asynchronously update the GUI. If contention appears because the queries are serialized, you can migrate to a connection-pool system fairly readily as well and improve things even more. This has a fairly high complexity cost to it though, so now you're looking to balancing what the gains are compared to the work involved.
Right now, my ultimate response is "if it ain't broke, don't fix it." Changes along the lines you propose are a lot of work -- how much do the users of this application stand to gain? Are there other problems to solve that might benefit them more?
Edit: Okay, so it's broke. Well, slow at least, which is all the same to me. If you've ruled out problems with the SQL Server itself, and the queries are performing as fast as they can (i.e. DB schema is sane, the right indexes are available, queries aren't completely braindead, server has enough RAM and fast enough I/O, network isn't flaky, etc.), then yes, it's time to find ways to improve the performance of the app itself.
Simply moving to a connect-query-disconnect is going to make things worse, and the more queries you're issuing the bigger the drop off is going to be. It sounds like you're going to need to rearchitect the app so that you can run fewer queries, run them in the background, cache more aggressively on the client, or some combination of all 3.
Don't forget the making the clients perform better means that server side performance gets more important since it's probably going to be handling a higher load if clients start making multiple connections and issuing multiple queries in parallel.
As mr Frazier told before - the one global connection is not bad per se.
If you intend to change, first detect WHAT is the problem. Let's see some scenarios:
1
Some screens(IOW: an set of 1..n forms to operate in a business entity) are slow. Possible causes:
insuficient filtering resulting in a pletora of records being pulled from database without necessity.
the number of records are ok, but takes too much to render it. Solution: faster controls or intelligent rendering (ex.: Virtual list views)
too much queries each time you open an screen. Possible solutions: use TClientDatasets (or any in-memory dataset) to hold infrequently modified lookup tables. An more sophisticated cache for more extensive tables or opening those datasets in other threads can improve response times.
Scrolling on datasets with controls bound can be slow (just to remember, because those little details can be easily forgotten).
2
Whole app simply slows down. Checklist:
Network cards are ok? An few net cards mal-functioning can wreak havoc even on good structured networks as they create unnecessary noise on the line.
[MSSQL DBA HAT ON] The next on the line of attack is SQL Server. Ask the DBA to trace blocks and deadlocks. Register slow queries and work on them speed up. This relate directly to #1.1 and #1.3
Detect if some naive developer have done SELECT inside transactions. In read committed isolation, it's just overhead, as it'll create more network traffic. Open the query, retrieve the data and close the dataset.
Review the database schema, if you can.
Are any data-bound operations on a bulk of records (let's say, remarking the price of some/majority/all products) being done on the app? Make an SP or refactor the operation on an query, it'll be much faster and will reduce the load of the entire server.
Extensive operations on a group of records? Learn how to do that operations at once on the server instead of one-by-one record. See an examination of most used alternatives on the MSSQL MVP Erland Sommarskog's article on array and list on MSSQL.
Beware of queries with WHERE like : WHERE SomeFunction(table1.blabla) = #SomeParam . Most of time, that ones will not use an index causing to read the entire table to select the desired data. If is a big table.... Indexing on a persisted computed columns can make miracles...[MSSQL HAT OFF]
That's what I can think of without a little more detail... ;-)
Database connections are time consuming resources to create and the rule of thumb should be create as little as possible and reuse as much as possible. That's why some other technologies have database connection pools, which are typically established at application/service startup and then kept as long as possible and shared among threads.
From your comment, the application has performances issues, but it's difficult without more details to make any recommendation.
Should try to nail down what is slow - are all queries slow or just some specific ones?
If just some specific ones is there some correlation.
My 2 cents.

When scraping a lot of stats from a webpage, how often should I insert the collected results in my DB?

I'm scraping a website (scripting responsibly by throttling my scraping and with permission) and I'm going to be gathering statistics on 300,000 users.
I plan on storing this data in a SQL Database, and I plan on scraping this data once a week. My question is, how often should I be doing inserts on the database as results come in from the scraper?
Is it best practice to wait till all results are in (keeping them all in memory), and insert them all when the scraping is finished? Or is it better to do an insert on every single result that comes in (coming in at a decent rate)? Or something in between?
If someone could point me in the right direction on how often/when I should be doing this I would appreciate it.
Also, would the answer change if I was storing these results in a flat file vs a database?
Thank you for your time!
You might get a performance increase by batching up several hundred, if your database supports inserting multiple rows per query (both MySQL and PostgreSQL do). You'll also probably get more performance by batching multiple inserts per transaction (except with transactionless databases, such as MySQL with MyISAM).
The benefits of the batching will rapidly fall as the batch size increases; you've already reduced the query/commit overhead 99% by the time you're doing 100 at a time. As you get larger, you will run into various limits (example: longest allowed query).
You also run into another large tradeoff: If your program dies, you'll lose any work you haven't yet save to the database. Losing 100 isn't so bad; you can probably redo that work in a minute or two. Losing 300,000 would take quite a while to redo.
Summary: Personally, I'd start with one value/one query, as it'll be the easiest to implement. If I found insert time was a bottleneck (very much doubt it, scrape will be so much slower), I'd move to 100 values/query.
PS: Since the site admin has given you permission, have you asked if you can just get a database dump of the relevant data? Would save a lot of work!
My preference is to write bulk data to the database every 1,000 rows, when I have to do it the way you're describing. It seems like a good volume. Not too much re-work if I do have a crash and need to re-generate some data (re-scraping in your case). But it's a good healthy bite that can reduce overhead.
As #derobert points out, wrapping a bunch of inserts in a transaction also helps reduce overhead. But don't put everything in a single transaction - some vendors of RDBMS like Oracle maintain a "redo log" during a transaction, so if you do too much work this can cause congestion. Breaking up the work into large, but not too large, chunks is best. I.e. 1,000 rows.
Some SQL implementations support multi-row INSERT (#derobert also mentions this) but some don't.
You're right that flushing raw data to a flat file and batch-loading it later is probably worthwhile. Each SQL vendor supports this kind of bulk-load differently, for instance LOAD DATA INFILE in MySQL or ".import" in SQLite, etc. You'll have to tell us what brand of SQL database you're using to get more specific instructions, but in my experience this kind of mechanism can be 10-20x the performance of INSERT even after improvements like using transactions and multi-row insert.
Re your comment, you might want to take a look at BULK INSERT in Microsoft SQL Server. I don't usually use Microsoft, so I don't have first-hand experience with it, but I assume it's a useful tool in your scenario.

Tips for optimising Database and POST request performance

I am developing an application which involves multiple user interactivity in real time. It basically involves lots of AJAX POST/GET requests from each user to the server - which in turn translates to database reads and writes. The real time result returned from the server is used to update the client side front end.
I know optimisation is quite a tricky, specialised area, but what advice would you give me to get maximum speed of operation here - speed is of paramount importance, but currently some of these POST requests take 20-30 seconds to return.
One way I have thought about optimising it is to club POST requests and send them out to the server as a group 8-10, instead of firing individual requests. I am not currently using caching in the database side, and don't really have too much knowledge on what it is, and whether it will be beneficial in this case.
Also, do the AJAX POST and GET requests incur the same overhead in terms of speed?
Rather than continuously hitting the database, cache frequently used data items (with an expiry time based upon how infrequently the data changes).
Can you reduce your communication with the server by caching some data client side?
The purpose of GET is as its name
implies - to GET information. It is
intended to be used when you are
reading information to display on the
page. Browsers will cache the result
from a GET request and if the same GET
request is made again then they will
display the cached result rather than
rerunning the entire request. This is
not a flaw in the browser processing
but is deliberately designed to work
that way so as to make GET calls more
efficient when the calls are used for
their intended purpose. A GET call is
retrieving data to display in the page
and data is not expected to be changed
on the server by such a call and so
re-requesting the same data should be
expected to obtain the same result.
The POST method is intended to be used
where you are updating information on
the server. Such a call is expected to
make changes to the data stored on the
server and the results returned from
two identical POST calls may very well
be completely different from one
another since the initial values
before the second POST call will be
differentfrom the initial values
before the first call because the
first call will have updated at least
some of those values. A POST call will
therefore always obtain the response
from the server rather than keeping a
cached copy of the prior response.
Ref.
The optimization tricks you'd use are generally the same tricks you'd use for a normal website, just with a faster turn around time. Some things you can look into doing are:
Prefetch GET requests that have high odds of being loaded by the user
Use a caching layer in between as Mitch Wheat suggests. Depending on your technology platform, you can look into memcache, it's quite common and there are libraries for just about everything
Look at denormalizing data that is going to be queried at a very high frequency. Assuming that reads are more common than writes, you should get a decent performance boost if you move the workload to the write portion of the data access (as opposed to adding database load via joins)
Use delayed inserts to give priority to writes and let the database server optimize the batching
Make sure you have intelligent indexes on the table and figure out what benefit they're providing. If you're rebuilding the indexes very frequently due to a high write:read ratio, you may want to scale back the queries
Look at retrieving data in more general queries and filtering the data when it makes to the business layer of the application. MySQL (for instance) uses a very specific query cache that matches against a specific query. It might make sense to pull all results for a given set, even if you're only going to be displaying x%.
For writes, look at running asynchronous queries to the database if it's possible within your system. Data synchronization doesn't have to be instantaneous, it just needs to appear that way (most of the time)
Cache common pages on disk/memory in a fully formatted state so that the server doesn't have to do much processing of them
All in all, there are lots of things you can do (and they generally come down to general development practices on a more bite sized scale).
The common tuning tricks would be:
- use more indexing
- use less indexing
- use more or less caching on filesystem, database, application, or content
- provide more bandwidth or more cpu power or more memory on any of your components
- minimize the overhead in any kind of communication
Of course an alternative would be to:
0 develop a set of tests, preferable automatic that can determine, if your application works correct.
1 measure the 'speed' of your application.
2 determine how fast it has to become
3 identify the source of the performane problems:
typical problems are: network throughput, file i/o, latency, locking issues, insufficient memory, cpu
4 fix the problem
5 make sure it is actually faster
6 make sure it is still working correct (hence the tests above)
7 return to 1
Have you tried profiling your app?
Not sure what framework you're using (if any), but frankly from your questions I doubt you have the technical skill yet to just eyeball this and figure out where things are slowing down.
Bluntly put, you should not be messing around with complicated ways to try to solve your problem, because you don't really understand what the problem is. You're more likely to make it worse than better by doing so.
What I would recommend you do is time every step. Most likely you'll find that either
you've got one or two really long running bits or
you're running a shitton of queries because of an n+1 error or the like
When you find what's going wrong, fix it. If you don't know how, post again. ;-)

Resources