Heavy vs light queries - database

When you have an application that is constantly querying a database for information, in terms of performance, and database usage, is it better to have one big query that pulls in all the data at once or is it better to have a bunch of smaller query's that pull in the data one at a time. Does it matter?
Im trying to figure out if I should query my entire class once any value in the class is requested. Or to only query individual values as they are needed.

If others are paying for your usage, I would recommend pulling as much data as you can in a single connection and work with it locally. However, I don't know what 'usage' is defined as. It could be connections, bandwidth, operations, etc.

Related

Which one is better, iterate and sort data in backend or let the database handle it?

I'm trying to design a database schema for Djabgo rest framework web application.
At some point, I have two choces:
1- Choose a schema in which in one or several apies, I have to get a queryset from database and iterate and order it with python. (For example, I can store some datas in an array-data-typed column, get them from database and sort them with python.)
2- store the data in another table and insert a kind of big number of rows with each insert. This way, I can get the data in my favorite format in much less lines with orm codes.
I tried some basic tests and benchmarking to see which way is faster, and letting database handle more of the job (second way) didn't let me down. But I don't have the means of setting a more real situatuin and here's the question:
Is it still a good idea to let database handle the job when it also has to handle hundreds of requests from other apies and clients each second?
Is database (and orm) usually faster and more reliable than backend?
As a general rule, you want to let the database do work when the work is appropriate for the database. Sorting result sets would be in that category.
Keep in mind:
The database is running on a server, often on a distributed system and so it has access to more resources.
Databases are designed to handle large data, so they are not limited by the memory in a single thread.
When this question comes up, often more data needs to be passed back to the application than is strictly needed. Consider a problem such as getting the top 10 of something.
Mixing processing in the application and the database often requires multiple queries and passing data back and forth, which is expensive.
(And there are no doubt other considerations.)
There are some situations where it might be more efficient or convenient to do work in the application. A common example is formatting result sets for the application -- say turning 1234.56 into $1,234.56. Other examples would be when the application language has capabilities that are not directly in SQL or are hard to implement in SQL.

Using application's internal cache while working with Cassandra

As I've been working with traditional relational database for a long time, moving to nosql, especially Cassandra, is a big change. I ussually design my application so that everything in the database are loaded into application's internal caches on startup and if there is any update to a database's table, its corresponding cache is updated as well. For example, if I have a table Student, on startup, all data in that table is loaded into StudentCache, and when I want to insert/update/delete, I will call a service which updates both of them at the same time. The aim of my design is to prevent selecting directly from the database.
In Cassandra, as the idea is to build table containing all needed data so that join is unnencessary, I wonder if my favorite design is still useful, or is it more effective to query data directly from the database (i.e. from one table) when required.
Based on your described usecase I'd say that querying data as you need it prevents storing of data you dont need, plus what if your dataset is 5Gb? Are you still going to load the entire dataset?
Maybe consider a design where you dont load all the data on startup, but load it as needed and then store it and check this store before querying again, like what a cache does!
Cassandra is built to scale, your design cant handle scaling, you'll reach a point where your dataset is too large. Based on that, you should think about a tradeoff. Lots of on-the-fly querying vs storing everything in the client. I would advise direct queries, but store data when you do carry out a query, dont discard it and then carry out the same query again!
I would suggest to query the data directly as saving all the data to the application makes the applications performance based on the input. Now this might be a good thing if you know that the amount of data will never exceed your target machine's memory.
Should you however decide that this limit should change (higher!) you will be faced with a problem. Taking this approach will be fast when it comes down to searching (assuming you sort the result at start) but will pretty much kill maintainability.
The former favorite 'approach' is however still usefull should you choose for this.

Strategy for generating statistics with user data

I need a general piece of advice, but for the record i use jpa.
I need to generate usage data statistics, eg breakdown of user purchases per product, etc... I see three possible strategies, 1) generate on the fly stats each time the stats are being viewed, 2) create a specific table for stats that i would update each time there is a change 3) do offline processing at regular time intervals
All have issues and advanages, eg cost vs not up to date data, and i was wondering if anyone with experience in this field could provide some advice. I am aware the question s pretty broad, i can refine my use case if needed.
I've done a lot of reporting and the first question I always want to know is if the stakeholder needs the data in real time or not. This definitely shifts how you think and how you'll design a reporting system.
Based on the size of your data, I think it's possible to do real time reporting. If you had data in the millions, then maybe you'd need to do some pre-processing or data warehousing (your options 2/3).
Some general recommendations:
If you want to do real time reporting, think about making a copy of the database so you aren't running reports against production data. Some reports can use queries that are heavy, so it's worth looking into replicating production data to some other server where you can run reports.
Use intermediate structures a lot for reports. Write views, stored procedures, etc. so every report isn't just some huge complex query.
If the reports start to get too complex for doing at the database level, make sure you move the report logic into the application layer. I've been bitten by this many times. I start writing a report with queries purely from the database and eventually it gets too complex and I have to jump through hoops to make it work.
Shoot for real time and then go to stale data if necessary. Databases are capable of doing a lot more than you'd think. Quite often you can make changes to your database structures that will give you a big yield in performance.

One big query vs. many small ones?

I'd like to know, which option is the most expensive in terms of bandwith and overall efficiency.
Let's say I have a class Client in my application and a table client in my database.
Is it better to have one static function Client.getById that retrieves the whole client record or many (Client.getNameById, Client.getMobileNumberById, etc.) that retrieve individual fields?
If a single record has a lot of fields and I end up using one or two in the current script, is it still better to retrieve everything and decide inside the application what to do with all the data?
I'm using PHP and MySQL by the way.
Is it better to have one static function Client.getById that retrieves the whole client record or many (Client.getNameById, Client.getMobileNumberById, etc.) that retrieve individual fields?
Yes, it is.
Network latency and lag as well as the overheads of establishing a connection mean that making as small a number of database calls as possible is the best way to keep the database from saturation.
If the size of the data is really so much that you see an effect, you can consider retrieval of the specific fields you need in one single query (tailor the queries to the data).

Optimize database for web usage (lots more reading than writing)

I am trying to layout the tables for use in new public-facing website. Seeing how there will lots more reading than writing data (guessing >85% reading) I would like to optimize the database for reading.
Whenever we list members we are planning on showing summary information about the members. Something akin to the reputation points and badges that stackoverflow uses. Instead of doing a subquery to find the information each time we do a search, I wanted to have a "calculated" field in the member table.
Whenever an action is initiated that would affect this field, say the member gets more points, we simply update this field by running a query to calculate the new values.
Obviously, there would be the need to keep this field up to date, but even if the field gets out of sync, we can always rerun the query to update this field.
My question: Is this an appropriate approach to optimizing the database? Or are the subqueries fast enough where the performance would not suffer.
There are two parts:
Caching
Tuned Query
Indexed Views (AKA Materialized views)
Tuned table
The best solution requires querying the database as little as possible, which would require caching. But you still need a query to fill that cache, and the cache needs to be refreshed when it is stale...
Indexed views are the next consideration. Because they are indexed, querying against is faster than an ordinary view (which is equivalent to a subquery). Nonclustered indexes can be applied to indexed views as well. The problem is that indexed views (materialized views in general) are very constrained to what they support - they can't have non-deterministic functions (IE: GETDATE()), extremely limited aggregate support, etc.
If what you need can't be handled by an indexed view, a table where the data is dumped & refreshed via a SQL Server Job is the next alternative. Like the indexed view, indexes would be applied to make fetching data faster. But data change means cleaning up the indexes to ensure the query is running as best it can, and this maintenance can take time.
The least expensive database query is the one that you don't have to run against the database at all.
In the scenario you describe, using a high-performance cache technology (example: memcached) to store query results in your application can be a lot better strategy than trying to trick out the database to be highly scalable.
The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization (for experts only!): Don't do it yet.
Michael A. Jackson
If you are just designing the tables, I'd say, it's definitely premature to optimize.
You might want to redesign your database a few days later, you might find out that things work pretty fast without any clever hacks, you might find out they work slow, but in a different way than you expected. In either case you would waste your time, if you start optimizing now.
The approach you describe is generally fine; you could get some pre-computed values, either using triggers/SPs to preserve data consistency, or running a job to update these values time-to-time.
All databases are more than 85% read only! Usually high nineties too.
Tune it when you need to and not before.

Resources