I have an interesting delimma. I have a very expensive query that involves doing several full table scans and expensive joins, as well as calling out to a scalar UDF that calculates some geospatial data.
The end result is a resultset that contains data that is presented to the user. However, I can't return everything I want to show the user in one call, because I subdivide the original resultset into pages and just return a specified page, and I also need to take the original entire dataset, and apply group by's and joins etc to calculate related aggregate data.
Long story short, in order to bind all of the data I need to the UI, this expensive query needs to be called about 5-6 times.
So, I started thinking about how I could calculate this expensive query once, and then each subsequent call could somehow pull against a cached result set.
I hit upon the idea of abstracting the query into a stored procedure that would take in a CacheID (Guid) as a nullable parameter.
This sproc would insert the resultset into a cache table using the cacheID to uniquely identify this specific resultset.
This allows sprocs that need to work on this resultset to pass in a cacheID from a previous query and it is a simple SELECT statement to retrieve the data (with a single WHERE clause on the cacheID).
Then, using a periodic SQL job, flush out the cache table.
This works great, and really speeds things up on zero load testing. However, I am concerned that this technique may cause an issue under load with massive amounts of reads and writes against the cache table.
So, long story short, am I crazy? Or is this a good idea.
Obviously I need to be worried about lock contention, and index fragmentation, but anything else to be concerned about?
I have done that before, especially when I did not have the luxury to edit the application. I think its a valid approach sometimes, but in general having a cache/distributed cache in the application is preferred, cause it better reduces the load on the DB and scales better.
The tricky thing with the naive "just do it in the application" solution, is that many time you have multiple applications interacting with the DB which can put you in a bind if you have no application messaging bus (or something like memcached), cause it can be expensive to have one cache per application.
Obviously, for your problem the ideal solution is to be able to do the paging in a cheaper manner, and not need to churn through ALL the data just to get page N. But sometimes its not possible. Keep in mind that streaming data out of the db can be cheaper than streaming data out of the db back into the same db. You could introduce a new service that is responsible for executing these long queries and then have your main application talk to the db via the service.
Your tempdb could balloon like crazy under load, so I would watch that. It might be easier to put the expensive joins in a view and index the view than trying to cache the table for every user.
Related
On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.
I'm new to Ehcache and am searching on how to do this but now quite sure if this is a normal use case. I am working on an application that isn't a traditional web app, its something that is only used by a few people at a time and is for retrieving data from a very large dataset so rather than making a call to the DB each time I want to use caching to cache this large table. However, there is a chance that a new entry could be added to this table and I need this reflected in the cache but I don't want to reload the entire cache each time as its quite large. Any advice on how to approach this / further resources is appreciated.
You should learn about Hibernate query cache. In simple words: it works on top of second level cache (L2) and stores results of queries. But it only stores ids of the records that should be returned by the query rather than a whole list. This means that you need to have L2 working and fine tuned.
In your scenario suppose you have 1M records in table T and a query that returns 1K by average. The first time you run this query it will miss the query cache and:
run the SQL
fetch 1K records
put all of them in L2
put 1K ids in query cache
The next time you execute the query it will hit the query cache and lookup all the result from L2. The interesting part comes when you modify table T. Hibernate will figure out that the results in query cache might be stale and it will invalidate the whole cache but not the L2. It will basically repeat points 1-4 but refreshing only query cache (most of entities from table T are already in L2).
In some scenarios it works great, in others it introduces N+1 problems in unpredictable moments. This is just a tip of an iceberg, you should be really careful as this mechanism is very fragile and requires great understanding.
Here's the issue:
The database is highly normalized, and one particular query relies on the multiple relationships in the database. The query is designed to join all the tables, construct the entire object, and then return a list of those objects.
In other words this particular query does a lot of work.
Now, the query does only return X number of items as it supports pagination, but we also need to know the total count of items that are there.
Currently these two tasks are independent, but highly similar queries in our Domain Service. Ideally what I'd like to do is combine these two queries so that the call to the server happens once, rather than twice, and that the joins happen only once.
Output/Reference parameters don't work, and since the function is designed to return an IQueryable of items, I'm stuck on how to return this list of items as well as the total count.
I'm sure someone's come across this before - any thoughts?
A count of item joined tables is not the same thing as returning a subset of those records. They just happen to share a certain amount of SQL code (specifically to join the tables). RIA does the actual paging server-side so you are actually getting a slightly different query for every paging call.
A count operation would also operate much faster than the record query as SQL counts can often be performed using database indexes only (although Linq may well optimise this for you to the same end result... Clever Linq coders!).
As you would only be requesting the total count once (on page load I assume), then you begin paging through multiple queries on different portions of the data, you are hitting different parts of the database with every call.
You are better off treating them as two distinct functions (as you were) and wear the slight overhead of an additional server call. There is always somewhere else you could make bigger gains (caching etc).
When in doubt: Do not overcomplicate any process for the sake of only a very small gain.
If the problem is with the client server communication, you can put the count result on the header of the result response.
All,
Looking for some guidance on an Oracle design decision I am currently trying to evaluate:
The problem
I have data in three separate schemas on the same oracle db server. I am looking to build an application that will show data from all three schemas, however the data that is shown will be based on real time sorting and prioritisation rules that is applied to the data globally (i.e.: based on the priority weightings applied I may pull back data from any one of the three schemas).
Tentative Solution
Create a VIEW in the DB which maintains logical links to the relevant columns in the three schemas, write a stored procedure which accepts parameterised priority weightings. The application subsequently calls the stored procedure to select the ‘prioritised’ row from the view and then queries the associated schema directly for additional data based on the row returned.
I have concerns over performance where the data is being sorted/ prioritised upon each query being performed but cannot see a way around this as the prioritisation rules will change often. We are talking of data sets in the region of 2-3 million rows per schema.
Does anyone have alternative suggestions on how to provide an aggregated and sorted view over the data?
Querying from multiple schemas (or even multiple databases) is not really a big deal, even inside the same query. Just prepend the table name with the schema you are interested in, as in
SELECT SOMETHING
FROM
SCHEMA1.SOME_TABLE ST1, SCHEMA2.SOME_TABLE ST2
WHERE ST1.PK_FIELD = ST2.PK_FIELD
If performance becomes a problem, then that is a big topic... optimal query plans, indexes, and your method of database connection can all come into play. One thing that comes to mind is that if it does not have to be realtime, then you could use materialized views (aka "snapshots") to cache the data in a single place. Then you could query that with reasonable performance.
Just set the snapshots to refresh at an interval appropriate to your needs.
It doesn't matter that the data is from 3 schemas, really. What's important to know is how frequently the data will change, how often the criteria will change, and how frequently it will be queried.
If there is a finite set of criteria (that is, the data will be viewed in a limited number of ways) which only change every few days and it will be queried like crazy, you should probably look at materialized views.
If the criteria is nearly infinite, then there's no point making materialized views since they won't likely be reused. The same holds true if the criteria itself changes extremely frequently, the data in a materialized view wouldn't help in this case either.
The other question that's unanswered is how often the source data is updated, and how important is it to have the newest information. Frequently updated source day can either mean a materialized view will get "stale" for some duration or you may be spending a lot of time refreshing the materialized views unnecessarily to keep the data "fresh".
Honestly, 2-3 million records isn't a lot for Oracle anymore, given sufficient hardware. I would probably benchmark simple dynamic queries first before attempting fancy (materialized) view.
As others have said, querying a couple of million rows in Oracle is not really a problem, but then that depends on how often you are doing it - every tenth of a second may cause some load on the db server!
Without more details of your business requirements and a good model of your data its always difficult to provide good performance ideas. It usually comes down to coming up with a theory, then trying it against your database and accessing if it is "fast enough".
It may also be worth you taking a step back and asking yourself how accurate the results need to be. Does the business really need exact values for this query or are good estimates acceptable
Tom Kyte (of Ask Tom fame) always has some interesting ideas (and actual facts) in these areas. This article describes generating a proper dynamic search query - but Tom points out that when you query Google it never tries to get the exact number of hits for a query - it gives you a guess. If you can apply a good estimate then you can really improve query performance times
I am writting an application which needs to periodically (each week for example) loop through several million records ina database and execute code on the results of each row.
Since the table is so big, I suspect that when I call SomeObject.FindAll() it is reading all 1.4million rows and trying to return all the rows in a SomeObject[].
Is there a way I can execute a SomeObject.FindAll() expression, but load the values in a more DBMS friendly way?
Not with FindAll() - which, as you've surmised, will try to load all the instances of the specified type at one time (and, depending on how you've got NHibernate set up may issue a stupendous number of SQL queries to do it).
Lazy loading works only on properties of objects, so for example if you had a persisted type SomeObjectContainer which had as a property a list of SomeObject mapped in such a way that it should match all SomeObjects and with lazy="true", then did a foreach on that list property, you'd get what you want, sort-of; by default, NHibernate would issue a query for each element in the list, loading only one at a time. Of course, the read cache would grow ginormous, so you'd probably need to flush a lot.
What you can do is issue an HQL (or even embedded SQL) query to retrieve all the IDs for all SomeObjects and then loop through the IDs one at a time fetching the relevant object with FindByPrimaryKey. Again, it's not particularly elegant.
To be honest, in a situation like that I'd probably turn this into a scheduled maintenance job in a stored proc - unless you really have to run code on the object rather than manipulate the data somehow. It might annoy object purists, but sometimes a stored proc is the right way to go, especially in this kind of batch job scenario.