note: this Q is looking for a comparison of Hibernate named queries and ordinary session queries. Hibernate Criteria is of no concern within the context of this Q.
from what i know, named queries are those parsed once when the system starts up, and can be used from everywhere throughout the application. so - w/named queries, the query isn't parsed from scratch for each caller of that query and this is the major gain in named queries.
but then -
is there a difference between how Hibernate operates its caches for named- and ordinary-queries? if so- what is this?
is there any loss in turning ordinary Hibernate queries into named-queries?
i've had a discussion w/a colleague. he thinks that, before i should go turning ordinary queries into named queries, i should device some metrics and write tests just to prove how named-queries is performing better.
i think this-- generating metrics and writing tests just for the sake of measuring how/whether named queries perform better than ordinary queries is nothing but burning time into something useless. that's been shown already-- the reason of existence of named queries is just getting the query parsed. what data it's pulling/changing in DB is immaterial. and, Hibernate named queries is being used by many developers.
my Q is -
am i missing something in named queries that is relevant to this discussion?
opinions on how to handle this situation? the options i'm looking at are i.) drop doing anything at all-- let queries as is, ii.) just change named queries-- reverting if disliked wont have burned too much of my time iii.) do those tests-- if i would consider this as an option.
TIA.
Short answer - Use it if you can. But if you already have queries that are working fine with tests that cover its functionality, I wouldn't recommend that you go converting them.
Another SO post addressing this can be found here:
Advantages of Named queries in hibernate?
When you have an application that is constantly querying a database for information, in terms of performance, and database usage, is it better to have one big query that pulls in all the data at once or is it better to have a bunch of smaller query's that pull in the data one at a time. Does it matter?
Im trying to figure out if I should query my entire class once any value in the class is requested. Or to only query individual values as they are needed.
If others are paying for your usage, I would recommend pulling as much data as you can in a single connection and work with it locally. However, I don't know what 'usage' is defined as. It could be connections, bandwidth, operations, etc.
I am trying to layout the tables for use in new public-facing website. Seeing how there will lots more reading than writing data (guessing >85% reading) I would like to optimize the database for reading.
Whenever we list members we are planning on showing summary information about the members. Something akin to the reputation points and badges that stackoverflow uses. Instead of doing a subquery to find the information each time we do a search, I wanted to have a "calculated" field in the member table.
Whenever an action is initiated that would affect this field, say the member gets more points, we simply update this field by running a query to calculate the new values.
Obviously, there would be the need to keep this field up to date, but even if the field gets out of sync, we can always rerun the query to update this field.
My question: Is this an appropriate approach to optimizing the database? Or are the subqueries fast enough where the performance would not suffer.
There are two parts:
Caching
Tuned Query
Indexed Views (AKA Materialized views)
Tuned table
The best solution requires querying the database as little as possible, which would require caching. But you still need a query to fill that cache, and the cache needs to be refreshed when it is stale...
Indexed views are the next consideration. Because they are indexed, querying against is faster than an ordinary view (which is equivalent to a subquery). Nonclustered indexes can be applied to indexed views as well. The problem is that indexed views (materialized views in general) are very constrained to what they support - they can't have non-deterministic functions (IE: GETDATE()), extremely limited aggregate support, etc.
If what you need can't be handled by an indexed view, a table where the data is dumped & refreshed via a SQL Server Job is the next alternative. Like the indexed view, indexes would be applied to make fetching data faster. But data change means cleaning up the indexes to ensure the query is running as best it can, and this maintenance can take time.
The least expensive database query is the one that you don't have to run against the database at all.
In the scenario you describe, using a high-performance cache technology (example: memcached) to store query results in your application can be a lot better strategy than trying to trick out the database to be highly scalable.
The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization (for experts only!): Don't do it yet.
Michael A. Jackson
If you are just designing the tables, I'd say, it's definitely premature to optimize.
You might want to redesign your database a few days later, you might find out that things work pretty fast without any clever hacks, you might find out they work slow, but in a different way than you expected. In either case you would waste your time, if you start optimizing now.
The approach you describe is generally fine; you could get some pre-computed values, either using triggers/SPs to preserve data consistency, or running a job to update these values time-to-time.
All databases are more than 85% read only! Usually high nineties too.
Tune it when you need to and not before.
I've just started my first development job for a reasonably sized company that has to manage a lot of data. An average database is 6gb (from what I've seen so far). One of the jobs is reporting. How it's done currently is -
Data is replicated and transferred onto a data warehouse. From there, all the data required for a particular report is gathered (thousands of rows and lots of tables) and aggregated to a reports database in the warehouse. This is all done with stored procedures.
When a report is requested, a stored procedure is invoked which copies the data onto a reports database which PHP reads from to display the data.
I'm not a big fan of stored procs at all. But the people I've spoken to insist that stored procedures are the only option, as queries directly against the data via a programming language are incredibly slow (think 30 mins?). Security is also a concern.
So my question is - are stored procedures required when you have a very large data set? Do queries really take that long on such a large amount of data or is there a problem with either the DB servers or how the data is arranged (and indexed?). I've got a feeling that something is wrong.
The reasoning behind using a stored procedure is that the execution plan that is created in order to execute your procedure is cached by SQL Server in an area of memory known as the Plan Cache. When the procedure is then subsequently re-run at a later time, the execution plan has the possibility of being re-used.
A stored procedure will not run any faster than the same query, executed as a batch of T-SQL. It is the execution plans re-use that result in a performance improvement. The query cost will be the same for the actual T-SQL.
Offloading data to a reporting database is a typical pursuit however you may need to review your indexing strategy on the reporting database as it will likely need to be quite different from that of your OLTP platform for example.
You may also wish to consider using SQL Server Analysis Services in order to service your reporting requirements as it sounds like your reports contain lots of data aggregations. Storing and processing data for the purpose of fast counts and analytics is exactly what SSAS is all about. It sounds like it is time for your business to look as building a data warehouse.
I hope this helps but please feel free to request further details.
Cheers, John
In the context in which you are operating - large corporate database accessed in several places - it is virtually always best to place as much business logic inside the database as is possible.
In this case your immediate performance benefits are :
Firstly because if the the SP involves any processing beyond a simple select the processing of the data within the database can be orders of magnitude faster than sending rows across the network to your program for handling there.
You do acquire some benefits in that the SP is stored compiled. This is usually marginal compared to 1. if processing large volumes
However, and in my mind often more important than performance, is the fact that with corporate databases encapsulating the logic inside the database itself provides major management and maintenance benefits:-
Data structures can be abstracted away from program logic, allowing database structures to change without requiring changes to programs accessing the data. Anyone who has spent hours grep'ing a corporate codebase for SQL using [mytable] before making a simple database change will appreciate this.
SPs can provide a security layer, although this can be overused and overrelied on.
You say this is your first job for a company with a database of this type, so you can be forgiven for not appreciating how a database-centric approach to handling the data is really essential in such environments. You are not alone either - in a recent podcast Jeff Attwood said he wasn't a fan of putting code into databases. This is a fine and valid opinion where you are dealing with a database serving a single application, but is 100% wrong with a database used across a company by several applications, where the best policy is to screw down the data with a full complement of constraints and use SPs liberally for access and update.
The reason for this is if you don't such databases always lose data integrity and accumulate crud. Sometimes it's virtually impossible to imagine how they do, but in any large corporate database (tens of millions of records) without sufficient constraints there will be badly formed records - at best these force a periodic clean-up of data (a task I regularly used to get dumped with as a junior programmer), or worse will cause applications to crash due to invalid inputs, or even worse not cause them to crash but deliver incorrect business information to the end-users. And if your end user is your finance director then that's your job on the line :-)
It seems to me that there is an additional step in there that, based on your description, appears unneccessary. Here is what I am referring to -
When a report is requested, a stored
procedure is invoked which gathers the
data into a format required for a
report, and forwarded to another
stored procedure which transforms the
data into a view, and forwards THAT
off to a PHP framework for display.
A sproc transforms the data for a report, then another sproc transforms this data into another format for front-end presentation - is the data ever used in the format in which it is in after the first sproc? If not, that stage seems unneccessary to me.
I'm assuming that your reports database is a data warehouse and that data is ETL'ed and stored within in a format for the purposes of reporting. Where I currently work, this is common practice.
As for your question regarding stored procedures, they allow you to centralize logic within the database and "encapsulate" security, the first of which would appear to be of benefit within your organisation, given the other sprocs that you have for data transformation. Stored procedures also have a stored execution plan which, under some circumstances, can provide some improvement to performance.
I found that stored procedures help with large data sets because they eliminate a ton of network traffic, which can be a huge performance bottleneck depending on how large the data set actually is.
When processing large numbers of rows, where indexes are available and the SQL is relatively tuned, the database engine performing set-based operations directly on the data - through SQL, say - will almost always outperform row-by-row processing (even on the same server) in a client tool. The data is not crossing any physical or logical boudaries to leave the database server processes or to leave the database server and go out across the network. Even performing RBAR (row by agonizing row) on the server will be faster than performing it in a client tool, if only a limited amount of data really needs to ever leave the server, because...
When you start to pull more data across networks, then the process will slow down and limiting the number of rows at each stage becomes the next optimization.
All of this really has nothing to do with stored procedures. Stored procedures (in SQL Server) no longer provide much performance advantages over batch SQL. Stored procedures do provide a large number of other benefits like modularization, encapsulation, security management, design by contract, version management. Performance, however is no longer an advantage.
Generally speaking stored procedures have a number of advantages over direct queries. I can't comment on your complete end to end process, however, SPs will probably perform faster. For a start a direct query needs to be compiled and an execution plan worked out every time you do a direct query - SPs don't.
There are other reasons, why you would want to use stored procedure - centralisation of logic, security etc.
The end to end process does look a little complicated but there may be good reasons for it simply due to the data volume - it might well be that if you run the reports on the main database, the queries are slowing down the rest of the system so much that you'll cause problems for the rest of the users.
Regarding the stored procedures, their main advantage in a scenario like this is that they are pre-compiled and the database has already worked out what it considers to be the optimal query plan. Especially with the data volumes you are talking about, this might well result in a very noticeable performance improvement.
And yes, depending on the complexity of the report, a query like this can take half an hour or longer...
This reporting solution seems to have been designed by people that think the database is the centre of the world. This is a common and valid view – however I don’t always hold to it.
When moving data between tables/databases, it can be a lot quicker to use stored procs, as the data does not need to travel between the database and the application. However in most cases, I would rather not use stored proc as they make development more complex, I am in the ORM camp myself. You can sometimes get great speedups by loading lots into RAM and processing it there, however that is a totally different way of coding and will not allow the reuse of the logic that is already in the stored procs. Sorry I think you are stack with stored proc while in that job.
Giving the amount of data being moved about, if using SQL server I would look at using SSIS or DTS – oracle will have something along the same line. SSIS will do the data transformations on many threads while taking care of a lot of the details for you.
Remember the design of software has more to do with the history of the software and the people working it in, than it has to do with the “right way of doing it”. Come back in 100 years and we may know how to write software, at present it is mostly a case of the blind leading the blind. Just like when the first bridges were build and a lot of them fell down, no one could tell you in advance witch bridge would keep standing and why.
Unlike autogenerated code from an ORM product, stored procs can be performance tuned. This is critical in large production environment. There are many ways to tweak performance that are not available when using an ORM. Also there are many many tasks performed by a large database which have nothing to do with the user interface and thus should not be run from code produced from there.
Stored procs are also required if you want to control rights so that the users can only do the procedures specified in the proc and nothing else. Otherwise, users can much more easily make unauthorized changes to the databases and commit fraud. This is one reason why database people who work with large business critical systems, do not allow any access except through stored procs.
If you are moving large amounts of data to other servers though, I would consider using DTS (if using SQL Server 2000) or SSIS. This may speed up your processes still further, but it will depend greatly on what you are doing and how.
The fact that sps may be faster in this case doesn't preclude that indexing may be wrong or statistics out of date, but generally dbas who manage large sets of data tend to be pretty on top of this stuff.
It is true the process you describe seems a bit convoluted, but without seeing the structure of what is happening and understanding the database and environment, I can't say if maybe this is the best process.
I can tell you that new employees who come in and want to change working stuff to fit their own personal predjudices tend to be taken less than seriously and then you will have little credibility when you do need to suggest a valid change. This is particularly true when your past experience is not with databases of the same size or type of processing. If you were an expert in large systems, you might be taken more seriously from the start, but, face it, you are not and thus your opinion is not likely to sway anybody until you have been there awhile and they have a measure of your real capabilities. Plus if you learn the system as it is and work with it as it is, you will be in a better position in six months or so to suggest improvements rather than changes.
I could perhaps come up with more, but a few points.
Assuming a modern DB, stored procedures probably won't actually be noticeably faster than normal procedures due to caching and the like.
The security benefits of Stored procedures are somewhat overrated.
Change is evil. Consistency is king.
I'd say #3 trumps all other concerns unless stored procedures are causing a legitimate problem.
The faster way for reporting is to just read all data into memory (64 bit OS required) and just walk the objects. This is of course limited to ram size (affordable 32 GB) and reports where you hit a large part of the db. No need to make the effort for small reports.
In the old days I could run a report querying over 8 million objects in 1.5 seconds. That was in about a gigabyte of ram on a 3GHz pentium 4. 64 bit should be about twice as slow, but that is compensated by faster processors.
How much database performance overhead is involved with using C# and LINQ compared to custom optimized queries loaded with mostly low-level C, both with a SQL Server 2008 backend?
I'm specifically thinking here of a case where you have a fairly data-intensive program and will be doing a data refresh or update at least once per screen and will have 50-100 simultaneous users.
In my experience the overhead is minimal, provided that the person writing the queries knows what he/she is doing, and take the usual precautions to ensure the generated queries are optimal, that the necessary indexes are in place etc etc. In other words, the database impact should be the same; there is a minimal but usually negligible overhead on the app side.
That said... there is one exception to this; if a single query generates multiple aggregates the L2S provider translates it to a large query with one sub-query per aggregate. For a large table this can have a significant I/O impact as the db I/O cost for the query grows by magnitudes for each new aggregate in the query.
The workaround for that is of course to move the aggregates to stored proc or view. Matt Warren has some sample code for an alternative query provider that translate that kind of queries in a more efficient way.
Resources:
https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=334211
http://blogs.msdn.com/mattwar/archive/2008/07/08/linq-building-an-iqueryable-provider-part-x.aspx
Thanks Stu. Bottom line seems to be that LINQ to SQL probably doesn't have a significant database performance overhead with the newer versions if you are able to use a compiled select, and the slower functions of updating are likely to be faster unless you have a REALLY sharp expert doing most of the coding.