Many connections vs. big data queries

Many connections vs. big data queries - sql-server

Hello I am creating a windows application that will be installed in 10 computers that will access the same database thru Entity Framework.
I was wondering what's better:
Spread the queries into packets (i.e. load contact then attach the included navigation properties - [DataContext.Contacts.Include("Phone"]).
Load everything in one query rather then splitting it out in individual queries.
You name it.
BTW I have a query that its trace string produced over 500 lines of sql, im doubting, maybe i should waive user-exprience for performance since performance is also a part of u.e.

You could put your SQL in stored procedures and write your Entity Framework logic to use the procedures instead of generating the SQL and sending it over the wire.

As with everything database related, it depends. Things like the connection type (LAN vs WAN), how you handle caching, database load level, type of database load (writes vs reads) etc, can all make a difference.
But in general, whenever you can reduce the number of round trips to the database that's a good thing. And remember: you can have more than one result set after executing a single SqlCommand.

Load everything in one query rather
then splitting it out in individual
queries.
This will normally be superior. You're usually better off writing chunkier queries than chatty ones. Fewer calls have less overhead - you need to obtain fewer connections, deal with less latency, etc..
Does the database server have to support other applications? For most business software applications, SQL server won't even break a sweat servicing ten clients - particularly performing basic entity lookups. It won't even really know you're there unless it's installed on a 486SX.

Related

Native query performance in Hibernate

It's obvious that hql query is slower that native. There are project on the mind which will use huge amount of small transactions to perform. So question is what will perform better:
native query in jpa
native query through jdbc
How much difference is? Because of jpa mapping capabilities and prepared statements prefer it. But according to performance requirements it could be that hibernate will not be fast enough...
EDIT
replaced "Huge amount of data to process" to "huge amount of small transactions to perform"

It is a way too little info you have given.
Huge amount of data to process
How huge is that? Like a few 10 millions of records of few 100 Gigabytes?
How do you plan to process that data? If you pull it to the Java side via Hibernate (or native query or jdbc) then probably you are on a wrong track. You shall keep the data in the database, and process it there with the tools what the database offers for that, and lightyears more performant than any client-side processing. Consider database side processing (PL/SQL of Oracle, Transact SQL of MS SQL Server).
What are you planning to optimize? Data insertion? Data retrieval? Are you planning to use select statements which dig trough a lots of data? Then consider an OLAP solution over classic OLTP. OLAP solutions are built for business intelligence and analyzing of huge gigabytes of data by a few clever tricks. Google for it (OLAP, Decision cubes)
Can you use any of the capabilities of the underlying SQL engine? For example, if you are using Oracle, you have 1000 times more features available what you can actually use in Hibernate. For example, you just simply can not make an Oracle Text query in Hibernate, and there are really a lots of things you can not do.
I could sum this up that the performance difference is not between using native SQL or HQL. Instead:
Hibernate is powerful in what it was built for: handing very few records of data, optimisitically caching it locally in the Java side as a groundwork for data processing systems built for databases which are not capable for data processing. (but only for some selects, inserts, updates, deletes)
Once you really have to move huge amount of data, Java side processing is not an option. Programmers in the '80s have invented the stored procedures exactly because of this reason. Pick a database which supports database side processing to shortcut all network roundtrips, and imperative data processing with for-loops in your java code. Prefer instead as much declarative SQL as you can, and run the processing on your database.
Once you are about to start using the features of your database, Hibernate will be pretty much in your way. It is really built as an ORM wrapper - however data processing problems are not CRUD and ORM-able problems all the time. For example, Hibernate is not too useful for an OLAP use case.
Huge amount of small transactions
In the case of having lots of small transactions (inserting/updating data to the database), Hibernate does not have any performance advantage (since no Java side caching can be utilized) disregarding of the database in use. However you may prefer hibernate since it is a nice tool for converting Java objects to SQL statements.
But with Hibernate, you are totally not in control on what is happening. For example, Hibernate + Oracle, inserting new entities to the database: this is the worst performance nightmare you can imagine.
This is what Hibernate does:
select one new id from a sequence
execute one insert
repeat zillion of times
This is very much not performing well. (The sequence reference shall be part of the insert statement, the whole insert statement shall use JDBC batching.)
I found that a JDBC based approach runs about 1000 times quicker than the Hibernate in this particular use case. (Prefetching next 100 sequences, use JDBC Batch mode for Oracle, bind variables to the batch, send down records in batches of 100, and use asynchronous commit (yet again something you can not control in Hibernate).
I found that if I want to squeeze out most of your tools, I need to learn them in depth. Unfortunately you can find lots of opinion-based comments especially in Hibernate vs. non-Hibernate wars on the net. Most of them are written by Java developers who have literally no idea about what happens behind the scenes. So, don't believe - measure :)

Effect of stored procedures on network traffic in Access/SQL setup

I am currently administering/developing an Access 2010 frontend/SQL backend database. We are trying to improve frontend performance, and one solution that has been suggested is pushing a lot of the VBA that is running the front end down into stored procedures on the server. I'm fairly proficient in VBA, but very new to SQL and network architecture. Everything I've turned up on google has been information about splitting the database, which is already done, rather than information about network loads resulting from running stored procedures vs running VBA.
What is the difference in network traffic between the current setup and pushing this action down to a stored procedure?
As a specific example, if I'm populating a form in the current setup, there are a few queries run to provide data to different elements on the form. With the current architecture, does Access retrieve the queried tables from the backend, query them client-side and then populate the data? How would that be different in terms of network traffic from, say, executing a SP when the form loads, and only transferring the data necessary for displaying the form?
The end goal is to reduce the chattiness between Access and SQL, and I'm mostly trying to figure out exactly what is happening where.

As a general rule, if you launch a form open with a where clause to restrict the form to one record, then using a bound form, or adopting a stored procedure will NOT result in any difference or reduction in network traffic.
Any local access query based on a table simply will request the one record. There is no “local” concept of processing in this regards EVEN with a linked table. Note the word “table” or singular here.
Access does not and will not pull down a whole table unless you have such forms and quires without any “where” clause to restrict the data pulled.
In other words if you have a poorly designed form, dump and change that design to something in which you now ONLY pull down the one record, then of course the setup will result in reduced network traffic.
However the above reduction is NOT DUE to adopting the stored procedure but ONLY that of adopting a design in which you restrict the records requested into the form.
So doing something poorly and then improving that process is NOT a justification to adopt stored procedures.
Thus in the case of pulling records into a form the using a stored procedure will NOT improve performance. Worse is binding a form to a stored procedure results in a form that is READY ONLY anyway!
So stored procedures don’t necessary increase performance or reduce network traffic when talking about loading a record into a form in terms of response time or performance.
If you have to do large amounts of recordset processing then of course adopting a stored procedure can save network performance. So in place of some VBA code to process 100,000 payroll reocrds, then yes moving such code server side will help. However processing a 100,000 payroll records is NOT common task and is NOT a user interface issue in most cases anyway. In other words, you don’t have a slow loading form or slow response time to load such forms. In other words, such types of processing are NOT done interactive by users waiting for a form to load.
SQL server is indeed a high performance system, and also a system that can scale to many users.
If you write your application in c++, or VB or in your case with ms-access, in GENERAL the performance of all of these tools will BE THE SAME.
In other words...sql server is rather nice, and is a standard system used in the IT industry.
However, sql server will NOT solve your performance issues without efforts on your part. And, it turns out that MOST of those same efforts also make your non sql server Access applications run better.
In fact, we see many posts that mention moving the back end data
to sql server actually slowed things down. (and in fact on a single machine, Access JET (now called ACE) is actually FASTER THEN SQL server (so when single user on same machine – Access is faster than SQL server on the same machine in most cases).
A few things:
Having a table with 75k records is quite small. Let’s assume you have 12 users. With a just a 100% file base system (jet), and no sql server, then the performance of that system should really have screamed.
I have some applications out there with 50, or 60 HIGHLY related tables. With 5 to 10 users on a network, response time is instant. I don't think any form load takes more than one second. Many of those 60+ tables are highly relational and in the 50 to 75k records range.
So, with my 5 users I see no reason why I can’t scale to 15 users with such small tables in the 75,000 record range. And this is without SQL server.
If the application did not perform with such small tables of only 75k records then upsizing to sql server will do absolute nothing to fix performance issues. In fact, in the sql server newsgroups you see weekly posts by people who find that upgrading to sql actually slowed things down.
I even seem some very cool numbers showing that some queries where actually MORE EFFICIENT in terms of network use by JET then sql server.
My point here is that technology will NOT solve performance problems. However, good designs that make careful use of limited bandwidth resources is the key here. So, if the application was not written with good performance in mind then you kind are stuck with a poor design!
I mean, when using a JET file share, you grab a invoice from the 75k record table only the one record is transferred down the network with a file share (and, sql server will also only transfer one record). So, at this point, you
really will NOT notice any performance difference by upgrading to SQL Server. There is no magic here. And adopting a SQL stored procedure will be even a GREATER waste of time!
And adopting a stored procedure in place of above will NOT gain you performance either!
Sql server is a robust and more scalable product then is JET. And, security, backup and host of other reasons make sql server a good choice. However, sql server will NOT solve a performance problem with dealing with such small tables as 75k records
Of course, when efforts are made to utilize sql server, then significant advances in performance can be realized.
I will give a few tips...these apply when using ms-access as a file share (without a server), or even odbc to sql server:
** Ask the user what they need before you load a form!
The above is so simple, but so often I see the above concept ignored. For example, when you walk up to an instant teller machine, does it download every account number and THEN ASK YOU what you want to do?
In access, it is downright silly to open up form attached to a table WITHOUT FIRST asking the user what they want! So, if it is a customer invoice, get the invoice number, and then load up the form with the ONE record. How can one record be slow? When done editing the record and the form is closed, and you are back to the prompt ready to do battle with the next customer.
You can read up on how this "flow" of a good user interface works here (and this applies to both JET, and sql server applications):
http://www.kallal.ca/Search/index.html
My only point here is restrict the form to only the ONE record the user needs. You don't need nor gain by using a stored procedure to accomplish this task. I am always dismayed how often a developer builds a nice form, attaches it to a large table, and then opens it and the throws this form attached to some huge table and then tells the users to go have at this and have fun. Don't we have any kind of concern for those poor users? Often, the user will not even know how to search for something!
So prompt, and asking the user also makes a HUGE leap forward in usability. And, the big bonus is reduced network traffic too! Gosh better and faster, and less network traffic! What more do we want!
** USE CAUTION with quires that require more than one linked table
JET has a real difficult time joining odbc tables together. Often the Access data engine (jet/Ace) does a good job, but often such joins are slow. However most forms for editing data are NOT based on a multi-table query. (so again, a stored procedure will not speed up form load for editing of data).
The simple solution for such multiple joins (for both forms and reports) is build the sql server side as a view, and then link to that view.
This view approach is MUCH less work then a stored procedure and results in the joins occurring server side. And results view are updatable as opposed to READ ONLY when you adopt stored procedures. And performance of such views will again equal that of stored procedure in THIS context.
So once gain, adopting stored procedures DOES NOT help and is more expensive from a developer cost then simply using a view. Really this just amounts to people suggesting that you rack up bills and use developer time to create something that yields nothing over that of a view except more billable hours.
I don't think it needs pointing out that if the query in question already runs well, then the above can be ignored, but just keep in mind that local queries with more than one table based on links to sql server can often run slow. So, just be aware of the above.
This view trick also applies well to combo boxes.
So one can continue to use bound forms to a linked table but one simply needs to restrict the form to the ONE RECORD you need.
You can safely open up to a single invoice form etc. but simply ENSURE you open such forms (openForm) by restricting records via the "where" clause. No view, or stored procedure is required here.
Bound forms are way less work then un-bound forms and performance is generally just as good anyway when done right.
Avoid large loading of combo boxes. A combo box is good for about 100 entries. After that you are torturing the user (what they got to look through 100s of entries). So, keep things like combo boxes down to a min size. This is both faster and MORE importantly it is kinder to your users.
After all, at the end of the day what we really want is to treat users well. It seems that treating the users well, and reducing the bandwidth (amount of data) goes hand in hand.
So, better applications treat the users well and run faster! (this is good news!)
So, #1 tip is to reduce the data that you transfer into a form.
Using stored procedures is not required in the vast majority of cases and will not reduce bandwidth requirements anymore then adopting where clauses and views.

Would it ever be wise to have a SQL server per web server?

I'm wondering if, under the circumstances that
You get lots more reads than writes
Your SQL server of choice is cheap/free and offers a fast mirroring/replication service
Your database isn't insanely large
rather than having separate SQL servers it would be better to have an instance of SQL on each machine getting instant updates from the master. This way there would be no network latency when doing all the read queries, but there would be a per box performance hit as the SQL instance has to execute. Would this be better overall for performance? Are there any other pros/cons that might come up?

Your SQL Server should always be on a different box to the webserver, of that there is no question.
How many DB servers and webservers you have, and how they mirror (or otherwise) is up to how you scale your application.
You have SQL Server on a different machine because it needs (and deserves) a lot of RAM.

It's quite a common architectural pattern to have read-only replicas of a database. We accept some degree of stalesness in them, perhaps they are even only updated once a day.
The general rule will be that multiple copies will introduce complexity in terms of operations and management and tend to introduce the possibilities of inconsistency of data - almost inevitably the copies will not be perfectly is step (or the costs of making them soo will be too high.)
An example: what happens if your replication processing breaks a bit. So that some, but not all copies become stale. Now your users start to see radically different views of the world. How much might that matter to you? If it's a site with low value data (eg. celebrity sightings in London suberbs) then perhaps that's fine. If it's on hand inventory, and being out of date means that your customers can't place orders, then maybe you care rather more.
My advice: things that sound simple at a boxed on paper sort of level don't always work out that way when you're sitting in an operations room at 3AM. Be very sure that you can easily operate your solution.

How would your SQL Server be cheap/free? I should have said the licensing costs for this setup would be crippling. At retail prices you're looking at $6000 per server. See also Jeff's comments about costs. Scale out the web servers by all means, but not your SQL Server until it's pretty much on its' knees.
You might instead want to think about a distributed cache like Velocity or NCache.
Either way, run your site first with one SQL server and see how it copes with the load, then think about mirroring/replication across servers, otherwise you're just optimising prematurely. Measure first!

An immediate con is that there is no distributed lock co-ordinator in SQL Server so you can get merge conflicts as updates can change the same row on two different servers at the same time.
Depending on the size of the database and the disks in the web servers, you will find your network latency is smaller than the disk latency you will start suffering as the web server disks will not usually be as performant as the disk array you give to the database. If you wanted that kind of performance, you would be buying it per web server.
Replication performance is not without latency either, the distribution of the transactions isn't 'free' and careful maintenance of the transaction log would have to be planned to ensure you did not get log fragmentation (too many vlog's wthin the transaction log) which kills replication performance.

Swapping out databases?

It seems like the goal of a lot of ORM tools and custom data access layers (DAO pattern, etc.) is to abstract the database to the point where you could supposedly swap out the entire database system with minimal work.
Following the common DAL patterns is usually a good idea in code, but it seems like it would never be minimal work to swap out a database. (Cost, training, data migration, etc.)
Does anyone have any experience with swapping out one database for another in a large system, and dealing with the implications in code? Is it worth it to worry about abstracting the actual database from your code?

Question 1: Does anyone have any experience with
swapping out one database for another
in a large system, and dealing with
the implications in code?
Yes we tried it. Our customer is using a large MS Access based Delphi client server application. After about five years we considered switching to SQL Server. We analyzed the problem and concluded that swapping the database would be very costly and provide only a few advantages. Customer decided not to swap the database. The application is still running fine and the customer is still happy.
Note that:
MS Access is only being used for data storage and report generation.
The server application ensures that MS Access is only being accessed on the server. Normal multi-user MS Access applications will transfer large chunks of the Access database over the network - resulting in slow and unreliable database functionality. This is not the case for this application. Client <> Server <> MS Access. Only the server application communicates with the MS Access database. Actually the Server has exclusive access to the MS Access database. No other computer can open to the MS Access database. Conclusion: MS Access is being used as a true RDBMS, Relational DataBase Management System - please no flaming about MS Access being inferior and unstable - it has been running fine for more than 10 years.
The most important issues you will have to consider:
SQL statements: (SELECT, UPDATE, DELETE, INSERT, CREATE TABLE) and make sure they would be compatible with the SQL database. It's amazing how much all the RDBMS differ in the details (date formats, number formats, search formats, string formats, join syntax, create table syntax, stored procedures, user defined functions, (auto) primary keys, etc.)
Report generation: Depending on your database you might be using a different reporting tool. Our customer has over 200 complex reports. Converting all these reports is very time consuming.
Performance: all RDBMS have different performances in different environments. Normally performance optimalisations are very much RDBMS dependent.
Costs: the costs of tools, developers, server and user licenses varies greatly. It ranges from free to very expensive. Free does not mean cheap and expensive does not always equate to good. A cost/value comparison will have to be made.
Experience: making the best use of your RDBMS requires experience. If you have to develop for an "unknown" RDBMS your productivity will suffer.
Question 2: Is it worth it to worry about
abstracting the actual database from
your code?
Yes. In an ideal world, swapping a database would just be adjusting the data connection string. In the real world this is not possible because all databases are different. They all have tables and SQL support but the differences are in the details. If you can keep the differences of the databases shielded through abstraction - please do so. Make a list of the databases you need to support. Check the selected database systems for the differences. Provide centralized code to handle the differences. Support one RDBMS and provide stubs for future support of other RDBMS.

I disagree that the purpose is to be able to swap out databases, and I think you are correct in showing some suspicion about ORMs leading towards that goal.
However, I would still use an ORM, as it abstracts away the details of data access. Isn't this the goal of object oriented programming? Keep your concerns separated.

I think the primary use case for database abstraction (via ORM tools) is to be able to ship a product that works with multiple database brands. I believe it's a rarer occurrence for a company to switch between database vendors, but that's still one of the use cases.
I've worked jobs where we started out using MySQL for monetary reasons (think a startup) and, one we started making money, wanted to switch to Oracle. We didn't end up making the switch, but it was nice to have the option.
Still, ORM tools are not a completely leak-less abstractions and I know our migration still would have been painful and costly. It totally depends on what you are building, but it has been my experience that -- for performance reasons, usually -- you end up either working around your ORM solution or exploiting vendor-specific features at some point.

The only time I've seen a database switch was from HSQL during early development to Oracle as the project progressed. The ORM made this easy.
I often use the DAO pattern to swap out data services (from a database to web service or to swap a web service to a test stub).
For ORM I don't think the goal is to enable you to switch databases - it is to hide you from the complexities of different database implementations and removing the need to worry about the fine details of translating from relational to object represenations of your data.
By having someone smart write an ORM that handles caching, only updates fields that have changed, groups updates, etc I don't need to. Although in the cases where I need something special I can still revert to SQL if I want.

Parallel query execution on multiple database servers (running Microsoft SQL Server)

Is it possible to configure multiple database servers (all hosting the same database) to execute a single query simultaneously?
I'm not asking about executing queries using multiple CPUs simultaneously - I know this it possible.
UPDATE
What I mean is something like this:
There are two 2 servers: Server1 and Server2
Both server host database Foo and both instances of Foo are identical
I connect to Server1 and submit a complicated (lots of joins, many calculations) query
Server1 decides that some calculations should be made on Server2 and some data should be read from that server, too - appropriate parts of the query are sent to Server2
Both servers read data and perform necessary calculations
Finally, results from Server1 and Server2 are merged and returned to the client
All this should happen automatically, without need to explicitly reference Server1 or Server2. I mean such parallel query execution - is it possible?
UPDATE 2
Thanks for the tips, John and wuputah.
I am researching alternatives of increasing both availability and capacity of MOSS database backend. So what I'm looking for is some kind out-of-the-box SQL Server load balancing solution that would be transparent to the application, because I cannot modify the application in any way. I guess SQL Server has no such feature (and Oracle, as far as I understand it, does - it is RAC mentioned by wuputah).
UPDATE 3
A quote from the Top Tips for SQL Server Clustering article:
Let's start by debunking a common
misconception. You use MSCS clustering
for high availability, not for load
balancing. Also, SQL Server does not
have any built-in, automatic
load-balancing capability. You have to
load balance through your
application's physical design.

What you're really talking about is a clustering solution. It looks like SQL Server and Oracle have solutions to this, but I don't know anything about them. I can guess they would be very costly to buy and implement.
Possible alternate suggestions would be as follows:
Use master-slave replication, and do your complex read queries from the slave. All writes must go to the master, which are then sent to the slave, so things stay in sync. This helps things go faster because the slave only has to worry about the writes coming from the master, which are already predetermined on behalf of the slave (no deadlocks etc). If you're looking to utilize multiple servers, this is the first place I would start.
Use master-master replication. This means that all writes from both servers go to each other, so they stay in sync (at least theoretically). This has some of the benefits as master-slave but you don't have to worry about writes going to one server instead of the other. The more common use of master-master replication is for failover support; master-slave is really better suited to performance.
Use the feature John Sansom talked about. I don't know much about it, but it seems its basis is splitting your database into tables on different servers, which will have some benefits as well as drawbacks. The big issue is that since the two systems can't share memory, they will have to share a lot of data over the network to compute complex joins.
Hope this helps!
RE Update 1:
If you can't modify the application, there is hope, but it might be a bit complicated. If you were to set up master-slave replication, you can then set up a proxy to send read queries to the slave(s) and write queries to the master(s). I've seen this done with MySQL, but not SQLServer. That's a bit of a problem unless you want to write the proxy yourself.
This has been discussed on SO previously, so you can find more information there.
RE Update 2:
Microsoft's clustering might not be designed for performance, but that's Microsoft fault. That's still the level of complexity you're talking about here. If they say it won't help, then your options are limited to those above and by what you do with your application (like sharding, splitting into multiple databases, etc).

Yes I believe it is possible, well sort of, let me explain.
You need to look into and research the use of Distributed Queries. A distributed query runs across multiple servers and is typically used to reference data that is not stored locally.
http://msdn.microsoft.com/en-us/library/ms191440.aspx
For example, Server A may hold my Customers table and Server B holds my Orders table. It is possible using distributed queries to run a query that references both Server A and Server B, with each server managing the processing of its local data (which could incorporate the use of parallelism).
Now in theory you could store the exact same data on each server and design your queries specifically so that only certain table were referenced on certain servers, thereby distributing the query load. This is not true parallel processing however, in terms of CPU.
If your intended goal is to distribute the processing load of your application then the typical approach with SQL Server is to use Replication to distribute data processing across multiple servers. This method is also not to be confused with parallel processing.
http://databases.about.com/cs/sqlserver/a/aa041303a.htm
I hope this helps but of course please feel free to pose any questions you may have.

Interesting question, but I'm struggling to get my head around this being beneficial for a multi-user system.
If I'm the only user having half my query done on Server1 and the other half on Server2 sounds cool :)
If there are two concurrent users (lets say with queries of identical difficulty) then I'm struggling to see that this helps :(
I could have identical data on both servers and load balancing - so I get Server1, my mate gets Server2 - or I could have half the data on Server1 and the other half on Server2, and each will be optimised, and cache, just their own data - spreading the load. But whenever you have to do a merge to complete a query the limiting factor becomes the pipe-size between them.
Which is basically Federated Database Servers. Instead of having all my Customers on one server and all my Orders on the other I could, say, have my USA customers and their orders on one, and my European customers/orders on the other, and only if my query spans both is there any need for a merge step.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight