I'm currently researching ways to speed up and scale up a long running matching job which is currently running as a stored procedure in MSSQL 2005. The matching is involves multiple fields with many inexact cases. While I'd like to ultimately scale it up to large scale data sets outside of the database I need to consider some shorter term solutions also.
Given that I don't know much about the internal implementation of how they are run I'm wondering if it were possible to split the process up into parallel procedures by dividing the data set with a master procedure, which then kicks off subprocs which work on smaller data sets.
Would this yield any performance gains with a clustered database? Will MSSQL distribute the subprocs across the cluster nodes automatically and sensibly?
Perhaps it's better to have the master process in java and call worker procedures through jdbc which would presumably use cluster load balancing effectively? Aside from any arguments about maintainability could this be faster?
You have a fundamental misunderstanding of what clustering means for SQL Server. Clustering does not allow a single instance of SQL Server to share the resources of multiple boxes. Clustering is a high availability solution that allows the functionality of one box to shift over to another standby box in case of a failure.
Related
I have a sql server running on my machine.It contains 10 data base file.
say
a
b
....
z
so my question is 10 or more database or 1 single database is best for sql server .Does more database cause more performance issue on single server machine? what is recommended?
You may think like:
"Using multiple databases helps like they are outer index and it can be helpfull for search times.
Think like that, when searching begins, your database server takes your query it will go the firstly to your table and it will execute query on that table and which helps for querying time because datas on other tables will not be looked only your table index will be looked at tables table. :)
In same manner when you group your tables on different dbs query will begin to look just table index of that table on tables table and because there will be less table in that table finding your tables table id will going to complete in less time. :) "
But that is not correct! If you dont have millions of tables it will not going to impact because datastructures used on dbs mostly acces data in O(log(n)) and that means that if(Big if) accesing in 1,000,000 input takes 6 step complete then 100,000 will take 5 step and 1,000 will take 3. As you can see it not makes difference.
On the other hand using 2 db guarantees that it has to be at least 2 connections and connections are expensive things and that is why connection pools are exist.
Mostly Common issue is for poor database design
The causes for performance problems can be various, but the most common are a poorly designed database, incorrectly configured system, insufficient disk space or other system resources, excessive query compilation and recompilation, bad execution plans due to missing or outdated statistics, and queries or stored procedures that have long execution times due to improper design
Memory bottlenecks are caused by limitations in available memory and memory pressure caused by SQL Server, system, or other application activity. Poor indexing requires table scans which in case of large tables means that a large number of rows is read from disk and handled in memory
Network bottlenecks are caused by overload on a server or network, so the data cannot flow as expected
I/O issues can be caused by slow hardware used, bad storage solution design, and configuration. Besides hardware components, such as disk types, disk array type, and RAID configuration that affect I/O performance, unnecessary requests made by a database also affect I/O traffic. Frequent index scans, inefficient queries, and out of date statistics can also cause I/O workload and bottlenecks
- See more at: http://www.sqlshack.com/dba-guide-sql-server-performance-troubleshooting-part-1-problems-performance-metrics/#sthash.QrzEyKbz.dpuf
Multiple Database is not a problem for performance.
you can see these links. I think it will help you about understanding performance tuning :D
I am tasked with putting together a solution that can handle a high level of inserts into a database. There will be many AJAX type calls from web pages. It is not only one web site/page, but several different ones.
It will be dealing with tracking people's behavior on a web site, triggered by various javascript events, etc.
It is important for the solution to be able to handle the heavy database inserting load.
After it has been inserted, I don't mind migrating the data to an alternative/supplementary data store.
We are initial looking at using the MEAN stack with MongoDB and migrating some data to MySql for reporting purposes. I am also wondering about the use of some sort of queue-ing before insert into db or caching like memcached
I didn't manage to find much help on this elsewhere. I did see this post but it is now close to 5 years old, feels a bit outdated and don't quite ask the same questions.
Your thoughts and comments are most appreciated. Thanks.
Why do you need a stack at all? Are you looking for a web-application to do the inserting? Or do you already have an application?
It's doubtful any caching layer will outrun your NoSQL database for inserts, but you should probably confirm that you even need a NoSQL database. MySQL has pretty solid raw insert performance, as long as your load can be handled on a single box. Most NoSQL solutions scale better horizontally. This is probably worth a read. But realistically, if you already have MySQL in-house, and you separate your reporting from your insert instances, you will probably be fine with MySQL.
Some initial theory
To understand how you can optimize for the heavy insert workload, I suggest to understand the main overheads involved in inserting data in a database. Once the various overheads are understood, all kings of optimizations will come to you naturally. The bonus is that you will both have more confidence in the solution, you will know more about databases, and you can apply these optimizations to multiple engines (MySQL, PostgreSQl, Oracle, etc.).
I'm first making a non-exhaustive list of insertion overheads and then show simple solutions to avoid such overheads.
1. SQL query overhead: In order to communicate with a database you first need to create a network connection to the server, pass credentials, get the credentials verified, serialize the data and send it over the network, and so on.
And once the query is accepted, it needs to be parsed, its grammar validated, data types must be parsed and validated, the objects (tables, indexes, etc.) referenced by the query searched and access permissions are checked, etc. All of these steps (and I'm sure I forgot quite a few things here) represent significant overheads when inserting a single value. The overheads are so large that some databases, e.g. Oracle, have a SQL cache to avoid some of these overheads.
Solution: Reuse database connections, use prepared statements, and insert many values at every SQL query (1000s to 100000s).
2. Ensuring strong ACID guarantees: The ACID properties of a DB come at the cost of logging all logical and physical modification to the database ahead of time and require complex synchronization techniques (fine-grained locking and/or snapshot isolation). The actual time required to deal with the ACID guarantees can be several orders of magnitude higher than the time it takes to actually copy a 200B row in a database page.
Solution: Disable undo/redo logging when you import data in a table. Alternatively, you could also (1) drop the isolation level to trade off weaker ACID guarantees for lower overhead or (2) use asynchronous commit (a feature that allows the DB engine to complete an insert before the redo logs are properly hardened to disk).
3. Updating the physical design / database constraints: Inserting a value in a table usually requires updating multiple indexes, materialized views, and/or executing various triggers. These overheads can again easily dominate over the insertion time.
Solution: You can consider dropping all secondary data structures (indexes, materialized views, triggers) for the duration of the insert/import. Once the bulk of the inserts is done you can re-created them. For example, it is significantly faster to create an index from scratch rather than populate it through individual insertions.
In practice
Now let's see how we can apply these concepts to your particular design. The main issues I see in your case is that the insert requests are sent by many distributed clients so there is little chance for bulk processing of the inserts.
You could consider adding a caching layer in front of whatever database engine you end up having. I dont think memcached is good for implementing such a caching layer -- memcached is typically used to cache query results not new insertions. I have personal experience with VoltDB and I definitely recommend it (I have no connection with the company). VoltDB is an in-memory, scale-out, relational DB optimized for transactional workloads that should give you orders of magnitude higher insert performance than MongoDB or MySQL. It is open source but not all features are free so I'm not sure if you need to pay for a license or not. If you cannot use VoltDB you could look at the memory engine for MySQL or other similar in-memory engines.
Another optimization you can consider is to have a different database for doing the analytics. Most likely, a database with a high data ingest volume is quite bad at executing OLAP-style queries and the other way around. Coming back to my recommendation, VoltDB is no exception and is also suboptimal at executing long analytical queries. The idea would be to create a background process that reads all new data in the frontend DB (i.e. this would be a VoltDB cluster) and moves it in bulk to the backend DB for the analytics (MongoDB or maybe something more efficient). You can then apply all the optimizations above for the bulk data movement, create a rich set of additional index structures to speed up data access, then run your favourite analytical queries and save the result as a new set of tables/materialized for later access. The import/analysis process can be repeated continuously in the background.
Tables are usually designed with the implied assumption that queries will far outnumber DML of all sorts. So the table is optimized for queries with indexes and such. If you have a table where DML (particularly Inserts) will far outnumber queries, then you can go a long way just by eliminating any indexes, including a primary key. Keys and indexes can be added to the table(s) the data will be moved to and subsequently queried from.
Fronting your web application with a NoSQL table to handle the high insert rate then moving the data more or less at your leisure to a standard relational db for further processing is a good idea.
I have a very large (100+ gigs) SQL Server 2005 database that receives a large number of inserts and updates, with less frequent selects. The selects require a lot of indexes to keep them functioning well, but it appears the number of indexes is effecting the efficiency of the inserts and updates.
Question: Is there a method for keeping two copies of a database where one is used for the inserts and updates while the second is used for the selects? The second copy wouldn't need to be real-time updated, but shouldn't be more than an hour old. Is it possible to do this kind of replication while keeping different indexes on each database copy? Perhaps you have other solutions?
Your looking to setup a master/child database topology using replication. With SQL server you'll need to setup replication between two databases (preferrably on separate hardware). The Master DB you should use for inserts and updates. The Child will service all your select queries. You'll want to also optimize both database configuration settings for the type of work they will be performing. If you have heavy select queries on the child database you may also want to setup view's that will make the queries perform better than complex joins on tables.
Some reference material on replication:
http://technet.microsoft.com/en-us/library/ms151198.aspx
Just google it and you'll find plenty of information on how to setup and configure:
http://search.aim.com/search/search?&query=sql+server+2005+replication&invocationType=tb50fftrab
Transactional replication can do this as the subscriber can have a number of aditional indexes compared with the publisher. But you have to bear in mind a simple fact: all inserts/updates/deletes are going to be replicated at the reporting copy (the subscriber) and the aditional indexes will... slow down replication. It is actually possible to slow down the replication to a rate at wich is unable to keep up, causing a swell of the distribution DB. But this is only when you have a constant high rate of updates. If the problems only occur durink spikes, then the distribution DB will act as a queue that absorbes the spikes and levels them off during off-peak hours.
I would not take this endevour without absolute, 100% proof evidence that it is the additional indexes that are slowing down the insert/updates/deletes, and w/o testing that the insert/updates/deletes are actually performing significantly better without the extra indexes. Specifically , ensure that the culprit is not the other usual suspect: lock contention.
Generally, all set-based operations (including updating indexes) are faster than non set-based ones
1,000 inserts will most probably be slower than one insert of 1,000 records.
You can batch the updates to the second database. This will, first, make the index updating more fast, and, second, smooth the peaks.
You could task schedule a bcp script to copy the data to the other DB.
You could also try transaction log shipping to update the read only db.
Don't forget to adjust the fill factor when you create your two databases. It should be low(er) on the database with frequent updates, and 100 on your "data warehouse"/read only database.
Hello I am creating a windows application that will be installed in 10 computers that will access the same database thru Entity Framework.
I was wondering what's better:
Spread the queries into packets (i.e. load contact then attach the included navigation properties - [DataContext.Contacts.Include("Phone"]).
Load everything in one query rather then splitting it out in individual queries.
You name it.
BTW I have a query that its trace string produced over 500 lines of sql, im doubting, maybe i should waive user-exprience for performance since performance is also a part of u.e.
You could put your SQL in stored procedures and write your Entity Framework logic to use the procedures instead of generating the SQL and sending it over the wire.
As with everything database related, it depends. Things like the connection type (LAN vs WAN), how you handle caching, database load level, type of database load (writes vs reads) etc, can all make a difference.
But in general, whenever you can reduce the number of round trips to the database that's a good thing. And remember: you can have more than one result set after executing a single SqlCommand.
Load everything in one query rather
then splitting it out in individual
queries.
This will normally be superior. You're usually better off writing chunkier queries than chatty ones. Fewer calls have less overhead - you need to obtain fewer connections, deal with less latency, etc..
Does the database server have to support other applications? For most business software applications, SQL server won't even break a sweat servicing ten clients - particularly performing basic entity lookups. It won't even really know you're there unless it's installed on a 486SX.
Is it possible to configure multiple database servers (all hosting the same database) to execute a single query simultaneously?
I'm not asking about executing queries using multiple CPUs simultaneously - I know this it possible.
UPDATE
What I mean is something like this:
There are two 2 servers: Server1 and Server2
Both server host database Foo and both instances of Foo are identical
I connect to Server1 and submit a complicated (lots of joins, many calculations) query
Server1 decides that some calculations should be made on Server2 and some data should be read from that server, too - appropriate parts of the query are sent to Server2
Both servers read data and perform necessary calculations
Finally, results from Server1 and Server2 are merged and returned to the client
All this should happen automatically, without need to explicitly reference Server1 or Server2. I mean such parallel query execution - is it possible?
UPDATE 2
Thanks for the tips, John and wuputah.
I am researching alternatives of increasing both availability and capacity of MOSS database backend. So what I'm looking for is some kind out-of-the-box SQL Server load balancing solution that would be transparent to the application, because I cannot modify the application in any way. I guess SQL Server has no such feature (and Oracle, as far as I understand it, does - it is RAC mentioned by wuputah).
UPDATE 3
A quote from the Top Tips for SQL Server Clustering article:
Let's start by debunking a common
misconception. You use MSCS clustering
for high availability, not for load
balancing. Also, SQL Server does not
have any built-in, automatic
load-balancing capability. You have to
load balance through your
application's physical design.
What you're really talking about is a clustering solution. It looks like SQL Server and Oracle have solutions to this, but I don't know anything about them. I can guess they would be very costly to buy and implement.
Possible alternate suggestions would be as follows:
Use master-slave replication, and do your complex read queries from the slave. All writes must go to the master, which are then sent to the slave, so things stay in sync. This helps things go faster because the slave only has to worry about the writes coming from the master, which are already predetermined on behalf of the slave (no deadlocks etc). If you're looking to utilize multiple servers, this is the first place I would start.
Use master-master replication. This means that all writes from both servers go to each other, so they stay in sync (at least theoretically). This has some of the benefits as master-slave but you don't have to worry about writes going to one server instead of the other. The more common use of master-master replication is for failover support; master-slave is really better suited to performance.
Use the feature John Sansom talked about. I don't know much about it, but it seems its basis is splitting your database into tables on different servers, which will have some benefits as well as drawbacks. The big issue is that since the two systems can't share memory, they will have to share a lot of data over the network to compute complex joins.
Hope this helps!
RE Update 1:
If you can't modify the application, there is hope, but it might be a bit complicated. If you were to set up master-slave replication, you can then set up a proxy to send read queries to the slave(s) and write queries to the master(s). I've seen this done with MySQL, but not SQLServer. That's a bit of a problem unless you want to write the proxy yourself.
This has been discussed on SO previously, so you can find more information there.
RE Update 2:
Microsoft's clustering might not be designed for performance, but that's Microsoft fault. That's still the level of complexity you're talking about here. If they say it won't help, then your options are limited to those above and by what you do with your application (like sharding, splitting into multiple databases, etc).
Yes I believe it is possible, well sort of, let me explain.
You need to look into and research the use of Distributed Queries. A distributed query runs across multiple servers and is typically used to reference data that is not stored locally.
http://msdn.microsoft.com/en-us/library/ms191440.aspx
For example, Server A may hold my Customers table and Server B holds my Orders table. It is possible using distributed queries to run a query that references both Server A and Server B, with each server managing the processing of its local data (which could incorporate the use of parallelism).
Now in theory you could store the exact same data on each server and design your queries specifically so that only certain table were referenced on certain servers, thereby distributing the query load. This is not true parallel processing however, in terms of CPU.
If your intended goal is to distribute the processing load of your application then the typical approach with SQL Server is to use Replication to distribute data processing across multiple servers. This method is also not to be confused with parallel processing.
http://databases.about.com/cs/sqlserver/a/aa041303a.htm
I hope this helps but of course please feel free to pose any questions you may have.
Interesting question, but I'm struggling to get my head around this being beneficial for a multi-user system.
If I'm the only user having half my query done on Server1 and the other half on Server2 sounds cool :)
If there are two concurrent users (lets say with queries of identical difficulty) then I'm struggling to see that this helps :(
I could have identical data on both servers and load balancing - so I get Server1, my mate gets Server2 - or I could have half the data on Server1 and the other half on Server2, and each will be optimised, and cache, just their own data - spreading the load. But whenever you have to do a merge to complete a query the limiting factor becomes the pipe-size between them.
Which is basically Federated Database Servers. Instead of having all my Customers on one server and all my Orders on the other I could, say, have my USA customers and their orders on one, and my European customers/orders on the other, and only if my query spans both is there any need for a merge step.