I have a web-based application that allows users to create their own complicated queries using a simplified scripting language and GUI. Problem is - sometimes my users are well, not so bright. Often, they'll create a query that does massive joins, or employs pointless comparisons over large datasets that quickly consumes most of the available resources on the machine. In effect, a small amount of folks are ruining the party for everyone else. Training or banning these "special" users isn't an option.
So here's my question: Are there any databases (NoSQL or SQL, or anything really) that support resource constraints on a per query basis?
Limiting CPU utilization would be bare minimum, but other constraints like execution time, memory usage and rows-returned limits would be nice too. It'd be especially handy if I could programmatically specify limits so I could target my problem users.
EDIT: Extra points for opensource and/or free products.
EDIT2: Found some related questions, that make it clear that Oracle supports some sort of resource-limiting scheme, but are there any other products that do? Just Oracle and SQL Server?
https://serverfault.com/questions/124158/throttle-or-limit-resources-used-by-a-user-in-a-database
Is there a way to throttle or limit resources used by a user in Oracle?
SQL Server 2008 supports a resource governor:
Resource Governor is a new technology
in SQL Server 2008 that enables you to
manage SQL Server workload and
resources by specifying limits on
resource consumption by incoming
requests. In the Resource Governor
context, workload is a set of
similarly sized queries or requests
that can, and should be, treated as a
single entity. This is not a
requirement, but the more uniform the
resource usage pattern of a workload
is, the more benefit you are likely to
derive from Resource Governor.
Resource limits can be reconfigured in
real time with minimal impact on
workloads that are executing.
Resource Governor provides:
The ability to classify incoming connections and route their workloads
to a specific group.
The ability to monitor resource usage for each workload in a group.
The ability to pool resources and set pool-specific limits on CPU usage
and memory allocation. This prevents
or minimizes the probability of
run-away queries.
The ability to associate grouped workloads with a specific pool of
resources.
The ability to identify and set priorities for workloads.
Ref.
Resource constraints may ease your problem, but I think that the problem behine the situation is the unpredictable usage of resources.
When database is executing queries, database need to load data into memory and lock resources to maintain the consistent of status.
Whatever the constraint system can do, the unpredictable behavior in internal mechanism of database is the most risky thing you should concern.
If I'm facing this kind of situation, I would try to figure out what the user really need and provide more precise query(in some table, or some conditions of data) for it.
If nothing can do, however, I would try to clone(replication) the database for heavy-used user.
Related
I work for a ticket re-sale eCommerce site, and one of the problems that we have is that during on-sale periods our database is bombarded with thousands of requests.
the table that holds the tickets is constantly updated and read from and this is a major bottleneck for the site.
We considered reading from replicated databases but these replicated servers are some times hours out of sync.
one Idea was to use triggers on the tickets table and according to Update,Insert,Delete actions populate a denormalized table, and use this denormalized table to do the reads on. this might make queries a bit faster.
We have considered CQRS but due to the nature of our site, and the following reasons we think that it would not be a good fit:
each ticket is unique, since it is uploaded by a seller, and multiple buyers will be competing for the same tickets concurrently.
we experience bursty traffic when popular events are listed, and tickets are sold in a request-response manner.
are there any other techniques that we can use distribute some of the load?
Can you tell us more about which version of SQL Server you are using (2008R2, 2012, edition, etc.) and what Isolation Level are you running? As far as triggers go, they are rarely synonymous with "performance improvement" =) Have you been able to identify the specific waits in your DB? Are reads waiting on a long-winded update transaction or delete of some sort? Or are you experiencing memory pressure on the db server? Do you have auto update statistics on? Are your writes to the table also bursty? If your stats are out of date, you might be picking up inefficient query plans along the way. If you are not already using it, I'd highly recommend sp_Blitz from Brent Ozar to give you some more insight.
Once you know more about those items, you'll probably have a better idea of whether or not you NEED to actually distribute load vs. just do some tuning.
As far as load distribution, SQL Server AlwaysOn Availability Groups are potentially an answer, though they take some finessing. A readable secondary can be created that is asynchronously replicated which, in my experience at least, generally maintains fairly low latency. A synchronous replica can also be spun up, but that could compound wait issues...you'd have to do a fair amount of testing on that one.
You are basically building another eBay, with the same scaling issues they have.
There are some descriptions of their architecture: http://www.quora.com/What-is-eBays-architecture, http://highscalability.com/ebay-architecture, and many others on google.
Basically though, it comes down to using asynchronous processing whenever possible (learn about queues), and offloading as much from your main database server as possible, having a good real-time search server (which is not your database server), and scale horizontally by moving as much logic as possible into the app layer.
This will require that you give up ACID principals, and embrace eventual consistency. Eventual doesn't mean hours though, as you learn about queues, you will realize that allowing for a .5 second delay allows MUCH greater scalability.
So, from a back-of-the-napkin architecture, I would suggest you move your search to some fairly real-time search engine (like elasticsearch), offload most of your metadata to some no-sql platform (like MongoDB, or Cassandra) and reserve your database for processing bids against tickets. These bids shouldn't go straight to the database, but should be put in a queue, which will enforce ordering, and allow another process to execute them against the database.
Any one of these architectural changes will help with your load, but the asynchronous updating will make the biggest difference.
The application we have been building is starting to solidify in that the majority of the functionality is now in place. This has given us some breathing room and we are starting evaluate our persistence model and the management of it. I guess you could say the big elephant in the room is RavenDB. While we functionally have not experienced any issues with it yet, we are not comfortable with managing it. Simple tasks such as executing a query, truncating a collection, etc, are challenging for us as we are new to the platform and document based NoSql solutions in general. Of course we are capable of learning it, but I think it comes down to confidence, time, and leveraging our existing Sql Server skill sets. For example, we pumped millions of events through the system over the course of a few weeks and the successfully processes message were routed to our Audit queue in MSMQ. We also had ServiceInsight installed and it processed the messages in the Audit queue, which chewed up all the disk space on the server. We did not know how to fix this and literally had to delete the Data file that we found for RavenDB. Let me just say, doing that caused all kinds of headaches.
So with that in mind, I have been charged with evaluating the feasibility and benefits of potentially leveraging Sql Server for the Transport and/or Persistence for our Service Endpoints. In addition, I could use some guidance as well for configuring ServiceControl and ServiceInsight to leverage Sql Server. Any information you might be able to provide regarding configuring these and identifying any draw backs or architectural issues that we should consider would be greatly appreciated.
Thank you, Jeffrey
Using SQL persistence requires very little configuration (implementation detail), however, using SQL transport is more of an architectural decision then an infrastructure one as you are changing to a broker style architecture, that has implications you need to consider before going down that route.
ServiceControl and ServiceInsight persistance:
Although the ServiceControl monitors MSMQ as the default transport, you can use ServiceControl to support other transports such as RabbitMQ, SqlServer as well, Here you can find the details of how to do that
At the moment ServiceControl relies on RavenDb for it's persistence and it is not possible to change that to SQL as ServiceControl relies on Raven features.(AFIK)
There is an open issue for expiring data in ServiceControl's data, see this issue in github
HTH
Regarding ServiceControl usage of RavenDB (this is the underlying service that serves the data to ServiceInsight UI):
As Sean Farmar mentioned (above), in the post-beta releases we will be including message expiration, and on-demand audited message deletion commands so that you can have full control of the capacity utilization of SC.
You can also change the drive/path of the ServiceControl database location to allow it to use a larger drive.
Note that ServiceControl (and ServiceInsight / ServicePulse that use it) is intended for analysis, debugging and operational monitoring. Its intended to store a limited amount of audited data (based on your throughput and capacity needs, this may vary significantly when counted as number of messages, but the database storage capacity can be up to 16TB).
If you need a long term storage for audited data, you can hook into ServiceControl HTTP API and transfer the messages' data into various long-term / unlimited-size / low-cost storage solutions (e.g. http://aws.amazon.com/glacier).
Please let us know if this answers your needs and whether you have additional questions
Danny.
We have an enterprise LOB application for managing millions of bibliographic (lots of text) records using SQLServer (2008). The database is very normalized (a complete record might easily be made of up ten joined tables plus nested collections). Write transactions are fine, and we have a very responsive search solution for now, which makes generous use of full-text indexing and indexed views.
The issue is that in reality, much of what the research users need could be better served by a read-only warehouse-type copy of the data, but it would need to be continually copied near real-time (latency of a few minutes is fine).
Our search is optimized by several calculated columns or composite tables already, and we would like to add more. Indexed views cannot cover all needs because of their constraints (such as no outer joins). There are dozens of 'aspects' to this data, much like a read-only data warehouse might provide, involving permissions, geography, category, quality, and counts of associated documents. We also compose complex xml representations of the records that are fairly static and could be composed and stored once.
The total amount of denormalization, calculation and search optimization provokes an unacceptable delay if done completely via triggers, and is also prone to lock conflicts.
I've researched some of Microsoft's SQL Server suggestions, and I would like to know if anyone having experience with similar requirements has can offer recommendation from the following three (or other suggestions that use the SQL Server/.Net stack):
Transactional replication to a read-only copy - but it is unclear from the documentation how much one can change the schema on the subscriber side and add triggers, calculated columns or composite tables;
Table partitioning - not to alter the data, but perhaps to segment large areas of data that currently are recalculated constantly, such as permissions, record type (60), geographical region, etc...would that allow triggers on the transactional side to run with less locks?
Offline batch processing - Microsoft uses that phrase often, but does not give great examples, except for 'checking for signs of credit card fraud' on the subscriber side of transaction replication...which would be a great sample, but how is that done exactly in practice? SSIS jobs that run every 5 minutes? Service Broker? External executables that poll continually? We want to avoid the 'run a long process at night' solution, and we also want to avoid locking up the transactional side of things by running an update-intensive aggregating/compositing routine every 5 minutes on the transactional server.
Update to #3: after posting, I found this SO answer with a link to Real Time Data Integration using Change Tracking, Service Broker, SSIS and triggers - looks promising - would that be a recommended path?
Another Update: which, in turn, has helped me find rusanu.com - all things ServiceBroker by SO user Remus Rusanu. The asyncrhonous messaging solutions seem to match our scenario much better than the Replication scenarios...
Service Broker technology is good for serving your task although there are maybe potential drawback depending on your particular system configuration. The most valuable feature IMO is ability to decouple two kind of processing - writing and aggregation. You will be able to do this even using different databases/SQL Server instances/physical servers in very reliable way. Of course you need to spend some time designing message exchange process - specifying message formats, planning conversations, etc., because this has huge influence on satisfaction from resulting system.
I've used SSBS for my task that was more or less similar - near real-time creation of analytic data warehouse based on regular data flow.
What are the top issues and in which order of importance to look into while optimizing (performance tuning, troubleshooting) an existing (but unknown to you) database?
Which actions/measures in your previous optimizations gave the most effect (with possibly the minimum of work) ?
I'd like to partition this question into following categories (in order of interest to me):
one needs to show the performance boost (improvements) in the shortest time. i.e. most cost-effective methods/actions;
non-intrusive or least-troublesome most effective methods (without changing existing schemas, etc.)
intrusive methods
Update:
Suppose I have a copy of a database on dev machine without access to production environment to observe stats, most used queries, performance counters, etc. in real use.
This is development-related but not DBA-related question.
Update2:
Suppose the database was developed by others and was given to me for optimization (review) before it was delivered to production.
It is quite usual to have outsourced development detached from end-users.
Besides, there is a database design paradigm that a database, in contrast to application data storage, should be a value in itself independently on specific applications that use it or on context of its use.
Update3: Thanks to all answerers! You all pushed me to open subquestion
How do you stress load dev database (server) locally?
Create a performance Baseline (non-intrusive, use performance counters)
Identify the most expensive queries (non-intrusive, use SQL Profiler)
Identify the most frequently run queries (non-intrusive, use SQL Profiler)
Identify any overly complex queries, or those using slowly performing constructs or patterns. (non-intrusive to identify, use SQL Profiler and/or code inspections; possibly intrusive if changed, may require substantial re-testing)
Assess your hardware
Identify Indexes that would benefit the measured workload (non-intrusive, use SQL Profiler)
Measure and compare to your baseline.
If you have very large databases, or extreme operating conditions (such as 24/7 or ultra high query loads), look at the high end features offered by your RDBMS, such as table/index partitioning.
This may be of interest: How Can I Log and Find the Most Expensive Queries?
If the database is unknown to you, and you're under pressure, then you may not have time for Mitch's checklist which is good best practice to monitor server health.
You also need access to production to gather real info from assorted queries you can run. Without this, you're doomed. The server load pattern is important: you can't reproduce many issue yourself on a development server because you won't use the system like an end user.
Also, focus on "biggest bang for the buck". An expensive query running once daily at 3am can be ignored. A not-so-expensive one running every second is well worth optimising. However, you may not know this without knowing server load pattern.
So, basic steps..
Assuming you're firefighting:
server logs
SQL Server logs
sys.sysprocesses eg ASYNC_NETWORK_IO waits
Slow response:
profiler, with a duration filter. What runs often and is lengthy
most expensive query, weighted for how often used
open transaction with plan
weighted missing index
Things you should have:
Backups
Tested restore of aforementioned backups
Regular index and statistic maintenance
Regular DBCC and integrity checks
Edit: After your update
Static analysis is best practices only: you can't optimise for usage. This is all you can do. This is marc_s' answer.
You can guess what the most common query may be, but you can't guess how much data will be written or how badly a query scales with more data
In many shops developers provide some support, either directly or as *3rd line"
If you've been given a DB for review by another team that you hand over to another team to deploy: that's odd.
If you're not interested in the runtime behavior of the database, e.g. what are the most frequently executed queries and those that consume the most time, you can only do a "static" analysis of the database structure itself. That has a lot less value, really, since you can only check for a number of key indicators of bad design - but you cannot really tell much about the "dynamics" of the system being used.
Things I would check for in a database that I get as a .bak file - without the ability to collect live and actual runtime performance statistics - would be:
normalization - is the table structure normalized to third normal form? (at least most of the time - there might be some exceptions)
do all tables have a primary key? ("if it doesn't have a primary key, it's not a table", after all)
For SQL Server: do all the tables have a good clustering index? A unique, narrow, static, and preferably ever-increasing clustered key - ideally an INT IDENTITY, and most definitely not a large compound index of many fields, no GUID's and no large VARCHAR fields (see Kimberly Tripp's excellent blog posts on the topics for details)
are there any check and default constraints on the database tables?
are all the foreign key fields backed up by a non-clustered index to speed up JOIN queries?
are there any other, obvious "deadly sins" in the database, e.g. overly complicated views, or really badly designed tables etc.
But again: without actual runtime statistics, you're quite limited in what you can do from a "static analysis" point of view. The real optimization can only really happen when you have a workload from a regular day of operation, to see what queries are used frequently and put the most stress on your database --> use Mitch's checklist to check those points.
The most important thing to do is collect up-to-date statistics. Performance of a database depends on:
the schema;
the data in the database; and
the queries being executed.
Looking at any of those in isolation is far less useful than the whole.
Once you have collected the statistics, then you start identifying operations that are sub-par.
For what it's worth, the vast majority of performance problems we've fixed have been by either adding indexes, adding extra columns and triggers to move the cost of calculations away from the select to the insert/update, and tactfully informing the users that their queries are, shall we say, less than optimal :-)
They're usually pleased that we can just give them an equivalent query that runs much faster.
I'm not sure if my question was at all clear or not, so let me just dive in and give you the long version:
Someone said recently, when discussing high-volume web applications, that "disk is the new tape". Website administrators use huge clusters of memcache servers to make disk I/O-free round trips between the client and the server. In order to accomplish this, application developers are having to treat RDBMS's like generic data stores, tossing out valuable features like foreign key constraints, check constraints, and cascading UPDATES and DELETES.
But what if you could put your memcache cluster on the other side of your db interface, outside the realm of the application software (PHP, Ruby, Python...), and inside the RDBMS? Think of it as a large, distributed, in-memory database cache. To make it clear, I'm talking about having the type of memcache cluster that can store pretty much an entire database in memory, guaranteeing 100% cache hit rate when reading from the database, regardless of the parameters of your SELECT statement. Then, when writing application code, you can not only forget about memory caching, you can start dealing with the data store as if it were a Relational Database Management System again, and use normalization and JOINS the way RDBMS's traditionally encourage.
The performance gains, while maintaining/restoring the data integrity benefits of RDBMS's, might be worth looking into.
Does anyone know of a project where someone has done this already, perhaps as an open source project wherin they might modify an open source RDBMS such as PostGRE or MySQL? I haven't found anything ye, and I have no idea how these programs are structured or whether it would be possible to even implement such a storage engine.
You should review the CAP theorom:
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
Keep in mind, if one of those memcache servers rebooted, you would be a bit miffed. Even if you used all memcache servers you would still be limited by the network speed, and then possibly network congestion would become an issue.
There are other devices such as fusionIO cards and other pure RAM harddisks that also solve this issue, such as hypersystem's ram disks. ooohh,. I wish my company could afford a dozen of these...
-daniel