How do you measure all the queries that flow through your software? - database

In one of his blog articles, the proprietor of this web site posed this question to the reader, "You're automatically measuring all the queries that flow through your software, right?"
How do you do this? Every line of code that makes a query against the database is followed by a line of code that increments a counter? Or, are there tools that sit between your app and the DB and do this for you?

SQL Server Profiler is my tool of choice, but only for the DB end obviously.
It should be noted, this is for optimizing queries and performance, and debugging. This is not a tool to be left running all the time, as it can be resource intensive.

I don't know exactly what Jeff was trying to say, but I would guess that he expects you to use whatever query performance monitoring facility you have for your database.
Another approach is to use wrappers for database connections in your code. For example, in Java, assuming you have a DataSource that all of your classes use, you can write your own implementation of DataSource that uses an underlying DataSource to create Connection objects. Your DataSource should wrap those connections in your own Connection objects, which can keep track of the data that flows though them.

I have a C++ wrapper that I use for all my database work. That wrapper (in debug mode) basically does an EXPLAIN QUERY PLAN on every statement that it runs. If it gets back a response that an index is not being used, it ASSERTS. Great way to make sure indexes are used (but only for debug mode)

We just bought a software product called dynaTrace to do this. It uses byte code instrumentation (MSIL in our case since we use .Net but it does Java as well) to do this. It basically instruments around the methods we choose and around various framework methods to capture the time it takes for each method to execute.
In regards to database calls, it keeps track of each call made (through ADO.Net) and the parameters in the call, along with the execution time. You can then go from the DB call and walk through the execution path the program took to get there. It will show every method call (that you have instrumented) in the path. It is quite badass.
You might use this in a number of different ways but typically this would be used in some kind of load testing scenario with some other product providing load through the various paths of your system. Then you get a list of your DB calls under load and can look at them.
You can also evaluate not just the execution of one call but the count of them to prevent the death of a thousand cuts.

For Perl, DBI::Profile.

If your architecture is well designed, it should be fairly easy to intercept all data access calls and measure query execution time. A fairly easy way of doing this is by using an aspect around DB calls (if your language/framework supports aspect programming). Another way is to use a special driver that intercepts all calls, redirect to a real driver and measure query time execution.

Related

How can we write a high performance and SAFE SQLCLR assembly to create and put a json object in MSMQ?

There is a requirement in one of the projects i am working on, that, I need to create and pass some event messages in JSON format from SQL Server to MSMQ.
I found SQL CLR could be the best way to implement this. But, some my colleagues say that it is expensive in terms of performance and memory utilization.
I am looking for a benchmark of 20 msg/sec max.
No. of messages is depending upon certain events occured in the database.
The implementation is required to run for ~10hrs/day.
Kindly suggest on how to achieve it. Code snippet / steps will be a great help.
Any other option to implement the functionality is always welcome.
Thanks.
I found SQL CLR could be the best way to implement this. But, some my colleagues say that it is expensive in terms of performance and memory utilization.
There is a lot of "it depends" in this conversation. And, most of the performance / memory / security concerns you hear are based on misinformation or simple lack of information.
SQLCLR code can be inefficient, but it can also be more efficient / faster than T-SQL in some cases. It all depends on what you are trying to accomplish, and how you approach the problem, both in terms of overall structure as well as how it is coded. For things that can be done in straight T-SQL, then it is nearly always faster in straight T-SQL. But if you place that T-SQL code in a Scalar Function / UDF, then it is no longer fast ;-). Inline T-SQL is the fastest, IF you can actually do the thing you are trying to do.
So, if you can communicate with MSMQ via T-SQL, then do that. But if you can't, then yes, SQLCLR could be efficient enough to handle this.
HOWEVER #1, regarding the need for JSON:
I do not think that any of the supported .NET Framework libraries include JSON support (but I need to check again). While it is tempting to try to load Json.NET, certain coding practices are being done in that code that are not wise to use in SQLCLR, namely using static class variables. SQLCLR is a shared App Domain so all sessions running the same piece of code will share the same memory space. So the Json.NET code shouldn't be used as-is, it needs to be modified (if possible). Some people go the easy route and just set the Assembly to UNSAFE to get passed the errors about not being able to use static class variables that are not marked as readonly, but that can lead to odd / unexpected behavior so I wouldn't recommend it. There might also be references to unsupported .NET Framework libraries that would need to be loaded into SQL Server as UNSAFE. So, if you want to do JSON, you might have to construct it manually.
HOWEVER #2, the question title is (emphasis added):
How can we write a high performance and SAFE SQLCLR assembly to create and put a json object in MSMQ?
you cannot interact with anything outside of SQL Server in an Assembly marked as SAFE. You will need to mark the Assembly as EXTERNAL_ACCESS in order to communicate with MSMQ, but that shouldn't pose any inherent problem.

Feasible way to do automated performance testing on various database technologies?

A lot of guys on this site state that: "Optimizing something for performance is the root of all evil". My problem now is that I have a lot of complex SQL queries, many of them utilizing user created functions in PL/pgSQL or PL/python. My problem is that I do not have any performance profiling tool to show me, which functions actually make the queries slow. My current method is to exclude the various functions and take the time on the query for each one. I know that I could use explain analyze as well, but I do not think it will provide me with the information about user created functions.
My current method is quite tedious, especially since there is not query progress in PostgreSQL so I have sometimes have to wait for the query to run for 60 seconds, if I choose to run it on too much data.
Therefore, I am thinking whether it could be a good idea to create a tool, which will automatically do a performance profiling of SQL queries by modifying the SQL query and take the actual processing time on various versions of it. Each version would be a simplified one, which would maybe just contain a single user created function. I know that I am not describing how to do this clearly, and I can think of a lot of complicating factors, but I can also see that there are workarounds for many of these factors. I basically need your gut feeling on whether such a method is feasible.
Another similar idea is to run the query setting server settings work_mem to various values, and showing how this would impact the performance.
Such a tool could be written using JDBC so it could be modified to work across all major databases. In this case it might be a viable commercial product.
Apache JMeter can be used to load test and monitor the performance of SQL Queries (using JDBC). It will howerever not modify your SQL.
Actually I don't think any tool out there could simplify and then re-run your SQL. How should that "simplifying" work?

Need help designing big database update process

We have a database with ~100K business objects in it. Each object has about 40 properties which are stored amongst 15 tables. I have to get these objects, perform some transforms on them and then write them to a different database (with the same schema.)
This is ADO.Net 3.5, SQL Server 2005.
We have a library method to write a single property. It figures out which of the 15 tables the property goes into, creates and opens a connection, determines whether the property already exists and does an insert or update accordingly, and closes the connection.
My first pass at the program was to read an object from the source DB, perform the transform, and call the library routine on each of its 40 properties to write the object to the destination DB. Repeat 100,000 times. Obviously this is egregiously inefficent.
What are some good designs for handling this type of problem?
Thanks
This is exactly the sort of thing that SQL Server Integration Services (SSIS) is good for. It's documented in Books Online, same as SQL Server is.
Unfortunately, I would say that you need to forget your client-side library, and do it all in SQL.
How many times do you need to do this? If only once, and it can run unattended, I see no reason why you shouldn't reuse your existing client code. Automating the work of human beings is what computers are for. If it's inefficient, I know that sucks, but if you're going to do a week of work setting up a SSIS package, that's inefficient too. Plus, your client-side solution could contain business logic or validation code that you'd have to remember to carry over to SQL.
You might want to research Create_Assembly, moving your client code across the network to reside on your SQL box. This will avoid network latency, but could destabilize your SQL Server.
Bad news: you have many options
use flatfile transformations: Extract all the data into flatfiles, manipulate them using grep, awk, sed, c, perl into the required insert/update statements and execute those against the target database
PRO: Fast; CON: extremly ugly ... nightmare for maintanance, don't do this if you need this for longer then a week. And a couple dozens of executions
use pure sql: I don't know much about sql server, but I assume it has away to access one database from within the other, so one of the fastes ways to do this is to write it as a collection of 'insert / update / merge statements fed with select statements.
PRO: Fast, one technology only; CON: Requires direct connection between databases You might reach the limit of SQL or the available SQL knowledge pretty fast, depending on the kind of transformation.
use t-sql, or whatever iterative language the database provides, everything else is similiar to pure sql aproach.
PRO: pretty fast since you don't leave the database CON: I don't know t-sql, but if it is anything like PL/SQL it is not the nicest language to do complex transformation.
use a high level language (Java, C#, VB ...): You would load your data into proper business objects manipulate those and store them in the database. Pretty much what you seem to be doing right now, although it sounds there are better ORMs available, e.g. nhibernate
use a ETL Tool: There are special tools for extracting, transforming and loading data. They often support various databases. And have many strategies readily available for deciding if an update or insert is in place.
PRO: Sorry, you'll have to ask somebody else for that, I so far have nothing but bad experience with those tools.
CON: A highly specialized tool, that you need to master. I my personal experience: slower in implementation and execution of the transformation then handwritten SQL. A nightmare for maintainability, since everything is hidden away in proprietary repositories, so for IDE, Version Control, CI, Testing you are stuck with whatever the tool provider gives you, if any.
PRO: Even complex manipulations can be implemented in a clean maintainable way, you can use all the fancy tools like good IDEs, Testing Frameworks, CI Systems to support you while developing the transformation.
CON: It adds a lot of overhead (retrieving the data, out of the database, instanciating the objects, and marshalling the objects back into the target database. I'd go this way if it is a process that is going to be around for a long time.
Building on the last option you could further glorify the architectur by using messaging and webservices, which could be relevant if you have more then one source database, or more then one target database. Or you could manually implement a multithreaded transformer, in order to gain through put. But I guess I am leaving the scope of your question.
I'm with John, SSIS is the way to go for any repeatable process to import large amounts of data. It should be much faster than the 30 hours you are currently getting. You could also write pure t-sql code to do this if the two database are on the same server or are linked servers. If you go the t-sql route, you may need to do a hybrid of set-based and looping code to run on batches (of say 2000 records at a time) rather than lock up the table for the whole time a large insert would take.

Most compatible way to listen for database changes?

I have a process that reads raw data and writes this to a database every few seconds.
What is the best way to tell if the database has been written to? I know that Oracle and MS-SQL can use triggers or something to communicate with other services, but I was hoping there would be a technique that would work with more types of SQL databases (SQL lite, MySQL, PostGRES).
Your question is lacking specifics needed for a good answer but I'll give it a try. Triggers are good for targeting tables but if you are interested in system-wide writes then you'll need a better method that is easier to maintain. For system-wide writes I'd investigate methods that detect changes in the transaction log. Unfortunately, each vendor implements this part differently, so one method that works for all vendors is not likely. That is, a method that works within the database server is unlikely. But there may be more elegant ways outside of the server at the OS level. For instance, if the transaction log is a file on disk then a simple script of some sort that detects changes in the file would indicate the DB was written to.
Keep in mind you have asked only to detect a db write. If you need to know what type of write it was then you'll need to get into the transaction log to see what is there. And that will definitely be vendor specific.
It depends on what you wish to do. If it is something external to the database that needs to be kicked off then a simple poll of the database would do the trick, otherwise a db specific trigger is probably best.
If you want to be database independant, polling can work. It's not very efficient or elegant. It also works if you are cursed to using a database that doesn't support triggers. A workaround that we've used in the past is to use a script that is timed (say via cron) to do a select MAX(primary_key_id) from saidTable. I am assuming that your primary key is an a sequential integer and is indexed.
And then compare that to the value you obtained the last time you ran the script. If they match, tell the script to exit or sleep. If not, do your thing.
There are other issues with this approach (ie: backlogs if the script takes too long, or concurrency issues, etc.). And of course performance can become an issue too!

Database design for physics hardware

I have to develop a database for a unique environment. I don't have experience with database design and could use everybody's wisdom.
My group is designing a database for piece of physics hardware and a data acquisition system. We need a system that will store all the hardware configuration parameters, and track the changes to these parameters as they are changed by the user.
The setup:
We have nearly 200 detectors and roughly 40 parameters associated with each detector. Of these 40 parameters, we expect only a few to change during the course of the experiment. Most parameters associated with a single detector are static.
We collect data for this experiment in timed runs. During these runs, the parameters loaded into the hardware must not change, although we should be able to edit the database at any time to prepare for the next run. The current plan:
The database will provide the difference between the current parameters and the parameters used during last run.
At the start of a new run, the most recent database changes be loaded into hardware.
The settings used for the upcoming run must be tagged with a run number and the current date and time. This is essential. I need a run-by-run history of the experimental setup.
There will be several different clients that both read and write to the database. Although changes to the database will be infrequent, I cannot guarantee that the changes won't happen concurrently.
Must be robust and non-corruptible. The configuration of the experimental system depends on the hardware. Any breakdown of the database would prevent data acquisition, and our time is expensive. Database backups?
My current plan is to implement the above requirements using a sqlite database, although I am unsure if it can support all my requirements. Is there any other technology I should look into? Has anybody done something similar? I am willing to learn any technology, as long as it's mature.
Tips and advice are welcome.
Thank you,
Sean
Update 1:
Database access:
There are three lite applications that can write and read to the database and one application that can only read.
The applications with write access are responsible for setting a non-overlapping subset of the hardware parameters. To be specific, we have one application (of which there may be multiple copies) which sets the high voltage, one application which sets the remainder of the hardware parameters which may change during the experiment, and one GUI which sets the remainder of the parameters which are nearly static and are only essential for the proper reconstruction of the data.
The program with read access only is our data analysis software. It needs access to nearly all of the parameters in the database to properly format the incoming data into something we can analyze properly. The number of connections to the database should be >10.
Backups:
Another setup at our lab dumps an xml file every run. Even though I don't think xml is appropriate, I was planning to back up the system every run, just in case.
Some basic things about the design; you should make sure that you don't delete data from any tables; keep track of the most recent data (probably best with most recent updated datetime); when the data value changes, though, don't delete the old data. When a run is initiated, tag every table used with the Run ID (in another column); this way, you maintain full historical record about every setting, and can pin exactly what the state used at a given run was.
Ask around of your colleagues.
You don't say what kind of physics you're doing, or how big the working group is, but in my discipline (particle physics) there is a deep repository of experience putting up and running just this type of systems (we call it "slow controls" and similar). There is a pretty good chance that someone you work with has either done this or knows someone who has. There may be a detailed description of the last time out in someone's thesis.
I don't personally have much to do with this, but I do know this: one common feature is to have no-delete-no-overwrite design. You can only add data, never remove it. This preserves your chances of figuring out what really happened in the case of trouble
Perhaps I should explain a little more. While this is an important task and has to be done right, it is not really related to physics, so you can't look it up on Spires or on arXive.org. No one writes papers on the design and implementation of medium sized slow controls databases. But they do sometimes put it in their dissertations. The easiest way to find a pointer really is to ask a bunch of people around the lab.
This is not a particularly large database by the sounds of things. So you might be able to get away with using Oracle's free database which will give you all kinds of great flexibility with journaling (not sure if that is an actual word) and administration.
Your mentioning of 'non-corruptible' right after you say "There will be several different clients that both read and write to the database" raises a red flag for me. Are you planning on creating some sort of application that has a interface for this? Or were you planning on direct access to the db via a tool like TOAD?
In order to preserve your data integrity you will need to get really strict on your permissions. I would only allow one (and a backup) person to have admin rights with the ability to do the data manipulation outside the GUI (which will make your life easier).
Backups? Yes, absolutely! Not only should you do daily, weekly and monthly backups you should do full and incremental. Also, test your backup images often to confirm they are in fact working.
As for the data structure I would need much greater detail in what you are trying to store and how you would access it. But from what you have put here I would say you need the following tables (to begin with):
Detectors
Parameters
Detector_Parameters
Some additional notes:
Since you will be doing so many changes I recommend using a version control like SVN to keep track of all your DDLs etc. I would also recommend using something like bugzilla for bug tracking (if needed) and using google docs for team document management.
Hope that helps.

Resources