ECL - HPCC Testing Roxie query - ecl

I'm trying to write a Roxie query using ECL language. Is there a way to write and test the code without constantly publishing the query?

I am assuming that you are only seeking to avoid the extra UI-oriented steps in publishing a query (switching back and forth between ECL Watch and your dev environment, for instance). You can make testing on roxie relatively painless with some build scripting and REST calls.
In HPCC, roxie and hthor are similar from an execution and runtime environment viewpoint. Their tactical execution strategies are different (roxie handles queries in OS threads, hthor handles them by forking child processes), but the rule of thumb is that if you can get code to work well in hthor then it will probably work well in roxie.
You can leverage that similarity during development. Rather than publishing the query to roxie, testing, tearing down the query, and repeating all that, you can simply submit the job to hthor (much like you would do for a thor job). You would have to hardcode some test values that would normally be parameters for the roxie query, but that is simple enough.
An additional bonus to using hthor is that it is the only engine that supports any kind of step-by-step debugging. That can be a hit-or-miss proposition, though, depending on the version of HPCC you are executing against and you did not mention that. Even if you don't use the debugger, hthor's execution graphs at least show details on a specific query's data flows such as record counts at each step (roxie shows the graph, but there is no detailed information on individual queries).

Related

Automating jobs for service layer regression testing (Powershell/MSBuild)

In a .net development environment, I'm looking to implement some regression testing scripts to do some end-to-end (/blackbox) testing on a fully setup server based application, which will end up being fairly complex.
My initial thought was to roll my own powershell scripts/XML configuration of the steps. But I wanted to do some analysis to see if there is anything out there I could reuse, and perhaps what anyone else did which might prove to be a best practice (which I haven't found, as of yet).
I realised I could potentially just use an MSBuild project, along with the MSBuildExtensions and community tasks, but I've found these scripts to be harder to modify/maintain in the long run.
An example of some of the job steps I'd be coding for one of the applications:
Copy files to certain directories and trigger a service to load
Wait for the service to load files (checking sql tables for job completion)
Truncating tables (etc, on sql databases)
Comparing sql table output with expected results
Parsing log files
and so on
Some pretty simple powershell would be able to cater for most of these. I'd be interested on opinions: what do you use if you have some regression style end-to-end testing? Rolling your own in order to have a fairly simple, and specific implementation, or use a third part tool (like MSBuild, or something else)?
Choosing the right tool for the job is often driven by personal preference but it really should be driven by effectiveness vs. maintainability.
MSBuild excels in task reuse and dependency chain resolution.
PowerShell shines in compressing complex processes into a set of few elegant commands.
In your scenario I'd probably use PowerShell for the integration-oriented DSLs of job queuing, database, IO. I'd keep MSBuild for producing the build artifacts.
No need for third party tool unless it's the top dog in its field and the price is right (=open source or already purchased by your company).

SpecFlow Integration Testing with Database Patterns

I'm attempting to set up SpecFlow for integration/acceptance testing. Our product has a backing database (not a huge one though) in Sqlite.
This is actually proving to be a slightly sticky point though; how do I model the database for the tests?
I would like to know what patterns others out there use for doing integration/acceptance testing with backing databases.
I can think of the following approaches:
Compile a database into the assembly with the tests, then shadow-copy it for each test. Seems slow though.
I could create the database in memory and populate it with pre-determined data.
I could create the database in memory and somehow have Givens populate the database. This seems like it would bloat the tests horribly, but might give them more control and make the tests less fragile.
I could abstract every database interaction and use mocks. Not in love with this idea since I'd like to use this to test the database interactions as well.
Compile the database into the tests and rely on clean-up code to return it to the base state (this one seems dodgy to me). Don't want to do it with transactions since there will be multiple interactions with some tests (so write an item then attempt to read it back with different privileges).
Before considering the How to test, I think you might find it valuable to look at What you want to test.
Starting with what data, I find that it really helps to take a single element, or a small number, and imagine a set of events around them in order to give you the right test data to run your tests with. For example;
If you were working on a healthcare system, you might define a person "Bob" and then produce his life events. Bob was born 37 years ago today, fell off his bike as a child and broke his arm, got married, and has two children.
If you are working on a financial trading system, you might look at a day between opening and closing for a couple of stocks, e.g. "MSFT" and "APPL". On this day you might see one starting low and climbing, the other starting high and falling. A piece of news comes out that reverses their fortunes.
Now you have the what you can actually evaluate which of your scenarios actually work for your data, e.g. “MSFT” and “APPL” could have 1,000s of price changes throughout the day, so generating the Givens and Mocks would be very time consuming. This data lends itself to being pre-captured. On the other hand the “Bob” data works particularly well when using generated data because the data can always change so that it is his birthday today.
One thing your question doesn’t seem to need to consider is updating your data. For example you might want to have a set of tests that work at various stages of your entities life cycle, e.g. Some tests deal with “Baby Bob”, others with “10yr old Bob”,or “Married Bob” etc. If your DB is read only then this isn’t a problem if you can write your tests so that they just don’t see the other data, but sometimes you want build a story through your tests. If your tests do change the data, then you will have problems with ensuring that either your tests run in order (see MSTest OrderedTest or mbUnit DependsOn), or that you can separate your tests so they each deal with an isolated data entity (this is fine if your entity can be described in a single row, but gets harder when you have to read many tables to get it).
You also might want to consider what code you are testing, you can vary the approach inside your different test sets. I currently work on a multi-tier application that has a UI Views, View Models, Client Models, multiple communication systems, and server models. I also have different sets of tests for these. I have some tests that work in a single tier, mocking out other tiers to keep my tests small. Other tests fire up a local server and local client and wire the two up directly. Finally I have some tests that launch a full server process, communicate via EMS and run some simple client side operations using everything but the UI Views.
So now to actually answer your question,
Shadow copy your database - Yes, I’ve done this once with SQLServer Developer and had an xxx.mdb that got copied in before running the tests. However some modern testing frameworks will run tests in parallel e.g. NCrunch, so this just breaks.
Create the database and pre-populate - Not done this one, but my concerns would be what happens where a test changes the database to an unexpected state. Other tests will fail when they have done nothing wrong.
Create the database and use Givens - I’ve done this with NUnit via [SetupFixture]on top of a Linq-to-sql DB.You still have concerns about parallel test runs and you have to balance the granularity of your givens (see StackOverflow-When do BDD scenarios become too specific), and you have the data update ordering/data isolation problem, but this can work really well to allow you to work through your data stories and grow the data throughout your tests. On the other hand, should one test fail and leave the data in a bad state you can end up with lots of failures, but at least you simply need to look at the one that fails first. This kind of testing will also be not play very nicely for developers on their workstation as they can’t just run a single test, particularly with tools such as NCrunch, which can just run tests whose code has changed.
Mock the database This is how I choose to do things now. The trick is that if you are personally following a reasonably strict TDD process where you only test the method you are working on, then you actually end up with some tiers that test the database interaction, e.g. [Test]DALLayerTests.ShouldReadARowAndCreatePOCO(), but most others that used mocked data to test what actually happens e.g. [Test]BusinessObjectPersonTests.ShouldGetBirthdayCongratulations().
Use clean up code - Never tried it, it sounds dodgy :-)

stored procedures and testing -- still a problem even today. Why?

Right now this is just a theory so if I'm way off, feel free to comment and give some ideas (I'm running out of them). I guess it's more so an update to this question and as I look at the "related question" list -- there's a lot of 0 answers. This tells me there's a real gap.
We have multiple problems with our sql setups in general, the majority of which stem from stored procedures that have grown into monsters from hell and some other user functions skattered about into the db. My biggest concern is they're completely untested -- when something goes wrong, no one can say with 100% certainty "yes, I know for a fact this works". Makes debugging a recurring nightmare.
This afternoon, I got this crazy idea we could start writing some assemblies (CLR-ing yo!) for SQL and test them. I ran into the constraints (static methods only, safe/external/unsafe, etc) and overall, that didn't go all that well. At least not as well as I'd hoped and didn't help me move toward my goal.
I've also tried setting up data in a test by hand (they tried it here too before I showed up). Even using an ORM to seed the data -- this also becomes rather difficult very quickly and a maintenance hassle. Of course, most of this pain is in the data setup and not the actual test.
So what's out there now in 2011 that helps fix/curb this problem or have we (as devs) abandonded the idea of testing stored procedures because of the heavy cost?
You can actually make stored procedure tests as a project. Our DBEs at work do that - here's a link you might like: Database Unit Testing with Visual Studio
We've had a lot of success with DbFit.
Yes, there is a cost to setting up test data (there is no way to avoid this cost IMHO), but the Fitnesse platfom (on which DbFit is based) enables you to reuse data population scripts by including them within multiple tests.
Corporate culture rules the day. Some places test extensively. Other place, well, not so much.
I did a short-term contract with a Fortune 500 a few years ago. Design, build, and deploy internally. My database had to interface with two legacy systems. It was clear early on that I was going to have to spend more time testing that usual. (Some days, a query of historical data would return 35 rows. Other days the identical query would return 20,000 rows.)
I built a tool in Microsoft Access that stored and executed SQL statements. (Access was the only tool I was allowed to use.) I could build a current version of the database, populate it with test data, and run all the tests I'd built--several hundred of them--in about 20 minutes.
It helped a lot to be able to go into meetings armed with a one-page printout that said my code was working exactly like it was when they signed off on it. But it wasn't easily automated--most of the SQL was hand-coded.
Can DBUnit help you?
Not used it much myself but you should be able to set the database to a known state, execute the procedure and then verify the data has changed as expected.
EDIT: After looking in to this more it would seem you need something like SQLunit rather than DBUnit. SQLUnit is described as
SQLUnit is a regression and unit
testing harness for testing database
stored procedures. An SQLUnit test
suite would be written as an XML file.
The SQLUnit harness, which is written
in Java, uses the JUnit unit testing
framework to convert the XML test
specifications to JDBC calls and
compare the results generated from the
calls with the specified results.
There are downsides; it's Java based which might not be your preference and more importantly there doesn't seem to have been much activity on the project since June '06 :(

Feasible way to do automated performance testing on various database technologies?

A lot of guys on this site state that: "Optimizing something for performance is the root of all evil". My problem now is that I have a lot of complex SQL queries, many of them utilizing user created functions in PL/pgSQL or PL/python. My problem is that I do not have any performance profiling tool to show me, which functions actually make the queries slow. My current method is to exclude the various functions and take the time on the query for each one. I know that I could use explain analyze as well, but I do not think it will provide me with the information about user created functions.
My current method is quite tedious, especially since there is not query progress in PostgreSQL so I have sometimes have to wait for the query to run for 60 seconds, if I choose to run it on too much data.
Therefore, I am thinking whether it could be a good idea to create a tool, which will automatically do a performance profiling of SQL queries by modifying the SQL query and take the actual processing time on various versions of it. Each version would be a simplified one, which would maybe just contain a single user created function. I know that I am not describing how to do this clearly, and I can think of a lot of complicating factors, but I can also see that there are workarounds for many of these factors. I basically need your gut feeling on whether such a method is feasible.
Another similar idea is to run the query setting server settings work_mem to various values, and showing how this would impact the performance.
Such a tool could be written using JDBC so it could be modified to work across all major databases. In this case it might be a viable commercial product.
Apache JMeter can be used to load test and monitor the performance of SQL Queries (using JDBC). It will howerever not modify your SQL.
Actually I don't think any tool out there could simplify and then re-run your SQL. How should that "simplifying" work?

How do you measure all the queries that flow through your software?

In one of his blog articles, the proprietor of this web site posed this question to the reader, "You're automatically measuring all the queries that flow through your software, right?"
How do you do this? Every line of code that makes a query against the database is followed by a line of code that increments a counter? Or, are there tools that sit between your app and the DB and do this for you?
SQL Server Profiler is my tool of choice, but only for the DB end obviously.
It should be noted, this is for optimizing queries and performance, and debugging. This is not a tool to be left running all the time, as it can be resource intensive.
I don't know exactly what Jeff was trying to say, but I would guess that he expects you to use whatever query performance monitoring facility you have for your database.
Another approach is to use wrappers for database connections in your code. For example, in Java, assuming you have a DataSource that all of your classes use, you can write your own implementation of DataSource that uses an underlying DataSource to create Connection objects. Your DataSource should wrap those connections in your own Connection objects, which can keep track of the data that flows though them.
I have a C++ wrapper that I use for all my database work. That wrapper (in debug mode) basically does an EXPLAIN QUERY PLAN on every statement that it runs. If it gets back a response that an index is not being used, it ASSERTS. Great way to make sure indexes are used (but only for debug mode)
We just bought a software product called dynaTrace to do this. It uses byte code instrumentation (MSIL in our case since we use .Net but it does Java as well) to do this. It basically instruments around the methods we choose and around various framework methods to capture the time it takes for each method to execute.
In regards to database calls, it keeps track of each call made (through ADO.Net) and the parameters in the call, along with the execution time. You can then go from the DB call and walk through the execution path the program took to get there. It will show every method call (that you have instrumented) in the path. It is quite badass.
You might use this in a number of different ways but typically this would be used in some kind of load testing scenario with some other product providing load through the various paths of your system. Then you get a list of your DB calls under load and can look at them.
You can also evaluate not just the execution of one call but the count of them to prevent the death of a thousand cuts.
For Perl, DBI::Profile.
If your architecture is well designed, it should be fairly easy to intercept all data access calls and measure query execution time. A fairly easy way of doing this is by using an aspect around DB calls (if your language/framework supports aspect programming). Another way is to use a special driver that intercepts all calls, redirect to a real driver and measure query time execution.

Resources