SQLite's test code to production code ratio - c

SQLite claim to have 679 times more test code than production one.
http://www.sqlite.org/testing.html
Does anyone knows how it is possible? Do they generate any test code automatically? What are the major parts of these "45678.3 KSLOC" of test code?

"Does anyone knows how it is possible?"
"It is possible" to have 679 times as much test code because a single feature can be used in many different ways. Consider just a single function that takes two parameters. I can generate alot of test code for that one function that tests boundary conditions and many other combinations of conditions. When you consider setup/teardown of the tests, there is additional code there. Depending on their testing framework this overhead may significantly add to the amount of code in testing.
What it really boils down to is the fact the a piece of software can be used in so many different ways, which means that you have many different scenarios to test for. This is the beauty of elegant software, in that a simple program can be applied to numerous scenarios, but that is the same thing that makes verifying and testing software so challenging.

It's presumably possible if the developers spent 679 times as much time writing test code as they spent writing production code. Just think: if they'd opted instead for 339 times as much test code, they could have had two entire database engines, each still with a ludicrous amount of test coverage.
I once watched a fellow developer trying to placate a furious customer about slipped deadlines by informing them that he had written 5 times as much test code as production code. The customer was not placated, if you can imagine. At least I don't think 5X coverage is extreme anymore.

It uses Tcl to power the test framework so it's much easier to write tests than it is to write the implementation. This encourages thorough testing, which is what you want in a database, yes? Moreover, a fair fraction of those tests are proprietary, aimed at testing in embedded environments; I imagine some corporate user (or users) paid for that sort of thing. It's also quite possible that the same feature is tested multiple times.

Looking at section 3.1 (OOM):
OOM testing is accomplished by
simulating OOM errors. SQLite allows
an application to substitute an
alternative malloc() implementation
using the
sqlite3_config(SQLITE_CONFIG_MALLOC,...)
interface. The TCL and TH3 test
harnesses are both capable of
inserting a modified version of
malloc() that can be rigged to fail
after a certain number of allocations.
These instrumented mallocs can be set
to fail only once and then start
working again, or to continue failing
after the first failure. OOM tests are
done in a loop. On the first iteration
of the loop, the instrumented malloc
is rigged to fail on the first
allocation. Then some SQLite operation
is carried out and checks are done to
make sure SQLite handled the OOM error
correctly. Then the time-to-failure
counter on the instrumented malloc is
increased by one and the test is
repeated. The loop continues until the
entire operation runs to completion
without ever encountering a simulated
OOM failure. Tests like this are run
twice, once with the instrumented
malloc set to fail only once, and
again with the instrumented malloc set
to fail continuously after the first
failure.
Note that section 7 explicitly states 100% core coverage as determined by gcov. I agree with Donal Fellows that the test framework is largely responsible for the test coverage beyond what a call graph would suggest. Its a much different thing to see malloc() entered nn times and write a test for it than it is to write dozens of tests geared to simulate environments where malloc() is likely to fail.
Yes, the resulting coverage is an artifact of diligence, however so is the selection of a test framework that enables that kind of diligence.
Finally, reiterating the obvious, malloc() takes only a single void pointer. This suggests that the tests written around it are by deliberate design, not automatically generated.

Related

How to run tSQLt in a specific order and make it dependent test?

Is there any way to run tests within tSQLt test class in a specific order?
To my knowledge, every test is run within a transaction (meaning at the end all data will be rollback). Can we make dependent tests in a run in a particular order so that the test will rollback only after all tests are performed?
I need to do this because my data is dependent on one after another.
Why do you write tests? I assume to achieve a goal. But what is that goal?
That goal is in the end (I assume again) to spend less time hunting down hard to find defects and more time delivering value to the customers/stakeholders.
Now, writing tests has an overhead. We need to keep that overhead as small as possible, otherwise we end up going from wasting time on defects to wasting time on tests and as a result we still are slow delivering value.
One of the ways to keep that overhead small is the pay attention to writing tests that are easy to read, understand, and maintain. And one of the most important parts of that is to keep each test independent of all others and as much as possible independent of the outside world.
That means that each test should set up the environment which includes creating the data it needs and cleaning up afterward. tSQLt handles the cleanup part for you courtesy of the transaction each test is executed in.
On the other hand, writing tests that depend on each other creates a maintenance nightmare and in addition can lead to ambiguous test results. (Did this test fail because code in the test or code outside of the test did not work?)
So, to answer your question, tSQLt does not allow you to specify test order and for the reasons above you should not try to circumvent that.

Relying on db transaction rollback in sunshine scenario

In a financial system I am currently maintaining, we are relying on the rollback mechanism of our database to simulate the results of running some large batch jobs - rolling back the transaction or committing it at the end, depending on whether we were doing a test run.
I really cannot decide what my opinion is. In a way I think this is great, because then there is no difference between the simulation and a live run - on the other hand it just feels kind of icky, like e.g. depending on code throwing exceptions to carry out your business logic.
What is your opinion on relying on database transactions to carry out your business logic?
EXAMPLE
Consider an administration system having 1000 mortgage deeds in a database. Now the user wants to run a batch job that creates the next term invoice on each deed, using a couple of advanced search criteria that decides which deeds are to be invoiced.
Before she actually does this, she does a test run (implemented by doing the actualy run but ending in a transaction rollback), creating a report on which deeds will be invoiced. If it looks satisfactory, she can choose to do the actual run, which will end in a transaction commit.
Each invoice will be stamped with a batch number, allowing us to revert the changes later if it is needed, so it's not "dangereous" to do the batch run. The users just feel that it's better UX to be able to simulate the results first.
CLARIFICATION
It is NOT about testing. We do have test and staging environments for that. It's about a regular user using our system wanting to simulate the results of a large operation, that may seem "uncontrollable" or "irreversible" even though it isn't.
CONCLUSION
Doesn't seem like anyone has any real good arguments against our solution. As always, context means everything, so in the context of complex functional requirements exceeding performance requirements, using db rollback to implement batch job simulations seems a viable solution.
As there is no real answer to this question, I am not choosing an answer - instead I upvoted those who actually put forth an argument.
I think it's an acceptable approach, as long as it doesn't interfere with regular processing.
The alternative would be to build a query that displays the consequences for review, but we all have had the experience of taking such an approach and not quite getting it right; or finding that the context changed between query and execution.
At the scale of 1000 rows, it's unlikely the system load is burdensome.
Before she actually does this, she does a test run (implemented by doing the actualy run but ending in a transaction rollback), creating a report on which deeds will be invoiced. If it looks satisfactory, she can choose to do the actual run, which will end in a transaction commit.
That's wrong, prone to failure, and must be hell on your database logs. Unless you wrap your simulation and the actual run in a single transaction (which, judging by the timeline necessary to inspect 1000 deeds, would lead to a lot of blocked users) then there's no guaranteed consistency between test run and real run. If somebody changed data, added rows, etc. then you could end up with a different result - defeating the entire purpose of the test run.
A better pattern to do this would be for the test run to tag the records, and the real run to pick up the tagged records and process them. Or, if you have a thick client app, you can pull down the records to the client, show the report, and - if approved - push them back up.
We can see what the user needs to do, quite a reasonable thing. I mean how often do we get a regexp right first time? Refining a query till it does exactly what you want is not unusual.
The business consequences of not catching errors may be quite high, so doing a trial run makes sense.
Given a blank sheet of paper I'm sure we can devise an clean implementation expressed in formal behaviours of the system rather than this somewhat back-door appraoch.
How much effort would I put into fixing that now? Depends on whether the current approach is actually hurting. We can imagine that in a heaviliy used system it could lead to contention in the database.
What I wrote about the PRO FORMA environment in that bank I worked in was also entirely a user thing.
I'm not sure exactly what you're asking here. Taking you literally
What is your opinion on relying on
database transactions to carry out
your business logic?
Well, that's why we have transactions. We do rely on them. We hit an error and abort a transaction and rely on work done in that transaction scope to be rolled-back. So exploiting the transactional beahviours of our systems is a jolly good thing, and we'd need to hand-roll the same thing ourselves if we didn't.
But I think your question is about testing in a live system and relying on not commiting in order to do no damage. In an ideal world we have a live system and a test system and we don't mess with live systems. Such ideals are rarely seen. Far more common is "patch the live system. testing? what do you mean testing?" So in fact you're ahead of the game compared with some.
An alternative is to have dummy data in the live system, so that some actions can actually run all the way through. Again, error prone.
A surprisingly high proportion of systems outage are due to finger trouble, it's the humans who foul up.
It works - as you say. I'd worry about the concurrency of the system since the transaction will hold locks, possibly many locks. It means that your tests will hold up any live action on the system (and any live action operations will hold up your tests). It is generally better to use a test system for testing. I don't like it much, but if the risks from running a test but forgetting to roll it back are not a problem, and neither is the interference, then it is an effective way of attaining a 'what if' type calculation. But still not very clean.
When I was working in a company that was part of the "financial system", there was a project team that had decided to use the production environment to test their conversion procedure (and just rollback instead of commit at the end).
They almost got shot for it. With afterthought, it's a pity they weren't.
Your test environments that you claim you have are for the IT people's use. Get a similar "PRO-FORMA" environment of which you can guarantee your users that it is for THEIR use exclusively.
When I worked in that bank, creating such a PRO-FORMA environment was standard procedure at every year closure.
"But it's not about testing, it's about a user wanting to simulate the results of a large job."
Paraphrased : "it's not about testing, it's about simulation".
Sometimes I wish Fabian Pascal was still in business.
(Oh, and in case anyone doesn't understand : both are about "not perpetuating the results".)

Should you test an external system prior to using it?

Note: This is not for unit testing or integration testing. This is for when the application is running.
I am working on a system which communicates to multiple back end systems, which can be grouped into three types
Relational database
SOAP or WCF service
File system (network share)
Due to the environment this will run in, there are no guarantees that any of those will be available at run time. In fact some of them seem pretty brittle and go down multiple times a day :(
The thinking is to have a small bit of test code which runs before the actual code. If there is a problem then persist the request and poll until the target system until it is available. Tests could possibly be rerun within the code to check it is still available at logical points. The ultimate goal is to have a very stable system, regardless of the stability (or lack thereof) of the systems it communicates to.
My questions around this design are:
Are there major issues with it? (small things like the fact it may fail between the test completing and the code running are understandable)
Are there better ways to implement this sort of design?
Would using traditional exception handling and/or transactions be better?
Updates
The system needs to talk to the back end systems in a coordinated way.
The system is very async in nature so using things like queuing technologies is fine.
The system must run even if one or more backend systems are down as others may be up and processing of some information is possible.
You will be needing that traditional exception handling no matter what, since as you point out there's always the chance that things'll fail between your last check and the actual request. So I really think any solution you find should try to interact smoothly with this.
You are not stating if these flaky resources need to interact in some kind of coordinated manner, which would indicate that you should probably be using a transaction manager of some sort to do this. I do not believe you want to get into the footwork of transaction management in application code for most needs.
Sometimes I have also seen people use AOP to encapsulate retry logic to back-end systems that fail (for instance due to time-out issues). Used sparingly this may be a decent solution.
In some cases you can also use message queuing technology to alleviate unstable back-ends. You could for instance commit to a message queue as part of a transaction, and only pop off the queue when successful. But this design is normally only possible when you're able to live with an asynchronous process.
And as always, real stability can only be achieved by attacking the root cause of the problem. I had a 25-year old bug fixed in a mainframe TCP/IP stack fixed because we were overrunning it, so it is possible.
The Microsoft Smartclient framework provides a ConnectionMonitor class. Should be easy to use or duplicate.
Our approach to this kind of issue was to run a really basic 'sanity tester' prior to bringing up our main application. This was thick client so we could run the test every time the app started. This sanity test would go out and check things like database availability, and external network (extranet) access, and it could have been extended to do webservices as well.
If there was a failure, the user was informed, and crucially an email was also sent to the support/dev team. These emails soon became unweildy as so many were being created, but we then setup filters, so we knew when somethings really bad was happening. Overall the approach worked pretty well, our biggest win was being able to tell users that the system was down, before they had entered data, and got part way through a long winded process. They absolutely loved it.
At a technica level the sanity was written in C#, it used exception handling in a conventional way not to find the problems it was looking for. The sanity program became a mini app in its own right, and it was standalone from the main app. If I were doing it again I'd using a logging framework to capture issues, which is more flexible then our hard coded approach.

If you were asked if a system could sustain double growth, what 3 things would you do to answer?

Let's say at your job your boss says,
That system over there, which has lost all institutional knowledge but seems to run pretty good right now, could we dump double the data in it and survive?
You're completely unfamiliar with the system.
It's in SQL Server 2000 (primarily a database app).
There's no test environment.
You might be able to hijack it on the weekends if you needed to run a benchmark.
What would be the 3 things you'd do to convince yourself and then your manager that you could take on that extra load. And if you couldn't do it, on the same hardware... the extra hardware (measured in dollars) it would take to satisfy that request.
To address the response from doofledorfer, you assumptions are almost all 180 degrees off. But that's my fault for an ambiguous question.
One of the main servers runs 7x24 at 70% base and spikes from there and no one knows what it is doing.
This isn't an issue of buy-in or whining... Our company may not have much of a choice in the matter.
Because this is being externally mandated, delays in implementation could result in huge fines. So large meeting to assess risk are almost impossible. There is one risk, that dumping double the data would take the system down for the existing customers.
I was hoping someone would say something like, see if you take the system off line Sunday night at midnight and run SQLIO tests to see how close the storage subsystem is to saturation. Things like that.
Set up a test environment, even if I have to do it on my laptop.
Enable some kind of logging on the production system to get an idea of the volume of transactions in addition to the volume of data.
Read the source code as I run stress tests on my laptop with increasing amounts of data.
Having said that, I sympathize with this assignment, because it's unfair. It's like asking someone in a boat if the boat can float with twice the cargo -- but you can't get out of the boat or take it out of its regular service.
You've just described a typical Agile project. Your answer should be:
I don't know, and I won't be able to tell without testing.
In addition to data volume, there might be issues with usage patterns, application interactions, database and server tuning, etc.
So let's work through a basic list of risk factors, and how we might resolve them.
Once we've done that, let's work through them in inverse order of risk; and make a stop/continue decision as we develop the results.
etc.
Without management buy-in and participation at least at that level, any other answer you might give is high-risk wishing, and "3 most important" is a non sequitur.
I'd be optimistic unless your current system is substantially loaded already. Most servers should run at less than 50% capacity on all resources, or else be on life-support.
And I expect you wouldn't be having the conversation if the existing server were already dealing with load issues; although "seems to run pretty good right now" is imprecise enough to be worrisome.
it mostly depends on its current level. If doubling is going from 2GB to 4GB just do it. If it's going from 1TB to 2TB you've got some planning to do.
I'd collect some info using Performance Monitor and provide it to help make an educated decision.
It depends what you mean by "double the data".
If that is going to affect one table only (say product table) then you are probably safe as most queries that are referring to that one table are most likely to double the time of execution (that assumes that you do not reference the same time twice in a query).
The problem will arise if you double the amount of data in all the tables as the execution time may grow in exponential fashion then and it can lead to some serious issues.
But in general I would support the answer by doofledorfer

What advice can you give me for writing a meaningful benchmark?

I have developed a framework that is used by several teams in our organisation. Those "modules", developed on top of this framework, can behave quite differently but they are all pretty resources consuming even though some are more than others. They all receive data in input, analyse and/or transform it, and send it further.
We planned to buy new hardware and my boss asked me to define and implement a benchmark based on the modules in order to compare the different offers we have got.
My idea is to simply start sequentially each module with a well chosen bunch of data as input.
Do you have any advice? Any remarks on this simple procedure?
Your question is pretty broad, so unfortunately my answer will not be very specific either.
First, benchmarking is hard. Do not underestimate the effort necessary to produce meaningful, repeatable, high-confidence results.
Second, what is your performance goal? Is it throughput (transaction or operations per second)? Is it latency (time it takes to execute a transaction)? Do you care about average performance? Do I care about worst case performance? Do you care about the absolute worst case or I care that 90%, 95% or some other percentile get adequate performance?
Depending on which goal you have, then you should design your benchmark to measure against that goal. So, if you are interested in throughput, you probably want to send messages / transactions / input into your system at a prescribed rate and see if the system is keeping up.
If you are interested in latency, you would send messages / transactions / input and measure how long it takes to process each one.
If you are interested in worst case performance you will add load to the system until up to whatever you consider "realistic" (or whatever the system design says it should support.)
Second, you do not say if these modules are going to be CPU bound, I/O bound, if they can take advantage of multiple CPUs/cores, etc. As you are trying to evaluate different hardware solutions you may find that your application benefits more from a great I/O subsystem vs. a huge number of CPUs.
Third, the best benchmark (and the hardest) is to put realistic load into the system. Meaning, you record data from a production environment, and put the new hardware solution through this data. Getting this done is harder than it sounds, often, this means adding all kinds of measure points in the system to see how it behaves (if you do not have them already,) modifying the existing system to add record/playback capabilities, modifying the playback to run at different rates, and getting a realistic (i.e., similar to production) environment for testing.
The most meaningful benchmark is to measure how your code performs under everyday usage. That will obviously provide you with the most realistic numbers.
Choose several real-life data sets and put them through the same processes your org uses every day. For extra credit, talk with the people that use your framework and ask them to provide some "best-case", "normal", and "worst-case" data. Anonymize the data if there are privacy concerns, but try not to change anything that could affect performance.
Remember that you are benchmarking and comparing two sets of hardware, not your framework. Treat all of the software as a black box and simply measure the hardware performance.
Lastly, consider saving the data sets and using them to similarly evaluate any later changes you make to the software.
If you're system is supposed to be able to handle multiple clients all calling at the same time, then your benchmark should reflect this. Note that some calls will not play well together. For example, having 25 threads post the same bit of information at the same time could lead to locks on the server end, thus skewing your results.
From a nuts-and-bolts point of view, I've used Perl and its Benchmark module to gather the information I care about.
If you're comparing differing hardware, then measuring the cost per transaction will give you a good comparison of the trade offs of hardware for performance. One configuration may give you the best performance, but costs too much. A less expensive configuration may give you adequate performance.
It's important to emulate the "worst case" or "peak hour" of load. It's also important to test with "typical" volumes. It's a balancing act to get good server utilization, that doesn't cost too much, that gives the required performance.
Testing across hardware configurations quickly becomes expensive. Another viable option is to first measure on the configuration you have, then simulate that behavior across virtual systems using a model.
If you can, try to record some operations users (or processes) are doing with your framework, ideally using a clone of the real system. That gives you the most realistic data. Things to consider:
Which functions are most often used?
How much data is transferred?
Do not assume anything. If you think "that is going to be fast/slow", don't bet on it. In 9 out of 10 cases, you're wrong.
Create a top ten for 1+2 and work from that.
That said: If you replace old hardware with new hardware, you can expect roughly 10% faster execution for each year that has passed since you bought the first set (if the systems are otherwise pretty equal).
If you have a specialized system, the numbers may be completely different but usually, new hardware doesn't change much. For example, adding an useful index to a database can reduce the runtime of a query from two hours to two seconds. Hardware will never give you that.
As I see it, there are two kinds of benchmarks when it comes to benchmarking software. First, microbenchmarks, when you try to evaluate a piece of code in isolation or how a system deals with narrowly defined workload. Compare two sorting algorithms written in Java. Compare two web browsers how fast can each perform some DOM manipulation operation. Second, there are system benchmarks (I just made the name up), when you try to evaluate a software system under a realistic workload. Compare my Python based backend running on Google Compute Engine and on Amazon AWS.
When dealing with Java and such like, keep in mind that the VM needs to warm up before it can give you realistic performance. If you measure time with the time command, the JVM startup time will be included. You almost always want to either ignore start-up time or keep track of it separately.
Microbenchmarking
During the first run, CPU caches are getting filled with the necessary data. The same goes for disk caches. During few subsequent runs the VM continues to warm up, meaning JIT compiles what it deems helpful to compile. You want to ignore these runs and start measuring afterwards.
Make a lot of measurements and compute some statistics. Mean, median, standard deviation, plot a chart. Look at it and see how much it changes. Things that can influence the result include GC pauses in the VM, frequency scaling on the CPU, some other process may start some background task (like virus scan), OS may decide move the process on a different CPU core, if you have NUMA architecture, the results would be even more marked.
In case of microbenchmarks, all of this is a problem. Kill what processes you can before you begin. Use a benchmarking library that can do some of it for you. Like https://github.com/google/caliper and such like.
System benchmarking
In case of benchmarking a system under a realistic workload, these details do not really interest you and your problem is "only" to know what a realistic workload is, how to generate it and what data to collect. It is always best if you can instrument a production system and collect data there. You can usually do that, because you are measuring end-user characteristics (how long did a web page render) and these are I/O bound so the code gathering data does not slow down the system. (The page needs to be shipped to the user over the network, it does not matter if we also log a few numbers in the process).
Be mindful of the difference between profiling and benchmarking. Benchmarking can give you absolute time spent doing something, profiling gives you relative time spent doing something compared to everything else that needed doing. This is because profilers run heavily instrumented programs (common technique is to stop-the-world every few hundred ms and save a stack trace) and the instrumentation slows everything down significantly.

Resources