I have an automated test framework for testing hardware widgets. Right now only pass/fail results of test cases are stored into a relational database using hibernate. I'd like to change this so that various characteristics of the test are stored in the database. (e.g. how many gerbils are running inside the widget, the inputs to various assertions in the tests, etc.).
Each test case is represented as a Java class, so the first thing I thought of was using hibernate to just create a table for each test case. However, we have lots and lots of test cases so I don't think that having a table for each test case is necessarily the best idea.
The amount and type of data for specific test cases will not change on different executions of the test case, but the data needed for each test case will be dramatically different. To use a silly example: for the gerbil-gnawing test we always want to record the age and color of the gerbils gnawing at the wires, but for the smash test we only need to record how many rocks were thrown at the widget.
Ideally we would be able to query this information from the database using SQL so the data can't be stored as binary blobs or other un-queryable entities.
Any ideas on how to structure the database to meet these requirements? Am I totally off-base on not wanting a large number of tables?
I'd say you have two major options:
Make your TestCase classes supclasses of a common superclass and then use one of the inheritance mapping strategies (http://hibernate.org/hib_docs/reference/en/html/inheritance.html) don't worry to much about the number of tables/columns but of cause you should ensure that you do not hit a limit of your database engine.
Or you do a EAV model (http://en.wikipedia.org/wiki/Entity-Attribute-Value_model)
It is extremely flexible, but be aware that it causes extremely complex queries for simple questions.
Make sure that you run tests with realistic volumes before commiting to one or the other.
I would not worry about a lot of tables. If that is the correct schema, then that's what it is. DB's are designed to handle it.
Related
Here is my situation,
I have an application in which I need to store information about the results of different tests made on blood samples. I am currently using ASP.Net core for the web application and SQL Server for the database. (Might switch to Postgres as I will surely host on Linux and SQL Server for Linux is not totally available yet)
All the tests have some information in common, who performed it, at what time, any other related information for tracking purposes. But then all of them also have specific variables that I need to save for reporting/further calculations.
As of now I have about 20 different types of tests we perform on the samples we receive. The question I have is what would be the best way to save that data?
The two options I see are the following:
Have 20 different tables, all containing the general sample tracking info + specific test variables. This way, when I need to fetch the info, everything for a specific type of test is easily accessible. But then I need to query all these tables by join queries whenever I want to generate a report or modify sample results information (as all the test results/variables entry forms are in a single page). There if very few moments where I need to query only a specific type of test, most of the time, I need to retrieve them all at once, which means that I will always (mostly) query the 20+ tables every time I need to access sample data.
Have one big table containing all the results for the different tests performed and serialize (JSON format) only the specific test variables. So I would have all tracking information available (queryable, searchable, etc....) but the variables and results of each test would be in a single serialized column.
It is important to know that the variables/results won't be queried directly, I don't need to filter by them or anything like that (yet at the very least).
Now I wonder what would give me the best performance in the long term between using the multiple tables with join queries vs using serialization/deserialization that needs to take place whenever I access the data.
Also, I am aware that by serializing the test results/variables, I am losing ability to query by the information they contain (except for SQL server 2016 that now includes a way to query JSON information if I'm not mistaken...).
I also try to follow best practices by normalizing the database but I'm not a pro and I don't know what would be the best approach between my two options (or any other option if there is a better alternative, I'm totally open to better ideas)
So what would be the best approach and why?
Usage estimate
There might be around 15 to 30 millions tests performed every year. Of which I would say 2/3 would be of 5 different blood tests and the other third would be all the other tests performed.
Different table for different test is a good idea to work with.
Reason 1:If only 10 tests are performed on the sample rest of the column will unnecessary waste DB space.
Reason 2:Creating report will be easy in future according to samples
Reason 3:Filtering of data will be easy
Reason 4:maintenance will be easy
If in case of tests are mandatory go with 1 table
Seems like forever I've read that, when testing, use a mock database object or repository. No reason to test someone else's DB code, right? No need to have your code actually mess with data in a database, right?
Now lately I see tests which set up a database (possibly in-memory) and seed it with test data, just for running tests against.
Is one approach better than the other? If tests with seeded data are worth running, should one even bother with mock databases connections? If so, why?
There are a lot of ways to test code that interacts with a database.
The repository pattern is one method of creating a facade over the data access code. It makes it easy to stub/mock out the repository during test. This is useful when a piece of business logic needs tested in isolation and dummy values can help test different branches of the code.
Fake databases (in-memory or local files) are less common because there needs to be some "middle-ware" that knows how to read data from a real database and a fake database. It usually just makes sense to have a repository over the whole thing and mock out the repository. This approach is more feasible in some older systems where there is an existing infrastructure. For instance, you use a real database and then switch over to a fake database for test performance reasons.
Another option is using an actual database, populating it with bogus data. This approach is slower and requires writing a lot of scripts. However, this approach is fairly common as part of integration testing. I used to write a lot of "transactional" tests where I used a database transaction to rollback changes after running my tests. I'd write one large test that collectively performed all of my CRUD operations on a particular table.
The last approach makes sense when you are testing the code that converts SQL results into your objects. Your SQL could be invalid (or you use the wrong stored procedure name). It is also easy to forget to check for nulls, perform an invalid cast, etc. when mapping to objects. This code should be tested at some point. An ORM can help alleviate a lot of this testing.
I am typically pretty lazy these days. I use repositories. Most of my data layer code is touched when performing actual integration tests (hitting a real database with dummy data), so I don't bother testing individual database calls (no more transactional tests). I also use ORMs for doing most of my SELECT statements. I think a lot of the industry is moving towards this more lazy approach.
You should use both.
The business services should rely on DAOs, and be tested by mocking the DAOs. This allows for fast, easy to implement, easy to maintain tests.
The DAOs unique responsibility is to contain database access code (queries, etc.), and should also be tested. So you should use a test database, with test data, and check that their queries return/save what they're support to return/save.
I'm not a big fan of using an in-memory database, different from the one used in production. The behavior of some queries, constraints, etc. will be different from database to database, and you'd better be sure that the code will work on the production database, and not in an in-memory database used only by tests.
I'm wondering if there are any benefits one could gain by taking advantage of the possibility to alter the database's table structures on the fly and querying the DB engine for the current structure before making queries (or just using the fields that are sure to be in the table). Are there any examples of systems that use such approach?
Check out http://blog.mongodb.org/post/119945109/why-schemaless.
From my personal experience I've found the following benefits:
Easier to change
Some conceptual structures are just too much of a pain to use in a normalized set of tables. For example, at my job we have a custom internal CMS system. To represent one "article" we have around 35 tables. When we have to run queries to generate one full article it is very painful.
When that same article is represented as a document, all the information is still there, it's just considerably easier to change the object in code and then serialize and de-serialize instead of writing queries with 20 joins.
Versioning
It makes versioning so much easier. You might start with one schema for your system and later decide you need to add/remove fields. In a traditional RDBMS this can be painful to deploy. With a document database, you just insert new documents with a different schema and everything is fine. (Keep in mind document data stores give you tools to handle this!)
Faster Development
This flows out of things being easier to change. The gain is significant from what I've seen. No more spending time making tables and making sure the data types are perfect and everything is very normalized.
Spreading test data across multiple small data sets seems to me to create a maintenance headache whenever the schema is tweaked. Anybody see a problem with create a single larger test data set? By "larger" I'm still only talk about a couple hundred records in total.
I would not use a unique large dataset (you want to avoid any overhead if you don't need it) and follow DbUnit's Best Practices recommendations:
Use multiple small datasets
Most of your tests do not require the
entire database to be re-initialized.
So, instead of putting your entire
database data in one large dataset,
try to break it into many smaller
chunks.
These chunks could roughly
corresponding to logical units, or
components. This reduces the overhead
caused by initializing your database
for each test. This also facilitates
team development since many developers
working on different components can
modify datasets independently.
For integrated testing, you can still
use the CompositeDataSet class to
logically combine multiple datasets
into a large one at run time.
Some more feedback from the Unitils folks:
Automatic test database maintenance
When writing database tests, keep in mind following guidelines:
Use small sets of test data, containing as few data as possible. In your data files, only specify columns that are used in join columns or the where clause of the tested query.
Make data sets test class specific. Don't reuse data sets between different test classes, for example do not use 1 big domain data set for all your test classes. Doing so will make it very difficult to make changes to your test data for a test without braking anything for another test. You are writing a unit test and such a test should be independent of other tests.
Don't use too many data sets. The more data sets you use, the more maintenance is needed. Try to reuse the testclass data set for all tests in that testclass. Only use method data sets if it makes your tests more understandable and clear.
Limit the use of expected result data sets. If you do use them, only include the tables and columns that are important for the test and leave out the rest.
Use a database schema per developer. This allows developers to insert test data and run tests without interfering with each other.
Disable all foreign key and not null constraints on the test databases. This way, the data files need to contain no more data than absolutely necessary
Using small datasets with just enough data has worked decently for us in the past. Sure, there is some maintenance if you tweak the database but this is manageable with some organization.
I would like to ask about your suggestions concerning unit testing against large databases.
I want to write unit tests for an application which is mostly implemented in T-SQL so mocking the database is not an option.
The database is quite large (approx. 10GB) so restoring the database after a test run is also practically impossible.
The application's purpose is to manage the handling of applications for credit agreements. There are users in specific roles that change the state of agreement objects and my job is to test part of this process.
I'm considering two approaches:
First Approach
Create agreements that meet specific conditions and then test changes of agreement state (eg. transition from waiting in some office to handled in this specific office). The agreements will be created in application itself and they will be my test cases. All the tests will be in transactions that would be rolled back after performing these tests.
Advantages
The advantage of this approach is quite straightforward test. Expected data could be easily described because I exactly know how the object should look like after the transition.
Disadvantages
The disadvantage is that the database cannot change in a way that will break the test. Users and agreements used in test cases must always look the same and if there will be a need to change the database the preparation process will have to be repeated.
Second Approach
Create agreements in unit tests. Programatically create agreements that would meet specific conditions. The data used for creating agreement will be chosen randomly. Also the users that will change agreement state will be created randomly.
Advantages
The advantage of this approach is ease of making changes to objects and ability to run tests on databases with different data.
Disadvantages
Both objects (agreement and user) have lots of fields and related data and I'm afraid it would take some time to implement creation of these objects (I'm also afraid that these objects may contain some errors because the creation method will be quite hard to implement without errors).
What do you think about these two approaches?
Do any Stack Overflow readers think it is worth the effort create objects as described in second approach?
Does anyone here have any experience creating such tests?
I'm not sure I entirely agree with your assumption that you cannot restore the database after a test run. While I definitely agree that some tests should be run on a full-size, multi-TB database, I don't see why you can't run most of your tests on a much, much smaller test database. Are there constraints that need to be tested like "Cannot be more than a billion identical rows?"
My recommendation would actually be to use a smaller test database for most of your functional specs, and to create-drop all of its tables with each test, with as little sample data as is necessary to test your functionality.
For creating fixture data for tests, you have a few choices:
(a) Create a script that creates an empty database, and then adds a small number of records as the fixture data. This data can be hand-constructed, or a few records from the real database. This is the Rails approach, and pretty common in the Java world.
(b) It's also common to use a "factory" to create this data (some sort of application code). There is an initial investment in building these classes, but once they are built they can be re-used for all your tests. This is now very popular in Ruby/Rails code. (This is your Second Approach above.)
(c) Of course you can use a copy of the "production" data, and try to test against that. But this is probably the hardest approach as you will always be competing against he real world changing the data. And it also tends to be orders of magnitude slower than a small set of fixture data.
There's definitely a cost of getting from state (c) to state (a) or (b)-- but it is an investment in the future. It won't take that long-- even if it takes a whole day, the speed-up in the test running will make up for it quickly.
There's a someone independent issue. After you get your data into the database, and then run your tests, you need to restore it. There are a few common approaches:
(1) rollback the transaction. This is a great way to go-- if practical. Sometimes, though, you actually need to confirm that transactions completed, so this doesn't work.
(2) just re-load a new set of fixture data. If your fixture data is small this is workable. A little slower than (1).
(3) manually undo what the tests have done. This is the most error prone and difficult approach, but possible.
Recommendation?
It sounds like your application is complicated. I'd recommend hand-crafting a small set of data for your tests (a). Keep it separate from your main database so that it's easier to keep track of and reload. Try to rollback transactions, but if that doesn't work for you, you can reload from a script before each test (remember-- the data is small).
The other piece of the puzzle is database migrations, if you don't already have that nailed. These are scripts that use use to evolve your database. If you have these organized and automated, you can apply them to your test/fixture data as well as your production data.
How about testing everything in a transaction and then roll it back? E.g:
BeginTransaction
DoThings
VerifyResult
RollbackTransaction