Techniques for populating large databases when testing

Techniques for populating large databases when testing - database

We are running 3 Oracle databases with 100+ tables in each. In order to test our code we are looking into alternatives for populating them with testdata. The tools we have found this far are DBSetup and DBUnit.
The problem we are having with these tools are all the manual work needed to specify data. For example if I am to test a table D I am required to populate tables A, B, C with data also. I dont care what the data is, i only care about the data in table D. The reason I have to also populate A B C is because the concistency checks on the derived keys in table D.
My questions is how is this type of problem uasually handled? Is this a sign of a badly designed database from a testability point of view?

If the database is strictly for testing purposes, I don't see anything stopping you from dropping the consistency checks (FK etc), populate the data, test it, truncate the table, and readd the consistency checks again later.
Other alternative that I can think of
1. copy the table structure (columns etc), and do your testing there
2. alter all FKs with "deferrable initially deferred" that basically will postpone the consistency checking until you commit the transaction

we stopped using dbunit and started to populate the db programmaticaly. this way you can create one factory method that creates your entity with all its dependencies. and then you can this one method as many times as possible.
i would avoid removing constraints during the process of testing. if the code you test create/modifies the data, you may get false positive results without constraints. same with performance
if you have to populate 100 tables just to insert anything to table X then yes, i would say there might be some problem with design. think if you can better divide your domain into smaller aggregate roots. but if you need just a few other tables then i would say there is nothing wrong with it

Related

How to store order history in database, especially old picture paths?

(table: order_items)
I'm not sure if this is the correct way to implement an order history table in my database. Normally, I'm trying to reduce the redundancy. But because the user can change data in his/her offer, I need to save the minimum information of the order.
Goal: Buyer can see his/her old orders with correct title/pictures/origin path/allergens (long story...)
What speaks against my approach?
The only "fear" is that the table is going to be bloated with a lot of redundancy information.

This started out as a comment but it's getting too long, so...
What database are you working with?
SQL Server, for instance, introduced the concept of temporal tables in 2016 version. Basically you have two tables identical in structure, where one is the main table where you can use DML just as you would with normal table, and the other is a readonly table that's storing the historical data - so when you update a record in the main table, what is actually happening is that the record gets copied into the history table first, and updated later.
Something similar might exists in other databases as well, and can also be quite easily manually implemented using triggers in case your database does not provide it out of the box.
Of course, you could use the technique called "soft delete", where instead of actually deleting the data you simply mark it as deleted, and instead of updating the data you create a new record with the updated data, and change the status of the existing record to Inactive.
The major advantage of this approach over temporal tables is that you still only have one table for your entity instead of two - but on the other hand, the advantage of temporal tables is that the active data is being kept in a separate table from the historical data, therefor the active data is stored in a relatively small table and as a result, all CRUD operations is more efficient.
The "fear" of having a bloated table in this day and age when memory and storage are so cheep seems a bit strange to me.

Database design: one large table versus several smaller tables

I have to create a database to store information being sent and received to / from a 3rd party web service portal. There are about 150 fields of information to be sent though I can remove about 50 of those fields by normalising (there are three sets addresses that can be saved in an address table, for example). However, this still leaves a table that could potentially have 100 columns.
I've come up with two ways of handling this though I'm not sure which to use:
1. Have a table with 100 columns and three references to an address table.
2. Break it down into maybe 15-20 separate dedicated tables.
Option 1 seems the quickest as it involves the fewest joins but the idea of a table with 100 columns doesn't feel right.
Option 2 feels better and would break things down in to more managable chunks but it won't save any database space and will increase the number of joins. Pretty much all the columns in the database will have a value and I cannot normalise these columns any further.
My question is, in this situation is it acceptable to have a table with c.100 columns in it or should I try and break it down over several tables for presentation?
Please note: The table structure will not change over the course of it's useage, a new database would be created for a new version of the web service portal. I have no control over the web service data structure.
Edit: #Oded's answer below has made me think a bit more about how the data will be accessed; it will really only be accessed in whole and not in part. I wouldn't for example, need to return columns 5-20 on a regular basis.
Answer: I accepted Oded's answer based on the comments after he posted it helped me make my mind up and I decided to go with option 1. As the data is accessed in full then having one table seems the better solution. If, for example, I regularly wanted to access columns 5-20 rather than the full table row then I'd see about breaking it up into separate tables for performance reasons.

Speaking from a relational purist point of view - first, there is nothing against having 100 columns in a table, if they are related. The point here is that if after normalizing you still have 100 columns, that's OK.
But you should normalize, and in the process you may very well end up with 15-20 separate dedicated tables, which most relational database professionals would agree is a better design (avoid data duplication with the update/delete issues associated, smaller data footprint etc...).
Pragmatically, however, if there is a measurable performance problem, it may be sensible to denormalize your design for performance benefit. The key here - measureable. Don't optimize before you have an actual problem.
In that respect, I'd say you should go with the set of 15-20 tables as an initial design.

From MSDN:Maximum Capacity Specifications for SQL Server :
Columns per nonwide table: 1,024
Columns per wide table: 30,000
So I think 100 columns is ok in your case. And also maybe you need to note(from same link):
Columns per primary key: 16
Of course this is only in the case if need data only as Log for a service.
If after reading from service you need to maintain data -> then normalising seems better...

If you find it easier to "manage" tables with fewer columns, however you happen to define manageability (e.g. less horizontal scrolling when looking at the table data in SSMS), you can break the table up into several tables with 1-to-1 relationships without violating the rules of normalization.

How to deal with extremely wide tables in SQL server?

I have encountered the following dilemma several times and would be interested to hear how others have addressed this issue or if there is a canonical way that the situation can be addressed.
In some domains, one is naturally led to consider very wide tables. Take, for instance, time series surveys that evolve over many years. Such surveys can have hundreds, if not thousands, of variables. Typically though there are probably only a few thousand or tens-of-thousands of rows. It is absolutely natural to consider such a result set as a table where each variable corresponds to a column in the table however, in SQL Server at least, one is limited to 1024 (non sparse) columns.
The obvious workarounds are to
Distribute each record over multiple tables
Stuff the data into a single table with columns of say, ResponseId, VariableName, ResponseValue
Number 2. I think is very bad for a number of reasons (difficult to query, suboptimal storage, etc) so really the first choice is the only viable option I see. This choice can be improved perhaps by grouping columns that are likely to be queried together into the same table - but one can't really know this until the database is actually being used.
So, my basic question is: Are there better ways to handle this situation?

You might want to put a view in front of the tables to make them appear as if they are a single table. The upside is that you can rearrange the storage later without queries needing to change. The downside is that only modifications to the base table can be done through the view. If necessary, you could mitigate this with stored procedures for frequently used modifications. Based on your use case of time series surveys, it sounds like inserts and selects are far more frequent than updates or deletes, so this might be a viable way to stay flexible without forcing your clients to update if you need to rearrange things later.

Hmmm it really depends on what you do with it. If you want to keep the table as wide as it is (possibly this is for OLAP or data warehouse), I would just use proper indexes. Also based on the columns that are selected more often , I could also use covering indexes. Based on the rows that are searched more often, I could also use filtered indexes. If there are, let’s say, billions of records in the table, you could partition the table as well. If you just want to store the table over multiple tables, definitely use proper normalization techniques, probably up to 3NF or 3.5NF, to divide the big table into smaller tables. I would use the first method of yours, normalization, to store data for the big table just because it seems like it makes sense better to me that way.

This is an old topic but something we are currently working on resolving. Neither of the above answers really give as many benefits as the solution we feel we have found.
We previously believed that having wide tables wasn't really a problem. Having spent time analysing this we have seen the light and realise that costs on inserts/updates are indeed getting out of control.
As John states above, the solution is really to create a VIEW to provide your application with a consistent schema. One of the challenges in any redesign may be, as in our case, that you have thousands or millions of lines of code referencing an old wide table and you may want to provide backwards compatibility.
Views can also be used for UPDATES and INSERTS as John alludes to, however a problem we found initially was that if you take the example of myWideTable which may have hundreds of columns and you want to split this to myWideTable_a with columns a, b and c and myWideTable_b with columns x, y and z then an insert to a view which only sets column a will only insert a record for myWideTable_a
This causes a problem when you want to later update your record and set myWideTable.z as this will fail.
The solution we're adopting, and performance testing, is to have an 'insteadof' trigger on the View insert to always insert to our split-tables so that we can continue to update or read from the view with impunity.
The question as to whether using this trigger on inserts provides more overhead than a wide table is still open, but it is clear that it will improve subsequent writes to columns in each split table.

Trying to design a column that should sum values from another table

sorry if the title is not very clear. I will try to explain now:
I have two tables: table A and table B. The relation between them is one (for A table) to many (for B table). So, it's something like master-detail situation.
I have a column 'Amount' in table B that is obviously decimal and a column 'TotalAmount' in table A.
I'm trying ti figure out how to keep the value in table A up to date. My suggestion is to make a view based on table A with aggregate query counting the Amounts from table B. Of course with the proper indexes ...
But, my team-mate suggest to update the value in table A every time we change something in table B from our application.
I wonder, what will be the best solution here? May be a third variant?
Some clarification ... We expected this tables to be the fastest growing tables in our database. And The table B will grow much much faster than table A. The most frequent operation in table B will be insert ... and almost nothing else. The most frequent operation in table A will be select ... but not only.

I see a number of options:
Use an insert trigger on table B and do the updates table A as your friend suggests. This will keep table B as up to date as possible.
Have a scheduled job that updates table A every x minutes (x = whatever makes sense for your application).
When updating table B, do an update on table A in your application logic. This may not work out if you update table B in many places.

If you have a single place in your app where you insert new rows to table B, then the most simple solution is to send an UPDATE A set TotalAmount=TotalAmount + ? where ID = ? and pass the values you just used to insert into B. Make sure you wrap both queries (the insert and the update) in a transaction so either both happen or none.
If that's not simple, then your next option is a database trigger. Read the docs for your database how to create them. Basically a trigger is a small piece of code that gets executed when something happens in the DB (in your case, when someone inserts data in table B).
The view is another option but it may cause performance problems during selects which you'll find hard to resolve. Try a "materialized view" or a "computed column" instead (but these can cause performance problems when you insert/remove columns).

If this value is going to change a lot, you're better off using a view: It is definitely the safer implementation. But even better would be using triggers (if your database supports them.)
I would guess your mate suggests updating the value on each insert because he thinks that you will need the value quite often and it might lead to a slow-down recalculating the value each time. If that is the case:
Your database should take care of the caching, so this probably won't be an issue.
If it is, nonetheless, you can add that feature at a later stage - this way you can make sure your application works otherwise and will have a much easier time debugging that cache column.

I would definitely recommend using a trigger over using application logic, as that ensures that the database keeps the value up-to-date, rather than relying on all callers. However, from a design point of view, I would be wary of storing generated data in the same table as non-generated data -- I believe it's important to maintain a clear separation, so people don't get confused between what data they should be maintaining and what will be maintained for them.
However, in general, prefer views to triggers -- that way you don't have to worry about maintaining the value at all. Profile to determine whether performance is an issue. In Postgres, I believe you could even create an index on the computed values, so the database wouldn't have to look at the detail table.
The third way, recalculating periodically, will be much slower than triggers and probably slower than a view. That it's not appropriate for your use anyway is the icing on the cake :).

Pros and Cons of massive table that controls all data flow with stored procs

DBA (with only 2 years of google for training) has created a massive data management table (108 columns and growing) containing all neccessary attribute for any data flow in the system. Well call this table BFT for short.
Of these columns:
10 are for meta-data references.
15 are for data source and temporal tracking
1 instance of new/curr columns for textual data
10 instances of new/current/delta/ratio/range columns for multi-value numeric updates
:totaling 50 columns.
Multi valued numeric updates usually only need 2-5 of the update groups.
Batches of 15K-1500K records are loaded into the BFT and processed by stored procs with logic to validate those records shuffle them off to permanent storage in about 30 other tables.
In most of the record loads, 50-70 of the columns are empty through out the entire process.
I am no database expert, but this model and process seems to smell a little, but I don't know enough to say why, and don't want to complain without being able to offer an alternative.
Given this very small insight to the data processing model, does anyone have thoughts or suggestions? Can the database (SQL Server) be trusted to handle records with mostly empty columns efficiently, or does processing in this manner wasted lots of cycles/memory,etc.

Sounds like he reinvented BizTalk.

I typically have multiple staging tables corresponding to the input loads. These may or may not correspond to the destination tables, but we don't do what you're talking about. If he doesn't like to have a lot of what are basically temporary work tables, they could be put into their own schema or even a separate database.
As far as the columns which are empty, if they aren't referenced in the particular query which is processing BFT it doesn't matter - HOWEVER, what will happen is that the indexing becomes much more crucial that the index chosen is a non-clustered covering index. When your BFT is used and a table scan or clustered index scan is chosen, the unused column have to be read and ignored or skipped, and this definitely seems to affect processing in my experience. Whereas with a non-clustered index scan or seek, less columns are read, and hopefully this doesn't include (m)any of the unused columns.

Normalization is the keyword here. If you have so many NULL values, chances are high that you're wasting a lot of space. Normalizing the table should also make data integrity in this table easier to enforce.

One thing that might make things a little more flexible (other than normalizing) could be to create one or more views or table functions to present the data. Particularly if the table is outside your control, these would enable you to filter the spurious crap out and grab only what you need from the table.
However, if you're going to be one of the people who will be working with (and frowning every time you have to crack open) that massive table, you might want to trump the DBA's "design" and normalize that beast, and maybe give the DBA the task of creating some views and/or table functions to help you out.
I currently work with a similar but not so huge table which has been around on our system for years and has had new fields and indices and constraints rather hastily tacked on Frankenstein-style. Unfortunately some other workgroups rely on the structure as gospel, so we've created such views and functions to enable us to "shape" the data the way we need it.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight