How to create a 'sanitized' copy of our SQL Server database?

How to create a 'sanitized' copy of our SQL Server database? - sql-server

We're a manufacturing company, and we've hired a couple of data scientists to look for patterns and correlation in our manufacturing data. We want to give them a copy of our reporting database (SQL 2014), but it must be in a 'sanitized' form. This means that all table names get converted to 'Table1', 'Table2' etc., and column names in each table become 'Column1', 'Column2' etc. There will be roughly 100 tables, some having 30+ columns, and some tables have 2B+ rows.
I know there is a hard way to do this. This would be to manually create each table, with the sanitized table name and column names, and then use something like SSIS to bulk insert the rows from one table to another. This would be rather time consuming and tedious because of the manual SSIS column mapping required, and manual setup of each table.
I'm hoping someone has done something like this before and has a much faster, more efficienct, way.
By the way, the 'sanitized' database will have no indexes or foreign keys. Also, it may seem to make any sense why we would want to do this, but this is what was agreed to by our Director of Manufacturing and the data scientists, as the first round of analysis which will involve many rounds.

You basically want to scrub the data and objects, correct? Here is what I would do.
Restore a backup of the db.
Drop all objects not needed (indexes, constraints, stored procedures, views, functions, triggers, etc.)
Create a table with two columns, populate the table, each row has orig table name and new table name
Write a script that iterates through the table, roe by row, and renames your tables. Better yet, put the data into excel, and create a third column that builds the tsql you want to build, then cut/paste and execute in ssms.
Repeat step 4, but for all columns. Best to query sys.columns to get all the objects you need, put to excel, and build your tsql
Repeat again for any other objects needed.
Backip/restore will be quicker than dabbling in SSIS and data transfer.

They can see the data but they can't see the column names? What can that possibly accomplish? What are you protecting by not revealing the table or column names? How is a data scientist supposed to evaluate data without context? Without a FK all I see is a bunch of numbers on a column named colx. What are expecting to accomplish? Get a confidentially agreement. Consider a FK columns customerID verses a materialID. Patterns have widely different meanings and analysis. I would correlate a quality measure with materialID or shiftID but not with a customerID.
Oh look there is correlation between tableA.colB and tableX.colY. Well yes that customer is college team and they use aluminum bats.
On top of that you strip indexes (on tables with 2B+ rows) so the analysis they run will be slow. What does that accomplish?
As for the question as stated do a back up restore. Using system table drop all triggers, FK, index, and constraints. Don't forget to drop the triggers and constraints - that may disclose some trade secret. Then rename columns and then tables.

Related

Speed up table scans using a middle-man data dump table

I'm a dev (not dba) working on a project where we have been managing "updates" to data via bulk insert. However we first insert to a non-indexed "pre-staging" table. This is due to the fact that we need to normalize a lot of denormalized data and make sure it's properly split up into our schema.
Naturally this makes the update and insert processes slow since we have to check if the information exists for each specific table with non-indexed codes or identifiers.
Since the "pre-staging" table is truncated we didn't include auto-generated IDs either.
I'm looking into ways to speed up table scans on that particular table in our stored procedures. What would be the best approach to do so? Indexes? Auto-generated IDs as clustered indexes? This last one is tricky because we cannot establish relationships with our "staging" data since it is truncated per data dump.

How to get a diff between two states in Oracle database?

I'm looking for a way to get a diff of two states (S1, S2) in a database (Oracle), to compare and see what has changed between these two states. Best would be to see what statements I would have to apply to the database in state one (S1) to transform it to state two (S2).
The two states are from the same database (schema) at different points in time (some small amount of time, not weeks).
I was thinking about doing something like a snapshot and compare - but how to make the snapshots and how to compare them in the best way ?
Edit: I'm looking for changes in the data (primarily) and if possible objects.

This is one of those questions which are easy to state, and it seems the solution should be equally simple. Alas it is not.
The starting point is the data dictionary. From ALL_TABLES you can generate a set of statements like this:
select * from t1#dbstate2
minus
select * from t1#dbstate1
This will give you the set of rows that have been added or amended in dbstate2. You also need:
select * from t1#dbstate1
minus
select * from t1#dbstate2
This will give you the set of rows that have been deleted or amended in dbstate2. Obviously the amended ones will be included in the first set, it's the delta you need, which gives the deleted rows.
Except it's not that simple because:
When a table has a surrogate primary key (populated by a sequence)
then the primary key for the same record might have a different value
in each database. So you should exclude such primary keys from the
sets, which means you need to generated tailored projections for each
table using ALL_TAB_COLS and ALL_CONSTRAINTS, and you may have to use
your skill and judgement to figure out which queries need to exclude
the primary key.
Also, resolving foreign keys is problematic. If the foreign key is a
surrogate key (or even if it isn't) you need to look up the
referenced table to compare the meaning / description columns in the
two databases. But of course, the reference data could have different
state in the two databases, so you have to resolve that first.
Once you have a set of queries which identify the difference you are
ready for the next stage: generating the appliance statements. There
are two choices here: generating a set of INSERT, UPDATE and DELETE
statements or generating a set of MERGE statements. MERGE has the
advantage of idempotency but is a gnarly thing to generate. Probably
go for the easier option.
Remember:
For INSERT and UPDATE statements exclude columns which are populated by triggers or are generated (identity, virtual columns).
For INSERT and UPDATE statements you will need to join to referenced tables for populating foreign keys on the basis of description columns (unless you have already synchronised the primary key columns of all foreign key tables).
So this means you need to apply changes in the order dictated by foreign key dependencies.
For DELETE statements you need to cascade foreign key deletions.
You may consider dropping foreign keys and maybe other constraints, but then you may be in a right pickle when you come to re-apply them only to discover you have you have constraint violations.
Use DML Error Logging to track errors in bulk operations. Find out more.
If you need to manage change of schema objects too? Oh boy. You need to align the data structures first before you can even start doing the data comparison task. This is simpler than the contents, because it just requires interrogating the data dictionary and generating DDL statements. Even so, you need to run minus queries on ALL_TABLES (perhaps even ALL_OBJECTS) to see whether there are tables added to or dropped from the target database. For tables which are present in both you need to query ALL_TAB_COLS to verify the columns - names, datatype, length and precision, and probably mandatory too.
Just synchronising schema structures is sufficiently complex that Oracle sell the capability as a chargeable extra to the Enterprise Edition license, the Change Management Pack.
So, to confess. The above is a thought experiment. I have never done this. I doubt whether anybody ever has done this. For all but the most trivial of schemas generating DML to synchronise state is a monstrous exercise, which could take months to deliver (during which time the states of the two databases continue to diverge).
The straightforward solution? For a one-off exercise, Data Pump Export from S2, Data Pump Import into S1 using the table_exists_action=REPLACE option. Find out more.
For ongoing data synchronisation Oracle offers a variety of replication solutions. Their recommended approach is GoldenGate but that's a separately licensed product so of course they recommend it :) Replication with Streams is deprecated in 12c but it's still there. Find out more.
The solution for synchronising schema structure is simply not to need it: store all the DDL scripts in a source control repository and always deploy from there.

What is the best way to implement soft deletion in a large relational Database? [duplicate]

Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.

I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL

You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.

Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).

The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.

if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table

That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?

Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)

You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.

you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.

Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy

Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.

I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...

Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345

#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete

A single huge table or many small ones

I have a database which logs modification records into a table. This modification table contains foreign keys to other tables (the modification table only contains references to objects modified).
Objects in this modification table can be grouped into different populations. When a user access the service he only requests the database for object on his population.
I will have about 2 to 10 new populations each week.
This table is requested by smartphones very very often and will contains about 500 000 / 1 000 000 records.
If I split the modification table into many tables there is no table-join to do to answer user requests
If I change this single table into many tables, I guess it will speed the response time.
But on the other hand, each "insert" in the modification table will require to have first the name of the target table (it implies another request). To do so, I plan to have a column in the "population" table with a varchar representing the target table for modification.
My question is a design-pattern / architecture one --> Should I go for a single very huge table with 3 "where" for each request, or should I give a try to many light table with no "where" to play?

The cleanest thing would be to use one table and partition it on the populations. Partitions are made for this.

500K - 1M records isn't trivial - but it certainly isn't huge either. What's your database platform? Most mainstream professional platforms (SQL, Oracle, MySQL, etc) are more than capable of handling this.
If the table in question is narrow (has few columns) then its less likely to be an issue than "wide tables" with lots of columns.
Having lots of joins could be an issue (i just can't speak from experience). Depending on how you manage things (and how good your application code is) do you really need the foreign-key constraints?

Database design

Our database is part of a (specialized) desktop application.
The primary goal is to keep data about certain events.
Events happen every few minutes.
The data collected about events changes frequently with new data groups being added in and old ones swapped out almost monthly (the data comes in definite groups).
I have to put together a database to track the events. A first stab at that might be to simply have a single big table where each row is an event and that is basically what our data looks like, but this seems undesirable because of our constantly changing groups of data (i.e. the number of columns would either keep growing perpetually or we would constantly having this months database incompatible with last months database - ugh!). Because of this I am leadning toward the following even though it creates circular references. (But maybe this is a stupid idea)
Create tables like
Table Events
Table Group of the Month 1
Table Group of the Month 2
...
Table Events has:
A primary key whose deletion cascade to delete rows with foreign keys referencing it
A nullable foreighn key for each data group table
Each data group table has:
A primary key, whose deletion cascades to null out foreign keys referencing it
Columns for the data in that group
A non-nullable foreign key back to the event
This still leaves you with a growing, changing Event Table (as you need to add new foreign key columns for each new data group), just much less drastically. However it seems more modular to me than one giant table. Is this a good solution to this situation? If not, what is?
Any suggestions?
P.S. We are using SQL Express or SQL Compact (we are currently experimenting with which one suits us best)

Why not use basically the single table approach and store the changing event data as XML in an XML column? You can even use XSD schemas to account for the changing data types, and you can add indexes on XML data if fast query performance on some XML data is required.
A permanently changing DB schema wouldn't really be an option if I were to implement such a database.

You should not have foreign keys for each data group table in the event table.
The event table already has an event_id which is in each data group table. So you can get from event to the child tables. Furthermore there will be old rows in the event table that are not in the latest data group table. So you really can't have a foreign key.
That said, I would wonder whether there is additional structure in the data group tables that can be used to clean up your design. Without knowing anything about what they look like I can't say. But if there is, consider taking advantage of it! (A schema that changes every month is a pretty bad code smell.)

Store your data at as granular a level as possible. It might be as simple as:
EventSource int FK
EventType int FK
Duration int
OccuredOn datetime
Get the data right and as simple as possible in the first place, and then
Aggregate via views or queries. Your instincts are correct about the ever changing nature of the columns - better to control that in T-SQL than in DDL.
I faced this problem a number of years ago with logfiles for massive armies of media players, and where I ultimately ended up was taking this data and creating a OLAP cube out of it. OLAP is another approach to database design where the important thing is optimizing it for reporting and "sliceability". It sounds like you're on that track, where it would be very useful to be able to look at a quick month's view of data, then a quarter's, and then back down to a week's. This is what OLAP is for.
Microsoft's technology for this is Analysis Services, which comes as part of Sql Server. If you didn't want to take the entire plunge (OLAP has a pretty steep learning curve), you could also look at doing a selectively denormalized database that you populated each night with ETL from your source database.
HTH.