Obfuscate a SQL Server Db schema

Obfuscate a SQL Server Db schema - sql-server

When posting example code or filing bug reports based on a real production app, it would be helpful to have some way to change the table and column names to not potentially give away information about the internals of the app. Doing it by hand without breaking things is time consuming. Does anything automatic exist? Ideally it would use real English words so they are more easily referred to than random text strings.

As long as you don't use real data, I don't see what the issue is. Most apps are fairly obvious based on the requirements. ie CRM system = (customer name, address, etc...) or (customer name, addressid, etc.. with some address table with parts of the address, etc...). By knowing your schema I have no idea how you implement your app. Generally without the stored procedures/program code it would be hard to steal any intellectual property. Even if you were the NSA or something (InternetIP, PacketHeadingID, PacketDetailID, TimeStampID). Even with the structure of the tables I still would have no information on how your system to log all the internet traffic actually works. I also wouldn't know anything that is logged.
I don't know of anything off hand to do what you are requesting, but I would think it is fairly easy to write a script to do it on your own. Look at the table columns and datatypes and call text columns "TextColumn1", int columns "IntColumn2", etc. and build a table of substitutions, then perform the substitutions globally in the script file. I would think this is a fairly easy Python/Perl/PowerShell/Ruby/VbScript program.

I agree that there's no real need to do so, but if you feel that way, take a look at anonymizers, usually used to protect the data and not the schemas, but you could easily apply those approaches to schemas as well.
See this paper (which is the description of this framework) especially page 8 an onwards for different anonymization methods, although replacing column names for static strings might probably be good enough anyway.

Related

Reusing tables for purposes other than the intended

We need to implement some new functionality for some clients. The functionality is essentially an EULA accept interface for the users. Users will open our app, will be presented with the corresponding EULA (varies from client to client). It needs to be able to store different versions of the EULA for the same client, and it also needs to store which users have accepted which version of the EULA. If a new version is saved, it will be presented to the users the next time they log in.
I've written a document suggesting to add two tables, EULAs and UserAcceptedEULA. That will allow us to store different EULAs and keep track of the accepted ones, current and previous ones.
My problem comes with how some people at the company want to do the implementation. They suggest to use a table ConstantGroups (which contains ConstantGroupID, Timestamp, ClientID and Name) that we use for grouping constants with their values that are stored in another table, e.g.: ConstantGroup would be Quality, and the values would be High, Medium, Low.
To me this is a horrible, incredibly wrong way to do it. They're suggesting it because we already have an endpoint where you pass the ClientID and you get back a string, so it "does what we need".
I wrote the document explaining the whole solution, covering DB changes, APIs needed and UI modifications, but they still don't want to accept it because they thing their way will save us time.
How do I make them understand how horribly wrong they are?

This depends somewhat on your assumptions about "good" design.
Many software folk have adopted the SOLID principles as being "good" (I am one of them). While the original thinking is about object oriented design, I think many apply to databases too.
The first element of that is "Single responsibility". A table should do one thing, and one thing only. Your colleagues are trying to get a single entity to manage different concepts; the Constants table suddenly is responsible for "constants" and "EULA acceptance".
That leads to "Open to extension, closed to change" - if you need to change the implementation of "constants" or "EULAs", you have to untangle the other. So any time you (might) save now will cost you later.
The second principle I like (especially in database design) is the Principle of Least Astonishment. Imagine a new developer joining the team, and having to figure out how EULAs work. They would naturally look for some kind of "EULA" and "Acceptance" tables, and would be astonished to learn that actually, this is managed in a thing called "constants". Any time you save now will be repaid by onboarding new people (or indeed, reminding yourself in 2 years time when you have to fix a bug).

SQL Best Practices for Identity value hard coding

First, I know this is a rather subjective question but I need some kind of formal documentation to help me educate my client.
Background - a large enterprise application with hundreds of tables and SP's, all neatly designed with normalized tables and foreign keys using identity columns.
Our client has a few employees writing complex reports in Crystal enterprise using a replicated copy of our production Db.
We have tables that store what I would classify as 'system' base information, such as a list of office locations, or departments within the company, standard set of roles for users, statuses of other objects (open/closed etc), basically data that doesn't change often.
The issue - the report designers and financial analysts are writing queries with hardcoded identity values inside of them. Something like this
SELECT xxx FROM OFFICE WHERE OFFICE_ID = 6
I'm greatly simplifying here, but basically they're using these hard coded int values inside their procedures all over the place.
For SQL developers seeing this will obviously make you facepalm as it's just a built-in instinct not to do this.
However, surprisingly I can't find any documentation or even best practices articles as to why this shouldn't be done.
They would argue it's fine to do this since the values never change, and they're right, within that single system those values won't change, however across multiple environments (staging/QA/Dev) those values can and are absolutely different, making their reporting design approach non-portable and only able to function in 1 isolated server environment.
Do any of the SQL guru's out there have any more in-depth information/articles etc that I can use to help educate my client on why they should avoid this approach?

Seems to me the strongest argument to your report writers is your second to last sentence "...those values can and are absolutely different [between environments]". That would be pretty much the gist of my response to them.
Of course there's always gray area to any question. Identity columns are essentially magic numbers. They have the benefit to the database of being...
Small
Sequential
Fast to seek and join on, sort by and create
...but have the downside of being of completely meaningless, and in effect, randomly assigned (sort the inserts into that table one way, you get a different identity per row than if you sorted the other way). As such, in cases where you have to look up something specific like that, it's common use also include a "business/natural/alternate" key (e.g. maybe (a completely made up example) [CategoryName] where CatgoryName is something short, unique and human readable, while. [CategoryId] is an identity, but not something intended to be sought on)
If you have a website with, say, a dropdown menu, usually the natural key gets put into the visible part of the drop down, and the surrogate/identity key gets passed around on the back end, invisible to the end user.
This gets a little trickier when you have people writing queries directly against the database. If they're owners of the data, they may know things about the larger data structure which they can take advantage of in *cough "clever" ways. If you know the keys wont change and you know what those values are, there might be a case to be made just referencing those. But again, not if they're going to be different when you query a different server.
Of course the flip side is, if you don't want them to use the identity values, you'll have to give them an alternative. And if your tables don't already include a business/natural/alternate key, you're going to have to add one wherever one doesn't already exist.
Also, there's nothing wrong with that alternate key being an integer too (maybe you already have company-wide identifiers for your offices of 1, 2, 3 etc), but the point is that it's deterministic no matter where you run your query.

What type of database for storing ML experiments

So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?

The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.

SQL Server Normalisation/Best Practices: Single Data Table

I have inherited the maintenance of a database from a former employee in another department and I believe their database development skills are not really up to snuff.
I have been asked to support or redevelop it.
It appears the database of the data for each record is in one single table, Yes I know and has hundreds of thousands of rows with empty fields.
TableData:
> RowID
> FieldID
> DateData
> NumberData
> TextData
> YesNoData
Only one field (dependent on the datatype required) appears to be populated in this instance for each row - the rest are empty.
There are two other tables which identify details of the Record (Created by etc) and the Field (Updated On, Field datatype)
Looking through the Access front-end code it appears that data for each field and record and field is stored by searching on record and field and then returning the appropriate field with the data.
My question: For what purpose does this achieve, or is this type of development considered the work of an inexperienced database developer?

My best guess is that a table like this is used to store arbitrary data (inferred from the other supporting tables) that won't require schema changes to store information that is "unplanned" or not yet implemented in the business logic of the application.
The questions I would start asking (yourself, any programmers, DBA's, project managers, etc.):
Were the requirements so abstract at the time that it was impossible to create a formal schema with data relationships? (Bad, bad, BAD)
Was the database designer lazy or inexperienced?
Was the programmer lazy or inexperienced? (Better yet, was the programmer the DBA?)
Is the reliability/availability of the data so sensitive that making formal schema changes is hard to do on a regular basis?
Has the project gone through plenty of people before you that simply inherited the problems, and this is a hack solution? (While maybe the original programmer knew where it was intended to go eventually...)
I think what you're really trying to get at here is "does this work, or should I change it?". I'd be shocked if the any read/search queries are optimized at all, as there couldn't be any indexes for such arbitrary data storage. If the application is simply logging information, it probably isn't as big of a deal, as the originator probably just didn't know yet how the data would be used later on, and writing a one-time applet to loop through and create formal objects out of the data would be better than trying to assume everything at the beginning.
Getting a little more targeted, are you running into any bottlenecks in your process because of this particular table, or are you concerned just out of surprise? If the former, I'd figure out how to change it right away. If the latter, I'd take my time figuring out the long-term requirements of the application first.

Database design help with varying schemas

I work for a billing service that uses some complicated mainframe-based billing software for it's core services. We have all kinds of codes we set up that are used for tracking things: payment codes, provider codes, write-off codes, etc... Each type of code has a completely different set of data items that control what the code does and how it behaves.
I am tasked with building a new system for tracking changes made to these codes. We want to know who requested what code, who/when it was reviewed, approved, and implemented, and what the exact setup looked like for that code. The current process only tracks two of the different types of code. This project will add immediate support for a third, with the goal of also making it easy to add additional code types into the same process at a later date. My design conundrum is that each code type has a different set of data that needs to be configured with it, of varying complexity. So I have a few choices available:
I could give each code type it's own table(s) and build them independently. Considering we only have three codes I'm concerned about at the moment, this would be simplest. However, this concept has already failed or I wouldn't be building a new system in the first place. It's also weak in that the code involved in writing generic source code at the presentation level to display request data for any code type (even those not yet implemented) is not trivial.
Build a db schema capable of storing the data points associated with each code type: not only values, but what type they are and how they should be displayed (dropdown list from an enum of some kind). I have a decent db schema for this started, but it just feels wrong: overly complicated to query and maintain, and it ultimately requires a custom query to view full data in nice tabular for for each code type anyway.
Storing the data points for each code request as xml. This greatly simplifies the database design and will hopefully make it easier to build the interface: just set up a schema for each code type. Then have code that validates requests to their schema, transforms a schema into display widgets and maps an actual request item onto the display. What this item lacks is how to handle changes to the schema.
My questions are: how would you do it? Am I missing any big design options? Any other pros/cons to those choices?
My current inclination is to go with the xml option. Given the schema updates are expected but extremely infrequent (probably less than one per code type per 18 months), should I just build it to assume the schema never changes, but so that I can easily add support for a changing schema later? What would that look like in SQL Server 2000 (we're moving to SQL Server 2005, but that won't be ready until after this project is supposed to be completed)?
[Update]:
One reason I'm thinking xml is that some of the data will be complex: nested/conditional data, enumerated drop down lists, etc. But I really don't need to query any of it. So I was thinking it would be easier to define this data in xml schemas.
However, le dorfier's point about introducing a whole new technology hit very close to home. We currently use very little xml anywhere. That's slowly changing, but at the moment this would look a little out of place.
I'm also not entirely sure how to build an input form from a schema, and then merge a record that matches that schema into the form in an elegant way. It will be very common to only store a partially-completed record and so I don't want to build the form from the record itself. That's a topic for a different question, though.
Based on all the comments so far Xml is still the leading candidate. Separate tables may be as good or better, but I have the feeling that my manager would see that as not different or generic enough compared to what we're currently doing.

There is no simple, generic solution to a complex, meticulous problem. You can't have both simple storage and simple app logic at the same time. Either the database structure must be complex, or else your app must be complex as it interprets the data.
I outline five solution to this general problem in "product table, many kind of product, each product have many parameters."
For your situation, I would lean toward Concrete Table Inheritance or Serialized LOB (the XML solution).
The reason that XML might be a good solution is that:
You don't need to use SQL to pick out individual fields; you're always going to display the whole form.
Your XML can annotate fields for data type, user interface control, etc.
But of course you need to add code to parse and validate the XML. You should use an XML schema to help with this. In which case you're just replacing one technology for enforcing data organization (RDBMS) with another (XML schema).
You could also use an RDF solution instead of an RDBMS. In RDF, metadata is queriable and extensible, and you can model entities with "facts" about them. For example:
Payment code XYZ contains attribute TradeCredit (Net-30, Net-60, etc.)
Attribute TradeCredit is of type CalendarInterval
Type CalendarInterval is displayed as a drop-down
.. and so on
Re your comments: Yeah, I am wary of any solution that uses XML. To paraphrase Jamie Zawinski:
Some people, when confronted with a problem, think "I know, I'll use XML." Now they have two problems.
Another solution would be to invent a little Domain-Specific Language to describe your forms. Use that to generate the user-interface. Then use the database only to store the values for form data instances.

Why do you say "this concept has already failed or I wouldn't be building a new system in the first place"? Is it because you suspect there must be a scheme for handling them in common?
Else I'd say to continue the existing philosophy, and establish additional tables. At least it would be sharing an existing pattern and maintaining some consistency in that respect.

Do a web search on "generalized specialized relational modeling". You'll find articles on how to set up tables that store the attributes of each kind of code, and the attributes common to all codes.
If you’re interested in object modeling, just search on “generalized specialized object modeling”.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight