Approaches to finding / controlling illegal data [closed] - sql-server

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Search and destroy / capturing illegal data...
The Environment:
I manage a few very "open" databases. The type of access is usually full select/insert/update/delete. The mechanism for accessing the data is usually through linked tables (to SQL-Server) in custom-build MS Access databases.
The Rules
No social security numbers, etc. (e.g., think FERPA/HIPPA).
The Problem
Users enter / hide the illegal data in creative ways (e.g., ssn in the middle name field, etc.); administrative/disciplinary control is weak/ineffective. The general attitude (even from most of the bosses) is that security is a hassle, if you find a way around it then good for you, etc. I need a (better) way to find the data after it has been entered.
What I've Tried
Initially, I made modifications to the various custom-built user interfaces folks had (that I was aware of), all the way down to the table structures that they were linking to our our database server. The SSN's, for example, no longer had a field of their own, etc. And yet...I continue to find them buried in other data fields.
After a secret audit some folks at my institution did, where they found this buried data, I wrote some sql that (literally) checks every character in every field field in every table of the database looking for anything that matched an ssn pattern. It takes a long time to run, and the users are finding ways around my pattern definitions.
My Question
Of course, a real solution would entail policy enforcement. That has to be addressed (way) above my head, however, it is beyond the scope and authority of my position.
Are you aware of or do you use any (free or commercial) tools that have been targeted at auditing for FERPA & HIPPA data? (or if not those policies specifically, then just data patterns in general?
I'd like to find something that I can run on a schedule, and that stayed updated with new pattern definitions.

I would monitor the users, in two ways.
The same users are likely to be entering the same data, so track who is getting around the roadbloacks, and identify them. Ensure that they are documented as fouling the system, so that they are disciplined appropriately. Their efforts create risk (monetary and legal, which becomes monetary) for the entire organization.
Look at the queries that users issue. If they are successful in searching for the information, then it is somehow stored in the repository.
If you are unable to track users, begin instituting passwords.
In the long-run, though, your organization needs to upgrade its users.

In the end you are fighting an impossible battle unless you have support from management. If it's illegal to store an SSN in your DB, then this rule must have explicit support from the top. #Iterator is right, record who is entering this data and document their actions: implement an audit trail.
Search across the audit trail not the database itself. This should be quicker, you only have one day (or one hour or ...) of data to search. Each violation record and publish.
You could tighten up some validation. No numeric field I guess needs to be as long as an SSN. No name field needs numbers in it. No address field needs more that 5 or 6 numbers in it (how many houses are there on route 66?) Hmmm Could a phone number be used to represent an SSN? Trouble is you can stop someone entering acaaabdf etc. (encoding 131126 etc) there's always a way to defeat your checks.
You'll never achieve perfection, but you can at least catch the accidental offender.

One other suggestion: you can post a new question asking about machine learning plugins (essentially statistical pattern recognition) for your database of choice (MS Access). By flagging some of the database updates as good/bad, you may be able to leverage an automated tool to find the bad stuff and bring it to your attention.
This is akin to spam filters that find the bad stuff and remove it from your attention. However, to get good answers on this, you may need to provide a bit more details in the question, such as the # of samples you have (if it's not many, then a ML plugin would not be useful), your programming skills (for what's known as feature extraction), and so on.
Despite this suggestion, I believe it's better to target the user behavior than to build a smarter mousetrap.

Related

How to design a user permission handling database? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We have a little problem in one of our projects, where two investors are architects and... as it usually is in life, they don't really get along with some of the ideas. Both have different experiences with previous projects, and it seems they look down upon the ideas of the other one. Yep, I'm one of them.
We have an argument over how to define user permission handling in one our project.
One idea is to have table with permissions, roles which gather sets of permissions and then users who have a role defined.
User
user_id
role_id
Role
role_id
permission_id
Permission
permission_id
The other side would like to propose to do it using a table with columns defining permissions:
User
user_id
role_id
Role
role_id
can_do_something
can_do_something_else
can_do_something_even_different
My take on the first option is that it's far cheaper to maintain:
adding a single permission means it's just one insert + handling of the permission in the code.
In case of the other (to me) it means that you have to alter the database, alter the code handling the database and on top of that, add code to handle the permission.
But maybe I'm just wrong, and I don't see some possible benefits of the other solution.
I always thought the former is the standard to handle it, but I'm told that it's subjective and that making a change in the database is a matter of just running a script (where for me it means that the script has to be added to the deployment, has to be run on every database and in case of migration has to be "remembered" etc.)
I know the question could be opinion based, but I'm kind of hoping, this really is a matter of standards and good practice, rather then subjective opinion.
I posted some other questions as comments to your original question.
Even if you had a completely flat role setup I cannot think of a reason to go for the second proposal. As you argue changing something will require modifying code and data structure.
What your colleague is proposing is a sort of denormalization which is only defensible in case you need to optimize for speed in handling large quantities of data. Which is not usually the case when dealing with roles.
(As an example, LDAP or other general-purpose single-sign-on models adopt something closer to your first solution, because even in a large organization the number of USERS is always larger than the number of ROLES by at least one order of magnitude).
Even if you were designing a Facebook replacement (where you may have billions of users) it is really improbable that you will need more than a handful of roles so this would be a case of premature optimization (and - most probably - made worse by optimizing the wrong part).
In a more general sense I strongly suggest to read the RBAC Wikipedia article for what is considered the standard approach to this kind of problems.

Database design for a company: single or multi database? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
We're (re)designing a corporate information system. For the database design, we are exploring these options:
[Option 1->] a single CompanyBigDatabase that has everything,
[Option 2->] several databases across the company (say, HRD_DB, FinanceDB, MarketingDB), which then synchronized through a layer of application. EmployeeTable is owned by HRD, if Finance wants to refer to employees, it queries EmployeeTable from HRD_DB via a web-service.
What is the best practice? What's the pros and cons? We want it to have high availability and to be reasonably reliable. Does Option 1 necessitate clustering-and-all for this? Do big companies and universities (like Toyota, Samsung, Stanford Uni, MIT, ...), always opt for Option 1?
I was looking in many DB textbooks but I could not find a sufficient explanation on this topic.
Any thoughts, tips, links, or advice is welcome. Thanks.
Ive have done this type of work for 20 yrs. Enterprise Architecting is one term used to describe this. If you are asking this question, in a real enterprise scenario, im going to recommend you get advice. If it's a uni question, There are so many things to consider:
budget
politics
timeframes
legacy systems or green field,
Scope of Build
In house or Hosted
Complete Outsource of some or all of the functionality (SaaS)
....
Entire Methodologies are written to support projects that do this.
You can end up with many answers to the variables.
Even agreeing on how to weight features and outcomes is tough.
This is HUGE question you could right a book on.
Its like a 2 paragraph question where I have seen 10 people spend a month putting a business case together to do X. Thats just costing and planning the various options. Without selection of the final approach.
So I have not directly answered your question... that my friend is a
serious research project, not really a StackOverflow question.
There is no single answer. It depends on the many other factors such as database load, application architecture, scalability and etc. My suggestion start the simplest way possible (single database) and change it based on the needs.
Single database has it's advantages: simpler joins, referential integrity, single backup. Only separate pieces of data when you have valid reason/need.
In my opinion, it would be more appropriate to have database normalized and have several databases across the company based on departments. This would allow you to manage data more effectively in terms of storing, retrieving and updating information and providing access to users based on department type or user type. You can also provide different views of the database. It will be a lot more easier to manage data.
There is a general principle of databases in particular, and computing in general, that there should be a single authoritative source for every data item.
Extending this to sets of data, as soon as you have multiple customer lists, multiple lists of items, multiple email addresses, you are soon into a quagmire of uncertainty that will then call for a business intelligence solution to resolve them all.
Now I'm a business intelligence chap by historical leaning, but I'd be the first to say that this is not a path that you want to go down simply because Marketing and Accounts cannot decide the definition of "customer". You do it because your well-normalised OLTP systems do not make it easy to count how many customers there were yesterday, last week, and last year.
Nor should they either, because then they would be in danger of sacrificing their true purpose -- to maintain a high performance, high-integrity persistent store of the "data universe" that your company exists in.
So in other words, the single database approach has data integrity on it's side, and you really do not want to work in a company that does not have data integrity. As a Business Intelligence practitioner I can tell you that it is a horrible place.
On the other hand, you are going to have practical situations in which you simply must have separate systems due to application vendor constraints etc, and in that case the problem becomes one of keeping the data as tightly coupled as possible, and of Metadata Management (ugh) in which the company agrees what data in what location has what meaning.
Either will work and other decisions will mostly affect your specification. To some extent you question could be described as 'Should I go down the ERP path or the SAAS path"? I think it is telling that right now most systems are tending towards SAAS.
How will you be managing the applications? If they will be updated at different times separate DBs make more sense. (SAAS path). On the other hand having one DB to connect to, one authorisation system, one place to look for details, one place to backup, etc appears to decrease complexity in the technical space. But then does not allow decisions affecting one part of the business to be considered separately from other parts of the business
Once the business is involved trying to get a single time each department agrees to an upgrade can be hell. Having a degree of abstraction so that you only have to get one department to align before updating its part of the stack has real advantages in coming years. And if your web services are robust and don't change with each release this can be a much easier path.
Don't forget you can have views of data in other DBs.
And as for your question of how do most big companies work; generally by a miss-mash of big and little systems that sometimes talk to each other, sometimes don't and often repeat data. Having said that repeating data is a real problem; always have an authoritative source and copies (or even better only one copy). A method I have seen work well in a number of enterprises is to have one place details can be CRUDed (Created, Retrieved, Updated and Deleted) and numerous where it can be read.
And really this design decision has little or nothing to do with availability and reliability. That tends to come from good design (simplicity, knowing where things live, etc) good practices (good release practices, admin practtices, backups, intelligent redundancy, etc) and spending money. Not from having one or multiple systems.

Is microsoft access a good stepping stone to learning real database management? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
My sister is going to start taking classes to try to learn how to become a web developer. She's sent me the class lists for a couple of candidate schools for me to help guide her decision.
One of the schools mentions Microsoft Access as the primary tool used in the database classes including relational algebra, SQL, database management, etc.
I'm wondering - if you learn Microsoft Access will you be able to easily pick up another more socially-acceptable database technology later like MySQL, Postgres, etc? My experience with Access was not pleasant and I picked up a whole lot of bad practices when I played around with it during my schooling years.
Basically: Does Microsoft Access use standards-compliant SQL? Do you learn the necessary skills for other databases by knowing how Microsoft Access works?
Access I would say a lot more peculiarities over 'actual' databasing software. Access can be used as a front end for SQL databases easily and that's part of the program.
Let's assume the class is using databases built in Access. Then let's break it down into the parts of a database:
Tables
Access uses a simplified model for variables. Basically you can have typical number, text fields, etc. You can fix the number of decimals for instance like you could with SQL. You won't see variables like varchar(x) though. You will just pick a text field and set the field size to "8", etc. However, like a real database, it will enforce the limits you've put in. Access will support OLE objects, but it quickly becomes a mess. Access databases are just stored like a file and can become incredibly large and bloat quickly. Therefore using it for more than storing address books, text databases, or linking to external sources via code...you have to be careful about how much information you store just because the file will get too big to use.
Queries
Access implements a lot of things along the line of SQL. I'm not aware that it is SQL compliant. I believe you can just export your Access database into something SQL can use. In code, you interact with SQL database with DAO, ADO, ADODB and the Jet or Ace engines. (some are outdated but on older databases work) However, once you get to just making queries, many things are similar. Typical commands--select, from, where, order, group, having, etc. are normal and work as you'd see them work elsewhere. The peculiar things happen when you get into using calculated expressions, complicated joins (access does not implement some kinds of joins but you will see arguably the most important--inner join/union ). For instance, the behavior of distinct is different in Access than other database architecture. You also are limited in the way you use aggregate functions (sum/max/min/avg) . In essence, Access works for a lot of tasks but it is incredibly picky and you will have to write queries just to work around these problems that you wouldn't have to write in a real database.
Forms/Reports
I think the key feature of Access is that it is much more approachable to users that are not computer experts. You can easily navigate the tables and drag and drop to create forms and reports. So even though it's not a database in my book officially, it can be very useful...particularly if few people will be using the database and they highly prefer ease of use/light setup versus a more 'enterprise level' solution. You don't need crystal reports or someone to code a lot of stuff to make an Access database give results and allow users to add data as needed.
Why Access isn't a database
It's not meant to handle lots of concurrent connections. One person can hold the lock and there's no negotiating about it--if one person is editing certain parts of the database it will lock all other users out or at least limit them to read-only. Also if you try to use Access with a lot of users or send it many requests via code, it will break after about 10-20 concurrent connections. It's just not meant for the kinds of things oracle and mySQL are built for. It's meant for the 'everyman' computer user if you will, but has a lot of useful things programmers can exploit to make the user experience much better.
So will this be useful for you to learn more about?
I don't see how it would be a bad thing. It's an environment that you can more easily see the relational algebra and understand how to organize your data appropriately. It's a similar argument to colleges that teach Java, C++, or Python and why each has its merits. Since you can immediately move from Access to Access being the front-end (you load links to the tables) for accessing a SQL database, I'm sure you could teach a very good class with it.
MS-Access is a good Sand-pit to build databases and learn the Basic's (Elementary) design and structure of a Database.
MS-Access'es SQL implementation is jsut about equivalent to SQL1.x syntax. Again Access is a Great app for learning the interaction between Query's, Tables, and Views.
Make sure she doesnt get used to the Macro's available in Access as they structure doesnt translate to Main-Stream RDBMS. The best equivalent is Stored procedures (SProcs) in professional RDBMS but SProcs have a thousand fold more utility and functionality than any Access Macro could provide.
Have her play with MS-Access to get a look and feel for DBMS, but once she gets comfortable with Database design, have her migrate to either MS-SQL Express or MySQL or Both. SQL-Express is as close to the real thing without paying for MS-SQL Std. MySQL is good for the LAMP web infrastructures.

Are bad data issues that common? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I've worked for clients that had a large number of distinct, small to mid-sized projects, each interacting with each other via properly defined interfaces to share data, but not reading and writing to the same database. Each had their own separate database, their own cache, their own file servers/system that they had dedicated access to, and so they never caused any problems. One of these clients is a mobile content vendor, so they're lucky in a way that they do not have to face the same problems that everyday business applications do. They can create all those separate compartments where their components happily live in isolation of the others.
However, for many business applications, this is not possible. I've worked with a few clients, one of whose applications I am doing the production support for, where there are "bad data issues" on an hourly basis. Yeah, it's that crazy. Some data records from one of the instances (lower than production, of course) would have been run a couple of weeks ago, and caused some other user's data to get corrupted. And then, a data script will have to be written to fix this issue. And I've seen this happening so much with this client that I have to ask.
I've seen this happening at a moderate rate with other clients, but this one just seems to be out of order.
If you're working with business applications that share a large amount of data by reading and writing to/from the same database, are "bad data issues" that common in your environment?
Bad data issues occur all the time. The only reasonably effective defense is a properly designed, normalized database, preferrably interacting with the outside world only through stored procedures.
This is why it is important to put the required data rules at the database level and not the application. (Of course, it seems that many systems don't bother at the application level either.)
It also seems that a lot of people who design data imports, don't bother to clean the data before putting it in their system. Of course it's hard to find all the possible ways to mess up the data, I've done imports for years and I still get surprised sometimes. My favorite was the company where their data entry people obviously didn't care about the field names and the application just went to the next field when the first field was fully. I got names like: "McDonald, Ja" in the last name field and "mes" in the first name field.
I do data imports from many, many clients and vendors. Out of hundreds of different imports I've developed, I can think of only one or two where the data was clean. For some reason the email field seems to be particularly bad and is often used for notes instead of emails. It's really hard to send an email to "His secretary is the hot blonde."
Yes, very common. Getting the customer to understand the extent of the problem is another matter. At one customer I had to resort to writing an application which analyzed their database and beeped every time it enountered a record which didn't match their own published data format. I took the laptop with their DB installed to a meeting and ran the program, then watched all the heads at the table swivel around to stare at their DBA while my machine beeped crazily in the background. There's nothing quite like grinding the customer's nose in his own problems to gain attention.
I don't think you are talking about bad data (but it would only be polite of you to answer the various questions raised in comments) but invalid data. For example, '9A!' stored in a field that is supposed to contains a 3-character ISO ccurrency code is probably invalid data, and should have been caught at data entry time. Bad is data usually taken to be equivalent to corruption caused by disk errors etc. The former are quite common, depending on the quality of the data input applications, while the latter are pretty rare.
I assume that by "bad data issues" you mean "issues of data that does not satisfy all applicable business constraints".
They can only be a consequence of two things : bad database design by the database designer (that is : either unintentional or -even worse- intentional omission of integrity constraints in the database definition), or else the inability of the DBMS to support the more complex types of database constraint, combined with a flawed program written by the programmer to enforce the dbms-unsupported integrity constraint.
Given how poor SQL databases are at integrity constraints, and given the poor level of knowledge of data management among the average "modern programmer", yes such issues are everywhere.
If the data get's corrupted because users shut down their application in the middle of complex database updates then transactions are your friends. This way you don't get entry in Invoice table, but no entries in InvoiceItems table. Unless Commited at the end of the process, all made changes are rolled back,

Creating test data in a database [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm aware of some of the test data generators out there, but most seem to just fill name and address style databases [feel free to correct me].
We have a large integrated and normalised application - e.g. invoices have part numbers linked to stocking tables, customer numbers linked to customer tables, change logs linked to audit information, etc which are obviously difficult to fill randomly. Currently we obfuscate real life data to get test data (but not very well).
What tools\methods do you use to create large volumes of data to test with?
Where I work we use RedGate Data Generator to generate test data.
Since we work in the banking domain. When we have to work with nominative data (Credit card numbers, personnal ID, phone numbers) we developed an application that can mask these database fields so we can work with them as real data.
I can say with Redgate you can get close to what your real data can look like on a production server since you can customize every field of every table in your BD.
You can generate data plans with VSTS Database Edition (with the latest 2008 Power tools).
It includes a Data Generation Wizard which allows automated data generation by pointing to an existing database so you get something that is realistic but contains entirely different data
I've rolled my own data generator that generates random data conforming to regular expressions. The basic idea is to use validation rules twice. First you use them to generate valid random data and then you use them to validate new input in production.
I've stated a rewrite of the utility as it seems like a nice learning project. It's available at googlecode.
I just completed a project creating 3,500,000+ health insurance claim lines. Due to HIPPA and PHI restrictions, using even scrubbed real data is a PITA. I used a tool called Datatect for this (http://www.datatect.com/).
Some of the things I like about this tool:
Uses ODBC so you can generate data into any ODBC data source. I've used this for Oracle, SQL and MS Access databases, flat files, and Excel spreadsheets.
Extensible via VBScript. You can write hooks at various parts of the data generation workflow to extend the abilities of the tool. I used this feature to "sync up" dependent columns in the database, and to control the frequency distribution of values to align with real world observed frequencies.
Referentially aware. When populating foreign key columns, pulls valid keys from parent table.
The Red Gate product is good...but not perfect.
I found that I did better when I wrote my own tools to generate the data. I use it when I want to generate say Customers...but it's not great if you wanted to simulate randomness that customers might engage in like creating orders...some with one item some with multiple items.
Homegrown tools will provide the most 'realistic' data I think.
Joel also mentioned RedGate in podcast #11

Resources