Creating test data in a database [closed] - database

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm aware of some of the test data generators out there, but most seem to just fill name and address style databases [feel free to correct me].
We have a large integrated and normalised application - e.g. invoices have part numbers linked to stocking tables, customer numbers linked to customer tables, change logs linked to audit information, etc which are obviously difficult to fill randomly. Currently we obfuscate real life data to get test data (but not very well).
What tools\methods do you use to create large volumes of data to test with?

Where I work we use RedGate Data Generator to generate test data.
Since we work in the banking domain. When we have to work with nominative data (Credit card numbers, personnal ID, phone numbers) we developed an application that can mask these database fields so we can work with them as real data.
I can say with Redgate you can get close to what your real data can look like on a production server since you can customize every field of every table in your BD.

You can generate data plans with VSTS Database Edition (with the latest 2008 Power tools).
It includes a Data Generation Wizard which allows automated data generation by pointing to an existing database so you get something that is realistic but contains entirely different data

I've rolled my own data generator that generates random data conforming to regular expressions. The basic idea is to use validation rules twice. First you use them to generate valid random data and then you use them to validate new input in production.
I've stated a rewrite of the utility as it seems like a nice learning project. It's available at googlecode.

I just completed a project creating 3,500,000+ health insurance claim lines. Due to HIPPA and PHI restrictions, using even scrubbed real data is a PITA. I used a tool called Datatect for this (http://www.datatect.com/).
Some of the things I like about this tool:
Uses ODBC so you can generate data into any ODBC data source. I've used this for Oracle, SQL and MS Access databases, flat files, and Excel spreadsheets.
Extensible via VBScript. You can write hooks at various parts of the data generation workflow to extend the abilities of the tool. I used this feature to "sync up" dependent columns in the database, and to control the frequency distribution of values to align with real world observed frequencies.
Referentially aware. When populating foreign key columns, pulls valid keys from parent table.

The Red Gate product is good...but not perfect.
I found that I did better when I wrote my own tools to generate the data. I use it when I want to generate say Customers...but it's not great if you wanted to simulate randomness that customers might engage in like creating orders...some with one item some with multiple items.
Homegrown tools will provide the most 'realistic' data I think.

Joel also mentioned RedGate in podcast #11

Related

Can't we omit database triggers? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
As Wikipedia says
Database Triggers are commonly used to:
audit changes (e.g. keep a log of the users and roles involved in
changes)
enhance changes (e.g. ensure that every change to a record
is time-stamped by the server's clock)
enforce business rules (e.g.
require that every invoice have at least one line item) etc.
ref: database triggers - wikipedia
But we can do these things inside the Business Layer using a common programming language (especially with OOP) easily. So what is the necessity of database triggers in modern software architecture? Why do we really need them?
It might work, if all data is changed by your application only. But there are other cases which I have seen very frequently:
There are other applications (like batch jobs doing imports etc.) which do not use the business layer
You cannot use plain SQL scripts as a means for hotfixes easily
Apart from that in some cases you can even combine both worlds: Define a trigger in the database, and use Java to implement it. PostgreSql for examples supports triggers written in Java. As for Oracle, you can call a Java method from a PL/SQL trigger. You can define CLR based triggers in MS SQL Server.
This way not every programmer needs to learn PL/SQL, and data integrity is enforced by the database.
Think about the performance. IF this is all to be done from the application, there are most likely a lot of extra sql*net round trips, slowing down the application. Having those actions defined in the database makes sure that they are always enforced, not only when the application is used to access the data.
When the database is in control, you have your rules defined on the central location, the database, instead of in many locations in the application.
Yes, you can completely omit database triggers.
However, if you can't guarantee that your database will only be accessed from the application layer (which is impossible) then you need them. Yes, you can perform all your database logic in the application layer but if you have a table that needs X done to it when you're updating it then the only way to do that is in a trigger. If you don't then people accessing your database directly, outside your application, will break your application.
There is nothing else you can do. If you need a trigger, use one. Do not assume that all connections to your database will be through your application...

Is microsoft access a good stepping stone to learning real database management? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
My sister is going to start taking classes to try to learn how to become a web developer. She's sent me the class lists for a couple of candidate schools for me to help guide her decision.
One of the schools mentions Microsoft Access as the primary tool used in the database classes including relational algebra, SQL, database management, etc.
I'm wondering - if you learn Microsoft Access will you be able to easily pick up another more socially-acceptable database technology later like MySQL, Postgres, etc? My experience with Access was not pleasant and I picked up a whole lot of bad practices when I played around with it during my schooling years.
Basically: Does Microsoft Access use standards-compliant SQL? Do you learn the necessary skills for other databases by knowing how Microsoft Access works?
Access I would say a lot more peculiarities over 'actual' databasing software. Access can be used as a front end for SQL databases easily and that's part of the program.
Let's assume the class is using databases built in Access. Then let's break it down into the parts of a database:
Tables
Access uses a simplified model for variables. Basically you can have typical number, text fields, etc. You can fix the number of decimals for instance like you could with SQL. You won't see variables like varchar(x) though. You will just pick a text field and set the field size to "8", etc. However, like a real database, it will enforce the limits you've put in. Access will support OLE objects, but it quickly becomes a mess. Access databases are just stored like a file and can become incredibly large and bloat quickly. Therefore using it for more than storing address books, text databases, or linking to external sources via code...you have to be careful about how much information you store just because the file will get too big to use.
Queries
Access implements a lot of things along the line of SQL. I'm not aware that it is SQL compliant. I believe you can just export your Access database into something SQL can use. In code, you interact with SQL database with DAO, ADO, ADODB and the Jet or Ace engines. (some are outdated but on older databases work) However, once you get to just making queries, many things are similar. Typical commands--select, from, where, order, group, having, etc. are normal and work as you'd see them work elsewhere. The peculiar things happen when you get into using calculated expressions, complicated joins (access does not implement some kinds of joins but you will see arguably the most important--inner join/union ). For instance, the behavior of distinct is different in Access than other database architecture. You also are limited in the way you use aggregate functions (sum/max/min/avg) . In essence, Access works for a lot of tasks but it is incredibly picky and you will have to write queries just to work around these problems that you wouldn't have to write in a real database.
Forms/Reports
I think the key feature of Access is that it is much more approachable to users that are not computer experts. You can easily navigate the tables and drag and drop to create forms and reports. So even though it's not a database in my book officially, it can be very useful...particularly if few people will be using the database and they highly prefer ease of use/light setup versus a more 'enterprise level' solution. You don't need crystal reports or someone to code a lot of stuff to make an Access database give results and allow users to add data as needed.
Why Access isn't a database
It's not meant to handle lots of concurrent connections. One person can hold the lock and there's no negotiating about it--if one person is editing certain parts of the database it will lock all other users out or at least limit them to read-only. Also if you try to use Access with a lot of users or send it many requests via code, it will break after about 10-20 concurrent connections. It's just not meant for the kinds of things oracle and mySQL are built for. It's meant for the 'everyman' computer user if you will, but has a lot of useful things programmers can exploit to make the user experience much better.
So will this be useful for you to learn more about?
I don't see how it would be a bad thing. It's an environment that you can more easily see the relational algebra and understand how to organize your data appropriately. It's a similar argument to colleges that teach Java, C++, or Python and why each has its merits. Since you can immediately move from Access to Access being the front-end (you load links to the tables) for accessing a SQL database, I'm sure you could teach a very good class with it.
MS-Access is a good Sand-pit to build databases and learn the Basic's (Elementary) design and structure of a Database.
MS-Access'es SQL implementation is jsut about equivalent to SQL1.x syntax. Again Access is a Great app for learning the interaction between Query's, Tables, and Views.
Make sure she doesnt get used to the Macro's available in Access as they structure doesnt translate to Main-Stream RDBMS. The best equivalent is Stored procedures (SProcs) in professional RDBMS but SProcs have a thousand fold more utility and functionality than any Access Macro could provide.
Have her play with MS-Access to get a look and feel for DBMS, but once she gets comfortable with Database design, have her migrate to either MS-SQL Express or MySQL or Both. SQL-Express is as close to the real thing without paying for MS-SQL Std. MySQL is good for the LAMP web infrastructures.

Approaches to finding / controlling illegal data [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Search and destroy / capturing illegal data...
The Environment:
I manage a few very "open" databases. The type of access is usually full select/insert/update/delete. The mechanism for accessing the data is usually through linked tables (to SQL-Server) in custom-build MS Access databases.
The Rules
No social security numbers, etc. (e.g., think FERPA/HIPPA).
The Problem
Users enter / hide the illegal data in creative ways (e.g., ssn in the middle name field, etc.); administrative/disciplinary control is weak/ineffective. The general attitude (even from most of the bosses) is that security is a hassle, if you find a way around it then good for you, etc. I need a (better) way to find the data after it has been entered.
What I've Tried
Initially, I made modifications to the various custom-built user interfaces folks had (that I was aware of), all the way down to the table structures that they were linking to our our database server. The SSN's, for example, no longer had a field of their own, etc. And yet...I continue to find them buried in other data fields.
After a secret audit some folks at my institution did, where they found this buried data, I wrote some sql that (literally) checks every character in every field field in every table of the database looking for anything that matched an ssn pattern. It takes a long time to run, and the users are finding ways around my pattern definitions.
My Question
Of course, a real solution would entail policy enforcement. That has to be addressed (way) above my head, however, it is beyond the scope and authority of my position.
Are you aware of or do you use any (free or commercial) tools that have been targeted at auditing for FERPA & HIPPA data? (or if not those policies specifically, then just data patterns in general?
I'd like to find something that I can run on a schedule, and that stayed updated with new pattern definitions.
I would monitor the users, in two ways.
The same users are likely to be entering the same data, so track who is getting around the roadbloacks, and identify them. Ensure that they are documented as fouling the system, so that they are disciplined appropriately. Their efforts create risk (monetary and legal, which becomes monetary) for the entire organization.
Look at the queries that users issue. If they are successful in searching for the information, then it is somehow stored in the repository.
If you are unable to track users, begin instituting passwords.
In the long-run, though, your organization needs to upgrade its users.
In the end you are fighting an impossible battle unless you have support from management. If it's illegal to store an SSN in your DB, then this rule must have explicit support from the top. #Iterator is right, record who is entering this data and document their actions: implement an audit trail.
Search across the audit trail not the database itself. This should be quicker, you only have one day (or one hour or ...) of data to search. Each violation record and publish.
You could tighten up some validation. No numeric field I guess needs to be as long as an SSN. No name field needs numbers in it. No address field needs more that 5 or 6 numbers in it (how many houses are there on route 66?) Hmmm Could a phone number be used to represent an SSN? Trouble is you can stop someone entering acaaabdf etc. (encoding 131126 etc) there's always a way to defeat your checks.
You'll never achieve perfection, but you can at least catch the accidental offender.
One other suggestion: you can post a new question asking about machine learning plugins (essentially statistical pattern recognition) for your database of choice (MS Access). By flagging some of the database updates as good/bad, you may be able to leverage an automated tool to find the bad stuff and bring it to your attention.
This is akin to spam filters that find the bad stuff and remove it from your attention. However, to get good answers on this, you may need to provide a bit more details in the question, such as the # of samples you have (if it's not many, then a ML plugin would not be useful), your programming skills (for what's known as feature extraction), and so on.
Despite this suggestion, I believe it's better to target the user behavior than to build a smarter mousetrap.

Recommendation for an in-memory database [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I would like to remove sql dependency of small chunks of data that I load on (almost) each request on a web application. Most of the data is key-value/document structured, but a relational solution is not excluded. The data is not too big so I want to keep it in memory for higher availability.
What solution would you recommend?
The simplest and most widely used in-memory Key-value storage is MemcacheD. The introduction page re-iterates what you are asking for:
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
The client list is impressive. It's been for a long time. Good documentation. It has API for almost every programming language. Horizontal scaling is pretty simple. As my experience goes Memcached is good.
You may also want to look into MemBase.
Redis is perfect for this kind of data. It also supports some fundamental datastructures and provides operations on them.
I recently converted my Django forum app to use it for all real-time/tracking data - it's so good to no longer have the icky feeling you get when you do this kind of stuff (SET views = views + 1 and other writes on every page view) with a relational database.
Here's an example of using Redis to store data required for user activity tracking, including keeping an ordered set of last seen users up to date, in Python:
def seen_user(user, doing, item=None):
"""
Stores what a User was doing when they were last seen and updates
their last seen time in the active users sorted set.
"""
last_seen = int(time.mktime(datetime.datetime.now().timetuple()))
redis.zadd(ACTIVE_USERS, user.pk, last_seen)
redis.setnx(USER_USERNAME % user.pk, user.username)
redis.set(USER_LAST_SEEN % user.pk, last_seen)
if item:
doing = '%s %s' % (
doing, item.get_absolute_url(), escape(str(item)))
redis.set(USER_DOING % user.pk, doing)
If you don't mind the sql but want to keep the db in memory, you might want to check out sqlite (see http://www.sqlite.org/inmemorydb.html).
If you don't want the sql and you really only have key-value pairs, why not just store them in a map / hash / associative array and be done with it?
If you end up needing an in-memory database, H2 is a very good option.
One more database to consider: Berkeley DB. Berkeley DB allows you to configure the database to be in-memory, on-disk or both. It supports both a key-value (NoSQL) and a SQL API. Berkeley DB is often used in combination with web applications because it's embedded, easily deployed (it deploys with your application), highly configurable and very reliable. There are several e-Retail web sites that rely on Berkeley DB for their e-Commerce applications, including Amazon.com.
I'm not sure this is what you are looking for but you should look into a caching framework (something that may be included in the tools you are using now). With a repository pattern you ask for the data, there you check if you have it in cache by key. I you don't, you fetch it from the database, if you do, you fetch it from the cache.
It will depend on what kind of data you are handling so it's up to you to decide how long to keep data in cache. Perhaps a sliding timeout is best as you'll keep the data as long as the key keeps being request. Which means if the cache has data for a user, once the user goes away, the data will expire from the cache.
Can you shard this data? Is data access pattern simple and stable (does not change with changing business requirements)? How critical is this data (session context, for example, is not too hard to restore, whereas some preferences a user has entered on a settings page should not be lost)?
Typically, provided you can shard and your data access patterns are simple and do not mutate too much, you choose Redis. If you look for something more reliable and supporting more advanced data access patterns, Tarantool is a good option.
Please do check out this :
http://www.mongodb.org/
Its a really good No-SQL database with drivers and support for all major languages.

What are the use cases of Graph-based Databases (http://neo4j.org/)? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have used Relational DB's a lot and decided to venture out on other types available.
This particular product looks good and promising: http://neo4j.org/
Has anyone used graph-based databases? What are the pros and cons from a usability prespective?
Have you used these in a production environment? What was the requirement that prompted you to use them?
I used a graph database in a previous job. We weren't using neo4j, it was an in-house thing built on top of Berkeley DB, but it was similar. It was used in production (it still is).
The reason we used a graph database was that the data being stored by the system and the operations the system was doing with the data were exactly the weak spot of relational databases and were exactly the strong spot of graph databases. The system needed to store collections of objects that lack a fixed schema and are linked together by relationships. To reason about the data, the system needed to do a lot of operations that would be a couple of traversals in a graph database, but that would be quite complex queries in SQL.
The main advantages of the graph model were rapid development time and flexibility. We could quickly add new functionality without impacting existing deployments. If a potential customer wanted to import some of their own data and graft it on top of our model, it could usually be done on site by the sales rep. Flexibility also helped when we were designing a new feature, saving us from trying to squeeze new data into a rigid data model.
Having a weird database let us build a lot of our other weird technologies, giving us lots of secret-sauce to distinguish our product from those of our competitors.
The main disadvantage was that we weren't using the standard relational database technology, which can be a problem when your customers are enterprisey. Our customers would ask why we couldn't just host our data on their giant Oracle clusters (our customers usually had large datacenters). One of the team actually rewrote the database layer to use Oracle (or PostgreSQL, or MySQL), but it was slightly slower than the original. At least one large enterprise even had an Oracle-only policy, but luckily Oracle bought Berkeley DB. We also had to write a lot of extra tools - we couldn't just use Crystal Reports for example.
The other disadvantage of our graph database was that we built it ourselves, which meant when we hit a problem (usually with scalability) we had to solve it ourselves. If we'd used a relational database, the vendor would have already solved the problem ten years ago.
If you're building a product for enterprisey customers and your data fits into the relational model, use a relational database if you can. If your application doesn't fit the relational model but it does fit the graph model, use a graph database. If it only fits something else, use that.
If your application doesn't need to fit into the current blub architecture, use a graph database, or CouchDB, or BigTable, or whatever fits your app and you think is cool. It might give you an advantage, and its fun to try new things.
Whatever you chose, try not to build the database engine yourself unless you really like building database engines.
We've been working with the Neo team for over a year now and have been very happy. We model scholarly artifacts and their relationships, which is spot on for a graph db, and run recommendation algorithms over the network.
If you are already working in Java, I think that modeling using Neo4j is very straight forward and it has the flattest / fastest performance for R/W of any other solutions we tried.
To be honest, I have a hard time not thinking in terms of a Graph/Network because it's so much easier than designing convoluted table structures to hold object properties and relationships.
That being said, we do store some information in MySQL simply because it's easier for the Business side to run quick SQL queries against. To perform the same functions with Neo we would need to write code that we simply don't have the bandwidth for right now. As soon as we do though, I'm moving all that data to Neo!
Good luck.
Two points:
First, on the data I've been working with the past 5 years in SQL Server, I've recently hit the scalability wall with SQL for the type of queries we need to run (nested relationhsips...you know...graphs). I've been playing around with neo4j, and my lookup times are several orders of magnitude faster when I need this kind of lookup.
Second, to the point that graph databases are outdated. Um...no. Early on, as people were trying to figure out how to store and lookup data efficiently, they created and played with graph and network style database models. These were designed so the physical model reflected the logical model, so their efficiency wasnt that great. This type of data structure was good for semi-structured data, but not as good for structured dense data. So, this IBM dude named Codd was researching efficient ways to arrange and store structured data and came up with the idea for the relational database model. And it was good, and people were happy.
What do we have here? Two tools for two different purposes. Graph database models are very good for representing semi-structured data and the relationships between entities (that may or may not exist). Relational databases are good for structured data that has a very static schema, and where join depths do not go very deep. One is good for one kind of data, the other is good for other kinds of data.
To coin the phrase, there is no Silver Bullet. Its very short sighted to say that graph database models are out of date and to use one gives up 40 years of progress. That's like saying using C is giving up all the technological progress we've gone through to get things like Java and C#. That's not true though. C is a tool that is needed for certain tasks. And Java is a tool for other tasks.
I've been using MySQL for years to manage engineering data, and it worked well, but one of the problems we had (but didn't realise we had) was that we always had to plan the schema up-front. Another problem we knew we had was mapping the data up to domain objects and back.
Now we've just started trying out neo4j and it looks like it is solving both problems for us. The ability to add different properties to each node (and relation) has allowed us to re-think our entire approach to data. It is like dynamic versus static languages (Ruby versus Java), but for databases. Building the data model in the database can be done in a much more agile and dynamic way, and that is dramatically simplifying our code.
And since the object model in code is generally a graph structure, mapping from the database is also simpler, with less code and consequently fewer bugs.
And as a additional bonus, our initial prototype code for loading our data into neo4j is actually performing faster than the previous MySQL version. I have no solid numbers on this (yet), but that was a nice additional feature.
But at the end of the day, the choice probably should be based mostly on the nature of your domain model. Does it map better to tables or graphs? Decide by doing some prototypes, load the data and play with it. Use neoclipse to look at different views of the data. Once you've done that, hopefully you know if you're on to a good thing or not.
Here is a good article that talks about the needs that non relational databases fill: http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php
It does a good job at pointing out (aside from the name) that relational databases arent flawed or wrong, its just that these days people are starting to process more and more data in mainstream software and web sites, and that relational databases just wont scale for these needs.
I am building an intranet at my company.
I am interested in understanding how to load data that was stored in tables (Oracle, MySQL, SQL Server, Excel, Access, various random lists) and loading it into Neo4J, or some other graph database. Specifcally, what happens when common data overlaps existing data already in the system.
Yes, I know some data is best modeled in RDBMS, but I have this idea itching me, that when you need to superimpose several distinct tables, the graph model is better than the table structure.
For instance, I work in a manufacturing environment. There is a major project we are working on and because of the complexity, each department has created a seperate Excel spreadsheet that has a BOM (Bill Of Materials) hierarchy in a column on the left and then several columns of notes and checks made by individuals who made these sheets.
So one of the problems is merging all these notes together into one "view" so that someone can see all the issues that need to be addressed in any particular part.
The second problem is that an Excel spreadsheet sucks at representing a hierarchial BOM when a common component is used in more than one subassembly. Meaning that, if someone writes a note about the P34 relay in the ignition subassembly, the same comment should be associated with the P34 relays used in the motor driver subassembly. This won't occur in the excel spreadsheet.
For the company intranet, I want to be able to search for anything easily. Such as data related to a part number, a BOM structure, a phone number, an email address, a company policy, or procedure. I want to even extend this to manage computer hardware assets, and installed software.
I envision that once the information network starts to get populated you can start doing cool traversals such as "I want to write an email to everyone working on the XYZ project". People will have been associated with the project because they will be tagged as creating and modifying the data within the XYZ project. So by using the XYZ project as a search key, a huge set with everything related to the XYZ project will be created. Including links to people who built the XYZ project. The people links will connect to their email addresses. So by their involvement in the XYZ project, they will be included in my email. This is in stark contrast to some secretary trying to maintain a list of people work on the project. We generate a lot of lists. We spend a lot of time maintaining lists and making sure they are up to date. And most of it doesn't add any value to our products.
Another cool traversal could report all the computers that have a certain piece of software installed, by version. That report could be used to generate tasks to remove extra copies of old software and to update people who need to have the latest copy. It would also be useful for license tracking.
might be a bit late, but there is a growing number of projects using Neo4j, the better known ones listed at Neo4j . Also NeoTechnology, the company behind Neo4j, has some references at their customers page
Note: I am part of the Neo4j team

Resources