Normalizing a Table with Low Integrity - database

I've been handed a table with about 18000 rows. Each record describes the location of one customer. The issue is, that when the person created the table, they did not add a field for "Company Name", only "Location Name," and one company can have many locations.
For example, here are some records that describe the same customer:
Location Table
ID Location_Name
1 TownShop#1
2 Town Shop - Loc 2
3 The Town Shop
4 TTS - Someplace
5 Town Shop,the 3
6 Toen Shop4
My goal is to make it look like:
Location Table
ID Company_ID Location_Name
1 1 Town Shop#1
2 1 Town Shop - Loc 2
3 1 The Town Shop
4 1 TTS - Someplace
5 1 Town Shop,the 3
6 1 Toen Shop4
Company Table
Company_ID Company_Name
1 The Town Shop
There is no "Company" table, I will have to generate the Company Name list from the most descriptive or best Location Name that represents the multiple locations.
Currently I am thinking I need to generate a list of Location Names that are similar, and then and go through that list by hand.
Any suggestions on how I can approach this is appreciated.
#Neall, Thank you for your statement, but unfortunately, each location name is distinct, there are no duplicate location names, only similar. So in the results from your statement "repcount" is 1 in each row.
#yukondude, Your step 4 is the heart of my question.

Please update the question, do you have a list of CompanyNames available to you? I ask because you maybe able to use Levenshtein algo to find a relationship between your list of CompanyNames and LocationNames.
Update
There is not a list of Company Names, I will have to generate the company name from the most descriptive or best Location Name that represents the multiple locations.
Okay... try this:
Build a list of candidate CompanyNames by finding LocationNames made up of mostly or all alphabetic characters. You can use regular expressions for this. Store this list in a separate table.
Sort that list alphabetically and (manually) determine which entries should be CompanyNames.
Compare each CompanyName to each LocationName and come up with a match score (use Levenshtein or some other string matching algo). Store the result in a separate table.
Set a threshold score such that any MatchScore < Threshold will not be considered a match for a given CompanyName.
Manually vet through the LocationNames by CompanyName | LocationName | MatchScore, and figure out which ones actually match. Ordering by MatchScore should make the process less painful.
The whole purpose of the above actions is to automate parts and limit the scope of your problem. It's far from perfect, but will hopefully save you the trouble of going through 18K records by hand.

I've had to do this before. The only real way to do it is to manually match up the various locations. Use your database's console interface and grouping select statements. First, add your "Company Name" field. Then:
SELECT count(*) AS repcount, "Location Name" FROM mytable
WHERE "Company Name" IS NULL
GROUP BY "Location Name"
ORDER BY repcount DESC
LIMIT 5;
Figure out what company the location at the top of the list belongs to and then update your company name field with an UPDATE ... WHERE "Location Name" = "The Location" statement.
P.S. - You should really break your company names and location names out into separate tables and refer to them by their primary keys.
Update: - Wow - no duplicates? How many records do you have?

I was going to recommend some complicated token matching algorithm but it's really tricky to get right and if you're data does not have a lot of correlation (typos, etc) then it's not going to give very good results.
I would recommend you submit a job to the Amazon Mechanical Turk and let a human sort it out.

Ideally, you'd probably want a separate table named Company and then a company_id column in this "Location" table that is a foreign key to the Company table's primary key, likely called id. That would avoid a fair bit of text duplication in this table (over 18,000 rows, an integer foreign key would save quite a bit of space over a varchar column).
But you're still faced with a method for loading that Company table and then properly associating it with the rows in Location. There's no general solution, but you could do something along these lines:
Create the Company table, with an id column that auto-increments (depends on your RDBMS).
Find all of the unique company names and insert them into Company.
Add a column, company_id, to Location that accepts NULLs (for now) and that is a foreign key of the Company.id column.
For each row in Location, determine the corresponding company, and UPDATE that row's company_id column with that company's id. This is likely the most challenging step. If your data is like what you show in the example, you'll likely have to take many runs at this with various string matching approaches.
Once all rows in Location have a company_id value, then you can ALTER the Company table to add a NOT NULL constraint to the company_id column (assuming that every location must have a company, which seems reasonable).
If you can make a copy of your Location table, you can gradually build up a series of SQL statements to populate the company_id foreign key. If you make a mistake, you can just start over and rerun the script up to the point of failure.

Yes, that step 4 from my previous post is a doozy.
No matter what, you're probably going to have to do some of this by hand, but you may be able to automate the bulk of it. For the example locations you gave, a query like the following would set the appropriate company_id value:
UPDATE Location
SET Company_ID = 1
WHERE (LOWER(Location_Name) LIKE '%to_n shop%'
OR LOWER(Location_Name) LIKE '%tts%')
AND Company_ID IS NULL;
I believe that would match your examples (I added the IS NULL part to not overwrite previously set Company_ID values), but of course in 18,000 rows you're going to have to be pretty inventive to handle the various combinations.
Something else that might help would be to use the names in Company to generate queries like the one above. You could do something like the following (in MySQL):
SELECT CONCAT('UPDATE Location SET Company_ID = ',
Company_ID, ' WHERE LOWER(Location_Name) LIKE ',
LOWER(REPLACE(Company_Name), ' ', '%'), ' AND Company_ID IS NULL;')
FROM Company;
Then just run the statements that it produces. That could do a lot of the grunge work for you.

Related

Getting records structured the same way only partially

While surfing through 9gag.com, an idea (problem) came up to my mind. Let's say that I want to create a website where users can add diffirent kinds of entries. Now each entry is diffirent type and needs diffirent / additional columns.
Let's say that we can add:
a youtube video
a cite which requires the cite's author name and last name
a flash game which requires additional game category, description, genre etc.
an image which requires the link
Now all the above are all entries and have some columns in common (like id, add_date, adding_user_id, etc...) and some diffirent / additional (for example: only flash game needs description or only image needs plus_18 column to be specified). The question is how should I organize DB / code for controlling all of the above as entries together? I might want to order them, or search entries by add_date etc...
The ideas that came up to my mind:
Add a "type" column which specifies what entry it is and add all the possible columns where NULL is allowed for not related to this particular type columns. But this is mega nasty. There is no data integration.
Add some column with serialized data for the additional data but it makes any filtration a total hell.
Create a master (parent) table for an entry and separate tables for concrete entry types (their additional columns / info). But here I don't even know how I'm supposed to select data properly and is just nasty as well.
So what's the best way to solve this problem?
The parent table seems like the best option.
// This is the parent table
Entry
ID PK
Common fields
Video
ID PK
EntryID FK
Unique fields
Game
ID PK
EntryID FK
Unique fields
...
What the queries will look like will largely depend on the type of query. To, for example, get all games ordered by a certain date, the query will look something like:
SELECT *
FROM Game
JOIN Entry ON Game.EntryID = Entry.ID
ORDER BY Entry.AddDate
To get all content ordered by date, will be somewhat messy. For example:
SELECT *
FROM Entry
LEFT JOIN Game ON Game.EntryID = Entry.ID
LEFT JOIN Video ON Video.EntryID = Entry.ID
...
ORDER BY Entry.AddDate
If you want to run queries like the one above, I suggest you give unique names to your primary key fields (i.e. VideoID and GameID) so you can easily identify which type of entry you're dealing with (by checking GameID IS NOT NULL for example).
Or you could add a Type field in Entry.

Database Design Questions

I have 2 questions regarding a project. Would appreciate if I get clarifications on that.
I have decomposed Address into individual entities by breaking down to the smallest
unit. Bur addresses are repeated in a few tables. Like Address fields are there in the
Client table as well as Employee table. Should we separate the Address into a separate table with just a linking field
For Example
Create an ADDRESS Table with the following attributes :
Entity_ID ( It could be a employee ID(Home Address) or a client ID(Office Address) )
Unit
Building
Street
Locality
City
State
Country
Zipcode
Remove all the address fields from the Employee table and the Client table
We can obtain the address by getting the employee ID and referring the ADDRESS table for the address
Which approach is better ? Having the address fields in all tables or separate as shown above. Any thoughts on which design in better ?
Ya definitely separating address is better Because people can have multiple addresses so it will be increasing data redundancy.
You can design the database for this problem in two ways according to me.
A. Using one table
Table name --- ADDRESS
Column Names
Serial No. (unique id or primary key)
Client / Employee ID
Address.
B. Using Two tables
Table name --- CLIENT_ADDRESS
Column Names
Serial No. (unique id or primary key)
Client ID (foreign key to client table)
Address.
Table name --- EMPLOYEE_ADDRESS
Column Names
Serial No. (unique id or primary key)
Client ID (foreign key to employee table)
Address.
Definitely you can use as many number of columns instead of address like what you mentioned Unit,Building, Street e.t.c
Also there is one suggestion from my experience
Please add this five Columns in your each and every table.
CREATED_BY (Who has created this row means an user of the application)
CREATED_ON (At what time and date table row was created)
MODIFIED_ON (Who has modified this row means an user of the application)
MODIFIED_BY (At what time and date table row was modified)
DELETE_FLAG (0 -- deleted and 1 -- Active)
The reason for this from point of view of most of the developers is, Your client can any time demand records of any time period. So If you are deleting in reality then it will be a serious situation for you. So every time when a application user deleted an record from gui you have to set the flag as 0 instead of practically deleting it. The default value is 1 which means the row is still active.
At time of retrieval you can select with where condition like this
select * from EMPOLOYEE_TABLE where DELETE_FLAG = 1;
Note : This is an suggestion from my experience. I am not at all enforcing you to adopt this. So please add it according to your requirement.
ALSO tables which don't have any significant purpose doesn't need this.
Separating address into a seperate table is a better design decision as it means any db-side validation logic etc. only needs to be maintained in one place.

Database 3rd Normal Form

Can somebody make sure my database is in the 3rd normal form, and if not, explain why not?
I need the DB to only have 3 tables. So here it is:
Customer No. (PK) Store No. (PK) Sale No. (PK)
Name Location Customer No. (FK)
Telephone Revenue Store No. (FK)
Address Total
Purchases($) Paid
Store No.
Here are what your three tables should be: Table 1: Customer
Customer No. (PK)
Name
Telephone
Address
then Table 2: Store
Store no. (PK)
Location
then Table 3: Sale
Sale No. (PK)
Customer No (FK)
Store No (FK)
Total
Paid_yes_no
If you are trying to track partial payments, etc. through the Paid column, then it would be a separate table (and a much more complicated database). However, if your Paid column just indicates whether the bill has been paid or not, the above should work.
Perhaps you need a Date field there as well?
There are a few problems, some of this may be as the specification is for homework rather than the real world.
Store No. in Customers is a duplicated column It's reasonable to expect that a business with multiple stores have customers who use multiple stores - unless your specification says otherwise and in which case the taxonomy (naming) should be expanded, you could consider First Store No. or Home Store No. instead. Also it should be marked up as a foreign key if it is to remain.
Purchases($) in the customers table is reliant on other data which will change. Since it is derived from other information you should not store it.
Address is not a single column - it has multiple parts like Street, City, State, Country & ZIP Code may in turn require extra tables details to fully satisfy 2nd Normal Form. Similarly Telephone is unlikely to be just one number.
Each thing you need to know should appear once and once only. If you can calculate it from something else you should do that rather than store the answer. In the real world you might sometimes cache some information in a table or batch process it for performance, but those would be applied later and only if necessary.
A brief overview of database normalisation is at http://databases.about.com/od/specificproducts/a/normalization.htm which you should probably look through before reworking your project.

Is it bad practice to store multiple ids into one table column

I'm working on a project where the database has a few tables that contain a type_id field that stores id's from multiple tables
for instance:
id | table_type | table id
==============================
1 ADDRESS 1
2 ADDRESS 2
3 CITY 1
4 CITY 2
4 ADDRESS 3
5 COUNTRY 1
the table_id field holds either an id from the Addresses table, or the Cities table, or the Countries table
I'm just wondering if this is good design. or should i avoid this whenever possible?
This table is used to grab all locations that a user has entered.
The answer is:
It Depends.
If the example table you gave was named Location and you're using it to achieve type inheritance, where Address, City, and Country are specific types of Location, then this design can work. In this case, your primary key will be in the Location table, and each of the other tables will have a foreign key to Location. If that's not how you're using it, then this is not a properly normalized database design.
This seems like a confusing design. If you tables Address, City, and Country each have their own id field, then tables which reference them should use a column name like Address_id, City_id, and Country_id - respectively.
Your current design is trying to be too generic. It's bound to cause you trouble later on.
Look, tTif you can make keys with the values you have.
in this case you could make a combination of table_type and table id the primary/unique key.
Though if the table_type can be only one of a few values maybe save the types as an enum?

Athletics Ranking Database - Number of Tables

I'm fairly new to this so you may have to bear with me. I'm developing a database for a website with athletics rankings on them and I was curious as to how many tables would be the most efficient way of achieving this.
I currently have 2 tables, a table called 'athletes' which holds the details of all my runners (potentially around 600 people/records) which contains the following fields:
mid (member id - primary key)
firstname
lastname
gender
birthday
nationality
And a second table, 'results', which holds all of their performances and has the following fields:
mid
eid (event id - primary key)
eventdate
eventcategory (road, track, field etc)
eventdescription (100m, 200m, 400m etc)
hours
minutes
seconds
distance
points
location
The second table has around 2000 records in it already and potentially this will quadruple over time, mainly because there are around 30 track events, 10 field, 10 road, cross country, relays, multi-events etc and if there are 600 athletes in my first table, that equates to a large amount of records in my second table.
So what I was wondering is would it be cleaner/more efficient to have multiple tables to separate track, field, cross country etc?
I want to use the database to order peoples results based on their performance. If you would like to understand better what I am trying to emulate, take a look at this website http://thepowerof10.info
Changing the schema won't change the number of results. Even if you split the venue into a separate table, you'll still have one result per participant at each event.
The potential benefit of having a separate venue table would be better normalization. A runner can have many results, and a given venue can have many results on a given date. You won't have to repeat the venue information in every result record.
You'll want to pay attention to indexes. Every table must have a primary key. Add additional indexes for columns you use in WHERE clauses when you select.
Here's a discussion about normalization and what it can mean for you.
PS - Thousands of records won't be an issue. Large databases are on the order of giga- or tera-bytes.
My thought --
Don't break your events table into separate tables for each type (track, field, etc.). You'll have a much easier time querying the data back out if it's all there in the same table.
Otherwise, your two tables look fine -- it's a good start.

Resources