What would be a better database structure for this? - database

I'm building a huge database of IP addresses with their geographic location attached (country, city, ect).
Right now, I'm using this simple database structure:
id || ip_addr || country || city ||
I've already starting building it, and I've got almost 1 million records, already. The thing is, lots of addresses have the same country attached and fetching from the database is becoming really slow.
I was thinking, if I do this:
countryTable:
countryID || countryName ||
cityTable:
cityID || cityName || countryID (for what country the city is in) ||
and then, ipTable:
id || ip_addr || countryID || cityID
Would it make fetching any faster?
Is this method more efficient (does it have any other benefits)? Or should I just stick to what I have already?

Yes, moving countries and cities to a separate table is actually a normalization and is a very good step. I would go even further with normalization: a city is located in a country, which means knowing a city you also always know the country. Thus try this:
id || ip_addr || cityID
cityTable:
id || cityName || countryID
countryTable:
countryID || countryName
An extra reference to the country in IP table is unnecessary. Note that this design is not problematic when several cities have the same name like Warsaw (Poland), Warsaw (Indiana, US) and a dozen others - there are duplicated names in the database but ids are different - and you identify cities by id - happening to point to the same name (but in different country).
However I don't understand why you have a separate id column when unique ip_addr exists (providing that a single IP has only one address attached)?
ip_addr (ID) || cityID
Remember that IP address can and should be represented as a number (some databases have built-in database for that), so such a key is as good as artificial one.
Finally, typically continuous ranges of IPs are assigned to the same area/city/district. You will save a lot of space by assigning a range of IPs to location rather than each and every IP.

Yes, normalization typically improves performance. Although the primary reason for normalization is usually data consistency. However in some cases denormalization actually improves performance. This is done in data warehouses and reporting to reduce the number of joins required to filter and compose the result of a query.
One important part here is that the database gets much smaller and more data fits into RAM.
Another key point to performance is having indexes supporting your typical queries.
If you search by city name you should have an index on cityTable.cityName, etc. This way the database can find your data using an efficient search, just reading a few records, instead of scanning the whole database.

Related

Reusing a database table for many other entities? Is this possible?

Say for example, I have an ADDRESS table, that will store similar attributes of other entities like address, city, zip, country, etc. The entities are USER, COMPANY, BANK, BRANCH, etc. I would like to use this one table ADDRESS to store the addresses of the other entities rather than creating other tables for each entity to store the ADDRESS like so, USER_ADDRESS, COMPANY_ADDRESS, BANK_ADDRESS, BRANCH_ADDRESS.
Is this possible? Am i breaking any laws or conventions? What are the consequences, if any?
Each entity (USER, COMPANY, etc.) should contain a reference to an entry in the ADDRESS table.
There are a few issues:
If 2 users have the same address, they should reference the same address id.
You will need to normalise addresses so that you're not duplicating information (e.g. if you know the city, then you automatically know the zip and country).
Of course, you may not want a well-normalised database. Saving the entire address as a string will improve read performance by reducing the number of join operations.
A lot of things depend on the exact use of the database.
It is fine to use a single ADDRESS table for that purpose and have an ADDRESS_ID in each of the other entities. Depends on the use case and the way you prefer to implement it. I most probably wouldn't do it. I also wouldn't do the other solution you're suggesting (an address table per entity).
So, let's say you want to implement a function to search for all the addresses, where it doesn't matter what type of entity is connected to it. You will have to search the ADDRESS table. If you get results, then you have to search the other four tables to see which record is connected to that address.
You could add a field ENTITY_TYPE in the ADDRESS table where you specify which type of entity it is connected to, so you don't have to search the four tables, but I don't recommend this since you can have consistency errors (USER 17 points to ADDRESS 14, but ADDRESS 14 has ENTITY_TYPE = BANK).
Now, with your other solution (having four separate tables to store the addresses of the four different entities) you're just going to have to search those four tables and then search the corresponding entity table to get the entity you're looking for.
My solution in most cases is adding the address fields to the entities tables themselves. Having ADDRESS, ZIP_CODE and COUNTRY_CODE (always use proper country codes, not country names https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) will make it simple. When you present a list of items (users, banks, companies, offices, whatever), it's really common to show the name and the address at the same time in a table. Having no JOINS makes it faster and easier to process. If you want to update an address, it's on the table itself. No lookups!
Of course, like most things in programming, it depends on what your needs are.
Also, please, don't try to split the ADDRESS in more fields. I've seen ADDRESS_TYPE (street, road, avenue, square, ...), STREET_NAME, STREET_NUMBER, BLOCK_NUMBER, BLOCK_FLOOR, BLOCK_LETTER. I'm pretty sure you're never going to need something like SELECT * FROM USER WHERE STREET_NUMBER = 74.

Database Design Questions

I have 2 questions regarding a project. Would appreciate if I get clarifications on that.
I have decomposed Address into individual entities by breaking down to the smallest
unit. Bur addresses are repeated in a few tables. Like Address fields are there in the
Client table as well as Employee table. Should we separate the Address into a separate table with just a linking field
For Example
Create an ADDRESS Table with the following attributes :
Entity_ID ( It could be a employee ID(Home Address) or a client ID(Office Address) )
Unit
Building
Street
Locality
City
State
Country
Zipcode
Remove all the address fields from the Employee table and the Client table
We can obtain the address by getting the employee ID and referring the ADDRESS table for the address
Which approach is better ? Having the address fields in all tables or separate as shown above. Any thoughts on which design in better ?
Ya definitely separating address is better Because people can have multiple addresses so it will be increasing data redundancy.
You can design the database for this problem in two ways according to me.
A. Using one table
Table name --- ADDRESS
Column Names
Serial No. (unique id or primary key)
Client / Employee ID
Address.
B. Using Two tables
Table name --- CLIENT_ADDRESS
Column Names
Serial No. (unique id or primary key)
Client ID (foreign key to client table)
Address.
Table name --- EMPLOYEE_ADDRESS
Column Names
Serial No. (unique id or primary key)
Client ID (foreign key to employee table)
Address.
Definitely you can use as many number of columns instead of address like what you mentioned Unit,Building, Street e.t.c
Also there is one suggestion from my experience
Please add this five Columns in your each and every table.
CREATED_BY (Who has created this row means an user of the application)
CREATED_ON (At what time and date table row was created)
MODIFIED_ON (Who has modified this row means an user of the application)
MODIFIED_BY (At what time and date table row was modified)
DELETE_FLAG (0 -- deleted and 1 -- Active)
The reason for this from point of view of most of the developers is, Your client can any time demand records of any time period. So If you are deleting in reality then it will be a serious situation for you. So every time when a application user deleted an record from gui you have to set the flag as 0 instead of practically deleting it. The default value is 1 which means the row is still active.
At time of retrieval you can select with where condition like this
select * from EMPOLOYEE_TABLE where DELETE_FLAG = 1;
Note : This is an suggestion from my experience. I am not at all enforcing you to adopt this. So please add it according to your requirement.
ALSO tables which don't have any significant purpose doesn't need this.
Separating address into a seperate table is a better design decision as it means any db-side validation logic etc. only needs to be maintained in one place.

Simple database design question about foreign keys

I have a simple question about database desing...
Let's say we have Table Customer with some fields:
(PK) Id,
Firstname,
Lastname,
Address,
City,
(FK) Sex_Id...
So...
Would it be a good idea to have an additional table Table Sex where data about Sex ('M', 'W') would be saved?
Sex_Id,
Value
or should Sex values ('M' or 'W') be saved directly into table Customer? What about query speed etc.?
Thanks in advance,
best Regards.
Or, one could use an existing standard. ISO 5218 covers four codes:
0 = Not Known
1 = Male
2 = Female
9 = Not applicable (lawful person such as corporation, organization etc)
ISO 5218 is a legal encoding and does not apply for medical/biological aspect.
Obviously, a reference table containing those codes should use the natural key (as per above list), and not a syntetic key.
Joe Celko's Data Measurements And Standards in SQL is a great (albeit boring) read.
You could try a multivalued attribute, but I prefer to do this: If there are only 2 values, you could consider using a BOOL type for that attribute in your DB and making 0 = Male and 1 = Female (commenting, of course, to avoid confusion). When data is entered in the external program (given there is one), you could just do a quick mapping where if they check "male", the attribute is 0 in the DB, and if they check "female", the attribute value is 1 in the DB.
How many different values are you planning on having for Sex? If you aren't going to be adding more possible values for that column, it doesn't make sense to use a foreign key.
You can use a character for the column, storing "M" or "W", and also use a foreign key into a table (primary key of a character) if you need to store any more details about that thing; You get the benefit of easy to write/read queries (no join required) for basic stuff, but still have the possibility of adding more data later on.
That said, unless you actually do have more columns in your Sex table, you could probably not create it at all now and add it later when you actually do have a need for it.
in your example, the extra table does not buy you anything.
#marc_s has the right idea here to add a good CHECK CONSTRAINT to make sure the local values are in the proper subset.
now if your example contained additional attributes on the related object, like a 'name' or'description' or further links to other objects like 'alias' or some kind of date range - then absolutely yes, create another table.

creating a address database

I am re-creating a part of my company’s database because it does not meet future needs.
Currently we have mainly a flat file and some disjoined tables that were never fully realized.
My way of thinking is we have a table for each category except maybe the zips table, which may serve as a connect it all together table.
Please refer to image below:
Database Diagram http://www.freeimagehosting.net/uploads/248cc7e884.jpg
One thing I am thinking of is removing the zip table and just putting the zip code in the zipstocities table since the zip code is almost unique and then indexing the table on the zip code. The only downside is zip code has to be a varchar to take care of zip codes with leading zeros. Just want to know if there is a flaw in my logic.
I don't know the US ZIPcode and territorial devision system well, but I assume it's somewhat like the German one.
A state has many counties.
A county has many cities.
A city has many zip codes.
Hence I would use the following schema.
ZipCodes CityZipCodes
------------ ---------------- Cities
ZipCode (PK) <─── ZipCode (PK)(FK) -----------
City (PK)(FK) ───> CityId (PK)
Name
County (FK) ───┐
│
│
Counties │
------------- │
States CountyId (PK) <───┘
----------------- Name
StateId (PK) <─── State (FK)
Name
Abbreviation
Fixed for multiple cities per ZIP code.
One thing you should be aware of is that not all cities are in counties. In Virginia you are in either a city or county but never both.
Looking at the diagram you have, the state table is the only one of the 4 outside tables that is really necessary. Lookup tables with just an ID and a single value aren't worth the effort. These relationships are designed to make a single value in the main table (ziptocities) refer to a set of related data in the lookup table (states).
You'll need to ask yourself why you care about counties. In many states in the US, they have little importance beyond tradition and maps.
The other question will be how important will it be that the address be accurate? How many deaths will there be if important letters are not delivered in a timely manner (possibly many if the letter is about prescription drug recalls!)
You probably want to think about using data from the Postal Service, possibly using a product that corrects addresses. That way, when you get a good address, you'll be certain the mail can be delivered there - because the Postal Service will have said so!
There seem to be flaws in both your process and your logic.
I suggest that you stop thinking about tables and relationships for a moment. Instead, think about facts. Make a list of valid addresses that your database needs to support. Many surprises await you.
Don't confuse an address with a mailing label. They're not at all the same thing. Consider modeling carriers, too. In the US, whether an address is valid depends on the carrier. For example, my PO box is a valid address when the carrier is the USPS, but not when the carrier is UPS.
To save time, you might try browsing some international address formats on bitboost.
Will your logic work if two countries happen to have the same zip code? These two would be pointing to different cities in that case. here are some points to consider
Do you want to use zipcode as a kind
of primary key into address? (at
lease the city, state and country
fields). In that case, you can have
zipcode, city,state,country in one
table. Create indexes on city, state
etc.. (you have a functional
dependency of the form
zipcode->country,state,city . This
as i said may not be true across
countries.
If auto populating is
your only concern, create a
materialized view and use it.
I would recommend reading 'Data Model patterns' by David C. Hay.
But not every person who has a valid medical claim is required by law to remain in the US until the claim is settled. People move.
San Francisco is a city in California; it's not a city in Alabama. Does your design prevent nonsense entries like "San Francisco, AL"?

Normalizing a Table with Low Integrity

I've been handed a table with about 18000 rows. Each record describes the location of one customer. The issue is, that when the person created the table, they did not add a field for "Company Name", only "Location Name," and one company can have many locations.
For example, here are some records that describe the same customer:
Location Table
ID Location_Name
1 TownShop#1
2 Town Shop - Loc 2
3 The Town Shop
4 TTS - Someplace
5 Town Shop,the 3
6 Toen Shop4
My goal is to make it look like:
Location Table
ID Company_ID Location_Name
1 1 Town Shop#1
2 1 Town Shop - Loc 2
3 1 The Town Shop
4 1 TTS - Someplace
5 1 Town Shop,the 3
6 1 Toen Shop4
Company Table
Company_ID Company_Name
1 The Town Shop
There is no "Company" table, I will have to generate the Company Name list from the most descriptive or best Location Name that represents the multiple locations.
Currently I am thinking I need to generate a list of Location Names that are similar, and then and go through that list by hand.
Any suggestions on how I can approach this is appreciated.
#Neall, Thank you for your statement, but unfortunately, each location name is distinct, there are no duplicate location names, only similar. So in the results from your statement "repcount" is 1 in each row.
#yukondude, Your step 4 is the heart of my question.
Please update the question, do you have a list of CompanyNames available to you? I ask because you maybe able to use Levenshtein algo to find a relationship between your list of CompanyNames and LocationNames.
Update
There is not a list of Company Names, I will have to generate the company name from the most descriptive or best Location Name that represents the multiple locations.
Okay... try this:
Build a list of candidate CompanyNames by finding LocationNames made up of mostly or all alphabetic characters. You can use regular expressions for this. Store this list in a separate table.
Sort that list alphabetically and (manually) determine which entries should be CompanyNames.
Compare each CompanyName to each LocationName and come up with a match score (use Levenshtein or some other string matching algo). Store the result in a separate table.
Set a threshold score such that any MatchScore < Threshold will not be considered a match for a given CompanyName.
Manually vet through the LocationNames by CompanyName | LocationName | MatchScore, and figure out which ones actually match. Ordering by MatchScore should make the process less painful.
The whole purpose of the above actions is to automate parts and limit the scope of your problem. It's far from perfect, but will hopefully save you the trouble of going through 18K records by hand.
I've had to do this before. The only real way to do it is to manually match up the various locations. Use your database's console interface and grouping select statements. First, add your "Company Name" field. Then:
SELECT count(*) AS repcount, "Location Name" FROM mytable
WHERE "Company Name" IS NULL
GROUP BY "Location Name"
ORDER BY repcount DESC
LIMIT 5;
Figure out what company the location at the top of the list belongs to and then update your company name field with an UPDATE ... WHERE "Location Name" = "The Location" statement.
P.S. - You should really break your company names and location names out into separate tables and refer to them by their primary keys.
Update: - Wow - no duplicates? How many records do you have?
I was going to recommend some complicated token matching algorithm but it's really tricky to get right and if you're data does not have a lot of correlation (typos, etc) then it's not going to give very good results.
I would recommend you submit a job to the Amazon Mechanical Turk and let a human sort it out.
Ideally, you'd probably want a separate table named Company and then a company_id column in this "Location" table that is a foreign key to the Company table's primary key, likely called id. That would avoid a fair bit of text duplication in this table (over 18,000 rows, an integer foreign key would save quite a bit of space over a varchar column).
But you're still faced with a method for loading that Company table and then properly associating it with the rows in Location. There's no general solution, but you could do something along these lines:
Create the Company table, with an id column that auto-increments (depends on your RDBMS).
Find all of the unique company names and insert them into Company.
Add a column, company_id, to Location that accepts NULLs (for now) and that is a foreign key of the Company.id column.
For each row in Location, determine the corresponding company, and UPDATE that row's company_id column with that company's id. This is likely the most challenging step. If your data is like what you show in the example, you'll likely have to take many runs at this with various string matching approaches.
Once all rows in Location have a company_id value, then you can ALTER the Company table to add a NOT NULL constraint to the company_id column (assuming that every location must have a company, which seems reasonable).
If you can make a copy of your Location table, you can gradually build up a series of SQL statements to populate the company_id foreign key. If you make a mistake, you can just start over and rerun the script up to the point of failure.
Yes, that step 4 from my previous post is a doozy.
No matter what, you're probably going to have to do some of this by hand, but you may be able to automate the bulk of it. For the example locations you gave, a query like the following would set the appropriate company_id value:
UPDATE Location
SET Company_ID = 1
WHERE (LOWER(Location_Name) LIKE '%to_n shop%'
OR LOWER(Location_Name) LIKE '%tts%')
AND Company_ID IS NULL;
I believe that would match your examples (I added the IS NULL part to not overwrite previously set Company_ID values), but of course in 18,000 rows you're going to have to be pretty inventive to handle the various combinations.
Something else that might help would be to use the names in Company to generate queries like the one above. You could do something like the following (in MySQL):
SELECT CONCAT('UPDATE Location SET Company_ID = ',
Company_ID, ' WHERE LOWER(Location_Name) LIKE ',
LOWER(REPLACE(Company_Name), ' ', '%'), ' AND Company_ID IS NULL;')
FROM Company;
Then just run the statements that it produces. That could do a lot of the grunge work for you.

Resources