I have a database of US zip codes and their corresponding states, cities and counties. It was supplied as a flat file and I'm trying to normalize the data and figure out exactly which entities depend on which others.
One problem I've come across is that some cities seem to exist in more than one county. I was under the impression that in the US, there is a hierarchy of State -> County -> City -> Zip.
However, this data seems to show otherwise for some cities:
Is my data set incorrect or is this actually a feature of US geography?
I am working with this same topic. I have learned that Virgina has cities that are not within a county. The city functions as both a city and county but in not within any county boundary. Also Alaska has no counties. Their equivilant is Boroughs, but the whole state is not divided into boroughs. Any area not within a borough is referred to as the "unorganized borough".
No, there isn't a clean hierarchy like that.
You're also liable to find cities that straddle state borders (cities in two states), and ZIP codes that take in more than one city. Not long ago, there were ZIP codes that straddled state borders, too. (ZIP codes are more about the route followed to deliver mail than about geography.) There might still be some.
As far as I know, no county is split between two states. But if there happened to be one, it wouldn't surprise me.
Depending on your application, you might discover even weirder things. I used to have to deal with addresses in the mountains that were "in" one county geographically, but were "in" a second county for emergency services (fire, police), and "in" yet a third county for non-emergency services (water, sewer, garbage collection). It depended on where the address was in relation to mountain ridges and roads.
Related
Let's say you have a the following
---------------------------
attribute constraints
| -------- -----------
| id (PK)
| location_name **
| street
| zipcode
| city
** realistically Unique, but not going to use for future proofing?
Would this violate BCNF as zipcode can be used to find the city? Although cities can share zipcodes and vise versa, a city can't be in two separate zipcodes where another city is part of that zipcode?
(zipcode1 --> city1) and (zipcode2 --> city1 and city2)
(note that zipcode and city are not a composite superkey as multiple locations can be associated with the same zipcode and city). Is BCNF suggesting that you should a completely separate table JUST for pairing cities and zipcodes?
States are omitted because this database is for a single state. Although in that case would you have to have 3 tables since a zipcode cannot be in multiple states (edit: apparently there are, but assuming there aren't). Seems too dumb to me true and that wayy too many unions would be needed.
I honestly dont understand much of anything regarding key terms and have just been left confused (if you could answer in layman's terms and/or technically that's highly appreciated). I tried searching for an answer because I figured it would be common, but couldn't find anything. Given my inability to organize and process mathematical logic, i'm starting to wonder if I picked the wrong field to enter..
What does a five digit zipcode actually determine? As I understand it, it determines a Post Office. This is enough to route every piece of mail from wherever it is to a destination post office. That post office then deliver it locally.
Figuring out what the dependencies are between zip code and state or zip code and city, or zip code and street plus number or apartment number can be the devil's own business.
The area served by a post office is generally part of some community that the Post Office is in, like a town. But there are quirks.
The residents of Magalloway, ME are served by the post office in nearby Errol, NH. They therefore use zipcode 03579, the same as the residents of downtown Errol. The letters get forwarded to the Errol post office, then delivered to them in Maine. This may seem very strange, but it works out well in terms of driving miles.
map of 03579
I'm trying to better understand designing a database schema. After reviewing the solution for a problem that I'm working on, I don't understand why the solution chooses to use an aggregation for the attributes "address" and "phone number" for a given "musician". Here are the specifications, I'm only interested in bullet point 1:
Each musician that records at Notown has an SSN, a name, an address, and a phone
number. Poorly paid musicians often share the same address, and no address has more
than one phone.
Each instrument used in songs recorded at Notown has a name (e.g., guitar, synthesizer,
flute) and a musical key (e.g., C, B-flat, E-flat).
Each album recorded on the Notown label has a title, a copyright date, a format (e.g.,
CD or MC), and an album identifier.
Each song recorded at Notown has a title and an author.
Each musician may play several instruments, and a given instrument may be played by
several musicians.
Each album has a number of songs on it, but no song may appear on more than one
album.
Each song is performed by one or more musicians, and a musician may perform a number
of songs.
Each album has exactly one musician who acts as its producer. A musician may produce several albums, of course.
Here is a solution that I found:
The ER Diagram I created looks almost exactly the same, except for the fact that I made "address" and "phone number" attributes of "musician" instead of giving each of them an entity set of their own, creating a relationship, and turning it into an aggregation. I don't understand why this would be done in this situation. Can anyone explain?? Thank you!
I'm not able to see the image you linked to, but anyway...
no address has more than one phone
This means we should make the phone number an attribute of the address - unless we want to allow for multiple phones per address in the future.
So it would not be completely wrong to make phones a table. But then, we know little about the future. Would there be multiple musicians sharing the same address and the same phones? (I.e. the phone number would be linked to an address.) Or would there be multiple musicians sharing the same address, but each would have their own phone? (I.e. the phone number would be linked to a musician. To use a phone table and link the phones to musicians, however, would only be necessary if a musician could have multiple phone numbers. Otherwise we'd still not make a phone table, but rather make the phone a musician's attribute.)
poorly paid musicians often share the same address
This means we make the address a table of its own. Thus there is only one row to change in case the phone number or some other attribute changes. If we made the address number a musician's attribute instead, we'd store the address redundantly and could get inconsistent data (e.g. same address, but different phone numbers).
A possible data model:
address (address_id, street, city, phone, ...)
musician (musician_id, ssn, name, address_id, ...)
This is a 1:n relation. A musician has one address; an address can belong to multiple musicians.
The primary purpose of database normalization is to make it more difficult for anomalous data to get into the database. Reading the first bullet point, we see that each address may have zero or one phone numbers associated with it. In other words, the phone number is an attribute of/identified by the address. Which normalization level does this violate?
To illustrate how not normalizing the address fields (including phone number) increases the chances of anomalous data, let's say you have four students staying at that address. This means you have four rows where the address data exists. Suppose the phone number changes. You have to make sure you change all four versions of the data. I said there were four students, but suppose there are actually five and I just missed one? Or suppose you found only three when you went to make the change? An address may have at most one phone number however now you have several copies of the same address but with different phone numbers. This is anomalous data.
If this data is normalized, you would have only one copy to change. Since this data is referenced by all the students who live there, no matter how many, this change is "propagated" to all of them. The integrity of the data is maintained.
For my database class homework, I've drawn out a diagram.
I feel like something is incorrect/wrong about it though.
We've been instructed to create an ERD diagram based on this information:
• Every Expo is clearly identified by its exhibition year and its place of event. Every Expo has its own logo and slogan. Statistical data such as the number of participating nations and planned events for each Expo are recorded.
• An Expo contains several pavilions (also called stalls), which are all clearly identified by their ID numbers. There are two types of pavilions: a) theme pavilions and b) national pavilions.
• Every pavilion has a name, an exhibition zone (e.g. Zone A1, Zone A5, Zone B5 etc.) and one of several exhibition categories (e.g. Open-air, Stage, Booth). Moreover, pavilions have different sizes recorded in square meters.
• Every country is uniquely identified and has a name and a capital city.
• Every country can only be part of one national pavilion. A country can present itself alone in a national pavilion or can work together with another partner country to present themselves together in one national pavilion. Not every country in the world will have a national pavilion in the exhibition.
• An event plan describes when each event takes place, which country/countries it is organised by, and in which pavilion it is happening. Events have an optional name. Every event is organised by at least one country. At one point of time, there is at most one event held in a pavilion.
It's also been requested for the diagram to be in third normal form.
Here is what I've done so far. What may be wrong about my diagram?
1) Separate "category" entity is not needed it can simply be an attribute of pavilion entity, since each pavilion has only one category. Similarly, StatisticalData and Expo entity can be merged, and also you need to specify that planned events is a mutivalue attribute.
2) For country entity country identity itself is sufficient primary key.
3) for pavilion, I would suggest you would rather use specialization and generalisation principle(It's like inheritance in classes in programming languages) https://creately.com/diagram/example/io43l9n82/Specialization+and+Generalization+-Entity+Relationship+Example .That means now that "country pavilion" entity is a goner. Rather using above principle create two specialized pavilions - national and theme.
4) Once you are done with above, you "have" to remove PartneredCountryId from country relationship, what you really want is simply one to many relationship between Country and nationalPavilion. By specifying one to many, it make sure that one national pavilion can have multiple countries organising it(all are partners to one another). And then you need to also specify that countries participation is optional in this relation, as "Not every country in the world will have a national pavilion in the exhibition".
5)then you need to remove CountryId and PavilionId from Event because you simply don't have to the two relationships(hasOrganised/isOrganisedBy and organised/isHeldAt) takes care of these thing, when you'll make actuall database tables.
I am working on a medical php application which will be implemented at national level.
It will be used by multiple hospitals and the patient record will be centralized i.e every hospital will be accessing and adding the patient records into same database.
I want that there should be only 1 record of a patient without any duplication. Simply speaking no hospital can again enter the 2nd record for same patient but in order to make it possible I need to know which criteria should we use which will remain fix throughout the entire lifetime of a patient. Only 2 are there in my mind i.e Name and Date of birth.
What other criterias can be there? I dont want to use mobile numbers and phone numbers etc. Moreover infants cant be having it. I need the criteria which will be there for every patient and unique.
Please give me your suggestions or any other better way to implement this functionality?
I'll take a shot because I've been involved in some data matching and validation, although not specifically in the medical industry. You haven't specified a particular country, just mentioned Asia, so I'll use an example from my home country of Australia just because I'm familiar with the rules and I believe the same would apply to many Asian countries:
We have a unique Medicare number used for health care, but it's not mandatory and while the free / discounted care means I expect 99%+ of people would have one you can't rely on it.
There is also a tax file number, likewise not mandatory even if you
work and people who have never had a job wouldn't normally have one.
You might be dealing with foreign people that aren't residents.
Drivers licenses are of course not mandatory to get healthcare.
It's perfectly legal to have "no fixed address". Plus some people will lie to get treatments and repeats of drugs etc. Not to mention many people move often.
Changing name is common in case of marriage / divorce and unless done
for illegal purposes someone can change their name just because they
don't like their original. Not to mention people use common substitutions for various things like Jim versus James.
Typing mistakes will be very common over a large dataset.
In short I think the 'perfect' scheme you are asking for is impossible. The best you can do is apply a weighting rule to find likely duplicates. Same name / date of birth / place of birth for example is an unlikely but possible event so show a warning to the data entry operator it's a likely duplicate and let them see the details of the likely duplicate. Even things like a drivers license number that should be unique may indicate that the original entry just had a data entry error, not a new duplicate.
From my experience the best thing is a report that lists likely duplicates that must be reviewed by someone higher up the chain, and give them an easy option to merge the duplicates. Then you can start to use more vague regex expressions that throw a few false positives that can be dismissed when a human reviews them. You can also refine the model over time to get the best match results.
Combination of name, date of birth, blood group, place of birth etc., can be tried.
You need to use some national-wide ID. Like Passport ID, or health insurance number.
Social Insurance Number with country.
I am re-creating a part of my company’s database because it does not meet future needs.
Currently we have mainly a flat file and some disjoined tables that were never fully realized.
My way of thinking is we have a table for each category except maybe the zips table, which may serve as a connect it all together table.
Please refer to image below:
Database Diagram http://www.freeimagehosting.net/uploads/248cc7e884.jpg
One thing I am thinking of is removing the zip table and just putting the zip code in the zipstocities table since the zip code is almost unique and then indexing the table on the zip code. The only downside is zip code has to be a varchar to take care of zip codes with leading zeros. Just want to know if there is a flaw in my logic.
I don't know the US ZIPcode and territorial devision system well, but I assume it's somewhat like the German one.
A state has many counties.
A county has many cities.
A city has many zip codes.
Hence I would use the following schema.
ZipCodes CityZipCodes
------------ ---------------- Cities
ZipCode (PK) <─── ZipCode (PK)(FK) -----------
City (PK)(FK) ───> CityId (PK)
Name
County (FK) ───┐
│
│
Counties │
------------- │
States CountyId (PK) <───┘
----------------- Name
StateId (PK) <─── State (FK)
Name
Abbreviation
Fixed for multiple cities per ZIP code.
One thing you should be aware of is that not all cities are in counties. In Virginia you are in either a city or county but never both.
Looking at the diagram you have, the state table is the only one of the 4 outside tables that is really necessary. Lookup tables with just an ID and a single value aren't worth the effort. These relationships are designed to make a single value in the main table (ziptocities) refer to a set of related data in the lookup table (states).
You'll need to ask yourself why you care about counties. In many states in the US, they have little importance beyond tradition and maps.
The other question will be how important will it be that the address be accurate? How many deaths will there be if important letters are not delivered in a timely manner (possibly many if the letter is about prescription drug recalls!)
You probably want to think about using data from the Postal Service, possibly using a product that corrects addresses. That way, when you get a good address, you'll be certain the mail can be delivered there - because the Postal Service will have said so!
There seem to be flaws in both your process and your logic.
I suggest that you stop thinking about tables and relationships for a moment. Instead, think about facts. Make a list of valid addresses that your database needs to support. Many surprises await you.
Don't confuse an address with a mailing label. They're not at all the same thing. Consider modeling carriers, too. In the US, whether an address is valid depends on the carrier. For example, my PO box is a valid address when the carrier is the USPS, but not when the carrier is UPS.
To save time, you might try browsing some international address formats on bitboost.
Will your logic work if two countries happen to have the same zip code? These two would be pointing to different cities in that case. here are some points to consider
Do you want to use zipcode as a kind
of primary key into address? (at
lease the city, state and country
fields). In that case, you can have
zipcode, city,state,country in one
table. Create indexes on city, state
etc.. (you have a functional
dependency of the form
zipcode->country,state,city . This
as i said may not be true across
countries.
If auto populating is
your only concern, create a
materialized view and use it.
I would recommend reading 'Data Model patterns' by David C. Hay.
But not every person who has a valid medical claim is required by law to remain in the US until the claim is settled. People move.
San Francisco is a city in California; it's not a city in Alabama. Does your design prevent nonsense entries like "San Francisco, AL"?