how to handle address dimension with 75M locations - data-modeling

I am creating an address dimension for a Snowflake Schema Data Warehouse. I have 75M locations on a source that I want to convert to said schema. I know how to handle Zip->City->County->State dimensions, but if I add street addresses to the location dimension I would have an equal number dimension rows as fact rows.
What I need to know, is where should the street addresses go (123 anywhere St.)? Should it go in the fact table? How do I handle street addresses?
Thanks.

The street address itself should go in a Fact. If it's a Real Estate app I'd imagine there'd be some kind of "Sale Contract Fact" or "Rental Contract Fact" or something similar - the street address would be an attribute of that fact.
In your instance the instance of the address is definitely tied to a single transaction. As you said, the same street address could appear multiple times, but it would be on different Sales Contracts and thus different Fact instances.
Other elements of the address (zipcode, city, state etc) would be dimensionalised as it makes sense to group them for classification.

Related

Why would the specifications for this database use an aggregation instead of attributes on an entity?

I'm trying to better understand designing a database schema. After reviewing the solution for a problem that I'm working on, I don't understand why the solution chooses to use an aggregation for the attributes "address" and "phone number" for a given "musician". Here are the specifications, I'm only interested in bullet point 1:
Each musician that records at Notown has an SSN, a name, an address, and a phone
number. Poorly paid musicians often share the same address, and no address has more
than one phone.
Each instrument used in songs recorded at Notown has a name (e.g., guitar, synthesizer,
flute) and a musical key (e.g., C, B-flat, E-flat).
Each album recorded on the Notown label has a title, a copyright date, a format (e.g.,
CD or MC), and an album identifier.
Each song recorded at Notown has a title and an author.
Each musician may play several instruments, and a given instrument may be played by
several musicians.
Each album has a number of songs on it, but no song may appear on more than one
album.
Each song is performed by one or more musicians, and a musician may perform a number
of songs.
Each album has exactly one musician who acts as its producer. A musician may produce several albums, of course.
Here is a solution that I found:
The ER Diagram I created looks almost exactly the same, except for the fact that I made "address" and "phone number" attributes of "musician" instead of giving each of them an entity set of their own, creating a relationship, and turning it into an aggregation. I don't understand why this would be done in this situation. Can anyone explain?? Thank you!
I'm not able to see the image you linked to, but anyway...
no address has more than one phone
This means we should make the phone number an attribute of the address - unless we want to allow for multiple phones per address in the future.
So it would not be completely wrong to make phones a table. But then, we know little about the future. Would there be multiple musicians sharing the same address and the same phones? (I.e. the phone number would be linked to an address.) Or would there be multiple musicians sharing the same address, but each would have their own phone? (I.e. the phone number would be linked to a musician. To use a phone table and link the phones to musicians, however, would only be necessary if a musician could have multiple phone numbers. Otherwise we'd still not make a phone table, but rather make the phone a musician's attribute.)
poorly paid musicians often share the same address
This means we make the address a table of its own. Thus there is only one row to change in case the phone number or some other attribute changes. If we made the address number a musician's attribute instead, we'd store the address redundantly and could get inconsistent data (e.g. same address, but different phone numbers).
A possible data model:
address (address_id, street, city, phone, ...)
musician (musician_id, ssn, name, address_id, ...)
This is a 1:n relation. A musician has one address; an address can belong to multiple musicians.
The primary purpose of database normalization is to make it more difficult for anomalous data to get into the database. Reading the first bullet point, we see that each address may have zero or one phone numbers associated with it. In other words, the phone number is an attribute of/identified by the address. Which normalization level does this violate?
To illustrate how not normalizing the address fields (including phone number) increases the chances of anomalous data, let's say you have four students staying at that address. This means you have four rows where the address data exists. Suppose the phone number changes. You have to make sure you change all four versions of the data. I said there were four students, but suppose there are actually five and I just missed one? Or suppose you found only three when you went to make the change? An address may have at most one phone number however now you have several copies of the same address but with different phone numbers. This is anomalous data.
If this data is normalized, you would have only one copy to change. Since this data is referenced by all the students who live there, no matter how many, this change is "propagated" to all of them. The integrity of the data is maintained.

Relation Between Pharmacist and Patient in access 2016

I am creating a Pharmacy Database in Access 2016. It is my school Project and first Database Project.
My first problem is that we know that a Pharmacist can have many Patients, so it means that the relationship between Pharmacist and Patient is one-to-many. So in order to create a one-to-many relation, I made Pharmacist_ID as Primary Key.
Now the problem is that we know that the relation of Address and Patient is one-to-one, so how can I accomplish this task?
Another problem is that I already have the address, the city and nationality which are linked with the Pharmacist_ID. Can I link these tables with Patient_ID?
I am confused because the data-type of Pharmacist_ID is Auto-Number. The Patient_ID of the first Patient will be 1 and then Pharmacist_ID of the first Pharmacist will also 1 so what will happen?
Again, I am on MS-Access 2016.
This is the Picture of The RelationShip and you can see the Details of my Tables
Regards,
Arslan Iftikhar
This is for Thomas G check it out Thomas do you think I am doing right or wrong
I will make below changes to Address table:
I will prefer creating one common table for Address which also has City and Nationality (for simplicity else link them like image 2 below)
Added field PID as Number where you can save Pharmacist or Patient ID
Added field Ptype as Number where to save value 1 when Pharmacist and 2 when patient, so we can easily differentiate using this field.
Image 1
Image 2
A few design mistakes in your approach.
I will list a few things I think about, and try to train you to raise the right questions.
My first problem is that we know that a Pharmacist can have many
Patients, so it means that the relationship between Pharmacist and
Patient is one-to-many
The first part is only partially correct, which makes the second part incorrect and might lead to a big design failure.
In a normal world:
a Pharmacist can have many Patients
a Patient can have many Pharmacists
Isn't it ?
Thus you have a m-to-m relationship. How do you solve this? With an intermediate table storing the relation between patients and pharmacists.
The only exception to this, is if you make your software for only one Pharmacist, then your 1-to-m approach will work, but then I dont see any reason to have a Phamarcist table :)
Now the problem is that we know that the relation of Address and
Patient is one-to-one, so how can I accomplish this task?
Are you really sure about this? What happens in such cases then:
Patients members of the same family (living in the same house).
Patients for which the billing address is not the same as home address.
Patients that are also Pharmacists.
Patients owning several houses.
Those are very common cases. If you go the 1-1, you'll end up with A LOT of doubles in your address table.
The REAL reason for which we almost always put addresses in seperate table, is that adresses are rarely one-to-one in information systems. If it was one-to-one, there would be no real reason to store them in additional tables.
I am confused because the data-type of Pharmacist_ID is Auto-Number.
The Patient_ID of the first Patient will be 1 and then Pharmacist_ID
of the first Pharmacist will also 1 so what will happen?
And that is a good question, which should have led you to the design mistake above. You should not have a 1-1 between address and something else (Patient or Pharmacist).
In your Patient AND Pharmacists tables, you should have an AddressID refering the ID in the address table. If you want to let the opportunity to the pharamacists to store:
Home Address
Billing Address
Holiday Address
Whatever additional address
You should either :
Create an AddressID field in your Patient (and eventually Pharmacist table) for each of the address types.
If you really have many types for your addresses, it's better to create an intermediate table handling m-to-m between address and Patient/Pharmacists, with at least a Type column in it.
Edit. Reactions on your new model.
I guess the point of your CITY table is to have a big list of cities? That you can use for comboboxes for instance? If you go that way, you might do the same for countries, or states/regions. That's fine, BUT : the city_id, (and eventually State_ID and Country_ID), should be part of the ADDRESS table. It doesnt make sense to have an address with only a street, house number and po box. That's incomplete, an address should also contain a zip code, city and country to be complete.

Reusing a database table for many other entities? Is this possible?

Say for example, I have an ADDRESS table, that will store similar attributes of other entities like address, city, zip, country, etc. The entities are USER, COMPANY, BANK, BRANCH, etc. I would like to use this one table ADDRESS to store the addresses of the other entities rather than creating other tables for each entity to store the ADDRESS like so, USER_ADDRESS, COMPANY_ADDRESS, BANK_ADDRESS, BRANCH_ADDRESS.
Is this possible? Am i breaking any laws or conventions? What are the consequences, if any?
Each entity (USER, COMPANY, etc.) should contain a reference to an entry in the ADDRESS table.
There are a few issues:
If 2 users have the same address, they should reference the same address id.
You will need to normalise addresses so that you're not duplicating information (e.g. if you know the city, then you automatically know the zip and country).
Of course, you may not want a well-normalised database. Saving the entire address as a string will improve read performance by reducing the number of join operations.
A lot of things depend on the exact use of the database.
It is fine to use a single ADDRESS table for that purpose and have an ADDRESS_ID in each of the other entities. Depends on the use case and the way you prefer to implement it. I most probably wouldn't do it. I also wouldn't do the other solution you're suggesting (an address table per entity).
So, let's say you want to implement a function to search for all the addresses, where it doesn't matter what type of entity is connected to it. You will have to search the ADDRESS table. If you get results, then you have to search the other four tables to see which record is connected to that address.
You could add a field ENTITY_TYPE in the ADDRESS table where you specify which type of entity it is connected to, so you don't have to search the four tables, but I don't recommend this since you can have consistency errors (USER 17 points to ADDRESS 14, but ADDRESS 14 has ENTITY_TYPE = BANK).
Now, with your other solution (having four separate tables to store the addresses of the four different entities) you're just going to have to search those four tables and then search the corresponding entity table to get the entity you're looking for.
My solution in most cases is adding the address fields to the entities tables themselves. Having ADDRESS, ZIP_CODE and COUNTRY_CODE (always use proper country codes, not country names https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) will make it simple. When you present a list of items (users, banks, companies, offices, whatever), it's really common to show the name and the address at the same time in a table. Having no JOINS makes it faster and easier to process. If you want to update an address, it's on the table itself. No lookups!
Of course, like most things in programming, it depends on what your needs are.
Also, please, don't try to split the ADDRESS in more fields. I've seen ADDRESS_TYPE (street, road, avenue, square, ...), STREET_NAME, STREET_NUMBER, BLOCK_NUMBER, BLOCK_FLOOR, BLOCK_LETTER. I'm pretty sure you're never going to need something like SELECT * FROM USER WHERE STREET_NUMBER = 74.

customer-address, property-address and company-address

I am modelling a loan database for a friend.
A Customer can have 0 to N Addresses (street address or POBox address or even more than 1 street addresses and more than on POBox addresses). A Property must have only one Address. A Company (employment info) must have only one Address.
It will be better to have a separate Addresses table for the Customers table. The address for Property and Company can go with Properties and Companies table.
But since we have an Addresses table here, do you think it is a good idea or not to share that Addresses table for Companies and Properties tables as well?
When we think about the relationship between entities, we should cut off a time point (static way?) or we should view a certain range of the time (dynamic way?) to analyze their relationship? For example, a company can only have ONE address at certain time point but that company may moved from one place to another recently. Then a company may have more than one address for a certain range of time.
Customer would be better with a 1 to N than a 0 to N relationship, since you are making loans you might want to know where their address.
A Company (employment info) must have only one address.
Then a company may have more than one address for a certain range of
time.
You are contradicting yourself a bit, why would you need the two address? I think the company will have their official just one address till they get everything on the new address at which point you can update your DB to the new one.
But since we have an Addresses table here, do you think it is a good
idea or not to share that Addresses table for Companies and Properties
tables as well?
Yes
And here a nice link with some ideas on modelling:
http://www.databaseanswers.org/data_models/
A Company (employment info) must have only one Address.
Not necessarily. A Company can have a mailing address and a physical address.
Since we have an Addresses table here, do you think it is a good idea or not to share that Addresses table for Companies and Properties tables as well?
Yes, it's a good idea to put addresses in the Addresses table. Your Properties table would have an address row foreign key, and your Companies table would have 2 foreign keys, one for a mailing address and one for a physical address. The mailing address would be an optional (nullable) foreign key.
You would need a CuustomerAddress table to maintain the 0 to N relationship between Customer and Address. If you want, you can also have a 0 to N relationship between Address and Customer.
The table would look like this.
CustomerAddress
---------------
CustomerAddress ID
Customer ID
Address ID
The CustomerAddress ID is the primary (clustering) index. It is an ascending integer or long, or some other unique ID.
You would have a unique indexon (Customer ID, Address ID).
If you want to associate addresses with customers, you would have another unique index on (Address ID, Customer ID).
A company can only have ONE address at certain time point but that company may moved from one place to another recently. Then a company may have more than one address for a certain range of time.
If this information is important, then you have to include a date written column in your CompanyAddress table. You would create a unique index on (Company ID, Date written descending). This way, the first row you retrieve from the Address table would be the most current address.
It seems like a very popular idea to put all Addresses in their own table. Developers love to seek out repetition and eliminate it. But in this case I would hesitate to dignify addresses with Entity status by putting them in their own dedicated table, because if, like most applications, you don't treat addresses as full-fledged entities, this gets overcomplicated.
If you treated addresses as real entities then if two companies somehow shared the same address, or one inhabited a location for a while, then another one inhabited that same location, then those companies would reference the same address. Because when your application was accepting an address as input it would go see if there was an existing address and reference it rather than just slam some garbage into the address table. Which one do you intend to do? I expect it's the slam one, which is fine, because like most business applications you totally don't care if the new address you're putting in is the same as some other address already in the database, you have no interest in tracking the addresses as individual things. And that's the difference between entities and cat food.
So with the consolidation we have to introduce an intersection table, and index it, and all our entities that have addresses have to join to it, we have to think about whether to get the address eagerly or use lazy loading. We chucked all the addresses into one bucket and have to work to make sure everybody can get to their own address quickly. For real entities this makes some sense because different things need to link to the same entity, but we established above that we don't care about that, nobody is sharing these entries.
Where's the repetition we're eliminating by consolidating addresses into one table? The addresses are going to end up in the database somewhere regardless, with the same fields, we're not saving space. The only repetition is in the DDL used to generate the schema, which we can manage by making a reusable component (where "component" is the Hibernate term) for the address (which addresses redundancy in the application code) and using the ORM tool to generate the schema. Or, worst case, just ignore it, addresses don't change that much, it's not your biggest problem.
These requirements you are describing sound suspiciously enterprise-y for a project you're doing for a friend. Possibly your friend's brain has been poisoned by overexposure to elaborate requirements concocted by committees who don't know what they're doing. It's bad enough we have to put up with this junk at work, but for personal projects? Try to talk him down.
But maybe your friend is outsourcing his enterprise-y work to you and you're stuck with 0-N addresses per customer. If so, contain the damage: make a table exclusively for customer addresses, so you don't need the intersection table, and put the other entities' addresses inline. Making these entities that have only one address go get their address from another table doesn't buy you anything but more joins. If you need history, write it to a separate history table where it's out of the way.

How to store "same as" data?

I've got one model with 3 addresses: pickup, dropoff, and billing. I figure the billing address will usually be either the pickup or drop-off address, so from a UI perspective, I should have a "same as" option. But from a DB perspective, should I save the "same as" field, or should I duplicate the data?
You should have the same Id of a row from an Address table in two different columns, PickUp and DropOff. This way, you do not duplicate the address, do not use some sentinel address, and can easily query to see if the PickUp address is the same as the DropOff. If one of these changes in the future, you can always modify the Id value stored in its respective column to a new address.
You could create a table called 'Address' and make Pickup, Dropoff, Billing FKs to that Address table.
Just because an address is the same physical address doesn't mean it's the same conceptual address. Really, John Doe's address may be "123 Elm St.", but conceptually his address is "John Doe's mailing address".
In particular, for addresses I would say can and should be duplicated within a database because of this simple case: consider two people who live at the same address. Now one of them moves. If you only stored the address once, updating the "mover"s address would then update the original roommate's address as well.
But in general, consider how the data is tied to other data. If multiple things can relate to it, make sure that a change for one should impact them all.

Resources