Why are these associations many-to-many? - database

I have the entities EMPLOYEE, ADDRESS and STUDIES, associated as in the picture below. An employee can have more than one address, and could have studied at more than one colleges.
Why are the relationships below (Has_address and Graduated) more-to-more?
Shouldn't they be one-to-more? (Because, for example, an address belongs to just ONE employee)?

Other employees could have lived at that same address (at the same or different times; married couples often share an address).
Also, more than one employee might have gone to the same college, and you don't necessarily want to copy the college's data for each employee that went there.
It depends on how you want your structure - you can say there is a single object 'address', and it is really 'a property' of the employee, so it would be 1:n (only allowing for moves of the employee). Or you argue that addresses are objects of their own (a location exists independent of your employees), and 'address' is a relation between an employee and a location; then it would be n:m.
The core point is if you want to handle locations as separate objects or not. Neither is right or wrong, it is a design decision that you have to make about the limits of your model.

Related

Why would the specifications for this database use an aggregation instead of attributes on an entity?

I'm trying to better understand designing a database schema. After reviewing the solution for a problem that I'm working on, I don't understand why the solution chooses to use an aggregation for the attributes "address" and "phone number" for a given "musician". Here are the specifications, I'm only interested in bullet point 1:
Each musician that records at Notown has an SSN, a name, an address, and a phone
number. Poorly paid musicians often share the same address, and no address has more
than one phone.
Each instrument used in songs recorded at Notown has a name (e.g., guitar, synthesizer,
flute) and a musical key (e.g., C, B-flat, E-flat).
Each album recorded on the Notown label has a title, a copyright date, a format (e.g.,
CD or MC), and an album identifier.
Each song recorded at Notown has a title and an author.
Each musician may play several instruments, and a given instrument may be played by
several musicians.
Each album has a number of songs on it, but no song may appear on more than one
album.
Each song is performed by one or more musicians, and a musician may perform a number
of songs.
Each album has exactly one musician who acts as its producer. A musician may produce several albums, of course.
Here is a solution that I found:
The ER Diagram I created looks almost exactly the same, except for the fact that I made "address" and "phone number" attributes of "musician" instead of giving each of them an entity set of their own, creating a relationship, and turning it into an aggregation. I don't understand why this would be done in this situation. Can anyone explain?? Thank you!
I'm not able to see the image you linked to, but anyway...
no address has more than one phone
This means we should make the phone number an attribute of the address - unless we want to allow for multiple phones per address in the future.
So it would not be completely wrong to make phones a table. But then, we know little about the future. Would there be multiple musicians sharing the same address and the same phones? (I.e. the phone number would be linked to an address.) Or would there be multiple musicians sharing the same address, but each would have their own phone? (I.e. the phone number would be linked to a musician. To use a phone table and link the phones to musicians, however, would only be necessary if a musician could have multiple phone numbers. Otherwise we'd still not make a phone table, but rather make the phone a musician's attribute.)
poorly paid musicians often share the same address
This means we make the address a table of its own. Thus there is only one row to change in case the phone number or some other attribute changes. If we made the address number a musician's attribute instead, we'd store the address redundantly and could get inconsistent data (e.g. same address, but different phone numbers).
A possible data model:
address (address_id, street, city, phone, ...)
musician (musician_id, ssn, name, address_id, ...)
This is a 1:n relation. A musician has one address; an address can belong to multiple musicians.
The primary purpose of database normalization is to make it more difficult for anomalous data to get into the database. Reading the first bullet point, we see that each address may have zero or one phone numbers associated with it. In other words, the phone number is an attribute of/identified by the address. Which normalization level does this violate?
To illustrate how not normalizing the address fields (including phone number) increases the chances of anomalous data, let's say you have four students staying at that address. This means you have four rows where the address data exists. Suppose the phone number changes. You have to make sure you change all four versions of the data. I said there were four students, but suppose there are actually five and I just missed one? Or suppose you found only three when you went to make the change? An address may have at most one phone number however now you have several copies of the same address but with different phone numbers. This is anomalous data.
If this data is normalized, you would have only one copy to change. Since this data is referenced by all the students who live there, no matter how many, this change is "propagated" to all of them. The integrity of the data is maintained.

Reusing a database table for many other entities? Is this possible?

Say for example, I have an ADDRESS table, that will store similar attributes of other entities like address, city, zip, country, etc. The entities are USER, COMPANY, BANK, BRANCH, etc. I would like to use this one table ADDRESS to store the addresses of the other entities rather than creating other tables for each entity to store the ADDRESS like so, USER_ADDRESS, COMPANY_ADDRESS, BANK_ADDRESS, BRANCH_ADDRESS.
Is this possible? Am i breaking any laws or conventions? What are the consequences, if any?
Each entity (USER, COMPANY, etc.) should contain a reference to an entry in the ADDRESS table.
There are a few issues:
If 2 users have the same address, they should reference the same address id.
You will need to normalise addresses so that you're not duplicating information (e.g. if you know the city, then you automatically know the zip and country).
Of course, you may not want a well-normalised database. Saving the entire address as a string will improve read performance by reducing the number of join operations.
A lot of things depend on the exact use of the database.
It is fine to use a single ADDRESS table for that purpose and have an ADDRESS_ID in each of the other entities. Depends on the use case and the way you prefer to implement it. I most probably wouldn't do it. I also wouldn't do the other solution you're suggesting (an address table per entity).
So, let's say you want to implement a function to search for all the addresses, where it doesn't matter what type of entity is connected to it. You will have to search the ADDRESS table. If you get results, then you have to search the other four tables to see which record is connected to that address.
You could add a field ENTITY_TYPE in the ADDRESS table where you specify which type of entity it is connected to, so you don't have to search the four tables, but I don't recommend this since you can have consistency errors (USER 17 points to ADDRESS 14, but ADDRESS 14 has ENTITY_TYPE = BANK).
Now, with your other solution (having four separate tables to store the addresses of the four different entities) you're just going to have to search those four tables and then search the corresponding entity table to get the entity you're looking for.
My solution in most cases is adding the address fields to the entities tables themselves. Having ADDRESS, ZIP_CODE and COUNTRY_CODE (always use proper country codes, not country names https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) will make it simple. When you present a list of items (users, banks, companies, offices, whatever), it's really common to show the name and the address at the same time in a table. Having no JOINS makes it faster and easier to process. If you want to update an address, it's on the table itself. No lookups!
Of course, like most things in programming, it depends on what your needs are.
Also, please, don't try to split the ADDRESS in more fields. I've seen ADDRESS_TYPE (street, road, avenue, square, ...), STREET_NAME, STREET_NUMBER, BLOCK_NUMBER, BLOCK_FLOOR, BLOCK_LETTER. I'm pretty sure you're never going to need something like SELECT * FROM USER WHERE STREET_NUMBER = 74.

Is a Value Object that is used many times an entity?

The question might not be very clear in the title, let me explain:
In my model, I have an Person, that has an Address. However, many Persons can share the same Address.
As I was defining my model, I assumed that Person is an Entity, but Address a Value-Object since if you change a single property of the Address, well it's not the same Address anymore.
Since multiple Persons can share an Address, if I jump right into the database implementation, and naively assume that person has some address_xxxx fields, wouldn't it generate too many duplicates in the database ? Isn't it better that person has an address_id field, related to an address table ? If so, then Address is an Entity right ?
Is a Value Object that is used many times an entity?
No, but it depends...
It is often the case that a value object is actually a proxy identifier for an entity, that you may not have explicitly realized in your model.
For example:
1600 Pennsylvania Ave NW
Washington, DC
20500
If you look at that carefully, you'll see embedded in it
The name of a street
The name of a city
If those are references to a street/city entities in your model, then "address" is the representation of the current state of some entity (ex: "The White House").
Complicating things further - you want suitable abstractions for your model.
Consider money:
{USD:100}
That's a value type, we can replace any USD:100 with a "different" USD:100
{USD:100, SerialNumber:KB46279860I}
That's still a value (it's state), but it the state of a specific bill that exists in circulation (somewhere). What we have here is an information resource that is describing an entity out in the real world, somewhere.
You also need to be careful about coincident properties. For example; the name of the street changes -- should the value of address change? If the model cares about the current identifier of a location, then perhaps it should. If the model is tracking what information you put on an envelope two months ago, then it certainly shouldn't. (In other words, when we changed the label for the street entity, the label already printed on the envelope entity didn't change).
It's an important question, but the answer changes depending on what you are modeling at the time.
In my model, I have an Person, that has an Address. However, many
Persons can share the same Address.
Isn't it better that person has an address_id field, related to an
address table ? If so, then Address is an Entity right?
You have to recognize that there are two distinct models, a domain model and a persistence model and both may not agree on whether a concept is an entity or a value.
The first thing you have to do is ask yourself what is an address from the domain perspective? Is your domain interested in the lifecycle of addresses or they are just immutable values? For instance, what happens if there is a typo in an address? Do you simply discard the incorrect one and replace it or would you rather modify the original address details to track it's continuity? These questions will help you to determine whether an address is an entity or a value from the domain perspective.
Now, a concept may be a value in the domain while being an entity in the persistence model. For instance, let's say that you aren't interested in the lifecycle of addresses in the domain, but you are very concerned about optimizing the storage space. In that case, you could give identifiers to unique addresses in the DB and use that for relationships rather than copying the same address details multiple times.
However, doing so would introduce additional tensions between your models, so you must be sure that there are real benefits to do so.

How can I get rid of an "indirect" foreign key? Should I?

Here's an example schema to illustrate what I'm talking about:
Let's say I'm storing information about some activities (seminars, trainings, whatever) that are being hosted in a certain set of locations, identified by type (hackerspace, swimming pool, etc) and city. Each activity happens at all of the locations of a suitable type at once (e.g. any programming seminar happens at all of the hackerspaces at once), so any person may choose to attend an activity in any of the suitable locations. Therefore, any activity is associated only with some location type, while an attendance record is associated with some activity (and therefore implicitly with some location type) and the city where this particular user attended the activity.
The most common query in the system by far is generating a report of all activities attended by a given person.
Am I right in feeling that this is ugly? Should I try to redesign this, and if so, how?
P.S. I'd rather not reveal the actual data I'm storing in a database where I had to employ a similar design, so I hope that this analogy makes some sense.
It sounds like you need a LocationTypes table with a list of location types. Then, Location can have a foreign key relationship to LocationTypes.
But, I don't like assuming that the set of locations doesn't change over time. So that is overly simplistic. So, I would have another entity of something like LocationSets, which would list the locations for a given activity over time. The LocationSets would contain the "type" which can be used. The locations associated with a location set would be in another table, a junction table connecting the location sets and the locations.
Then Activities would have a LocationSetId. And Attendance would have a LocationId. You might want to enforce that at any given time, the Attendance location is consistent with the locations in the Activity's LocationSets. This could be done at the application layer, through a trigger in the database, or through mechanism such as a function-based constraint (if your database supports those).

customer-address, property-address and company-address

I am modelling a loan database for a friend.
A Customer can have 0 to N Addresses (street address or POBox address or even more than 1 street addresses and more than on POBox addresses). A Property must have only one Address. A Company (employment info) must have only one Address.
It will be better to have a separate Addresses table for the Customers table. The address for Property and Company can go with Properties and Companies table.
But since we have an Addresses table here, do you think it is a good idea or not to share that Addresses table for Companies and Properties tables as well?
When we think about the relationship between entities, we should cut off a time point (static way?) or we should view a certain range of the time (dynamic way?) to analyze their relationship? For example, a company can only have ONE address at certain time point but that company may moved from one place to another recently. Then a company may have more than one address for a certain range of time.
Customer would be better with a 1 to N than a 0 to N relationship, since you are making loans you might want to know where their address.
A Company (employment info) must have only one address.
Then a company may have more than one address for a certain range of
time.
You are contradicting yourself a bit, why would you need the two address? I think the company will have their official just one address till they get everything on the new address at which point you can update your DB to the new one.
But since we have an Addresses table here, do you think it is a good
idea or not to share that Addresses table for Companies and Properties
tables as well?
Yes
And here a nice link with some ideas on modelling:
http://www.databaseanswers.org/data_models/
A Company (employment info) must have only one Address.
Not necessarily. A Company can have a mailing address and a physical address.
Since we have an Addresses table here, do you think it is a good idea or not to share that Addresses table for Companies and Properties tables as well?
Yes, it's a good idea to put addresses in the Addresses table. Your Properties table would have an address row foreign key, and your Companies table would have 2 foreign keys, one for a mailing address and one for a physical address. The mailing address would be an optional (nullable) foreign key.
You would need a CuustomerAddress table to maintain the 0 to N relationship between Customer and Address. If you want, you can also have a 0 to N relationship between Address and Customer.
The table would look like this.
CustomerAddress
---------------
CustomerAddress ID
Customer ID
Address ID
The CustomerAddress ID is the primary (clustering) index. It is an ascending integer or long, or some other unique ID.
You would have a unique indexon (Customer ID, Address ID).
If you want to associate addresses with customers, you would have another unique index on (Address ID, Customer ID).
A company can only have ONE address at certain time point but that company may moved from one place to another recently. Then a company may have more than one address for a certain range of time.
If this information is important, then you have to include a date written column in your CompanyAddress table. You would create a unique index on (Company ID, Date written descending). This way, the first row you retrieve from the Address table would be the most current address.
It seems like a very popular idea to put all Addresses in their own table. Developers love to seek out repetition and eliminate it. But in this case I would hesitate to dignify addresses with Entity status by putting them in their own dedicated table, because if, like most applications, you don't treat addresses as full-fledged entities, this gets overcomplicated.
If you treated addresses as real entities then if two companies somehow shared the same address, or one inhabited a location for a while, then another one inhabited that same location, then those companies would reference the same address. Because when your application was accepting an address as input it would go see if there was an existing address and reference it rather than just slam some garbage into the address table. Which one do you intend to do? I expect it's the slam one, which is fine, because like most business applications you totally don't care if the new address you're putting in is the same as some other address already in the database, you have no interest in tracking the addresses as individual things. And that's the difference between entities and cat food.
So with the consolidation we have to introduce an intersection table, and index it, and all our entities that have addresses have to join to it, we have to think about whether to get the address eagerly or use lazy loading. We chucked all the addresses into one bucket and have to work to make sure everybody can get to their own address quickly. For real entities this makes some sense because different things need to link to the same entity, but we established above that we don't care about that, nobody is sharing these entries.
Where's the repetition we're eliminating by consolidating addresses into one table? The addresses are going to end up in the database somewhere regardless, with the same fields, we're not saving space. The only repetition is in the DDL used to generate the schema, which we can manage by making a reusable component (where "component" is the Hibernate term) for the address (which addresses redundancy in the application code) and using the ORM tool to generate the schema. Or, worst case, just ignore it, addresses don't change that much, it's not your biggest problem.
These requirements you are describing sound suspiciously enterprise-y for a project you're doing for a friend. Possibly your friend's brain has been poisoned by overexposure to elaborate requirements concocted by committees who don't know what they're doing. It's bad enough we have to put up with this junk at work, but for personal projects? Try to talk him down.
But maybe your friend is outsourcing his enterprise-y work to you and you're stuck with 0-N addresses per customer. If so, contain the damage: make a table exclusively for customer addresses, so you don't need the intersection table, and put the other entities' addresses inline. Making these entities that have only one address go get their address from another table doesn't buy you anything but more joins. If you need history, write it to a separate history table where it's out of the way.

Resources