How to store "same as" data? - database

I've got one model with 3 addresses: pickup, dropoff, and billing. I figure the billing address will usually be either the pickup or drop-off address, so from a UI perspective, I should have a "same as" option. But from a DB perspective, should I save the "same as" field, or should I duplicate the data?

You should have the same Id of a row from an Address table in two different columns, PickUp and DropOff. This way, you do not duplicate the address, do not use some sentinel address, and can easily query to see if the PickUp address is the same as the DropOff. If one of these changes in the future, you can always modify the Id value stored in its respective column to a new address.

You could create a table called 'Address' and make Pickup, Dropoff, Billing FKs to that Address table.

Just because an address is the same physical address doesn't mean it's the same conceptual address. Really, John Doe's address may be "123 Elm St.", but conceptually his address is "John Doe's mailing address".
In particular, for addresses I would say can and should be duplicated within a database because of this simple case: consider two people who live at the same address. Now one of them moves. If you only stored the address once, updating the "mover"s address would then update the original roommate's address as well.
But in general, consider how the data is tied to other data. If multiple things can relate to it, make sure that a change for one should impact them all.

Related

Reusing a database table for many other entities? Is this possible?

Say for example, I have an ADDRESS table, that will store similar attributes of other entities like address, city, zip, country, etc. The entities are USER, COMPANY, BANK, BRANCH, etc. I would like to use this one table ADDRESS to store the addresses of the other entities rather than creating other tables for each entity to store the ADDRESS like so, USER_ADDRESS, COMPANY_ADDRESS, BANK_ADDRESS, BRANCH_ADDRESS.
Is this possible? Am i breaking any laws or conventions? What are the consequences, if any?
Each entity (USER, COMPANY, etc.) should contain a reference to an entry in the ADDRESS table.
There are a few issues:
If 2 users have the same address, they should reference the same address id.
You will need to normalise addresses so that you're not duplicating information (e.g. if you know the city, then you automatically know the zip and country).
Of course, you may not want a well-normalised database. Saving the entire address as a string will improve read performance by reducing the number of join operations.
A lot of things depend on the exact use of the database.
It is fine to use a single ADDRESS table for that purpose and have an ADDRESS_ID in each of the other entities. Depends on the use case and the way you prefer to implement it. I most probably wouldn't do it. I also wouldn't do the other solution you're suggesting (an address table per entity).
So, let's say you want to implement a function to search for all the addresses, where it doesn't matter what type of entity is connected to it. You will have to search the ADDRESS table. If you get results, then you have to search the other four tables to see which record is connected to that address.
You could add a field ENTITY_TYPE in the ADDRESS table where you specify which type of entity it is connected to, so you don't have to search the four tables, but I don't recommend this since you can have consistency errors (USER 17 points to ADDRESS 14, but ADDRESS 14 has ENTITY_TYPE = BANK).
Now, with your other solution (having four separate tables to store the addresses of the four different entities) you're just going to have to search those four tables and then search the corresponding entity table to get the entity you're looking for.
My solution in most cases is adding the address fields to the entities tables themselves. Having ADDRESS, ZIP_CODE and COUNTRY_CODE (always use proper country codes, not country names https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) will make it simple. When you present a list of items (users, banks, companies, offices, whatever), it's really common to show the name and the address at the same time in a table. Having no JOINS makes it faster and easier to process. If you want to update an address, it's on the table itself. No lookups!
Of course, like most things in programming, it depends on what your needs are.
Also, please, don't try to split the ADDRESS in more fields. I've seen ADDRESS_TYPE (street, road, avenue, square, ...), STREET_NAME, STREET_NUMBER, BLOCK_NUMBER, BLOCK_FLOOR, BLOCK_LETTER. I'm pretty sure you're never going to need something like SELECT * FROM USER WHERE STREET_NUMBER = 74.

Is a Value Object that is used many times an entity?

The question might not be very clear in the title, let me explain:
In my model, I have an Person, that has an Address. However, many Persons can share the same Address.
As I was defining my model, I assumed that Person is an Entity, but Address a Value-Object since if you change a single property of the Address, well it's not the same Address anymore.
Since multiple Persons can share an Address, if I jump right into the database implementation, and naively assume that person has some address_xxxx fields, wouldn't it generate too many duplicates in the database ? Isn't it better that person has an address_id field, related to an address table ? If so, then Address is an Entity right ?
Is a Value Object that is used many times an entity?
No, but it depends...
It is often the case that a value object is actually a proxy identifier for an entity, that you may not have explicitly realized in your model.
For example:
1600 Pennsylvania Ave NW
Washington, DC
20500
If you look at that carefully, you'll see embedded in it
The name of a street
The name of a city
If those are references to a street/city entities in your model, then "address" is the representation of the current state of some entity (ex: "The White House").
Complicating things further - you want suitable abstractions for your model.
Consider money:
{USD:100}
That's a value type, we can replace any USD:100 with a "different" USD:100
{USD:100, SerialNumber:KB46279860I}
That's still a value (it's state), but it the state of a specific bill that exists in circulation (somewhere). What we have here is an information resource that is describing an entity out in the real world, somewhere.
You also need to be careful about coincident properties. For example; the name of the street changes -- should the value of address change? If the model cares about the current identifier of a location, then perhaps it should. If the model is tracking what information you put on an envelope two months ago, then it certainly shouldn't. (In other words, when we changed the label for the street entity, the label already printed on the envelope entity didn't change).
It's an important question, but the answer changes depending on what you are modeling at the time.
In my model, I have an Person, that has an Address. However, many
Persons can share the same Address.
Isn't it better that person has an address_id field, related to an
address table ? If so, then Address is an Entity right?
You have to recognize that there are two distinct models, a domain model and a persistence model and both may not agree on whether a concept is an entity or a value.
The first thing you have to do is ask yourself what is an address from the domain perspective? Is your domain interested in the lifecycle of addresses or they are just immutable values? For instance, what happens if there is a typo in an address? Do you simply discard the incorrect one and replace it or would you rather modify the original address details to track it's continuity? These questions will help you to determine whether an address is an entity or a value from the domain perspective.
Now, a concept may be a value in the domain while being an entity in the persistence model. For instance, let's say that you aren't interested in the lifecycle of addresses in the domain, but you are very concerned about optimizing the storage space. In that case, you could give identifiers to unique addresses in the DB and use that for relationships rather than copying the same address details multiple times.
However, doing so would introduce additional tensions between your models, so you must be sure that there are real benefits to do so.

how to handle address dimension with 75M locations

I am creating an address dimension for a Snowflake Schema Data Warehouse. I have 75M locations on a source that I want to convert to said schema. I know how to handle Zip->City->County->State dimensions, but if I add street addresses to the location dimension I would have an equal number dimension rows as fact rows.
What I need to know, is where should the street addresses go (123 anywhere St.)? Should it go in the fact table? How do I handle street addresses?
Thanks.
The street address itself should go in a Fact. If it's a Real Estate app I'd imagine there'd be some kind of "Sale Contract Fact" or "Rental Contract Fact" or something similar - the street address would be an attribute of that fact.
In your instance the instance of the address is definitely tied to a single transaction. As you said, the same street address could appear multiple times, but it would be on different Sales Contracts and thus different Fact instances.
Other elements of the address (zipcode, city, state etc) would be dimensionalised as it makes sense to group them for classification.

should I move related columns in a new table?

I have a customer table that references an address - a 1 to 1 relationship, with several fields. No other table currently references an address. Does it therefore make more sense to store all the fields under one table, even though it can be encapsulated or should I just create a separate address table to store the address fields. What are the advantages/disadvantages?
In general, it depends on your requirements.
If some of the field values can be updated, but the entity should stay the same (should be referenced by same key), then you need a separate table for it. For example, it’s required to maintain a database of valid addresses, and customer should select a predefined address from a list instead of typing it by hand, or it’s required to provide some specific behavior based on address location and so on.
In your case doing so just adds absolutely unnecessary bit of complexity to the model.
Having single table and using OOP language you can still encapsulate customer address in separate object (so called Value Object pattern).

customer-address, property-address and company-address

I am modelling a loan database for a friend.
A Customer can have 0 to N Addresses (street address or POBox address or even more than 1 street addresses and more than on POBox addresses). A Property must have only one Address. A Company (employment info) must have only one Address.
It will be better to have a separate Addresses table for the Customers table. The address for Property and Company can go with Properties and Companies table.
But since we have an Addresses table here, do you think it is a good idea or not to share that Addresses table for Companies and Properties tables as well?
When we think about the relationship between entities, we should cut off a time point (static way?) or we should view a certain range of the time (dynamic way?) to analyze their relationship? For example, a company can only have ONE address at certain time point but that company may moved from one place to another recently. Then a company may have more than one address for a certain range of time.
Customer would be better with a 1 to N than a 0 to N relationship, since you are making loans you might want to know where their address.
A Company (employment info) must have only one address.
Then a company may have more than one address for a certain range of
time.
You are contradicting yourself a bit, why would you need the two address? I think the company will have their official just one address till they get everything on the new address at which point you can update your DB to the new one.
But since we have an Addresses table here, do you think it is a good
idea or not to share that Addresses table for Companies and Properties
tables as well?
Yes
And here a nice link with some ideas on modelling:
http://www.databaseanswers.org/data_models/
A Company (employment info) must have only one Address.
Not necessarily. A Company can have a mailing address and a physical address.
Since we have an Addresses table here, do you think it is a good idea or not to share that Addresses table for Companies and Properties tables as well?
Yes, it's a good idea to put addresses in the Addresses table. Your Properties table would have an address row foreign key, and your Companies table would have 2 foreign keys, one for a mailing address and one for a physical address. The mailing address would be an optional (nullable) foreign key.
You would need a CuustomerAddress table to maintain the 0 to N relationship between Customer and Address. If you want, you can also have a 0 to N relationship between Address and Customer.
The table would look like this.
CustomerAddress
---------------
CustomerAddress ID
Customer ID
Address ID
The CustomerAddress ID is the primary (clustering) index. It is an ascending integer or long, or some other unique ID.
You would have a unique indexon (Customer ID, Address ID).
If you want to associate addresses with customers, you would have another unique index on (Address ID, Customer ID).
A company can only have ONE address at certain time point but that company may moved from one place to another recently. Then a company may have more than one address for a certain range of time.
If this information is important, then you have to include a date written column in your CompanyAddress table. You would create a unique index on (Company ID, Date written descending). This way, the first row you retrieve from the Address table would be the most current address.
It seems like a very popular idea to put all Addresses in their own table. Developers love to seek out repetition and eliminate it. But in this case I would hesitate to dignify addresses with Entity status by putting them in their own dedicated table, because if, like most applications, you don't treat addresses as full-fledged entities, this gets overcomplicated.
If you treated addresses as real entities then if two companies somehow shared the same address, or one inhabited a location for a while, then another one inhabited that same location, then those companies would reference the same address. Because when your application was accepting an address as input it would go see if there was an existing address and reference it rather than just slam some garbage into the address table. Which one do you intend to do? I expect it's the slam one, which is fine, because like most business applications you totally don't care if the new address you're putting in is the same as some other address already in the database, you have no interest in tracking the addresses as individual things. And that's the difference between entities and cat food.
So with the consolidation we have to introduce an intersection table, and index it, and all our entities that have addresses have to join to it, we have to think about whether to get the address eagerly or use lazy loading. We chucked all the addresses into one bucket and have to work to make sure everybody can get to their own address quickly. For real entities this makes some sense because different things need to link to the same entity, but we established above that we don't care about that, nobody is sharing these entries.
Where's the repetition we're eliminating by consolidating addresses into one table? The addresses are going to end up in the database somewhere regardless, with the same fields, we're not saving space. The only repetition is in the DDL used to generate the schema, which we can manage by making a reusable component (where "component" is the Hibernate term) for the address (which addresses redundancy in the application code) and using the ORM tool to generate the schema. Or, worst case, just ignore it, addresses don't change that much, it's not your biggest problem.
These requirements you are describing sound suspiciously enterprise-y for a project you're doing for a friend. Possibly your friend's brain has been poisoned by overexposure to elaborate requirements concocted by committees who don't know what they're doing. It's bad enough we have to put up with this junk at work, but for personal projects? Try to talk him down.
But maybe your friend is outsourcing his enterprise-y work to you and you're stuck with 0-N addresses per customer. If so, contain the damage: make a table exclusively for customer addresses, so you don't need the intersection table, and put the other entities' addresses inline. Making these entities that have only one address go get their address from another table doesn't buy you anything but more joins. If you need history, write it to a separate history table where it's out of the way.

Resources