Avoid Duplicate Data in Postgres with Lookup Table - database

If I have a table of installed equipment with make and model, where the make and model will be duplicated a lot, but with a variety of spellings, etc, how is the best way to avoid wasted space from data duplication?
CREATE TABLE equipment (
id integer NOT NULL,
make character varying(128),
model character varying(128),
lat double precision,
lon double precision,
created timestamp without time zone,
updated timestamp without time zone
);
This table has a lot more fields in reality and will have many millions of rows, and I have other tables in a similar situation totalling about 600 GB of data.
The source data needs to be kept the same (ie. "Panasonic" and "PANASONIC" can't be combined / corrected), and the scale and variety of the data makes that impractical anyway.
I'm envisioning a separate key:value table that stores the values and then the ID is just stored in the equipment table, with a function where I just pass the value and it returns the ID (whether it looks it up and returns the ID or inserts it and returns the new ID).
That would make the tables become:
CREATE TABLE equipment (
id integer NOT NULL,
make integer,
model integer,
lat double precision,
lon double precision,
created timestamp without time zone,
updated timestamp without time zone
);
CREATE TABLE lookup (
id integer NOT NULL,
value character varying(128),
updated timestamp without time zone
);
And interacting with the table would be:
SELECT
id,
lookup_value(make) AS make,
lookup_value(model) AS model,
lat,
lon,
created,
updated
FROM
equipment
INSERT INTO
equipment (id, make, model, created)
VALUES
(nextval('equipment_id_seq'::regclass), lookup_value('Panasonic'), lookup_value('ABC123-G'), NOW())
The lookup table could be reused among a variety of fields and tables, with each string value only appearing once, and the key:value staying the same forever (changing from "Panasonic" and "PANASONIC" wouldn't change the key for "Panasonic", it would return the key for "PANASONIC" instead, inserting if needed).
What are the problems with this approach (aside from code complexity)?
Is there a better approach?

You would never want to have generic lookup table like this. For one it means you cant create a foreign key between the two "value" columns and the IDs, because there is no way of stopping an entry for Make ending up in Model.
As #a_horse_with_no_name said, you would be better to create a model and make table, with FK between them, and then do as you say where you only save a new model or make if it doesn't already exist.
I would also be tempted to have a third column so for all the possible spellings for PANASONIC for example, you have both the lookup row for what they entered, and a reference to what they probably meant. That would assist in cleaning up data going forward. You could suggest in the UI "Did you mean Panasonic" when they enter "Panasoonic" for example.
Coding us up to you, either in a single update, stored proc, or app code.

Related

Duplicate data vs Calculated data in database

I'm starting to track a host of variables around my life (QuantifiedSelf). I have a lot of input sources, and I'm working on sticking it all into a database. I plan on using this database with R to ask arbitrary questions about my life ("Which routes are the fastest to work", or "What foods affect my mood", etc)
The key question I'm trying to answer here is "Do I process the input before sticking it into the database?"
Examples of "process":
Some of my input is a list of moods (one for each day). As of right now, there are only 5 available moods (name with a rating between -2 and 2). Do I normalize this data and create two tables: A Mood table (with 5 items) and a DailyMood table?
If I process the data then I lose the original data. Perhaps I change a mood to have a different name. If I do this in a normalized database, then I lose the information that before the change, I had a mood "oldName"
If I don't process the data, then I have duplication of data
Another input is a list of GPS locations (lat, long). However, most of my day is spent in a single spot, or spent driving. Do I process this data to create two tables "Locations" and "Routes"?
If I don't process the data, then I have a whole bunch of duplicate locations (at different timestamps), which is difficult to query and get good data out of.
If I process the data, then I lose the original data. I end up with a nice set of Locations and Routes that is easy to query, but if those locations or routes are wrong, I would have to redownload the input source and rebuild the database.
However, I feel like I'm stuck between two opposing "ideals":
If I process the data, then I don't have the original data.
If I don't process the data, then I have duplicate, hard to use data.
I've considered storing both the original and the calculated. This feels like I'm getting the worst of both worlds: Some of my tables aren't original, and would need a full recalculation if they are wrong, while other tables are original but hard to use and have duplicate data.
To some of the points in the comments, I think which data you store depend on the need in your application, and I would approach each set of data through a use case lens.
For the first use case, mood data, it sounds like there is value in being able to see this data over time (i.e. it appears that over the last month, my mood has been improving) as well as to pull up individual events (i.e. on date x, I ate a hamburger, how did this affect my mood in the subsequent mood entry after date x).
If it were me, I would create a Mood table, with two attributes:
Name
Id (pk)
This table would essentially serve as a definition table. Here you could add attributes specific to the mood (such as description).
I would then create a MoodHistory table with the following attributes:
- Timestamp
- MoodId
- IsCurrent (Boolean)
Before you enter a mood in your application, UPDATE MoodHistory SET IsCurrent = 0 WHERE IsCurrent = 1, and then insert your new record with IsCurrent = 1. This structure is normalized and by indexing or partitioning by the IsCurrent column (and honestly even without any indexing/partitioning), even as your table grows quite large, you should always be able to query the current mood super quickly.
For your second use case, this is quite dependent not only on your planned usage, but where the data is coming from (particularly for routes). I'm not sure how you are planning on grouping locations into "routes" but if you clarify in the comments, I'm happy to add to my answer.
For locations however, I'm assuming you're taking a Location Snapshot during some set time interval. I would create a LocationSnapshot table structured similarly to the MoodHistory table:
I would then create a MoodHistory table with the following attributes:
Timestamp
Latitude
Longitude
IsCurrent
By processing your IsCurrent data in a similar way to your MoodHistory data, it should be quite straightforward to grab the last entered location. You could also do some additional processing if you want to avoid duplicates. Essentially, before updating IsCurrent, query the row where IsCurrent = 1. Then compare that records Latitude and Longitude to your new Latitude and Longitude before Inserting the new record. If there is any change, proceed to the insert, otherwise, no need to insert a new record.
You could also create a table of known locations such as KnownLocation:
Latitude
Longitude
Name
Joining to this table ON Latitude and Longitude should tell you when you were spending time at a particular location, say "Home" vs "Work"

DynamoDB - Design 1 to Many relationship

I'm new at DynamoDB technologies but not at NoSQL (I've already done some project using Firebase).
Read that a DynamoDB best practice is one table per application I've been having a hard time on how to design my 1 to N relationship.
I have this entity (pseudo-json):
{
machineId: 'HASH_ID'
machineConfig: /* a lot of fields */
}
A machineConfig is unique for each machine and can change rarely and only by an administration (no consistency issue here).
The issue is that I have to manage a log of data from the sensors of each machine. The log is described as:
{
machineId: 'HASH_ID',
sensorsData: [
/* Huge list of: */
{ timestamp: ..., data: /* lot of fields */ },
...
]
}
I want to keep my machineConfig in one place. Log list can't be insert into the machine entity because it's a continuous stream of data taken over time.
Furthermore, I don't understand which could be the composite key, the partition key obviously is the machineId, but what about the order key?
How to design this relationship taking into account the potential dimensions of data?
You could do this with 1 table. The primary key could be (machineId, sortKey) where machineId is the partition key and sortKey is a string attribute that is going to be used to cover the 2 cases. You could probably come up with a better name.
To store the machineConfig you would insert an item with primary key (machineId, "CONFIG"). The sortKey attribute would have the constant value CONFIG.
To store the sensorsData you could use the timestamp as the sortKey value. You would insert a new item for each piece of sensor data. You would store the timestamp as a string (as time since the epoch, ISO8601, etc)
Then to query everything about a machine you would run a Dynamo query specifying just the machineId partition key - this would return many items including the machineConfig and the sensor data.
To query just the machineConfig you would run a Dynamo query specifying the machineId partition key and the constant CONFIG as the sortKey value
To query the sensor data you could specify an exact timestamp or a timestamp range for the sortKey. If you need to query the sensor data by other values then this design might not work as well.
Editing to answer follow up question:
You would have to resort to a scan with a filter to return all machines with their machineId and machineConfig. If you end up inserting a lot of sensor data then this will be a very expensive operation to perform as Dynamo will look at every item in the table. If you need to do this you have a couple of options.
If there are not a lot of machines you could insert an item with a primary key like ("MACHINES", "ALL") and a list of all the machineIds. You would query on that key to get the list of machineIds, then you would do a bunch of queries (or a batch get) to retrieve all the related machineConfigs. However since the max Dynamo item size is 400KB you might not be able to fit them all.
If there are too many machines to fit in one item you could alter the above approach a bit and have ("MACHINES", $machineIdSubstring) as a primary key and store chunks of machineIds under each sort key. For example, all machineIds that start with 0 go in ("MACHINES", "0"). Then you would query by each primary key 0-9, build a list of all machineIds and query each machine as above.
Alternatively, you don't have to put everything in 1 table - it is just a guideline that fits a lot of use cases. If there are too many machines to fit in less than 400KB but there aren't tens of thousands and you aren't trying to query all of them all the time, you could have a separate table of machineId and machineConfig that you resort to scanning when necessary.

Database / table structure - similar entries vs. too much normalization?

I have designed this relational database that is keeping track of various assets and their owners over time. One of the most important piece of analysis I want to do is to track the value of those assets over time: expected original cost, actual original cost, actual cost, etc. So I have been putting data relative to a cost / value in a separate table called “Support_Value”. To complicates things some of the assets I’m tracking are in countries with foreign currencies so I’m collecting cost / value data in US Dollars but also in local currencies (“LC”), which ends up doubling the number of columns I have in this table. I also use this table as a way to keep track of the value of the asset owners themselves in a similar fashion.
- The columns of this table are the following:
My initial plan was to carve out separate tables to deal with (1) the various “qualities” of entries relative to cost and value (i.e. the “planned”, “upper” bound, “lower” bound”, “estimated” by analysts, and “actual” and another table to track) and (2) another one for currencies. But I realize this is likely to break as it doesn’t allow to have an initial “planned” cost that is then subsequently revised unless we make it explicit by creating new column for revised appendages but then there can be more than one revision.. So still not perfect.
What I’m now envisaging is to create a different value table that would have the following columns:
ID (PK representing individual instances of cost / value estimates)
Currency (FK to my currency table)
Asset (FK to my assets table) - i.e. what this cost or value is referring to
Date (FK to my date table) - i.e. to track revisions actually
Type (i.e. “cost" or “value")
Quality (i.e. “planned”, “upper”, “lower”, “estimated”, “actual”)
Valuation - i.e. the actual absolute amount in the currency designated in the second column
What do think of this approach? Is this an improvement?
Thanks for any suggestion you could have!
Both approaches are fine.
But, if you think you may need additional similar columns,
then the second aproach is more extensible.
Your second approach, it does look it has overnormalization,
I suggest split the "Quality" column back to its parts.
Some thing like:
"ID"
"Currency"
"Asset"
"Date"
"Type"
"Planned"
"Lower"
"Upper"
"Estimated"
"Actual"
"Valuation"
Cheers.

Organizing database tables - large number of properties

I have a database that stores some users in it. Each user has its account settings, privacy settings and lots of other properties to set. The number of those properties started to grow and I could end up with 30 properties or so.
Till now, I used to keep it in "UserInfo" table having User and UserInfo related as One-To-Many (keeping a log of all changes). Putting it in a single "UserInfo" table doesn't sound nice and, at least in the database model, it would look messy. What's the solution?
Separating privacy settings, account settings and other "groups" of settings in separate tables and have 1-1 relations between UserInfo and each group of settings table is one solution, but would that be too slow (or much slower) when retrieving the data? I guess all data would not be presented on a single page at the same moment. So maybe having one-to-many relationships to each table is a solution too (keeping log of each group separately)?
If it's only 30 properties, I'd recommend just creating 30 columns. That's not too much for a modern database to handle.
But I would guess that if you ahve 30 properties today, you will continue to invent new properties as time goes on, and the number of columns will keep growing. Restructuring your table to add columns every day may become time-consuming as you get lots of rows.
For an alternative solution check out this blog for a nifty solution for storing lots of dynamic attributes in a "schemaless" way: How FriendFeed Uses MySQL.
Basically, collect all the properties into some format and store it in a single TEXT column. The format is semi-structured, that is your application can separate the properties if needed but you can also add more at any time, or even have different properties per row. XML or YAML or JSON are example formats, or some object serialization format supported by your application code language.
CREATE TABLE Users (
user_id SERIAL PRIMARY KEY,
user_proerties TEXT
);
This makes it hard to search for a given value in a given property. So in addition to the TEXT column, create an auxiliary table for each property you want to be searchable, with two columns: values of the given property, and a foreign key back to the main table where that particular value is found. Now you have can index the column so lookups are quick.
CREATE TABLE UserBirthdate (
user_id BIGINT UNSIGNED PRIMARY KEY,
birthdate DATE NOT NULL,
FOREIGN KEY (user_id) REFERENCES Users(user_id),
KEY (birthdate)
);
SELECT u.* FROM Users AS u INNER JOIN UserBirthdate b USING (user_id)
WHERE b.birthdate = '2001-01-01';
This means as you insert or update a row in Users, you also need to insert or update into each of your auxiliary tables, to keep it in sync with your data. This could grow into a complex chore as you add more auxiliary tables.

What is the best way to keep changes history to database fields?

For example I have a table which stores details about properties. Which could have owners, value etc.
Is there a good design to keep the history of every change to owner and value. I want to do this for many tables. Kind of like an audit of the table.
What I thought was keeping a single table with fields
table_name, field_name, prev_value, current_val, time, user.
But it looks kind of hacky and ugly. Is there a better design?
Thanks.
There are a few approaches
Field based
audit_field (table_name, id, field_name, field_value, datetime)
This one can capture the history of all tables and is easy to extend to new tables. No changes to structure is necessary for new tables.
Field_value is sometimes split into multiple fields to natively support the actual field type from the original table (but only one of those fields will be filled, so the data is denormalized; a variant is to split the above table into one table for each type).
Other meta data such as field_type, user_id, user_ip, action (update, delete, insert) etc.. can be useful.
The structure of such records will most likely need to be transformed to be used.
Record based
audit_table_name (timestamp, id, field_1, field_2, ..., field_n)
For each record type in the database create a generalized table that has all the fields as the original record, plus a versioning field (additional meta data again possible). One table for each working table is necessary. The process of creating such tables can be automated.
This approach provides you with semantically rich structure very similar to the main data structure so the tools used to analyze and process the original data can be easily used on this structure, too.
Log file
The first two approaches usually use tables which are very lightly indexed (or no indexes at all and no referential integrity) so that the write penalty is minimized. Still, sometimes flat log file might be preferred, but of course functionally is greatly reduced. (Basically depends if you want an actual audit/log that will be analyzed by some other system or the historical records are the part of the main system).
A different way to look at this is to time-dimension the data.
Assuming your table looks like this:
create table my_table (
my_table_id number not null primary key,
attr1 varchar2(10) not null,
attr2 number null,
constraint my_table_ak unique (attr1, att2) );
Then if you changed it like so:
create table my_table (
my_table_id number not null,
attr1 varchar2(10) not null,
attr2 number null,
effective_date date not null,
is_deleted number(1,0) not null default 0,
constraint my_table_ak unique (attr1, att2, effective_date)
constraint my_table_pk primary key (my_table_id, effective_date) );
You'd be able to have a complete running history of my_table, online and available. You'd have to change the paradigm of the programs (or use database triggers) to intercept UPDATE activity into INSERT activity, and to change DELETE activity into UPDATing the IS_DELETED boolean.
Unreason:
You are correct that this solution similar to record-based auditing; I read it initially as a concatenation of fields into a string, which I've also seen. My apologies.
The primary differences I see between the time-dimensioning the table and using record based auditing center around maintainability without sacrificing performance or scalability.
Maintainability: One needs to remember to change the shadow table if making a structural change to the primary table. Similarly, one needs to remember to make changes to the triggers which perform change-tracking, as such logic cannot live in the app. If one uses a view to simplify access to the tables, you've also got to update it, and change the instead-of trigger which would be against it to intercept DML.
In a time-dimensioned table, you make the strucutural change you need to, and you're done. As someone who's been the FNG on a legacy project, such clarity is appreciated, especially if you have to do a lot of refactoring.
Performance and Scalability: If one partitions the time-dimensioned table on the effective/expiry date column, the active records are in one "table", and the inactive records are in another. Exactly how is that less scalable than your solution? "Deleting" and active record involves row movement in Oracle, which is a delete-and-insert under the covers - exactly what the record-based solution would require.
The flip side of performance is that if the application is querying for a record as of some date, partition elimination allows the database to search only the table/index where the record could be; a view-based solution to search active and inactive records would require a UNION-ALL, and not using such a view requires putting the UNION-ALL in everywhere, or using some sort of "look-here, then look-there" logic in the app, to which I say: blech.
In short, it's a design choice; I'm not sure either's right or either's wrong.
In our projects we usually do it this way:
You have a table
properties(ID, value1, value2)
then you add table
properties_audit(ID, RecordID, timestamp or datetime, value1, value2)
ID -is an id of history record(not really required)
RecordID -points to the record in original properties table.
when you update properties table you add new record to properties_audit with previous values of record updated in properties. This can be done using triggers or in your DAL.
After that you have latest value in properties and all the history(previous values) in properties_audit.
I think a simpler schema would be
table_name, field_name, value, time, userId
No need to save current and previous values in the audit tables. When you make a change to any of the fields you just have to add a row in the audit table with the changed value. This way you can always sort the audit table on time and know what was the previous value in the field prior to your change.

Resources