PostgreSQL Replace Column Data with Unique Integers - database

I have a table of many columns in Postgresql database. Some of these columns are in text type and have several rows of values. The values also recur. I would like to change these text values with unique integer values.
This is my table column:
Country_Name
------------
USA
Japan
Mexico
USA
USA
Japan
England
and the new column I want is:
Country_Name
------------
1
2
3
1
1
2
4
Each country name is assigned (mapped) to a unique integer and all the recurrences of the text is replaced with this number. How can I do this?
Edit 1: I want to replace my column values on the fly if possible. I don't actually need another column to keep the names but it would be nice to see the actual values too. Is it possible to do:
Create a column country_id with the same values of country_name column in the same table
And for country_id replace each name with a unique integer with an update statement or procedure without requiring a new table or dictionary or map.
I don't know if this is possible but this will speed up things because I have a total of 220 columns and millions of rows. Thank you.

assuming the country_name column is in a table called country_data
create a new table & populate with unique country_names
-- valid in pg10 onwards
-- for earlier versions use SERIAL instead in the PK definition
CREATE TABLE countries (
country_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
country_name TEXT);
INSERT INTO countries (country_name)
SELECT DISTINCT country_name
FROM country_data;
alter the table country_data & add a column country_id
ALTER TABLE country_data ADD COLUMN country_id INT
Join country_data to countries and populate the country_id column
UPDATE country_data
SET country_id = s.country_id
FROM countries
WHERE country_data.country_name = countries.country_name
At this point the country_id is available to query, but a few following actions may be recommended depending on the use case:
set up country_data.country_id as a foreign key referring to countries.country_id
drop the column country_data.country_name as that's redundant through the relationship with countries
maybe create an index on country_data.country_id if you determine that it will speed up the queries you normally run on this table.

Related

Bad design to compare to computed columns?

Using SQL Server I have a table with a computed column. That column concatenates 60 columns:
CREATE TABLE foo
(
Id INT NOT NULL,
PartNumber NVARCHAR(100),
field_1 INT NULL,
field_2 INT NULL,
-- and so forth
field_60 INT NULL,
-- and so forth up to field_60
)
ALTER TABLE foo
ADD RecordKey AS CONCAT (field_1, '-', field_2, '-', -- and so on up to 60
) PERSISTED
CREATE INDEX ix_foo_RecordKey ON dbo.foo (RecordKey);
Why I used a persisted column:
Not having the need to index 60 columns
To test to see if a current record exists by checking just one column
This table will contain no fewer than 20 million records. Adds/Inserts/updates happen a lot, and some binaries do tens of thousands of inserts/updates/deletes per run and we want these to be quick and live.
Currently we have C# code that manages records in table foo. It has a function which concatenates the same fields, in the same order, as the computed column. If a record with that same concatenated key already exists we might not insert, or we might insert but call other functions that we may not normally.
Is this a bad design? The big danger I see is if the code for any reason doesn't match the concatenation order of the computed column (if one is edited but not the other).
Rules/Requirements
We want to show records in JQGrid. We already have C# that can do so if the records come from a single table or view
We need the ability to check two records to verify if they both have the same values for all of the 60 columns
A better table design would be
parts table
-----------
id
partnumber
other_common_attributes_for_all_parts
attributes table
----------------
id
attribute_name
attribute_unit (if needed)
part_attributes table
---------------------
part_id (foreign key to parts)
attribute_id (foreign key to attributes)
attribute value
It looks complicated but due to proper indexing this is super fast even if part_attributes contain billions of records!

Postgres index performance with duplicated data

Having a table with the following columns:
users
id | name | city
--------+---------+-----------
As many users are in the same city the same city could appear many times. By adding an index on it, do we get a performance benefit as values are repeated? or would it negatively affect performance?
Query I'm trying to run
SELECT DISTINCT city FROM users;
The real solution here IMHO, is to normalize that table and create a new City table. Like this:
User
idUser: Primary key
Name
City_idCity <-- foreign key to the City table
City
idCity: Primary key
Name
This way, there is no index to maintain (idCity is a primary key to table City anyway). To get a list of existing cities, just
SELECT Name FROM City
Thus you do not need DISTINCT and avoid a full table scan.
You will also get the other benefits of normalized tables.
Ex. is the city of New-York the same as New York? Or NewYork? This ensures users will always have 1 single choice of city, and not create duplicates like they would do in a free text field.

SSIS Lookup Suggestion

I have a task to Generate a Derived Column (RestrictionID) from a table and add it to a source table if 6 columns on both tables match. The 6 columns include (Country, Department, Account .etc) I decided to go with the SSIS Lookup and generate the Derived column when there's a match. This ID also has a time and amount Limit. Once all the records have ID's, I'm supposed to calculate a running total based on the ID to enforce the limits which is the easy part.
The only problem is this Lookup table changes almost daily and any or all of the 6 columns can have NULLS. Even the completely null rows have an Id. Nulls mean the Restriction is open. eg. If the Country column on one record on the lookup table is null, then the ID of that record can be assigned to records with any country on the source. If one row on the lookup has all null columns, then this is completely open and all records on the source qualify for that ID. The Source table doesn't have NULLS.
Please assist if possible
Thanks
If NULL means any and ignore column in lookup then add this to your where:
use a stored proc and pass your values in and return:
select lookup.ID
from lookup
where #Country = isnull(lookup.Country,#Country) //If lookup inull then it refers to itself creating a 1=1 scenario
and #Department = isnull(lookup.Department,#Department)
and ...

what is the correct way to design a 'table to row' relationship?

I am trying to model the following in a postgres db.
I have N number of 'datasets'. These datasets are things like survey results, national statistics, aggregated data etc. They each have a name a source insitution a method etc. This is the meta data of a dataset and I have tables created for this and tables for codifying the research methods etc. The 'root' meta-data table is called 'Datasets'. Each row represents one dataset.
I then need to store and access the actual data associated with this dataset. So I need to create a table that contains that data. How do I represent the relationship between this table and its corresponding row in the 'Datasets' table?
an example
'hea' is a set of survey responses. it is unaggregated so each row is one survey response. I create a table called 'HeaData' that contains this data.
'cso' is a set of aggregated employment data. each row is a economic sector. I create a table called 'CsoData' that contains this data
I create a row for each of these in the 'datasets' table with the relevant meta data for each and they have ids of 1 & 2 respectively.
what is the best way to relate 1 to the HeaData table and 2 to the CsoData table?
I will eventually be accessing this data with scala slick so if the database design could just 'plug and play' with slick that would be ideal
Add a column to the Datasets table which designates which type of dataset it represents. Then a 1 may mean HEA and 2 may mean CSO. A check constraint would limit the field to one of the two values. If new types of datasets are added later, the only change needed is to change the constraint. If it is defined as a foreign key to a "type of dataset" table, you just need to add the new type of dataset there.
Form a unique index on the PK and the new field.
Add the same field to each of the subtables. But the check constraint limits the value in the HEA table to only that value and the CSO table to only that value. Then form the ID field of Datasets table and the new field as the FK to Datasets table.
This limits the ID value to only one of the subtables and it must be the one defined in the Datasets table. That is, if you define a HEA dataset entry with an ID value of 1000 and the HEA type value, the only subtable that can contain an ID value of 1000 is the HEA table.
create table Datasets(
ID int identity/auto_generate,
DSType char( 3 ) check( DSType in( 'HEA', 'CSO' ),
[everything else],
constraint PK_Datasets primary key( ID ),
constraint UQ_Dateset_Type unique( ID, DSType ) -- needed for references
);
create table HEA(
ID int not null,
DSType char( 3 ) check( DSType = 'HEA' ) -- making this a constant value
[other HEA data],
constraint PK_HEA primary key( ID ),
constraint FK_HEA_Dataset_PK foreign key( ID )
references Dataset( ID ),
constraint FK_HEA_Dataset_Type foreign key( ID, DSType )
references Dataset( ID, DSType )
);
The same idea with the CSO subtable.
I would recommend an HEA and CSO view that would show the complete dataset rows, metadata and type-specific data, joined together. With triggers on those views, they can be the DML points for the application code. Then the apps don't have to keep track of how that data is laid out in the database, making it a lot easier to make improvements should the opportunity present itself.

Access Relationship Design

I am fairly green when it comes to working with Access and databases in general.
I am asking for your help in figuring out how to set the correct relationships for three tables:
Table 1 contains:
(no unique ID)
SalesTripID
EmployeeName
StartDate
EndDate
*Each record on this table is related to 1 specific employee's 1 specific sales trip
Table 2 contains:
HotelName
HotelStart
HotelEnd
HotelTotal
*This table may contain multiple records that belong to only 1 record on table 1 (for instance, an employee would stay at 2 hotels during their sales trip)
Table 3 contains:
(no unique ID)
MealVendor
MealDate
MealTotal
*This table, similar to Table 2, may have multiple records in it that are tied to the 1 SalesTripID
How do I set something up to show me each SalesTripID, the multiple Table 2, and the multiple Table 3 records associated with it? Do I need to add a Primary Key anything other than Table 1? Is writing a query involved to display the information? Because I am so green, any and all feedback is welcome.
The following is my recommendation:
Add a SalesTripId field on tables 2,3. This is called a ForeignKey.
If SalesTripId in Table1 is not unique (i.e. each employee can have a trip with the same Id as another employee), add another field (Id) in Table1. You can use Access' AutoNumber type for that field.
I recommend always having a primary key in your tables. But you can skip the Id fields in tables 2,3.

Resources