Postgres index performance with duplicated data

Postgres index performance with duplicated data - database

Having a table with the following columns:
users
id | name | city
--------+---------+-----------
As many users are in the same city the same city could appear many times. By adding an index on it, do we get a performance benefit as values are repeated? or would it negatively affect performance?
Query I'm trying to run
SELECT DISTINCT city FROM users;

The real solution here IMHO, is to normalize that table and create a new City table. Like this:
User
idUser: Primary key
Name
City_idCity <-- foreign key to the City table
City
idCity: Primary key
Name
This way, there is no index to maintain (idCity is a primary key to table City anyway). To get a list of existing cities, just
SELECT Name FROM City
Thus you do not need DISTINCT and avoid a full table scan.
You will also get the other benefits of normalized tables.
Ex. is the city of New-York the same as New York? Or NewYork? This ensures users will always have 1 single choice of city, and not create duplicates like they would do in a free text field.

Related

try to add a new column as foreign key in existing table with data and existing data manipulation

A very simple example. I have web API with a table in the database
Employees
---------
Id
---------
Name
and for example, I have 50 records.
Now I have to Implement a feature to add extra info about the department. Because I have one to many relationships the new database schema is with department id
Employees Department
---------- -----------
Id Id
--------- -----------
Name Name
---------
DepartmentId
for this, I run the query (i use SQL server)
alter table Employees add constraint fk_employees_departmentid
foreign key (DepartmentId) references Department(Id);
But now I have some issues to handle
1)Now I have the 50 existing records without departmentId. However, I must add manually this value? What is the best practice? For 50 records it is possible but for 2000 records and more?
2) when I add departmentId column I set this column to have null values(is correct?), but as a foreign key, I don't want to allow null values. Can I change it or how can I handle it?

1)Now I have the 50 existing records without departmentId. However, I must add manually this value? What is the best practice? For 50 records it is possible but for 2000 records and more?
It depends. You could set up a new department for "unassigned" and assign them all to that; you could send out a spreadsheet to HR saying "the following employees don't have an assigned department; what department are they in? ps; don't remove the EmployeeID column from the sheet before you send it back; i need it to update the DB". It's very much a business contextual question, not a technical one. X thousand records is easy to handle.. It'll just take a bit of time to work through if you (or someone else) is doing it manually. This information is likely to be available somewhere else; you could perhaps send a list out to all department heads saying "are any of these guys yours? Please remove all the names you don't have in your team from this spreadsheet and send it back to me" then update the DB based on what you get back
As this is a one time operation you don't need anything particularly whizz for it - you can just get your Excel sheet back and in an empty column put:
="UPDATE emp SET departmentID = 5 WHERE id = " & A1
And fill it down to generate a bunch of update statements, copy the text into your query tool and hit go; don't need to get all fancy loading the sheet into a table, doing update joins etc - just hacky style sling together something in excel that will write the SQL for you, copy/paste/run. If HR have sent back the sheet with a list of department names, then put the dept name and id somewhere else on the sheet and use VLOOKUP or XLOOKUP to turn the name into the department number, then compose your SQL based on that
2) when I add departmentId column I set this column to have null values(is correct?), but as a foreign key, I don't want to allow null values. Can I change it or how can I handle it?
Foreign keyed columns are allowed to have NULL values - it isn't the FK that imposes a "No Nulls" restriction, it's the nullability of the column (alter the column to departmantid INT NOT NULL) that imposes that. A FK references a primary key and the primary key may not be null (or in some DB, at most one record can have a [partly] null PK), but you could just leave those departments null. If you do alter the column to be not null, you'll need to correct the NULL values first or the change will fail

PostgreSQL Replace Column Data with Unique Integers

I have a table of many columns in Postgresql database. Some of these columns are in text type and have several rows of values. The values also recur. I would like to change these text values with unique integer values.
This is my table column:
Country_Name
------------
USA
Japan
Mexico
USA
USA
Japan
England
and the new column I want is:
Country_Name
------------
1
2
3
1
1
2
4
Each country name is assigned (mapped) to a unique integer and all the recurrences of the text is replaced with this number. How can I do this?
Edit 1: I want to replace my column values on the fly if possible. I don't actually need another column to keep the names but it would be nice to see the actual values too. Is it possible to do:
Create a column country_id with the same values of country_name column in the same table
And for country_id replace each name with a unique integer with an update statement or procedure without requiring a new table or dictionary or map.
I don't know if this is possible but this will speed up things because I have a total of 220 columns and millions of rows. Thank you.

assuming the country_name column is in a table called country_data
create a new table & populate with unique country_names
-- valid in pg10 onwards
-- for earlier versions use SERIAL instead in the PK definition
CREATE TABLE countries (
country_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
country_name TEXT);
INSERT INTO countries (country_name)
SELECT DISTINCT country_name
FROM country_data;
alter the table country_data & add a column country_id
ALTER TABLE country_data ADD COLUMN country_id INT
Join country_data to countries and populate the country_id column
UPDATE country_data
SET country_id = s.country_id
FROM countries
WHERE country_data.country_name = countries.country_name
At this point the country_id is available to query, but a few following actions may be recommended depending on the use case:
set up country_data.country_id as a foreign key referring to countries.country_id
drop the column country_data.country_name as that's redundant through the relationship with countries
maybe create an index on country_data.country_id if you determine that it will speed up the queries you normally run on this table.

Handling multi-select list in database design

I'm creating a clinic management system where I need to store Medical History for a patient. The user can select multiple history conditions for a single patient, however, each clinic has its own fixed set of Medical History fields.
For example:
Clinic 1:
DiseaseOne
DiseaseTwo
DiseaseThree
Clinic 2:
DiseaseFour
DiseaseFive
DiseaseSize
For my Patient visit in a specific Clinic , the user should be able to check 1 or more Diseases for the patient's medical history based on the clinic type.
I thought of two ways of storing the Medical History data:
First Option:
Add the fields to the corresponding clinic Patient Visit Record:
PatientClinic1VisitRecord:
PatientClinic1VisitRecordId
VisitDate
MedHist_DiseaseOne
MedHist_DiseaseTwo
MedHist_DisearThree
And fill up each MedHist field with the value "True/False" based on the user input.
Second Option:
Have a single MedicalHistory Table that holds all Clinics Medical History detail as well as another table to hold the Patient's medical history in its corresponding visit.
MedicalHistory
ClinicId
MedicalHistoryFieldId
MedicalHistoryFieldName
MedicalHistoryPatientClinicVisit
VisitId
MedicalHistoryFieldId
MedicalHistoryFieldValue
I'm not sure if these approaches are good practices, is a third approach that could be better to use ?

If you only interested on the diseases the person had, then storing the false / non-existing diseases is quite pointless. Not really knowing all the details doesn't help getting the best solution, but I would probably create something like this:
Person:
PersonID
Name
Address
Clinic:
ClinicID
Name
Address
Disease:
DiseaseID
Name
MedicalHistory:
HistoryID (identity, primary key)
PersonID
ClinicID
VisitDate (either date or datetime2 field depending what you need)
DiseaseID
Details, Notes etc
I created this table because my assumption was that people have most likely only 1 disease on 1 visit, so in case there's sometimes several, more rows can be added, instead of creating separate table for the visit, which makes queries most complex.
If you need to track also situation where a disease was checked but result was negative, then new status field is needed for the history table.
If you need to limit which diseases can be entered by which clinic, you'll need separate table for that too.

Create a set of relational tables to get a robust and flexible system, enabling the clinics to add an arbitrary number of diseases, patients, and visits. Also, constructing queries for various group-by criteria will become easier for you.
Build a set of 4 tables plus a Many-to-Many (M2M) "linking" table as given below. The first 3 tables will be less-frequently updated tables. On each visit of a patient to a clinic, add 1 row to the [Visits] table, containing the full detail of the visit EXCEPT disease information. Add 1 row to the M2M [MedicalHistory] table for EACH disease for which the patient will be consulting on that visit.
On a side note - consider using Table-Valued Parameters for passing a number of rows (1 row per disease being consulted) from your front-end program to the SQL Server stored procedure.
Table [Clinics]
ClinicId Primary Key
ClinicName
-more columns -
Table [Diseases]
DiseaseId Primary Key
ClinicId Foreign Key into the [Clinics] table
DiseaseName
- more columns -
Table [Patients]
PatientId Primary Key
ClinicId Foreign Key into the [Clinics] table
PatientName
-more columns -
Table [Visits]
VisitId Primary Key
VisitDate
DoctorId Foreign Key into another table called [Doctor]
BillingAmount
- more columns -
And finally the M2M table: [MedicalHistory]. (Important - All the FK fields should be combined together to form the PK of this table.)
ClinicId Foreign Key into the [Clinics] table
DiseaseId Foreign Key into the [Diseases] table
PatientId Foreign Key into the [Patients] table
VisitId Foreign Key into the [Visits] table

Access Relationship Design

I am fairly green when it comes to working with Access and databases in general.
I am asking for your help in figuring out how to set the correct relationships for three tables:
Table 1 contains:
(no unique ID)
SalesTripID
EmployeeName
StartDate
EndDate
*Each record on this table is related to 1 specific employee's 1 specific sales trip
Table 2 contains:
HotelName
HotelStart
HotelEnd
HotelTotal
*This table may contain multiple records that belong to only 1 record on table 1 (for instance, an employee would stay at 2 hotels during their sales trip)
Table 3 contains:
(no unique ID)
MealVendor
MealDate
MealTotal
*This table, similar to Table 2, may have multiple records in it that are tied to the 1 SalesTripID
How do I set something up to show me each SalesTripID, the multiple Table 2, and the multiple Table 3 records associated with it? Do I need to add a Primary Key anything other than Table 1? Is writing a query involved to display the information? Because I am so green, any and all feedback is welcome.

The following is my recommendation:
Add a SalesTripId field on tables 2,3. This is called a ForeignKey.
If SalesTripId in Table1 is not unique (i.e. each employee can have a trip with the same Id as another employee), add another field (Id) in Table1. You can use Access' AutoNumber type for that field.
I recommend always having a primary key in your tables. But you can skip the Id fields in tables 2,3.

SQL Server how to maintain GUID across tables in same DB

I want to create a DB , where each table's PK will be GUID and which will be unique across the DB,
Example: my DB name is 'LOCATION'. And I have 3 table as 'CITY' , 'STATE' and 'COUNTRY'.
I want that all the 3 tables have same PK field as GUID ,and that value will be unique across DB.
How to do this in SQL Server, any idea? I have never used SQL Server before, so it will be helpful if briefly explained.

create table CITY (
ID uniqueidentifier not null primary key default newid(),
.
.
.
)
Repeat for the other tables.

What do you mean exactly ?
Just create the table, add an Id field to each table, set the data type of the Id field to 'uniqueidentifier', and you're good to go.
Next, add a primary constraint on those columns, and make sure that, when inserting a new record you assign a new guid to that column (for instance, by using the newid() function).

I can't think of any good reason to have a unique number shared by 3 tables, why not just give each table a unique index with a foreign key reference? Indexed fields are queried quicker than random numbers would be.
I would create a 'Location' table with foreign keys CityId, StateId & CountryId to link them logically.
edit:
If you are adding a unique id across the City, State and Country tables then why not just have them as fields in the same table? I would have thought that your reason for splitting them into 3 tables was to reduce duplication in the database.