Load data from CSV-file into PostgreSQL database - database

I have an application, that needs to load data from user-specified CSV-files into PostgreSQL database tables.
The structure of CSV-file is simple:
name,email
John Doe,john#example.com
...
In the database I have three tables:
---------------
-- CAMPAIGNS --
---------------
CREATE TABLE "campaigns" (
"id" serial PRIMARY KEY,
"name" citext UNIQUE CHECK ("name" ~ '^[-a-z0-9_]+$'),
"title" text
);
----------------
-- RECIPIENTS --
----------------
CREATE TABLE "recipients" (
"id" serial PRIMARY KEY,
"email" citext UNIQUE CHECK (length("email") <= 254),
"name" text
);
-----------------
-- SUBMISSIONS --
-----------------
CREATE TYPE "enum_submissions_status" AS ENUM (
'WAITING',
'SENT',
'FAILED'
);
CREATE TABLE "submissions" (
"id" serial PRIMARY KEY,
"campaignId" integer REFERENCES "campaigns" ON UPDATE CASCADE ON DELETE CASCADE NOT NULL,
"recipientId" integer REFERENCES "recipients" ON UPDATE CASCADE ON DELETE CASCADE NOT NULL,
"status" "enum_submissions_status" DEFAULT 'WAITING',
"sentAt" timestamp with time zone
);
CREATE UNIQUE INDEX "submissions_unique" ON "submissions" ("campaignId", "recipientId");
CREATE INDEX "submissions_recipient_id_index" ON "submissions" ("recipientId");
I want to read all rows from the specified CSV-file and to make sure that according records exist in recipients and submissions tables.
What would be the most performance-efficient method to load data in these tables?
This is primarily a conceptual question, I'm not asking for a concrete implementation.
First of all, I've naively tried to read and parse CSV-file line-by-line and issue SELECT/INSERT queries for each E-Mail. Obviously, it was a very slow solution that allowed me to load ~4k records per minute, but code was pretty simple and straightforward.
Now, I'm reading the CSV-file line-by-line, but aggregating all E-Mails into a batches of 1'000 elements. All SELECT/INSERT queries are made in batches using SELECT id, email WHERE email IN ('...', '...', '...', ...) constructs. Such approach increased the performance, and now I have performance of ~25k records per minute. However, this approach demanded a pretty-complex multi-step code to work.
Are there any better approaches to solve this problem and get even greater performance?
The key problem here is that I need to insert data to the recipients table first and then I need to use the generated id to create a corresponding record in the submissions table.
Also, I need to make sure that inserted E-Mails are unique. Right now, I'm using a simple array-based index in my application to prevent duplicate E-Mails from being added to the batch.
I'm writing my app using Node.js and Sequelize with Knex, however, the concrete technology doesn't matter here much.

pgAdmin has GUI for data import since 1.16. You have to create your table first and then you can import data easily - just right-click on the table name and click on Import.
enter image description here
enter image description here

Related

Cassandra DB structure suggestion (two tables vs one)

I am new to Cassandra's DB, and I'm creating a database structure and I wonder whether my pick is optimal for my requirements.
I need to get information on unique users, and each unique user will have multiple page views.
My two options are de-normalizing my data into one big table, or create two different tables.
My most used queries would be:
Searching for all of the pages with a certain cookie_id.
Searching for all of the pages with a certain cookie_id and a client_id. If a cookie doesn't have a client, it would be marked client_id=0 (that would be most of the data).
Find the first cookie_id with extra data (for example data_type_1 + data_type_2).
My two suggested schemas are these:
Two tables - One for users and one for visited pages.
This would allow me to save a new user on a different table, and keep every recurring request in another table.
CREATE TABLE user_tracker.pages (
cookie_id uuid,
created_uuid timeuuid,
data_type_3 text,
data_type_4 text,
PRIMARY KEY (cookie_id, created_uuid)
);
CREATE TABLE user_tracker.users (
cookie_id uuid,
client_id id,
data_type_1 text,
data_type_2 text,
created_uuid timeuuid,
PRIMARY KEY (cookie_id, client_id, created_uuid)
);
This data is normalized as I don't enter the user's data for every request.
One table - For all of the data saved, and the first request as the key. First request would have data_type_1 and data_type_2.
I could also save "data_type_1" and "data_type_2" as a hashmap, as they represent multiple columns and they will always be in a single data set (is it considered to be better?).
The column "first" would be a secondary index.
CREATE TABLE user_tracker.users_pages (
cookie_id uuid,
client_id id,
data_type_1 text,
data_type_2 text,
data_type_3 text,
data_type_4 text,
first boolean,
created_uuid timeuuid,
PRIMARY KEY (cookie_id, client_id, created_uuid)
);
In reality we have more columns than 4, this was written briefly.
As far as I understand Cassandra's best practices, I'm leaning into option #2, but please do enlighten me :-)
Data modelling in Cassandra is done based on type of queries you will be making. You should be able to query using partition key.
For following option 2 mentioned by you is okay.
1.Searching for all of the pages with a certain cookie_id
Searching for all of the pages with a certain cookie_id and a client_id. If a cookie doesn't have a client, it would be marked client_id=0 (that would be most of the data).
For third query
Find the first cookie_id with extra data (for example data_type_1 + data_type_2).
Not sure what do you mean by first cookie_id with extra data. In Cassandra all data is stored by partition key and are not sorted. So all your data will be stored using cookie_id as parition key and all future instances with the same cookie_id will keep adding to this row.

cassandra - how to perform table query?

I am trying to perform a query using 2 tables:
CREATE TABLE users(
id_ UUID PRIMARY KEY,
username text,
email text,
);
CREATE TABLE users_by_email(
id UUID,
email text PRIMARY KEY
)
In this cas, how to perform a query by email?
I am assuming in the case above you are specifically trying to retrieve the username by the email.
Short Answer:
There is no way in Cassandra that you are going to be able to get the username from the email in a single query using the table structure you have defined. You would need to query users_by_email to get the id and then query users to get the username. A better option would be to add the username column to the users_by_email table.
Long Answer:
Due to the underlying mechanisms by which Cassandra stores data on disk the only available parameters you may use in a where clause have to be in the Primary Key. The Primary Key is made up of 2 different types of keys. First is the partition key which is used to physically separate files on disk and between nodes in the cluster. Second are the cluster keys which are used to organize data stored in a partition and aid in efficient retrieval of data. One other critical part to note is that if you use a WHERE clause in your query it must contain all of the partition keys in it for each call. This is to allow for efficient retrieval of the data. If you want to get some more detailed information on the working of the WHERE clause take a look at this link:
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
Now that you know what the limitations of the WHERE clause are the question is how do we get around them. First thing you need to know is that Cassandra is not a RDBMS and you cannot perform JOIN's against tables. This means that we need to forget all the rules that we have learned for so many years about how to properly normalize data in a database and begin thinking differently about the problem. In general Cassandra is designed for a table-per-query pattern. This means that for each data access pattern (i.e. query) you are going to run against there is an associated table that contains the data for that query and has the proper keys to allow the data to be filtered appropriately. I am not going to be able to go into all the nitty gritty details of how to properly data model your data but I suggest you take the free Datastax Academy Data modeling course avaliable here:
https://academy.datastax.com/courses/ds220-data-modeling
So as I understand your particular need I think that you can modify your users table to look like this:
CREATE TABLE users_by_email(
email text,
username text,
id_ UUID,
PRIMARY KEY (email, username)
);
This table setup will allow you to select the username by email using a query like:
SELECT username FROM users_by_email WHERE email=XXXXX;
I am assuming that you also want username returned in the query. You cannot JOIN tables in Cassandra. So to do that, you will have to add that column to your users_by_email table:
CREATE TABLE users_by_email(
id UUID,
email text PRIMARY KEY,
username text,
);
Then, simply query that table by email address.
> SELECT id, email, username FROM users_by_email WHERE email='mreynolds#serenity.com';
id | email | username
--------------------------------------+------------------------+----------
d8e57eb4-c837-4bd7-9fd7-855497861faf | mreynolds#serenity.com | Mal
(1 rows)

Table design with joining tables or separate ID in main table

I'm designing a database that has a couple of tables; FAQ's, Bulletins, and Attachments. Bulletins and FAQ's could have an attachment associated with them, so my initial thought was to create a joining table with the two primary keys as a composite key:
Bulletin
--------
BulletinID
Subject
Description
Notes
Attachment
-----------
AttachmentID
FileName
FilePath
etc.
Joining table:
BulletinAttachments
-------------------
BulletinID
AttachmentID
As I design this, I also thought, what if other entities are introduced later (say Newsletter, Email, etc.) that need Attachments as well. I would have to create a joining table for each of these entities. Not awful, but it made me think, what if I got rid of the joining tables and put an AttachmentType in the Attachment table and then assigned the type accordingly:
AttachmentType
--------------
AttachmentTypeID
AttachmentType
Description
The data in that table would be:
1-Bulletin
2-FAQ
3-Newsletter
4-Email
Then the Attachment table would hold the AttachmentTypeID to identify it:
Attachments
-----------
AttachmentID
AttachmentTypeID
FileName
FilePath
etc.
So my question is, for performance wise (using SQL 2008 R2), is there a better choice between the two? Is there a better way to design this? My concern with using individual joining tables is that we may have more entities come along and to accommodate Attachments, we would have to create a joining table and on our front-end software, we would have to write logic for it whereas the AttachmentTypeID would allow the front-end to insert a new AttachmentType and no db interaction would be needed.
Your second solution doesn't have a way to link the attachment to the item, just what kind of item it is.
Even if it did (ie: an itemID), what you would create would be a violation of 4th Normal form - ie: a multivalued dependency.
Stick with your first plan, but consider whether Bulletins are fundamentally different to Newsletters, Emails, FAQs, etc in your application. If you do need a new table for Newsletters, add a new table for NewsletterAttachments.
Also consider, are you going to share attachments between different items, or types of item?
I am totally agree with podiluska. you need to create separate table for each type of attachment otherwise you cant map itemid with attachment and you will face a problem of joining table for different type of attachment . also if you make separate table for each type of attachment then performance will be faster .

Organizing database tables - large number of properties

I have a database that stores some users in it. Each user has its account settings, privacy settings and lots of other properties to set. The number of those properties started to grow and I could end up with 30 properties or so.
Till now, I used to keep it in "UserInfo" table having User and UserInfo related as One-To-Many (keeping a log of all changes). Putting it in a single "UserInfo" table doesn't sound nice and, at least in the database model, it would look messy. What's the solution?
Separating privacy settings, account settings and other "groups" of settings in separate tables and have 1-1 relations between UserInfo and each group of settings table is one solution, but would that be too slow (or much slower) when retrieving the data? I guess all data would not be presented on a single page at the same moment. So maybe having one-to-many relationships to each table is a solution too (keeping log of each group separately)?
If it's only 30 properties, I'd recommend just creating 30 columns. That's not too much for a modern database to handle.
But I would guess that if you ahve 30 properties today, you will continue to invent new properties as time goes on, and the number of columns will keep growing. Restructuring your table to add columns every day may become time-consuming as you get lots of rows.
For an alternative solution check out this blog for a nifty solution for storing lots of dynamic attributes in a "schemaless" way: How FriendFeed Uses MySQL.
Basically, collect all the properties into some format and store it in a single TEXT column. The format is semi-structured, that is your application can separate the properties if needed but you can also add more at any time, or even have different properties per row. XML or YAML or JSON are example formats, or some object serialization format supported by your application code language.
CREATE TABLE Users (
user_id SERIAL PRIMARY KEY,
user_proerties TEXT
);
This makes it hard to search for a given value in a given property. So in addition to the TEXT column, create an auxiliary table for each property you want to be searchable, with two columns: values of the given property, and a foreign key back to the main table where that particular value is found. Now you have can index the column so lookups are quick.
CREATE TABLE UserBirthdate (
user_id BIGINT UNSIGNED PRIMARY KEY,
birthdate DATE NOT NULL,
FOREIGN KEY (user_id) REFERENCES Users(user_id),
KEY (birthdate)
);
SELECT u.* FROM Users AS u INNER JOIN UserBirthdate b USING (user_id)
WHERE b.birthdate = '2001-01-01';
This means as you insert or update a row in Users, you also need to insert or update into each of your auxiliary tables, to keep it in sync with your data. This could grow into a complex chore as you add more auxiliary tables.

What is the best practices in db design when I want to store a value that is either selected from a dropdown list or user-entered?

I am trying to find the best way to design the database in order to allow the following scenario:
The user is presented with a dropdown list of Universities (for example)
The user selects his/her university from the list if it exists
If the university does not exist, he should enter his own university in a text box (sort of like Other: [___________])
how should I design the database to handle such situation given that I might want to sort using the university ID for example (probably only for the built in universities and not the ones entered by users)
thanks!
I just want to make it similar to how Facebook handles this situation. If the user selects his Education (by actually typing in the combobox which is not my concern) and choosing one of the returned values, what would Facebook do?
In my guess, it would insert the UserID and the EducationID in a many-to-many table. Now what if the user is entering is not in the database at all? It is still stored in his profile, but where?
CREATE TABLE university
(
id smallint NOT NULL,
name text,
public smallint,
CONSTRAINT university_pk PRIMARY KEY (id)
);
CREATE TABLE person
(
id smallint NOT NULL,
university smallint,
-- more columns here...
CONSTRAINT person_pk PRIMARY KEY (id),
CONSTRAINT person_university_fk FOREIGN KEY (university)
REFERENCES university (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);
public is set to 1 for the Unis in the system, and 0 for user-entered-unis.
You could cheat: if you're not worried about the referential integrity of this field (i.e. it's just there to show up in a user's profile and isn't required for strictly enforced business rules), store it as a simple VARCHAR column.
For your dropdown, use a query like:
SELECT DISTINCT(University) FROM Profiles
If you want to filter out typos or one-offs, try:
SELECT University FROM PROFILES
GROUP BY University
HAVING COUNT(University) > 10 -- where 10 is an arbitrary threshold you can tweak
We use this code in one of our databases for storing the trade descriptions of contractor companies; since this is informational only (there's a separate "Category" field for enforcing business rules) it's an acceptable solution.
Keep a flag for the rows entered through user input in the same table as you have your other data points. Then you can sort using the flag.
One way this was solved in a previous company I worked at:
Create two columns in your table:
1) a nullable id of the system-supplied string (stored in a separate table)
2) the user supplied string
Only one of these is populated. A constraint can enforce this (and additionally that at least one of these columns is populated if appropriate).
It should be noted that the problem we were solving with this was a true "Other:" situation. It was a textual description of an item with some preset defaults. Your situation sounds like an actual entity that isn't in the list, s.t. more than one user might want to input the same university.
This isn't a database design issue. It's a UI issue.
The Drop down list of universities is based on rows in a table. That table must have a new row inserted when the user types in a new University to the text box.
If you want to separate the list you provided from the ones added by users, you can have a column in the University table with origin (or provenance) of the data.
I'm not sure if the question is very clear here.
I've done this quite a few times at work and just select between either the drop down list of a text box. If the data is entered in the text box then I first insert into the database and then use IDENTITY to get the unique identifier of that inserted row for further queries.
INSERT INTO MyTable Name VALUES ('myval'); SELECT ##SCOPE_IDENTITY()
This is against MS SQL 2008 though, I'm not sure if the ##SCOPE_IDENTITY() global exists in other versions of SQL, but I'm sure there's equivalents.

Resources