cassandra - how to perform table query? - database

I am trying to perform a query using 2 tables:
CREATE TABLE users(
id_ UUID PRIMARY KEY,
username text,
email text,
);
CREATE TABLE users_by_email(
id UUID,
email text PRIMARY KEY
)
In this cas, how to perform a query by email?

I am assuming in the case above you are specifically trying to retrieve the username by the email.
Short Answer:
There is no way in Cassandra that you are going to be able to get the username from the email in a single query using the table structure you have defined. You would need to query users_by_email to get the id and then query users to get the username. A better option would be to add the username column to the users_by_email table.
Long Answer:
Due to the underlying mechanisms by which Cassandra stores data on disk the only available parameters you may use in a where clause have to be in the Primary Key. The Primary Key is made up of 2 different types of keys. First is the partition key which is used to physically separate files on disk and between nodes in the cluster. Second are the cluster keys which are used to organize data stored in a partition and aid in efficient retrieval of data. One other critical part to note is that if you use a WHERE clause in your query it must contain all of the partition keys in it for each call. This is to allow for efficient retrieval of the data. If you want to get some more detailed information on the working of the WHERE clause take a look at this link:
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
Now that you know what the limitations of the WHERE clause are the question is how do we get around them. First thing you need to know is that Cassandra is not a RDBMS and you cannot perform JOIN's against tables. This means that we need to forget all the rules that we have learned for so many years about how to properly normalize data in a database and begin thinking differently about the problem. In general Cassandra is designed for a table-per-query pattern. This means that for each data access pattern (i.e. query) you are going to run against there is an associated table that contains the data for that query and has the proper keys to allow the data to be filtered appropriately. I am not going to be able to go into all the nitty gritty details of how to properly data model your data but I suggest you take the free Datastax Academy Data modeling course avaliable here:
https://academy.datastax.com/courses/ds220-data-modeling
So as I understand your particular need I think that you can modify your users table to look like this:
CREATE TABLE users_by_email(
email text,
username text,
id_ UUID,
PRIMARY KEY (email, username)
);
This table setup will allow you to select the username by email using a query like:
SELECT username FROM users_by_email WHERE email=XXXXX;

I am assuming that you also want username returned in the query. You cannot JOIN tables in Cassandra. So to do that, you will have to add that column to your users_by_email table:
CREATE TABLE users_by_email(
id UUID,
email text PRIMARY KEY,
username text,
);
Then, simply query that table by email address.
> SELECT id, email, username FROM users_by_email WHERE email='mreynolds#serenity.com';
id | email | username
--------------------------------------+------------------------+----------
d8e57eb4-c837-4bd7-9fd7-855497861faf | mreynolds#serenity.com | Mal
(1 rows)

Related

Cassandra DB structure suggestion (two tables vs one)

I am new to Cassandra's DB, and I'm creating a database structure and I wonder whether my pick is optimal for my requirements.
I need to get information on unique users, and each unique user will have multiple page views.
My two options are de-normalizing my data into one big table, or create two different tables.
My most used queries would be:
Searching for all of the pages with a certain cookie_id.
Searching for all of the pages with a certain cookie_id and a client_id. If a cookie doesn't have a client, it would be marked client_id=0 (that would be most of the data).
Find the first cookie_id with extra data (for example data_type_1 + data_type_2).
My two suggested schemas are these:
Two tables - One for users and one for visited pages.
This would allow me to save a new user on a different table, and keep every recurring request in another table.
CREATE TABLE user_tracker.pages (
cookie_id uuid,
created_uuid timeuuid,
data_type_3 text,
data_type_4 text,
PRIMARY KEY (cookie_id, created_uuid)
);
CREATE TABLE user_tracker.users (
cookie_id uuid,
client_id id,
data_type_1 text,
data_type_2 text,
created_uuid timeuuid,
PRIMARY KEY (cookie_id, client_id, created_uuid)
);
This data is normalized as I don't enter the user's data for every request.
One table - For all of the data saved, and the first request as the key. First request would have data_type_1 and data_type_2.
I could also save "data_type_1" and "data_type_2" as a hashmap, as they represent multiple columns and they will always be in a single data set (is it considered to be better?).
The column "first" would be a secondary index.
CREATE TABLE user_tracker.users_pages (
cookie_id uuid,
client_id id,
data_type_1 text,
data_type_2 text,
data_type_3 text,
data_type_4 text,
first boolean,
created_uuid timeuuid,
PRIMARY KEY (cookie_id, client_id, created_uuid)
);
In reality we have more columns than 4, this was written briefly.
As far as I understand Cassandra's best practices, I'm leaning into option #2, but please do enlighten me :-)
Data modelling in Cassandra is done based on type of queries you will be making. You should be able to query using partition key.
For following option 2 mentioned by you is okay.
1.Searching for all of the pages with a certain cookie_id
Searching for all of the pages with a certain cookie_id and a client_id. If a cookie doesn't have a client, it would be marked client_id=0 (that would be most of the data).
For third query
Find the first cookie_id with extra data (for example data_type_1 + data_type_2).
Not sure what do you mean by first cookie_id with extra data. In Cassandra all data is stored by partition key and are not sorted. So all your data will be stored using cookie_id as parition key and all future instances with the same cookie_id will keep adding to this row.

Load data from CSV-file into PostgreSQL database

I have an application, that needs to load data from user-specified CSV-files into PostgreSQL database tables.
The structure of CSV-file is simple:
name,email
John Doe,john#example.com
...
In the database I have three tables:
---------------
-- CAMPAIGNS --
---------------
CREATE TABLE "campaigns" (
"id" serial PRIMARY KEY,
"name" citext UNIQUE CHECK ("name" ~ '^[-a-z0-9_]+$'),
"title" text
);
----------------
-- RECIPIENTS --
----------------
CREATE TABLE "recipients" (
"id" serial PRIMARY KEY,
"email" citext UNIQUE CHECK (length("email") <= 254),
"name" text
);
-----------------
-- SUBMISSIONS --
-----------------
CREATE TYPE "enum_submissions_status" AS ENUM (
'WAITING',
'SENT',
'FAILED'
);
CREATE TABLE "submissions" (
"id" serial PRIMARY KEY,
"campaignId" integer REFERENCES "campaigns" ON UPDATE CASCADE ON DELETE CASCADE NOT NULL,
"recipientId" integer REFERENCES "recipients" ON UPDATE CASCADE ON DELETE CASCADE NOT NULL,
"status" "enum_submissions_status" DEFAULT 'WAITING',
"sentAt" timestamp with time zone
);
CREATE UNIQUE INDEX "submissions_unique" ON "submissions" ("campaignId", "recipientId");
CREATE INDEX "submissions_recipient_id_index" ON "submissions" ("recipientId");
I want to read all rows from the specified CSV-file and to make sure that according records exist in recipients and submissions tables.
What would be the most performance-efficient method to load data in these tables?
This is primarily a conceptual question, I'm not asking for a concrete implementation.
First of all, I've naively tried to read and parse CSV-file line-by-line and issue SELECT/INSERT queries for each E-Mail. Obviously, it was a very slow solution that allowed me to load ~4k records per minute, but code was pretty simple and straightforward.
Now, I'm reading the CSV-file line-by-line, but aggregating all E-Mails into a batches of 1'000 elements. All SELECT/INSERT queries are made in batches using SELECT id, email WHERE email IN ('...', '...', '...', ...) constructs. Such approach increased the performance, and now I have performance of ~25k records per minute. However, this approach demanded a pretty-complex multi-step code to work.
Are there any better approaches to solve this problem and get even greater performance?
The key problem here is that I need to insert data to the recipients table first and then I need to use the generated id to create a corresponding record in the submissions table.
Also, I need to make sure that inserted E-Mails are unique. Right now, I'm using a simple array-based index in my application to prevent duplicate E-Mails from being added to the batch.
I'm writing my app using Node.js and Sequelize with Knex, however, the concrete technology doesn't matter here much.
pgAdmin has GUI for data import since 1.16. You have to create your table first and then you can import data easily - just right-click on the table name and click on Import.
enter image description here
enter image description here

Database schema about social networking like Facebook

Like Facebook, I have posts, comments and user profiles.
I THINK THAT
Posts and comments do not need the details of user
ONLY user profiles need the details
So I separate the user information into main and detail
Here is the schema.
Question
Is it necessary to separate user data into main and details?
WHY not or WHY yes?
Thanks for applying!
I would recommend using separate tables because you may not need all that information at one time. You could do it either way but I think of it as do you need all of the data at once.
Table 1 (User Auth)
This table would hold only information for log-in and have three columns (user_name, hashed_password, UID)
So your query would select UID where user_name and hashed_password matched. I would also recommend never storing a readable password in a database table because that can become a security issue.
Table 2 (Basic Information)
This table would hold the least amount of information that you would get at signup to make a basic profile. The fields would consist of UID, name, DOB, zip, link_to_profile_photo, email and whatever basic information you would like. email is kind of special because if you require the user_name to be an email address there is no reason to have it twice.
Table 3 (Extended Information)
This table would hold any optional information that the user could enter like phone_number, bio or address assigned by UID.
Then after that you can add as many other tables that you would like. One for Post, one for comments, ect.
An Example of a Post table would be like:
post_id, UID, the_post, date_of_post, likes, ect.
Then for Comments
comment_id, for_post_id, UID, the_comment, date_of_comment, likes, ect.
Breaking it down in to small sections would be more efficient in the long run.
Database performance is associated with disk seek time. Disk seek time is a bottleneck of database performance. For large table, you may need large seek time to locate and read an entry. As for post and comments you do not need user details, just user main info, you may get reduced read time when you read only user Id for post and comments. Also joins with user_main_info will be faster. You may keep the smallest portion of data you need to read most frequently on one table and other detailed information on another table. But, in a scenario like when you will always need to read all the user information together, this won't give you any benefit.
1)the userinformation table will be added
ex:create table fb_users
(intuserid primary key,
username varchar(50),
phoneno int,
emailid varchar(max))
2)the sending of the friend request would be
2.a)create the table name called friends, friend requestor, friend requested by, status b/w both of them, Active flag
ex:create table fb_friends
(intfriendid primary key,
intfriendrequestor int (foreign key with fb_users's(intuserid)),
intfriendrequestedby int (foreign key with fb_users's(intuserid)),
statusid varchar(max)(use the status id from the below table which is a look up table),
active bit)
3)creating the table for the status the status
3.a)create the table name called status, statusname, statusdesc, Active flag
ex:create table fb_staus
(intstatusid primary key,
statusname varchar,
statusdesc varchar,
active bit)
the status could be
pending
approval
deleted
..etc
4)similarly for creating the groups,likes,comments also
a table will be created respectively for each one of them and the foreign key of the intuserid from user table
are linked for each of them

Database design and large tables?

Are tables with lots of columns indicative of bad design? For example say I have the following table that stores user information and user settings:
[Users table]
userId
name
address
somesetting1
...
somesetting50
As the site requires more settings the table gets larger. In my mind this table is normalized, all the settings are dependent on the userId.
I have a thing against tables with lots of columns it just seems wrong to me, but then I remembered that you can select what data to return from the table, so If the table is large I could still break it into several different objects in code. For example
[User object]
[UserSetting object]
and return only the data to fill those objects.
Is the above common practice, or are their other techniques that deal with tables with lots of columns that are more suitable to use?
I think you should use multiple tables like this:
[Users table]
userId
name
address
[Settings table]
settingId
userId
settingKey
settingValue
The tables are related by the userId column which you can use to retrieve the settings for the user you need to.
I would say that it is bad table design. If a user doesn't have an entry for 47 of those 50 settings then you will have a large number of NULL's in the table which isn't good practice and will also slow down performance (NULL's have to be handled in a special way).
Instead, have the following:
USER TABLE
Id,
FirstName
LastName
etc
SETTINGS
Id,
SettingName
USER SETTINGS
Id,
SettingId,
UserId,
SettingValue
You then have a many to many join, and eliminate NULL's
first, don't put spaces in table names! all the [braces] will be a real pain!
if you have 50 columns how meaningful will all that data be for each user? will there be lots of nulls? Most data may not even apply to any given user. Think 1 to 1 tables, where you break down the "settings" into logical groups:
Users: --main table where most values will be stored
userId
name
address
somesetting1 ---please note that I'm using "somesetting1", don't
... --- name the columns like this, use meaningful names!!
somesetting5
UserWidgets --all widget settings for the user
userId
somesetting6
....
somesetting12
UserAccounting --all accounting settings for the user
userId
somesetting13
....
somesetting23
--etc..
you only need to have a Users row for each user, and then a row in each table where that data applies to the given user. I f a user doesn't have any widget settings then no row for that user. You can LEFT join each table as necessary to get all the settings as needed. Usually you only need to work on a sub set of settings based on which part of the application that is running, which means you won't need to join in all of the tables, just the one or tow that you need at that time.
You could consider an attributes table. As long as your indexes are good, then you wouldn't have too much of a performance issue:
[AttributeDef]
AttributeDefId int (primary key)
GroupKey varchar(50)
ItemKey varchar(50)
...
[AttributeVal]
AttributeValId int (primary key)
AttributeDefId int (FK -> AttributeDef.AttributeDefId)
UserId int (probably FK to users table?)
Val varchar(255)
...
basically you're "pivoting" your table with many columns into 2 tables with less columns. You can write views and table functions around this structure to give you data for a group of related items or just a specific item, etc. You could also add other things to the attribute definition table to indicate required data elements, restrictions on the data elements, etc.
What's your thought on this type of design?
Use several tables with matching indexes to get the best SELECT speed. Use the indexes as a way to relate the information between tables using a JOIN.

What is the best practices in db design when I want to store a value that is either selected from a dropdown list or user-entered?

I am trying to find the best way to design the database in order to allow the following scenario:
The user is presented with a dropdown list of Universities (for example)
The user selects his/her university from the list if it exists
If the university does not exist, he should enter his own university in a text box (sort of like Other: [___________])
how should I design the database to handle such situation given that I might want to sort using the university ID for example (probably only for the built in universities and not the ones entered by users)
thanks!
I just want to make it similar to how Facebook handles this situation. If the user selects his Education (by actually typing in the combobox which is not my concern) and choosing one of the returned values, what would Facebook do?
In my guess, it would insert the UserID and the EducationID in a many-to-many table. Now what if the user is entering is not in the database at all? It is still stored in his profile, but where?
CREATE TABLE university
(
id smallint NOT NULL,
name text,
public smallint,
CONSTRAINT university_pk PRIMARY KEY (id)
);
CREATE TABLE person
(
id smallint NOT NULL,
university smallint,
-- more columns here...
CONSTRAINT person_pk PRIMARY KEY (id),
CONSTRAINT person_university_fk FOREIGN KEY (university)
REFERENCES university (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);
public is set to 1 for the Unis in the system, and 0 for user-entered-unis.
You could cheat: if you're not worried about the referential integrity of this field (i.e. it's just there to show up in a user's profile and isn't required for strictly enforced business rules), store it as a simple VARCHAR column.
For your dropdown, use a query like:
SELECT DISTINCT(University) FROM Profiles
If you want to filter out typos or one-offs, try:
SELECT University FROM PROFILES
GROUP BY University
HAVING COUNT(University) > 10 -- where 10 is an arbitrary threshold you can tweak
We use this code in one of our databases for storing the trade descriptions of contractor companies; since this is informational only (there's a separate "Category" field for enforcing business rules) it's an acceptable solution.
Keep a flag for the rows entered through user input in the same table as you have your other data points. Then you can sort using the flag.
One way this was solved in a previous company I worked at:
Create two columns in your table:
1) a nullable id of the system-supplied string (stored in a separate table)
2) the user supplied string
Only one of these is populated. A constraint can enforce this (and additionally that at least one of these columns is populated if appropriate).
It should be noted that the problem we were solving with this was a true "Other:" situation. It was a textual description of an item with some preset defaults. Your situation sounds like an actual entity that isn't in the list, s.t. more than one user might want to input the same university.
This isn't a database design issue. It's a UI issue.
The Drop down list of universities is based on rows in a table. That table must have a new row inserted when the user types in a new University to the text box.
If you want to separate the list you provided from the ones added by users, you can have a column in the University table with origin (or provenance) of the data.
I'm not sure if the question is very clear here.
I've done this quite a few times at work and just select between either the drop down list of a text box. If the data is entered in the text box then I first insert into the database and then use IDENTITY to get the unique identifier of that inserted row for further queries.
INSERT INTO MyTable Name VALUES ('myval'); SELECT ##SCOPE_IDENTITY()
This is against MS SQL 2008 though, I'm not sure if the ##SCOPE_IDENTITY() global exists in other versions of SQL, but I'm sure there's equivalents.

Resources