I am implementing web application similar to Twitter. I need to implement 'retweet' action, and one tweet can by retweeted by one person multiple times.
I have a basic 'tweets' table that have columns for:
Tweets: tweet_id | tweet_text | tweet_date_created | tweet_user_id
(where tweet_id is primary key for tweets, tweet_text contains tweet text, tweet_date_created is the DateTime when tweet was created and tweet_user_id is the foreign key to users table and identifies user who has created the tweet)
Now I am wondering how should I implement the retweet action in my database.
Option 1
Should I create new join table, which would look like this:
Retweets: tweet_id | user_id | retweet_date_retweeted
(Where tweet_id is a foreign key to tweets table, user_id is a foreign key to users table and identifies user who has retweeted the tweet, retweet_date_retweeted is a DateTime which specifies when the retweet was done.)
pros: There will be no empty columns, when user process reteet, new line in retweets table will be created.
cons: Querying process will be more difficult, it will need to join two tables and somehow sort the tweets by two dates (when tweet is not retweet, sort it by tweet_date_created, when tweet is retweet, sort it by retweet_date_retweeted).
Option 2
Or should I implement it in the tweets table as parent_id, it will then look like this:
Tweets: tweet_id | tweet_text | tweet_date_created | tweet_user_id | parent_id
(Where all the columns remains the same and parent_id is a foreign key to the same tweets table. When tweet is created, parent_id remains empty. When tweet is retweeted, parent_id contains origin tweet id, tweet_user_id contains user which processed the retweet action, tweet_date_created contains the DateTime when retweet was done, and tweet_text remains empty - becouse we will not let users change the original tweet when retweeting.)
pros: Querying process is much more elegant, as I do not have to join two tables.
cons: There will be empty cells every time tweet is retweeted. So if I have 1 000 tweets in my database and every of them is retweeted for 5 times, there will be 5 000 lines in my tweets table.
Which is the most efficient way? Is it better to have empty cells or to have querying process more clean?
IMO option #1 would be better. The query to join the tweet and retweet tables would not be at all complex and could be done via a left or inner join, depending on whether you want to show all tweets or only tweets which were retweeted. And the join query should be performant as the table is narrow, the columns being joined are ints, and they will each have indices due to the FK constraints.
Another recommendation is not to label all your columns with tweet or retweet, those can be inferred from the table in which the data is stored, for example:
tweet
id
user_id
text
created_at
retweet
tweet_id
user_id
created_at
And sample joins:
# Return all tweets which have been retweeted
SELECT
count(*),
t.id
FROM
tweet AS t
INNER JOIN retweet AS rt ON rt.tweet_id = t.id
GROUP BY
t.id
# Return tweet and possible retweet data for a specific tweet
SELECT
t.id
FROM
tweet AS t
LEFT OUTER JOIN retweet AS rt ON rt.tweet_id = t.id
WHERE
t.id = :tweetId
-- Update per request --
The following is demonstrative only, representing why I would opt for option #1, there are no foreign keys nor are there any indices, you will have to add these yourself. But the results should demonstrate that the joins won't be too painful.
CREATE TABLE `tweet` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(10) unsigned NOT NULL,
`value` varchar(255) NOT NULL,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=utf8
CREATE TABLE `retweet` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`tweet_id` int(10) unsigned NOT NULL,
`user_id` int(10) unsigned NOT NULL,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;
# Sample Rows
mysql> select * from tweet;
+----+---------+----------------+---------------------+
| id | user_id | value | created_at |
+----+---------+----------------+---------------------+
| 1 | 1 | User1 | Tweet1 | 2012-07-27 00:04:30 |
| 2 | 1 | User1 | Tweet2 | 2012-07-27 00:04:35 |
| 3 | 2 | User2 | Tweet1 | 2012-07-27 00:04:47 |
| 4 | 3 | User3 | Tweet1 | 2012-07-27 00:04:58 |
| 5 | 1 | User1 | Tweet3 | 2012-07-27 00:06:47 |
| 6 | 1 | User1 | Tweet4 | 2012-07-27 00:06:50 |
| 7 | 1 | User1 | Tweet5 | 2012-07-27 00:06:54 |
+----+---------+----------------+---------------------+
mysql> select * from retweet;
+----+----------+---------+---------------------+
| id | tweet_id | user_id | created_at |
+----+----------+---------+---------------------+
| 1 | 4 | 1 | 2012-07-27 00:06:37 |
| 2 | 3 | 1 | 2012-07-27 00:07:11 |
+----+----------+---------+---------------------+
# Query to pull all tweets for user_id = 1, including retweets and order from newest to oldest
select * from (
select t.* from tweet as t where user_id = 1
union
select t.* from tweet as t where t.id in (select tweet_id from retweet where user_id = 1))
a order by created_at desc;
mysql> select * from (select t.* from tweet as t where user_id = 1 union select t.* from tweet as t where t.id in (select tweet_id from retweet where user_id = 1)) a order by created_at desc;
+----+---------+----------------+---------------------+
| id | user_id | value | created_at |
+----+---------+----------------+---------------------+
| 7 | 1 | User1 | Tweet5 | 2012-07-27 00:06:54 |
| 6 | 1 | User1 | Tweet4 | 2012-07-27 00:06:50 |
| 5 | 1 | User1 | Tweet3 | 2012-07-27 00:06:47 |
| 4 | 3 | User3 | Tweet1 | 2012-07-27 00:04:58 |
| 3 | 2 | User2 | Tweet1 | 2012-07-27 00:04:47 |
| 2 | 1 | User1 | Tweet2 | 2012-07-27 00:04:35 |
| 1 | 1 | User1 | Tweet1 | 2012-07-27 00:04:30 |
+----+---------+----------------+---------------------+
Notice in the last set of results, that we were able to also include the retweets and display the retweet of #4 before the retweet of #3.
-- Update --
You can accomplish what you are asking for by changing the query a bit:
select * from (
select t.id, t.value, t.created_at from tweet as t where user_id = 1
union
select t.id, t.value, rt.created_at from tweet as t inner join retweet as rt on rt.tweet_id = t.id where rt.user_id = 1)
a order by created_at desc;
mysql> select * from (select t.id, t.value, t.created_at from tweet as t where user_id = 1 union select t.id, t.value, rt.created_at from tweet as t inner join retweet as rt on rt.tweet_id = t.id where rt.user_id = 1) a order by created_at desc;
+----+----------------+---------------------+
| id | value | created_at |
+----+----------------+---------------------+
| 3 | User2 | Tweet1 | 2012-07-27 00:07:11 |
| 7 | User1 | Tweet5 | 2012-07-27 00:06:54 |
| 6 | User1 | Tweet4 | 2012-07-27 00:06:50 |
| 5 | User1 | Tweet3 | 2012-07-27 00:06:47 |
| 4 | User3 | Tweet1 | 2012-07-27 00:06:37 |
| 2 | User1 | Tweet2 | 2012-07-27 00:04:35 |
| 1 | User1 | Tweet1 | 2012-07-27 00:04:30 |
+----+----------------+---------------------+
I would choose option 2 with slight modification. Column parent_id in tweets table should point to itself if it is not a retweet. Then, the querying will be extremely easy:
SELECT tm.Id, tm.UserId, tc.Text, tm.Created,
CASE WHEN tm.Id <> tc .Id THEN tm.UserId ELSE NULL END AS OriginalAsker
FROM tweet tm
LEFT JOIN tweet tc ON tm.ParentId = tc.Id
ORDER BY tm.Created DESC
(tc is parent table - the one with content.. it has tweet's text, original poster's Id, etc.)
The reason for introducing rule about pointing to itself if not retweet is that then it is easy to add more joins to original tweet. You just join a table with tc and don't care if it is retweet or not.
Not only the query is easy, but it will also perform much better than option 1, because sorting is done using only one physical column, which can be indexed.
The only drawback is that the DB will be a little bit larger.
Related
I have an Images, Orders and OrderItems table, I want to match for any images, if any has already been bought by the User passed as parameters by displaying true or false in an IsBought column.
Select Images.Id,
Images.Title,
Images.Description,
Images.Location,
Images.PriceIT,
Images.PostedAt,
CASE WHEN OrderItems.ImageId = Images.Id THEN CAST(1 AS BIT)
ELSE CAST(0 AS BIT) END
AS 'IsBought'
FROM Images
INNER JOIN Users as u on Images.UserId = u.Id
LEFT JOIN Orders on Orders.UserId = #userId
LEFT JOIN OrderItems on Orders.Id = OrderItems.OrderId and OrderItems.ImageId = Images.Id
Group By Images.Id,
Images.Title,
Images.Description,
Images.Location,
Images.PriceIT,
Images.PostedAt,
OrderItems.ImageId,
Orders.UserId
When I use this CASE WHEN I have duplicates when the item has been bought where IsBought is True and the duplicate is False.
In the case where the Item has never been bought, there is no duplicates, IsBought is just equal to False
----------------------------------
| User | type |
----------------------------------
| Id | nvarchar(450) |
----------------------------------
| .......|
----------------------------------
----------------------------------
| Orders | type |
----------------------------------
| Id | nvarchar(255) |
----------------------------------
| UserId | nvarchar(450) |
----------------------------------
| ........................... |
----------------------------------
----------------------------------
| OrderItems | type |
----------------------------------
| Id | nvarchar(255) |
----------------------------------
| OrderId | nvarchar(255) |
----------------------------------
| ImageId | int |
----------------------------------
----------------------------------
| Images | type |
----------------------------------
| Id | int |
----------------------------------
| UserId | nvarchar(450) |
----------------------------------
| Title | nvarchar(MAX) |
----------------------------------
| Description| nvarhar(MAX) |
----------------------------------
| ......................... |
----------------------------------
Any ideas on how I could just have one row per Images with IsBought set to true or false but not duplicates?
I would like something like this:
----------------------------------------------------------------------------
| Id | Title | Description | Location | PriceIT | Location | IsBought |
----------------------------------------------------------------------------
| 1 | Eiffel Tower | .... | ...... | 20.0 | Paris | true |
----------------------------------------------------------------------------
| 2 | Tore di Pisa | .... | ...... | 20.0 | Italia | false |
---------------------------------------------------------------------------
| etc ......
---------------------------------------------------------------------------
Your query logic looks suspicious. It is unusual to see a join that consists only of a comparison of a column from the unpreserved table to a parameter. I suspect that you don't need a join to users at all since you seem to be focused on things "bought" by a person and not things "created" (which is implied by the name "author") by that same person. And a group by clause with no aggregate is often a cover-up for a logically flawed query.
So start over. You want to see all images apparently. For each, you simply want to know if that image is associated with any order of a given person.
select img.*, -- you would, or course, only select the columns needed
(select count(*) from Sales.SalesOrderDetail as orddet
where orddet.ProductID = img.ProductID) as [Order Count],
(select count(*) from Sales.SalesOrderDetail as orddet
inner join Sales.SalesOrderHeader as ord
on orddet.SalesOrderID = ord.SalesOrderID
where orddet.ProductID = img.ProductID
and ord.CustomerID = 29620
) as [User Order Count],
case when exists(select * from Sales.SalesOrderDetail as orddet
inner join Sales.SalesOrderHeader as ord
on orddet.SalesOrderID = ord.SalesOrderID
where orddet.ProductID = img.ProductID
and ord.CustomerID = 29620) then 1 else 0 end as [Has Ordered]
from Production.ProductProductPhoto as img
where img.ProductID between 770 and 779
order by <something useful>;
Notice the aliases - it is much easier to read a long query when you use aliases that are shorter but still understandable (i.e., not single letters). I've included 3 different subqueries to help you understand correlation and how you can build your logic to achieve your goal and help debug any issues you find.
This is based on AdventureWorks sample database - which you should install and use as a learning tool (and to help facilitate discussions with others using a common data source). Note that I simply picked a random customer ID value - you would use your parameter. I filtered the query to a range of images to simplify debugging. Those are very simple but effective methods to help write and debug sql.
Please, can someone help me with what is possibly a simple query?
We have two tables with below structure.
Customer table:
+----+-----------+
| id | name |
+----+-----------+
| 1 | customer1 |
| 2 | customer2 |
| 3 | customer3 |
+----+-----------+
Customer role mapping table:
+-------------+-----------------+
| customer_id | customerRole_id |
+-------------+-----------------+
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 1 |
+-------------+-----------------+
I want to select customers with role id 1 only NOT with role id 1 AND 2.
So, in this case, it would be customer id 2,3,4 & 5. ignoring 1 as it has multiple roles.
Is there a simple query to do this?
Many thanks, for any help offered.
Hmmm, there are several ways to do this.
select c.*
from customers c
where exists (select 1 from mapping m where m.customerid = c.id and m.role = 1) and
not exists (select 1 from mapping m where m.customerid = c.id and m.role <> 1);
If you just want the customer id, a perhaps simpler version is:
select customerid
from mapping
group by customerid
having min(role) = 1 and max(role) = 1;
This solution assumes that role is never NULL.
I have a table "temp"
author | title | bibkey | Data
-----------------------------------
John | JsPaper | John2008 | 65
Kate | KsPaper | | 60
| | Data2015 | 80
From this I want to produce two tables, a 'sample_table' and a 'ref_table' like so:
sample_table:
sample_id|ref_id| data
--------------------------
1 | 1 | 65
2 | 2 | 60
3 | 3 | 80
ref_table:
ref_id | author | title | bibkey
--------------------------------------
1 | John | JsPaper | John2008
2 | Kate | KsPaper |
3 | | | Data2015
I've created both tables
CREATE TABLE ref_table ( CREATE TABLE sample_table (
ref_id serial PRIMARY KEY, sample_id serial PRIMARY KEY,
author text, ref_id integer REFERENCES ref_table(ref_id),
title text, data numeric
bibkey text );
);
And inserted the unique author,title,bibkey rows into the reference table as above. What I want to do now is do the join for the sample_table to get the ref_id's. For my insert statement i currently have:
INSERT INTO sample_table (
ref_id,data
)
SELECT ref.ref_id, t.data
FROM
temp t
LEFT JOIN
ref_table ref ON COALESCE(ref.author,'00000') = COALESCE(t.author,'00000')
AND COALESCE(ref.title,'00000') = COALESCE(t.title,'00000')
AND COALESCE(ref.bibkey,'00000') = COALESCE(t.bibkey,'00000');
However i really want to have a conditional statement in the join, rather than all 3 like I have:
IF a bibkey exists for that row, I know it is unique, and join only on that.
If bibkey is NULL, then join on both author and title for the unique pair, and not bibkey.
Is this possible?
I have table with Employees (tblEmployee):
| ID | Name |
| 1 | Smith |
| 2 | Black |
| 3 | Thompson |
And a table with Roles (tblRoles):
| ID | Name |
| 1 | Submitter |
| 2 | Receiver |
| 3 | Analyzer |
I have also a table with relations of Employees to their Roles with many to many relation type (tblEmployeeRoleRel):
| EmployeeID | RoleID |
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 3 | 3 |
I need to select ID, Name from tblEmployee that have exaclty the same set of roles from tblEmployeeRoleRel as has the Employee with ID = 1. How can I do it?
Use a where clause to limit the roles you're looking at to those of employeeID of 1 and use a having clause to make sure that the employee's role count matches that of employee1.
SELECT A.EmployeeID
FROM tblEmployeeRoleRel A
WHERE Exists (SELECT 1
FROM tblEmployeeRoleRel B
WHERE B.EmployeeID = 1
and B.RoleID = A.RoleID)
GROUP BY A.EmployeeID
HAVING count(A.RoleID) = (SELECT count(C.RoleID)
FROM tblEmployeeRoleRel C
WHERE EmployeeID = 1)
This assumes that employeeID and roleID are unique in tblEmployeeRoleRel otherwise we may have to distinct the roleID fields above.
Declare #EmployeeID int = 1 -- change this to whatever employee ID you like, or perhaps you'd pass an Employee ID to it in a stored procedure.
Select Distinct e.EmployeeID -- normally distinct would incur extra overhead, but in this case you only want the employee IDs. not using Distinct when an employee has multiple roles will give you multiple employee IDs.
from tblEmployeeRoleRel as E
where E.EmployeeID not in
(Select EmployeeID from tblEmployeeRoleRel where RoleID not in (Select RoleID from tblEmployeeRoleRel where Employee_ID = #EmployeeID))
and exists (Select EmployeeID from tblEmployeeRoleRel where EmployeeID = e.EmployeeID) -- removes any "null" matches.
and E.Employee_ID <> #Employee_ID -- this keeps the employee ID itself from matching.
So I started off with this query:
SELECT * FROM TABLE1 WHERE hash IN (SELECT id FROM temptable);
It took forever, so I ran an explain:
mysql> explain SELECT * FROM TABLE1 WHERE hash IN (SELECT id FROM temptable);
+----+--------------------+-----------------+------+---------------+------+---------+------+------------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-----------------+------+---------------+------+---------+------+------------+-------------+
| 1 | PRIMARY | TABLE1 | ALL | NULL | NULL | NULL | NULL | 2554388553 | Using where |
| 2 | DEPENDENT SUBQUERY | temptable | ALL | NULL | NULL | NULL | NULL | 1506 | Using where |
+----+--------------------+-----------------+------+---------------+------+---------+------+------------+-------------+
2 rows in set (0.01 sec)
It wasn't using an index. So, my second pass:
mysql> explain SELECT * FROM TABLE1 JOIN temptable ON TABLE1.hash=temptable.hash;
+----+-------------+-----------------+------+---------------+----------+---------+------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+------+---------------+----------+---------+------------------------+------+-------------+
| 1 | SIMPLE | temptable | ALL | hash | NULL | NULL | NULL | 1506 | |
| 1 | SIMPLE | TABLE1 | ref | hash | hash | 5 | testdb.temptable.hash | 527 | Using where |
+----+-------------+-----------------+------+---------------+----------+---------+------------------------+------+-------------+
2 rows in set (0.00 sec)
Can I do any other optimization?
You can gain some more speed by using a covering index, at the cost of extra space consumption. A covering index is one which can satisfy all requested columns in a query without performing a further lookup into the clustered index.
First of all get rid of the SELECT * and explicitly select the fields that you require. Then you can add all the fields in the SELECT clause to the right hand side of your composite index. For example, if your query will look like this:
SELECT first_name, last_name, age
FROM table1
JOIN temptable ON table1.hash = temptable.hash;
Then you can have a covering index that looks like this:
CREATE INDEX ix_index ON table1 (hash, first_name, last_name, age);