Cassandra table definition / partitioning / modeling

Cassandra table definition / partitioning / modeling - database

Trying to define the right schema / table for our scenario:
We have few hundreds of eCommerce sites, each one of them has unique siteId.
Each site has it own end-users, up to 10M unique users per month. Each user has unique userId.
Each end-user interacts with the site: view products, add products to cart and purchase products (we call it user events). I want to store the activities of the last 30 days (or 180 days if it possible).
Things to consider:
Site sizes are different! We have some "heavy" sites with 10M end users but we also have "light" sites with a few hundreds/thousands of users.
Events don't have unique ids.
Users can have more than one event at a time, for example they can a view page with more than one product (but we could live without that restriction to simplify).
Rough estimation: 100 Customers x 10M EndUsers x 100 Interactions = 100,000,000,000 rows (per month)
Writes done in realtime (when the event arrive to the server). Reads done much less (1% of the events).
Events have some more metadata and different events (view/purchase/..) have different metadata.
Using Keyspace to separate between sites, and manage table per each site vs. all customers in one table.
How to define the key here?
+--------+---------+------------+-----------+-----------+-----------+
| siteId | userId | timestamp | eventType | productId | other ... |
+--------+---------+------------+-----------+-----------+-----------+
| 1 | Value 2 | 1501234567 | view | abc | |
| 1 | cols | 1501234568 | purchase | abc | |
+--------+---------+------------+-----------+-----------+-----------+
My query is: Get all events (and their metadata) of specific user. As I assumed above, around 100 events.
Edit2:I guess it wasn't clear, but the uniqueness of users is per site, two different users might have the same id if they are on different sites

If you want to query for the userid than the userid should be the first part of your compound primary key (this is the partition key). Use a compound primary key to create columns that you can query to return sorted results. I would suggest the following schema:
CREATE TABLE user_events (
userid long,
timestamp timestamp,
event_type text,
site_id long,
product_id long,
PRIMARY KEY (userid, site_id, timestamp, product_id));
That should make queries like
SELECT * FROM user_events WHERE user_id = 123 and site_id = 456;
quite performant. By adding the timestamp to the PK you can also easily LIMIT your queries to get only the top(latest) 1000 (whatever you need) events without getting into performance issues because of high active users (or bots) having a very long history.
One thing to keep mind: I would recommend to have the user_id or a composition of user_id, site_id as the partition key (the first part of the primary key). That will prevent your rows from becoming too big.
So an alternative design would look like this:
CREATE TABLE user_events (
userid long,
timestamp timestamp,
event_type text,
site_id long,
product_id long,
PRIMARY KEY ( (userid, site_id), timestamp, product_id));
The "downside" of this approach is that you always have to provide user and site-id. But I guess that is something that you have to do anyways, right?
To point out one thing. The partition key (also called to row id) identifies a row. A row will stay on specific node. For this reason it is a good idea to have the rows more or less of the same size. A row with a couple of thousands or 10ks of columns is not really a problem. You will get problems if you have some rows with millions of columns and other rows with only 10-20 columns. That will cause the cluster to be inbalanced. Furthermore it makes the row caches less effictive. In your example I would suggest to avoid to have the site_id as the partition key (row key).
Does that make sense to you? Maybe the excelent answer to this post give you some more insides: difference between partition-key, composite-key and clustering-key. Furthermore a closer look at this part of the datastax documentation offers some more details.
Hope that helps.

My query is: Get all events (and their metadata) of specific user. As I assumed above, around 100 events.
So, you want all the events of a given user. As each user has a unique id on a site, so you can form the table using userid and site_id as a primary key and timestamp as clustering key. Here is the table structure:
CREATE TABLE user_events_by_time (
userid bigint,
timestamp timestamp,
event_type text,
product_id bigint,
site_id bigint,
PRIMARY KEY ((site_id,userid), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC) ;
Now you can query all of a user's event in a given time by using the following query:
SELECT * from user_events_by_time WHERE site_id= <site_id> and userid = <user_id> and timestamp > <from_time> and timestamp < <to_time>;
Hope this solves your problem.

Related

How do i prevent the duplicate id from the imported table in MariaDB?

(Before that, i apologize for my bad English)
I have study cases like this:
I am currently having a trouble with my Web Application. I made a Web application for a certain company. I made the app using CodeIgniter 3.
I built the database using Maria DB. For the id in each table, i am using Auto-increment id for my application database for each table. I usually deploy the web app to the cloud server (sometimes the company have their own dedicated server, but sometimes haven't ). One day, there is a company that they don't want to deploy the app that i have made before to the cloud ( for the security purposes they said ).
This company wanted to deploy the app to the employee's PC personally in the office, while the pc for each employee not connected to each other ( i.e stand alone pc/personal computer/employee's Laptop ). They said, for every 5 months, they would collect all of the data from employee's personal computer to company's data center, and of course the data center are no connected to the internet. I told them that's not the good way to store their data. ( because the data will be duplicate when i am trying to merge all of the data into one, since my column id for every table are in auto-increment id, and it's a primary key). Unfortunately, The company still want to kept the app that way, and i don't know how to solved this.
They have at least 10 employees that would used this web app. According that, I have to deploy the app to the 10 PC personally.
Additional info : Each employee have their own unique id which they got from the company, and i made the auto_increment id for each employee, just like the table below:
id | employee_id | employee_name |
1 | 156901010 | emp1
2 | 156901039 | emp2
3 | 156901019 | emp3
4 | 156901015 | emp4
5 | 156901009 | emp5
6 | 156901038 | emp6
The problem is whenever they fill the form from that application, some of the table are not stored the employee's id but the new id that come from increment id.
For example electronic_parts table. They have the attribute like below:
| id | electronic_part_name | kind_of_electronic_part_id |
if the emp1 fill the form from the web app , the table's content would like below.
| id | electronic_part_name | kind_of_electronic_part_id |
| 1 | switch | 1 |
and if the emp2 fill the form from the web app , the table's content would like below.
| id | electronic_part_name | kind_of_electronic_part_id |
| 1 | duct tape | 10 |
When i tried to merge the contents of the table into the data center it would falling apart because the duplicate id.
It's getting worst when i think about my foreign key in other tables.. like for example the customer_order table.
The table for customer_order column looks like below (just a sample, not the actual table, but similar).
|id | customer_name | electronic_parts_id | cashier(a.k.a employee_id, the increment id one, not the id that employee got from a company as i described above ) |
| 1 | Henry | 1 | 10 |
| 2 | Julie | 2 | 9 |
Does anyone know how to solved this problem ? or can someone suggest/recommend me some good way to solved this ?
NOTE: Each Employees have their own database for their app, so the database is not centralized, it's a stand-alone database, that means, i have to installed the database to the employee's pc one by one

This is an unconventional situation and you can have an unconventional solution.
I can suggest you two methods to solve this issue.
Instead of using autoincrement for primary key generate a UUID and use it as the primary key. Regarding the probability of duplicates
in random UUIDs: Only after generating 1 billion UUIDs every second
for the next 100 years
In CodeIgniter you could do this with the following code snippet.
$this->db->set('id', 'UUID', FALSE);
This generates a 36 characters hexadecimal key (with 4 dashes
included).
ac689561-f7c9-4f7e-be94-33c6c0fb0672
As you can see it has dashes in the string, using the CodeIgniter DB
function will insert this in the database with the dashes, it still
will work. If it does not look at clean, you could remove and
convert the string to a 32-char key.
You can use the following function with the help of [CodeIgniter
UUID library][1].
function uuid_key {
$this->load->library('uuid');
//Output a v4 UUID
$id = $this->uuid->v4();
$id = str_replace('-', '', $id);
$this->db->set('id', $id, FALSE);
}
Now we have a 32-byte key,
ac689561f7c94f7ebe9433c6c0fb0672
An alternate unconventional method to tackle the situation is by
adding function to log all Insert, Update, Delete queries processed
in the site to a file locally. By this way, in each local
implementation will generate a log file with an actual list of
queries that modify the DB over time in the right sequential order.
At any point in time, the state of the database is the result of the
set of all those queries happened in the past till that date.
So in every 5 months when you are ready to collect data from
employees personal computer, instead of taking data dump, take this
file with all query log.(Note: Such a query log won't have
auto-increment id as it will be created only in the real time when
it is executed towards a Database. )
Use such files to import data to your datacenter. This will not
conflict as it will generate autoincrements in your data center in
real time. (Hope you do not have to link your local to data center
at any point of time in future)
[1]: https://github.com/Repox/codeigniter-uuid

Is that id used in any other tables? It would probably be involved in a JOIN. If so, you have a big problem of unraveling the ids.
If the id is not used anywhere else, then the values are irrelevant, and the rows can be renumbered. This would be done (roughly speaking) by loading the data from the various sources into the same table, but not include the id in the load.
Or, if there is some other column (or combination of columns) that is UNIQUE, then make that the PRIMARY KEY and get rid of id.
Which case applies? We can pursue in more detail. Please provide SHOW CREATE TABLE for any table(s) that are relevant.
In my first case (where id is used as a FK elsewhere), do something like this:
While inserting the rows into the table with id, increment the values by enough to avoid colliding with the existing ids. Then do (in the same transaction):
UPDATE the_other_table SET fk_id = fk_id + same_increment.
Repeat for each other table and each id, as needed.

I think your problem come from your database... you didn't design it well.
it's a bug if you have an id for two difference users .
if you just made your id field unique in your database then two employee wouldn't have a same id so your problem is in your table design .
just initiate your id field like this and your problem will be solved .
CREATE TABLE [YOUR TABLE NAME](
[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
....

Is it required for the id to be an integer? if not may be you can use a prefix on the id so the input for each employee will be unique in general. that means you have to give up the auto increment and just do count on the table data (assuming youre not deleting any of the records.)

You may need to write a code in PHP to handel this. If other table is already following unique/primary key based than it is fine.
You can also do it after import.
like this
Find duplicates in the same table in MySQL

What is a good database structure for tracking who has read which messages?

In a messaging forum such as Stack Overflow, what is an efficient way to store the data to track who has read which messages?
If there are m messages and n users, is it possible to have a worst case of less than m * n bits?

I will go with a classic READ_MESSAGES table.
-----FK---------FK------------------------------------
| msg_id | user_id | read_timestamp | blah...
------------------------------------------------------
\========PK==========/
This will work well upto a million rows or so. Then insertions will become a pain. If we use something like MySQL, then we will need to have an artificial autoincrement primary key.
-------------------FK---------FK----------------------------------
| autoinc_pk | msg_id | user_id | read_timestamp | blah...
------------------------------------------------------------------
\=====PK=====/ \===UNIQUE=NOT=NULL===/
This will capture our data fine, but may not be optimal for querying. We have two possible ways:
Given message id show which or how many users have read it. SELECT msg_id, COUNT(user_id) FROM read_messages WHERE msg_id='123'
Given user id show which or how many messages have been read. SELECT user_id, COUNT(msg_id) FROM read_messages WHERE user_id='456'
Of course system will need to perform both types of queries, but if it does one type of query way more than other then we can tweak the design to make those queries a little faster. This is done by altering the order of the columns in the UNIQUE-NOT-NULL key. Idea is out of the two columns, put first the column with the given value, in other words put first the column which appears in the WHERE clause.
So, if we find the system doing more Type-1 queries then Type-2 queries, we will have columns ordered as {msg_id, user_id} otherwise we order it as {user_id, msg_id}. Remember, when we do a WHERE query on a multicolumn key, the first column favors speed.
If we do find our application favoring one type of queries way more over the other, we can go further and partition/shard the table horizontally on the column in WHERE clause. In databases like Cassandra or DynamoDB, it could be the partition key.

If you need a really scalable solution and for some reason simple sql table does not work for you l, here is alternative with DynamoDB:
Have a table with primary key of user ID and range key of message ID. Also create a Global Secondary Index with primary key as Message ID and range key as User ID. Now you can easily perform any types of queries you might want (ex: get messages read by user X, get users who read message Y, or did user X read message Y). This solution is scalable and has constant, predictable speed. Disadvantage is that it is probably going to be more expensive than sql.

Multi-user web app - Database design

I am going to be developing a multi-user web app where users will record every day whether or not they have completed various tasks. The tasks are repeated every day, eg: Every day XYZ needs to be done.
Not all users will have the same tasks to complete each day. There will likely be a database table containing all possible tasks. When a new user signs up on the web, they will select the tasks that apply to them by creating a profile for themselves.
Each day the user will then record whether or not they completed their respective tasks. Following that, there will be in depth reporting and historical stats not just on a users own task history,...but also globally to look for trends.
Im just looking for any suggestions on how to design the database (in general terms). Would it be ok to have a task table that contains all the tasks. Then when a new user creates their own profile online, a brand new table is created with their profile information and the tasks that they have selected. Each unique user profile table will then contain an ongoing history of tasks completed each day.
Or is there a better way to design this?
Edit: Or would a better idea to be to have something like the below:
Task history table:
PersonID | Date | Task1 | Task2 | Task3 | Task 4
001 | 24Jan15 | Complete | Complete | |
002 | 24Jan15 | | Complete | Complete | Not Complete
003 | 24Jan15 | Not Complete | | |
So there would be one table containing all the users (and the tasks they've chosen), another table containing all possible tasks, and lastly the above table recording the task history each day.
The only issue here is that not every task is applicable to every person. So there will be blanks. Not sure if that matters.
As you can no doubt tell, im a beginner. So any advice would be appreciated.

It is almost never a good idea to create new tables dynamically to hold subsets of the data. Data for different users should go in the same set of tables, with some field identifying the user. There is no good reason to have hundreds of tables that are all identical except that one is for some key value A, the next is for key value B, etc. Just add the key field to the table.
As a_horse_with_no_name says, numbered columns is a strong sign that you are doing it wrong. There are many reasons why this is a bad idea. Among them: If you have one column for each task, what happens when a new task is added? Instead of just adding a new record, now you have to add a new column to the table, and update all the existing records. Also, it makes queries very complicated. A query like "what tasks were done today" requires a separate test for every column, instead of one test on a single "task" column.
From what you've said, here's my first thought on how this should look:
Task table
(task_id, task_name)
This lists all the tasks of interest.
User table
(user_id, user_name)
This lists all the users.
Assigned_Task table
(user_id, task_id)
This relates users to tasks. There will be one record in this table for each task for each user. That is, if Alice is user 1 and she is supposed to do tasks 1, 2, and 3; and Bob is user 2 and he is supposed to do 2 and 4, then there will be records (1,1), (1,2), (1,3), (2, 2), and (2,4).
(Note: You might have an assigned_task_id field for this table to be the primary key, or the PK could be user_id + task_id, as that must be unique.)
Task_Status table
(user_id, task_id, task_date, completed)
This will have one record for each user/task combination, for each day. So after 30 days, if Alice has 3 tasks, there will be 3 x 30 = 90 records for her, 3 for each day times 30 days.
(You might have a task_status_id as the PK, or you might use user_id + task_id + task_date. Keys with more than 2 fields tend to be a pain so I'd probably create a task_status_id. Whatever.)
Any of these tables might have additional fields if there's other information you need. Like the User table might have employee number, phone number, department, etc.
Then a question like, "What tasks were not completed yesterday?" is easily answered with this query:
select user.name, task.name
from task_status
join user on user.user_id=task_status.user_id
join task on task.task_id=task_status.task_id
where task_date=#date
and completed=0
How many tasks were completed today?
select count(*)
from task_status
where date=#date and completed=1
Etc.

How to do computed column values in SQL Server with Entity Framework

I'm used to using identity columns and having pk's generated for me. I'm wondering, is there any way to scope an identity column or otherwise compute it to be unique within a foreign key?
For example, using identity column a table might look like this:
| UserId | WidgetId | <-- Table Widget, WidgetId is identity column & pk
---------------------
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 4 |
| 2 | 5 |
| 2 | 6 |
What if I want to achieve something more like the following:
| UserId | WidgetId | <-- Table Widget, both UserId and WidgetId compose the pk
---------------------
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
I know EF allows for database-generated column values, like identity and computed. My question is, how might one achieve a computed column like this?
I realize it can be done in the application, by finding the highest WidgetId a user has and incrementing by one. But this doesn't feel like it's the right way.
I imagine it might also be possible to do this with a trigger, but adding a trigger to the db from the DbContext would place a dependency on the RDBMS, right?
Is there a better way, something that takes less of a dependency on which backend EF is using?
Update
To be more clear, in the model above, there is an identifying relationship between Widget and User on purpose. UserId is the primary key of the User table, and by having the FK to User as part of Widget's PK, this means that Widgets are never shared between Users. When a User is deleted, all of the Widgets get deleted along with it. Querying Widgets by WidgetId alone makes no sense; To query for a particular Widget, both the UserId and WidgetId parameters are necessary.
I'm exploring this because I recall reading in Evans DDD that non-root entity id's in an aggregate only need to be unique within the aggregate. The above is an example of this case:
ENTITIES other than the root have local identity, but that identity
needs to be distinguishable only within the AGGREGATE, because no
outside object can ever see it out of the context of the root ENTITY.
(Domain-Driven Design, Evans)
Update 2
After Gorgan's answer, I have implemented something like the following in my widget factory class. Is this the correct approach to achieve the result desired above?
public class WidgetFactory : BaseFactory
{
private object _sync = new object();
internal WidgetFactory(IWriteEntities entityWriter) : base(entityWriter)
{
// my DbContext class implements the IWriteEntities interface
}
public Widget CreateOrUpdate(User user, int? widgetId, string prop1,
bool prop2, string prop3)
{
Widget widget = null;
if (!widgetId.HasValue)
{
// when widgetId is null do a create (construct & hydrate), then
EntityWriter.Insert(widget); // does DbEntityEntry.State = Added
}
else
{
// otherwise query the widget from EntityWriter to update it, then
EntityWriter.Update(widget); // does DbEntityEntry.State = Modified
}
// determine the appropriate WidgetId & save
lock (_sync)
{
widget.WidgetId = widgetId.HasValue ? widget.WidgetId
: EntityWriter.Widgets.Where(w => w.UserId == user.Id)
.Max(w => w.WidgetId) + 1;
EntityWriter.SaveChanges(); // does DbContext.SaveChanges()
}
return widget;
}
}
Widget
I suppose I should not have disguised this term. The actual entity is WorkPlace. I am building an app to keep track of where I work during the year, as I have to submit an itinerary with my municipal tax refund form every year. When done, I plan to publish it to the cloud so others can use it for free. But of course I want their WorkPlaces to be completely isolated from mine.
Users are automatically created for every visitor using anonymous identification / Request.AnonymousID (there will be a register feature later, but it is not necessary to try out a demo). The web app will have restful url's like /work-places/1, /work-places/2, etc. Because every request is guaranteed to have a user, the user id need not be identified in the url. I can identify users from Request.AnonymousID when nothing is present in User.Identity.Name.
If WorkPlaceId was an identity column, my controller actions would first need to check to make sure the user owns the workplace before displaying it. Otherwise I could hack the URL to see every WorkPlace that every user has set up in the system. By making the WorkPlaceId unique only for the user, I need not worry about it. The URL /work-places/1 will display entirely different data to 2 different users.

Purpose of identity column is to have some ID in table when you need column that will be unique no matter what other column values are, and sql server supports auto-generation for only one identity column in table (no multiple or complex primary auto-generated keys).
If your every widget is different (even when same widget is used by more users), then this primary key makes sense, but it cannot be computed in sql server (except if you use stored procedure or other db programmability for insert). And you can only read autogenerated column with ef (see How should I access a computed column in Entity Framework Code First?).
If you, however have more widget types (in this example widget types 1, 2 and 3 EDIT: WidgetId could be FK for table WidgetType where Id is autogenerated, so you would have PK comprised from 2 auto-generated ID's), you could make WidgetType (or WidgetInstance, WidgetSomething, whatever, entity that describes part of widget not affected by user) table and have autogenerated ID in it, and use it in this table as FK and part of primary key. That would allow you to have both UserId and WidgetId autogenerated and existing in moment of insert to this table.

I hate to oversimplify and apologize for doing so. Whenever this question comes up, it smells like a database design issue.
Widgets and users are both considered master data. The widget table should only contain attributes about the widget and the user table should only contain attributes about the user. If the widget is related to a user, it would probably be ok to include the userid in the widget table but the widgetid should be the identity and the userid would be an FK.
The linkage of the widget and user should be done outside of these master tables. If they are related to a bill/invoice, it's common to use a combination of an invoice header table (invoiceid, userid, datetime, total cost, total price...) and an invoice line table (invoicelineid, invoiceid, widgetid, unit cost, unit price, quantity...)
As the data volume grows, a little bit of normalization will ensure that your app will scale. Additionally, these strong relationships will help with data quality when it's time to turn this data into business intelligence.

Bank transactions table - can this be done better?

I am wondering what is the best way to make bank transaction table.
I know that user can have many accounts so I add AccountID instead of UserID, but how do I name the other, foreign account. And how do I know if it is incoming or outgoing transaction. I have an example here but I think it can be done better so I ask for your advice.
In my example I store all transactions in one table and add bool isOutgoing. So if it is set to true than I know that user sent money to ForeignAccount if it's false then I know that ForeignAccount sent money to user.
My example
Please note that this is not for real bank, of course. I am just trying things out and figuring best practices.

My opinion:
make the ID not null, Identity(1,1) and primary key
UserAccountID is fine. Dont forget to create the FK to the Accounts table;
You could make the foreignAccount a integer as well if every transaction is between 2 accounts and both accounts are internal to the organization
Do not create Nvarchar fields unless necessary (the occupy twice as much space) and don't create it 1024. If you need more than 900 chars, use varchar(max), because if the column is less than 900 you can still create an index on it
create the datetime columns as default getdate(), unless you can create transactions on a different date that the actual date;
Amount should be numeric, not integer

usually, i think, you would see a column to reflect DEBIT, or CREDIT to the account, not outgoing.
there are probably several tables something like these:
ACCOUNT
-------
account_id
account_no
account_type
OWNER
-------
owner_id
name
other_info
ACCOUNT_OWNER
--------------
account_id
owner_id
TRANSACTION
------------
transaction_id
account_id
transaction_type
amount
transaction_date
here you would get 2 records for transactions - one showing a debit, and one for a credit
if you really wanted, you could link these two transactions in another table
TRANSACTION_LINK
----------------
transaction_id1
transaction_id2

I'd agree with the comment about the isOutgoing flag - its far too easy for an insert/update to incorrectly set this (although the name of the column is clear, as a column it could be overlooked and therefore set to a default value).
Another approach for a transaction table could be along the lines of:
TransactionID (unique key)
OwnerID
FromAccount
ToAccount
TransactionDate
Amount
Alternatively you can have a "LocalAccount" and a "ForeignAccount" and the sign of the Amount field represents the direction.
If you are doing transactions involving multiple currencies then the following columns would be required/considered
Currency
AmountInBaseCcy
FxRate
If involving multiple currencies then you either want to have an fx rate per ccy combination/to a common ccy on a date or store it per transaction - that depends on how it would be calculated

I think what you are looking for is how to handle a many-tomany relationship (accounts can have multiple owners, owners can have mulitple accounts)
You do this through a joining table. So you have account with all the details needed for an account, you have user for all teh details needed for a user and then you have account USer which contains just the ids from both the other two tables.