Related
Trying to define the right schema / table for our scenario:
We have few hundreds of eCommerce sites, each one of them has unique siteId.
Each site has it own end-users, up to 10M unique users per month. Each user has unique userId.
Each end-user interacts with the site: view products, add products to cart and purchase products (we call it user events). I want to store the activities of the last 30 days (or 180 days if it possible).
Things to consider:
Site sizes are different! We have some "heavy" sites with 10M end users but we also have "light" sites with a few hundreds/thousands of users.
Events don't have unique ids.
Users can have more than one event at a time, for example they can a view page with more than one product (but we could live without that restriction to simplify).
Rough estimation: 100 Customers x 10M EndUsers x 100 Interactions = 100,000,000,000 rows (per month)
Writes done in realtime (when the event arrive to the server). Reads done much less (1% of the events).
Events have some more metadata and different events (view/purchase/..) have different metadata.
Using Keyspace to separate between sites, and manage table per each site vs. all customers in one table.
How to define the key here?
+--------+---------+------------+-----------+-----------+-----------+
| siteId | userId | timestamp | eventType | productId | other ... |
+--------+---------+------------+-----------+-----------+-----------+
| 1 | Value 2 | 1501234567 | view | abc | |
| 1 | cols | 1501234568 | purchase | abc | |
+--------+---------+------------+-----------+-----------+-----------+
My query is: Get all events (and their metadata) of specific user. As I assumed above, around 100 events.
Edit2:I guess it wasn't clear, but the uniqueness of users is per site, two different users might have the same id if they are on different sites
If you want to query for the userid than the userid should be the first part of your compound primary key (this is the partition key). Use a compound primary key to create columns that you can query to return sorted results. I would suggest the following schema:
CREATE TABLE user_events (
userid long,
timestamp timestamp,
event_type text,
site_id long,
product_id long,
PRIMARY KEY (userid, site_id, timestamp, product_id));
That should make queries like
SELECT * FROM user_events WHERE user_id = 123 and site_id = 456;
quite performant. By adding the timestamp to the PK you can also easily LIMIT your queries to get only the top(latest) 1000 (whatever you need) events without getting into performance issues because of high active users (or bots) having a very long history.
One thing to keep mind: I would recommend to have the user_id or a composition of user_id, site_id as the partition key (the first part of the primary key). That will prevent your rows from becoming too big.
So an alternative design would look like this:
CREATE TABLE user_events (
userid long,
timestamp timestamp,
event_type text,
site_id long,
product_id long,
PRIMARY KEY ( (userid, site_id), timestamp, product_id));
The "downside" of this approach is that you always have to provide user and site-id. But I guess that is something that you have to do anyways, right?
To point out one thing. The partition key (also called to row id) identifies a row. A row will stay on specific node. For this reason it is a good idea to have the rows more or less of the same size. A row with a couple of thousands or 10ks of columns is not really a problem. You will get problems if you have some rows with millions of columns and other rows with only 10-20 columns. That will cause the cluster to be inbalanced. Furthermore it makes the row caches less effictive. In your example I would suggest to avoid to have the site_id as the partition key (row key).
Does that make sense to you? Maybe the excelent answer to this post give you some more insides: difference between partition-key, composite-key and clustering-key. Furthermore a closer look at this part of the datastax documentation offers some more details.
Hope that helps.
My query is: Get all events (and their metadata) of specific user. As I assumed above, around 100 events.
So, you want all the events of a given user. As each user has a unique id on a site, so you can form the table using userid and site_id as a primary key and timestamp as clustering key. Here is the table structure:
CREATE TABLE user_events_by_time (
userid bigint,
timestamp timestamp,
event_type text,
product_id bigint,
site_id bigint,
PRIMARY KEY ((site_id,userid), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC) ;
Now you can query all of a user's event in a given time by using the following query:
SELECT * from user_events_by_time WHERE site_id= <site_id> and userid = <user_id> and timestamp > <from_time> and timestamp < <to_time>;
Hope this solves your problem.
(Before that, i apologize for my bad English)
I have study cases like this:
I am currently having a trouble with my Web Application. I made a Web application for a certain company. I made the app using CodeIgniter 3.
I built the database using Maria DB. For the id in each table, i am using Auto-increment id for my application database for each table. I usually deploy the web app to the cloud server (sometimes the company have their own dedicated server, but sometimes haven't ). One day, there is a company that they don't want to deploy the app that i have made before to the cloud ( for the security purposes they said ).
This company wanted to deploy the app to the employee's PC personally in the office, while the pc for each employee not connected to each other ( i.e stand alone pc/personal computer/employee's Laptop ). They said, for every 5 months, they would collect all of the data from employee's personal computer to company's data center, and of course the data center are no connected to the internet. I told them that's not the good way to store their data. ( because the data will be duplicate when i am trying to merge all of the data into one, since my column id for every table are in auto-increment id, and it's a primary key). Unfortunately, The company still want to kept the app that way, and i don't know how to solved this.
They have at least 10 employees that would used this web app. According that, I have to deploy the app to the 10 PC personally.
Additional info : Each employee have their own unique id which they got from the company, and i made the auto_increment id for each employee, just like the table below:
id | employee_id | employee_name |
1 | 156901010 | emp1
2 | 156901039 | emp2
3 | 156901019 | emp3
4 | 156901015 | emp4
5 | 156901009 | emp5
6 | 156901038 | emp6
The problem is whenever they fill the form from that application, some of the table are not stored the employee's id but the new id that come from increment id.
For example electronic_parts table. They have the attribute like below:
| id | electronic_part_name | kind_of_electronic_part_id |
if the emp1 fill the form from the web app , the table's content would like below.
| id | electronic_part_name | kind_of_electronic_part_id |
| 1 | switch | 1 |
and if the emp2 fill the form from the web app , the table's content would like below.
| id | electronic_part_name | kind_of_electronic_part_id |
| 1 | duct tape | 10 |
When i tried to merge the contents of the table into the data center it would falling apart because the duplicate id.
It's getting worst when i think about my foreign key in other tables.. like for example the customer_order table.
The table for customer_order column looks like below (just a sample, not the actual table, but similar).
|id | customer_name | electronic_parts_id | cashier(a.k.a employee_id, the increment id one, not the id that employee got from a company as i described above ) |
| 1 | Henry | 1 | 10 |
| 2 | Julie | 2 | 9 |
Does anyone know how to solved this problem ? or can someone suggest/recommend me some good way to solved this ?
NOTE: Each Employees have their own database for their app, so the database is not centralized, it's a stand-alone database, that means, i have to installed the database to the employee's pc one by one
This is an unconventional situation and you can have an unconventional solution.
I can suggest you two methods to solve this issue.
Instead of using autoincrement for primary key generate a UUID and use it as the primary key. Regarding the probability of duplicates
in random UUIDs: Only after generating 1 billion UUIDs every second
for the next 100 years
In CodeIgniter you could do this with the following code snippet.
$this->db->set('id', 'UUID', FALSE);
This generates a 36 characters hexadecimal key (with 4 dashes
included).
ac689561-f7c9-4f7e-be94-33c6c0fb0672
As you can see it has dashes in the string, using the CodeIgniter DB
function will insert this in the database with the dashes, it still
will work. If it does not look at clean, you could remove and
convert the string to a 32-char key.
You can use the following function with the help of [CodeIgniter
UUID library][1].
function uuid_key {
$this->load->library('uuid');
//Output a v4 UUID
$id = $this->uuid->v4();
$id = str_replace('-', '', $id);
$this->db->set('id', $id, FALSE);
}
Now we have a 32-byte key,
ac689561f7c94f7ebe9433c6c0fb0672
An alternate unconventional method to tackle the situation is by
adding function to log all Insert, Update, Delete queries processed
in the site to a file locally. By this way, in each local
implementation will generate a log file with an actual list of
queries that modify the DB over time in the right sequential order.
At any point in time, the state of the database is the result of the
set of all those queries happened in the past till that date.
So in every 5 months when you are ready to collect data from
employees personal computer, instead of taking data dump, take this
file with all query log.(Note: Such a query log won't have
auto-increment id as it will be created only in the real time when
it is executed towards a Database. )
Use such files to import data to your datacenter. This will not
conflict as it will generate autoincrements in your data center in
real time. (Hope you do not have to link your local to data center
at any point of time in future)
[1]: https://github.com/Repox/codeigniter-uuid
Is that id used in any other tables? It would probably be involved in a JOIN. If so, you have a big problem of unraveling the ids.
If the id is not used anywhere else, then the values are irrelevant, and the rows can be renumbered. This would be done (roughly speaking) by loading the data from the various sources into the same table, but not include the id in the load.
Or, if there is some other column (or combination of columns) that is UNIQUE, then make that the PRIMARY KEY and get rid of id.
Which case applies? We can pursue in more detail. Please provide SHOW CREATE TABLE for any table(s) that are relevant.
In my first case (where id is used as a FK elsewhere), do something like this:
While inserting the rows into the table with id, increment the values by enough to avoid colliding with the existing ids. Then do (in the same transaction):
UPDATE the_other_table SET fk_id = fk_id + same_increment.
Repeat for each other table and each id, as needed.
I think your problem come from your database... you didn't design it well.
it's a bug if you have an id for two difference users .
if you just made your id field unique in your database then two employee wouldn't have a same id so your problem is in your table design .
just initiate your id field like this and your problem will be solved .
CREATE TABLE [YOUR TABLE NAME](
[ID] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
....
Is it required for the id to be an integer? if not may be you can use a prefix on the id so the input for each employee will be unique in general. that means you have to give up the auto increment and just do count on the table data (assuming youre not deleting any of the records.)
You may need to write a code in PHP to handel this. If other table is already following unique/primary key based than it is fine.
You can also do it after import.
like this
Find duplicates in the same table in MySQL
My database tables holds a number of keys (sensitive information) which are encrypted. These keys are associated with users via an ID field. At any time i may need to invalidate a user by updating their ID field making them no longer identifiable. However i don't want to completely remove the row from the database. Instead i would like to keep it for audit purposes.
Is there a common convention i can follow for this or is simply appending a string with some random content enough to the ID field being invalidated sufficient?
E.g
Table before invalidate request
| ID | KEY |
------------------------
| user123 | yiuy321ui |
Table after invalidate request
| ID | KEY |
--------------------------------------
| legacy_79878_user123 | yiuy321ui |
I would avoid using any ID field of any table dynamically. Not only does it defy convention and best practices, but you will likely break associations with other tables which lookup/join on that field. I suggest adding a simple boolean field to your table, and set that field true or false to maintain a users validity.
Updating the user ID is not really a very good way of doing this. What you want is to be able to say 'this user is not active' anymore, so it would seem to make sense to have an Active bit field on your user table.
You may need to update your code where it validates your user to check for 'active' users only, but this will be easier in the long run (and also make it easier to re-enable a user if you need to).
I'm used to using identity columns and having pk's generated for me. I'm wondering, is there any way to scope an identity column or otherwise compute it to be unique within a foreign key?
For example, using identity column a table might look like this:
| UserId | WidgetId | <-- Table Widget, WidgetId is identity column & pk
---------------------
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 4 |
| 2 | 5 |
| 2 | 6 |
What if I want to achieve something more like the following:
| UserId | WidgetId | <-- Table Widget, both UserId and WidgetId compose the pk
---------------------
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
I know EF allows for database-generated column values, like identity and computed. My question is, how might one achieve a computed column like this?
I realize it can be done in the application, by finding the highest WidgetId a user has and incrementing by one. But this doesn't feel like it's the right way.
I imagine it might also be possible to do this with a trigger, but adding a trigger to the db from the DbContext would place a dependency on the RDBMS, right?
Is there a better way, something that takes less of a dependency on which backend EF is using?
Update
To be more clear, in the model above, there is an identifying relationship between Widget and User on purpose. UserId is the primary key of the User table, and by having the FK to User as part of Widget's PK, this means that Widgets are never shared between Users. When a User is deleted, all of the Widgets get deleted along with it. Querying Widgets by WidgetId alone makes no sense; To query for a particular Widget, both the UserId and WidgetId parameters are necessary.
I'm exploring this because I recall reading in Evans DDD that non-root entity id's in an aggregate only need to be unique within the aggregate. The above is an example of this case:
ENTITIES other than the root have local identity, but that identity
needs to be distinguishable only within the AGGREGATE, because no
outside object can ever see it out of the context of the root ENTITY.
(Domain-Driven Design, Evans)
Update 2
After Gorgan's answer, I have implemented something like the following in my widget factory class. Is this the correct approach to achieve the result desired above?
public class WidgetFactory : BaseFactory
{
private object _sync = new object();
internal WidgetFactory(IWriteEntities entityWriter) : base(entityWriter)
{
// my DbContext class implements the IWriteEntities interface
}
public Widget CreateOrUpdate(User user, int? widgetId, string prop1,
bool prop2, string prop3)
{
Widget widget = null;
if (!widgetId.HasValue)
{
// when widgetId is null do a create (construct & hydrate), then
EntityWriter.Insert(widget); // does DbEntityEntry.State = Added
}
else
{
// otherwise query the widget from EntityWriter to update it, then
EntityWriter.Update(widget); // does DbEntityEntry.State = Modified
}
// determine the appropriate WidgetId & save
lock (_sync)
{
widget.WidgetId = widgetId.HasValue ? widget.WidgetId
: EntityWriter.Widgets.Where(w => w.UserId == user.Id)
.Max(w => w.WidgetId) + 1;
EntityWriter.SaveChanges(); // does DbContext.SaveChanges()
}
return widget;
}
}
Widget
I suppose I should not have disguised this term. The actual entity is WorkPlace. I am building an app to keep track of where I work during the year, as I have to submit an itinerary with my municipal tax refund form every year. When done, I plan to publish it to the cloud so others can use it for free. But of course I want their WorkPlaces to be completely isolated from mine.
Users are automatically created for every visitor using anonymous identification / Request.AnonymousID (there will be a register feature later, but it is not necessary to try out a demo). The web app will have restful url's like /work-places/1, /work-places/2, etc. Because every request is guaranteed to have a user, the user id need not be identified in the url. I can identify users from Request.AnonymousID when nothing is present in User.Identity.Name.
If WorkPlaceId was an identity column, my controller actions would first need to check to make sure the user owns the workplace before displaying it. Otherwise I could hack the URL to see every WorkPlace that every user has set up in the system. By making the WorkPlaceId unique only for the user, I need not worry about it. The URL /work-places/1 will display entirely different data to 2 different users.
Purpose of identity column is to have some ID in table when you need column that will be unique no matter what other column values are, and sql server supports auto-generation for only one identity column in table (no multiple or complex primary auto-generated keys).
If your every widget is different (even when same widget is used by more users), then this primary key makes sense, but it cannot be computed in sql server (except if you use stored procedure or other db programmability for insert). And you can only read autogenerated column with ef (see How should I access a computed column in Entity Framework Code First?).
If you, however have more widget types (in this example widget types 1, 2 and 3 EDIT: WidgetId could be FK for table WidgetType where Id is autogenerated, so you would have PK comprised from 2 auto-generated ID's), you could make WidgetType (or WidgetInstance, WidgetSomething, whatever, entity that describes part of widget not affected by user) table and have autogenerated ID in it, and use it in this table as FK and part of primary key. That would allow you to have both UserId and WidgetId autogenerated and existing in moment of insert to this table.
I hate to oversimplify and apologize for doing so. Whenever this question comes up, it smells like a database design issue.
Widgets and users are both considered master data. The widget table should only contain attributes about the widget and the user table should only contain attributes about the user. If the widget is related to a user, it would probably be ok to include the userid in the widget table but the widgetid should be the identity and the userid would be an FK.
The linkage of the widget and user should be done outside of these master tables. If they are related to a bill/invoice, it's common to use a combination of an invoice header table (invoiceid, userid, datetime, total cost, total price...) and an invoice line table (invoicelineid, invoiceid, widgetid, unit cost, unit price, quantity...)
As the data volume grows, a little bit of normalization will ensure that your app will scale. Additionally, these strong relationships will help with data quality when it's time to turn this data into business intelligence.
What's the best method of storing a large number of booleans in a database table?
Should I create a column for each boolean value or is there a more optimal method?
Employee Table
IsHardWorking
IsEfficient
IsCrazy
IsOverworked
IsUnderpaid
...etc.
I don't see a problem with having a column for each boolean. But if you foresee any future expansion, and want to use the table only for booleans, then use a 2-column table with VARIABLE and VALUE columns, with a row for each bool.
If the majority of employees will have the same values across a large sample size, it can be more efficient to define a hierarchy, allowing you to establish default values representing the norm, and override them per employee if required.
Your employee table no longer stores these attributes. Instead I would create a definition table of attributes:
| ATTRIBUTE_ID | DESCRIPTION | DEFAULT |
| 1 | Is Hard Working | 1 |
| 2 | Is Overpaid | 0 |
Then a second table joining attributes to Employees:
| EMPLOYEE_ID | ATTRIBUTE_ID | OVERRIDE |
| 2 | 2 | 1 |
Given two employees, employee with ID 1 doesn't have an override entry, and thus inherits the default attribute values (is hard working, is not overpaid), however employee 2 has an override for attribute 2 - Is Overpaid, and is thus both hard working and overpaid.
For integrity you could place a unique constraint on the EMPLOYEE_ID and ATTRIBUTE_ID columns in the override table, enforcing you can only override an attribute once per employee.
Something to consider: how often will you be adding/changing/removing these booleans? If they're not likely to change then you'll probably like having them as individual columns. Many databases will probably pack them for you, especially if they're adjacent in the row, so they'll be stored efficiently.
If, on the other hand, you see yourself wanting to add/change/remove these booleans every once in a while you might be better served by something like (excuse PostgreSQL-isms and shoddy names):
CREATE TABLE employee_qualities (
id SERIAL8 PRIMARY KEY,
label TEXT UNIQUE
);
CREATE TABLE employee_employee_qualities (
employee_id INT8 REFERENCES employee (id),
quality_id INT8 REFERENCES employee_qualities (id),
UNIQUE (employee_id, quality_id)
);
A column for each is the best representation of your business requirements. You could combine a bunch of bools into a single int column and use bit masks to read the values, but this seems unnecessarily complex, and is something I would consider only if there was some high-end performance need for it.
Also, if you are using sql server, up to 8 bit fields get combined internally into a single int, so the performance thing is sort-of done for you already. (I don;t know if other dbs do this.)