Star Schema Design for Social Media - data-modeling

I am new to dimensional modeling and have read a lot of material (star-schema, dimension/fact tables, SCD, Ralph Kimball's - The Data Warehouse Toolkit book, etc). So I have a good conceptual understanding of dimensional modeling constructs but finding it hard to apply to a usecase due to lack of experience and need some guidance.
Consider Twitter for example, I want to design a dimensional model to calculate -
DAU (Daily active users) = number of users who logged in and accessed twitter via website or mobile app on a given day
MAU (Monthly active users) = number of users who logged in and accessed twitter via website or mobile app in last 30 days including measurement date
User engagement = total(clicks + favorites + replies + retweets) on a tweet
These metrics over a period (like month) is the summation of these metrics on each day in that period.
I want to write SQLs to calculate these metrics for every quarter by region (eg: US and rest of world) and calculate year-over-year growth (or decline) in these metrics.
Eg:
Here are some details that I thought about -
Factless (transaction) fact table for user login activity with grain of 1 row per user per login : user_login_fact_schema (user_dim_key, date_dim_key, user_location_dim_key, access_method_dim_key)
Factless (transaction) fact table for user activity with grain of 1 row per user per activity : user_activity_fact_schema (user_dim_key, date_dim_key, user_location_dim_key, access_method_dim_key, post_key, activity_type_key)
Does this sounds correct? How should my model look like? What other dimensions/facts can I add here?
Wonder if I should collapse these 2 tables into 1 and have activity_type for logins as 'login', but there can be a huge number of logins without any activity so this will skew the data. Am I missing anything else?

Your model seems correct, it answers the questions on the graph you posted.
It could make sense to aggregate those two fact tables into one fact table joined with a "UserAction" dimension, mostly because a login can be interpreted as just another user action.
However, having separate fact tables focused on one metric (or process) may be preferable because it enables you to introduce measures/metrics into the tables, i.e. when your fact tables stop being factless. It also spares you a join with another dimension (UserAction) but that is becoming a bit less relevant these days, where storage and DB processing power are just getting cheaper.

You should keep the data on different tables to make sure you dont mix different grains.
user_login_fact_schema can be a materalized view based on user_activity_fact_schema filtering for activity type=login and including some logic to exclude duplicates (i.e. one login per user per day, if you are talking about daily active users)

Related

Should I use nested data structures in SQL?

I have a fairly large database in SQL Server. To illustrate my use case, suppose I have a mobile game and I want to report on user activity.
To start with I have a table that looks like:
userId
date
# Sessions
Total Session Duration
1
2021-01-01
3
55
1
2021-01-02
9
22
2
2021-01-01
6
43
I am trying to "add" information of each session into this data. The options I'm considering are:
Add the session data as a new column containing a JSON array with the data for each session
Create a table with all session data indexed by userId & date - and query this table as needed.
Is this possible in SQL Server? (my experience is coming from GCP's BigQuery)
Your question boils down to whether it is better to use nested data or to figure out a system of tables where every column of every table has a simple domain (text string, number, date, etc.).
It turns out that this question was being pondered by Ed Codd fifty years ago when he was proposing the first database system based on the relational model. He decided that it was worthwhile restricting relational databases to Normal Form, later renamed First Normal Form. He proved to his own satisfaction that this restriction wouldn't reduce the expressive power of the relational model. And it would make it easier to build the first relational dabase manager.
Since then, just about every relational or SQL database has conformed to First Normal Form, although there are ways to get around the restriction by storing one of various forms of data structures in one column of a table. JSON is an example.
You'll gain the flexibility you get with JSON, but you will lose the ability to specify the data you want to retrieve using the various clauses of the SELECT statement, clauses like INNER JOIN or WHERE, among others. This loss could be deal killer.
If it were me, I would go with the added table approach, and analyze the session data down to one of more tables with simple columns. But you may find that JSON decoders are just as powerful, and that doing full table scans are worth the time taken.

What storage mechanism can I use to store the data related to user interaction of my website for a day

I store information about which items were accessed. That's it initially. I will store the id and type of item that were accessed. For example in a relational table it would be.
id type view
1 dairy product 100
2 meat 88
Later on, in the end of the day, I will transfer this data to the actual table of the product.
products
id name view
1 Cheesy paradise 100
This is a web site, I don't want to update the table everytime the user visits a product. Because the products are in relational database and it would be very unprofessional. I want to make a service in Nodejs that when the user visits a product and stay for 5 secs and roll the page to the bottom I increment a high speed storage and in the end of the day I updated the related products in "one go".
I will handle only 300 visits in diferent products a day. But, of course, I want to my system to grow and it will handle keeping track of 1 thousand of products per minute, for example. In my mind when I though about this feature I thought about using Mongo. But I don't know it seems so much for this simple task. What tecnology can fit this situation better?
I would recommend MongoDB, since you are mostly "dumping" data into a database. That also allows you in the future to dump more information then you will now, no matter what kind of documents you dump now. Mongo is totally fine for a "dump" database structure.

Filtering Functionality Similar to Ebay SQL Count Issue

I am stuck on a database problem for a client, wandering if someone could help me out. I am currently trying to implement filtering functionality so that a user can filter results after they have searched for something. We are using SQL Server 2008. I am working on an electronics e-commerce site and the database is quite large (500,000 plus records). The scenario is this - user goes to our website and types in 'laptop' and clicks search. This brings up the first page of several thousand results. What I want to do is then
filter these results further and present the user with options such as:
Filter By Manufacturer
Dell (10,000)
Acer (2,000)
Lenovo (6,000)
Filter By Colour
Black (7000)
Silver (2000)
The main columns of the database are like this - the primary key is an integer ID
ID Title Manufacturer Colour
The key part of the question is how to get the counts in various categories in an efficient manner. The only way I currently know how to do it is with separate queries. However, should we wish to filter by further categories then this will become very slow - especially as the database grows. My current SQL is this:
select count(*) as ManufacturerCount, Manufacturer from [ProductDB.Product] GROUP BY Manufacturer;
select count(*) as ColourCount, Colour from [ProductDB.Product] GROUP BY Colour;
My question is if I can get the results as a single table using some-kind of join or union and if this would be faster than my current method of issuing multiple queries with the Count(*) function. Thanks for your help, if you require any further information please ask. PS I am wandering how on sites like ebay and amazon manage to do this so fast. In order to understand my problem better if you go onto ebay and type in laptop you will
see a number of filters on the left - this is basically what I am trying to achieve. I don't know how it can be done efficiently when there are many filters. E.g to get functionality equivalent to Ebay I would need about 10 queries and I'm sure that will be slow. I was thinking of creating an intermediate table with all the counts however the intermediate table would have to be continuously updated in order to reflect changes to the database and that would be a problem if there are multiple updates per minute. Thanks.
The "intermediate table" is exactly the way to go. I can guarantee you that no e-commerce site with substantial traffic and large number of products would do what you are suggesting on the fly at every inquiry.
If you are worried about keeping track of changes to products, just do all changes to the product catalog thru stored procs (my preferred method) or else use triggers.
One complication is how you will group things in the intermediate table. If you are only grouping on pre-defined categories and sub-categories that are built into the product hierarchy, then it's fairly easy. It sounds like you are allowing free-text search... if so, how will you manage multiple keywords that result in an unexpected intersection of different categories? One way is to save the keywords searched along with the counts and a time stamp. Then, the next time someone searches on the same keywords, check the intermediate table and if the time stamp is older than some predetermined threshold (say, 5 minutes), return your results to a temp table, query the category counts from the temp table, overwrite the previous counts with the new time stamp, and return the whole enchilada to the web app. Otherwise, skip the temp table and just return the pre-aggregated counts and data records. In this case, you might get some quirky front-end count behavior, like it might say "10 results" in a particular category but then when the user drills down, they actually find 9 or 11. It's happened to me on different sites as a customer and it's really not a big deal.
BTW, I used to work for a well-known e-commerce company and we did things like this.

How to calculate "active users" and monitor stickiness?

Currently I was doing this:
I have a table called tracking_log where I am inserting a row whenever a user is using our app. We use this table to calculate users that are sticking with us and continuing to use the software.
This was working OK. But now I also want to see the number of active users for:
each day since launch
week-wise since launch
month wise since launch.
The table structure is something like:
tracking_log
- user_id (integer, pk)
- when_used (timestamp)
- event (event-type triggered by user. Kept for future usage.)
Our definition for "Active Users On a Day d1": users who had signed up earlier than (d1-15) days and has used the product within (d1-7) days.
The tracking_log table has around 500K records and counting. I was writing MySQL queries to calculate the above numbers but they are turning out to be very slow.
What is the best way to implement it? Are there any existing solutions to generate such reports with less effort?

What Database / Technology should be used to calculate unique visitors in time scope

I've got a problem with performance of my reporting database (tables have millions of records, 50+), when I want to calculate distinct on column that indicates a visitor uniqueness, let's say some hashkey.
For example:
I have these columns:
hashkey, name, surname, visit_datetime, site, gender, etc...
I need to get distinct in time span of 1 year, less than in 5 sec:
SELECT COUNT(DISTINCT hashkey) FROM table WHERE visit_datetime BETWEEN 'YYYY-MM-DD' AND 'YYYY-MM-DD'
This query will be fast for short time ranges, but if it be bigger than one month, than it can takes more than 30s.
Is there a better technology to calculate something like this than relational databases?
I'm wondering what google analytics use to do theirs unique visitors calculating on the fly.
For reporting and analytics, the type of thing you're describing, these sorts of statistics tend to be pulled out, aggregated, and stored in a data warehouse or something. They are stored in a fashion meant for performance reasons in lieu of nice relational storage techniques optimized for OLTP (online transaction processing). This pre-aggregated technique is called OLAP (online analytical processing).
You could have another table store the count of unique visitors for each day, updated daily by a cron function or something.
Google Analytics uses a first-party cookie, which you can see if you log Request Headers using LiveHTTPHeaders, etc.
All GA analytics parameters are packed into the Request URL, e.g.,
utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B">http://www.google-analytics.com/_utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=GATC012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/GATC012.html?utm_source=www.gatc012.org&utm_campaign=campaign+gatc012&utm_term=keywords+gatc012&utm_content=content+gatc012&utm_medium=medium+gatc012&utmac=UA-30138-1&utmcc=__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B...
Within that URL is a piece that keyed to __utmcc, these are the GA cookies. Within _utmcc, is a string keyed to _utma, which is string comprised of six fields each delimited by a '.'. The second field is the Visitor ID, a random number generated and set by the GA server after looking for GA cookies and not finding them:
__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1
In this example, 1774621898 is the Visitor ID, intended by Google Analytics as a unique identifier of each visitor
So you can see the flaws of technique to identify unique visitors--entering the Site using a different browser, or a different device, or after deleting the cookies, will cause you to appear to GA as a unique visitor (i.e., it looks for its cookies and doesn't find any, so it sets them).
There is an excellent article by EFF on this topic--i.e., how uniqueness can be established, and with what degree of certainty, and how it can be defeated.
Finally, once technique i have used to determine whether someone has visited our Site before (assuming the hard case, which is that they have deleted their cookies, etc.) is to examine the client request for our favicon. The directories that store favicons are quite often overlooked--whether during a manual sweep or programmatically using a script.

Resources