About Youtube views count - database

I'm implementing an app that keeps track of how many times a post is viewed. But I'd like to keep a 'smart' way of keeping track. This means, I don't want to increase the view counter just because a user refreshes his browser.
So I decided to only increase the view counter if IP and user agent (browser) are unique. Which is working so far.
But then I thought. If Youtube, is doing it this way, and they have several videos with thousands or even millions of views. This would mean that their views table in the database would be overly populated with IP's and user agents....
Which brings me to the assumption that their video table has a counter cache for views (i.e. views_count). This means, when a user clicks on a video, the IP and user agent is stored. Plus, the counter cache column in the video table is increased.
Every time a video is clicked. Youtube would need to query the views table and count the number of entries. Won't this affect performance drastically?
Is this how they do it? Or is there a better way?

I would leverage client side browser fingerprinting to uniquely identify view counts. This library seems to be getting significant traction:
https://github.com/Valve/fingerprintJS
I would also recommend using Redis for anything to do with counts. It's atomic increment commands are easy to use and guarantee your counts never get messed up via race conditions.
This would be the command you would want to use for incrementing your counters:
http://redis.io/commands/incr
The key in this case would be the browser fingerprint hash sent to you from the client. You could then have a Redis "set" that would contain a list of all browser fingerprints known to be associated with a given user_id (the key for the set would be the user_id).
Finally, if you really need to, you run a cron job or other async process that dumps the view counts for each user into your counter cache field for your relational database.
You could also take the approach where you store user_id, browser fingerprint, and timestamps in a relational database (mysql?) and counter cache them into your user table periodically (probably via cron).

First of all, afaik, youtube uses BigTable, so do not worry about querying the count, we don't know the exact structure of the database anyway.
Assuming that you are on a relational model, create a column view_count, but do not update it on every refresh. Record the visists and periodically update the cache.
Also, you can generate hash from IP, browser, date and any other information you are using to detect if this is an unique view, and do not store the whole data.
Also, you can use session/cookie to record the view being viewed. Since it will expire, it won't be such memory problem - I don't believe anyone is viewing thousand of videos in one session

If you want to store all the IP's and browsers, then make sure you have enough DB storage space, add an index and that's it.
If not, then you can use the rails session to store the list of videos that a user has visited, and only increment the view_count attribute of a video when he's visiting a new video.

Related

Notion API Pagination Random Database Entry

I'm trying to retrieve a random entry from a database using the Notion API. There is a page limit on how many entries you can retrieve at once, so pagination is utilized to sift through the pages 100 entries at a time. Since there is no database attribute telling you how long the database is, you have to go through the pages in order until reaching the end in order to choose a random entry. This is fine for small databases, but I have a cron job going that regularly chooses a random entry from a notion database with thousands of entries. Additionally, if I make too many calls simultaneously I risk being rate limited pretty often. Is there a better way to go about choosing a random value from a database that uses pagination? Thanks!
I don't think there is a better way to do it right now (sadly). If your entries don't change often, think about caching the pages. Saves you a lot of execution time in your cron job. For the rate limit, if you use Node.js, you can build a rate-limited queue (3 requests/second) pretty easily with something like bull

Efficiently search/exist() Firestore without exhausting free quota

I'm working on a side project and want to let my users check if their friends have accounts.
Currently I've implemented it like this:
Read phone contacts for emails
Loop through the emails
Make .get() query on the user database1 for users with that email
If data comes back, the friend is on the platform and an invite button is displayed
Free quota2 exceeded within an hour
The thing is that any .get is considered a read operation, even if no data comes back. Their doc.exists can only be tun after a .get so a document read is needed to check for existence.
I'm sure I'm overlooking something obvious, what I want to do is in essence to an .exist() like query that does not 'cost' a read.
1: I'm not actually storing emails in firestore but their hashes, and am querying those. Same effect, but it allows me to query a secondary user database that doesn't expose true emails and other data.
2: Not trying to be cheap per se, but if this app turns commercial this would make the billing a nightmare.
According to your comment, you say that you keep the contacts in memory and for each contact (email address), you search in your existing Firestore database for matches.
Free quota exceeded within an hour
It means that you are searching the Firestore database for a huge number of contacts.
The thing is that any .get is considered a read operation, even if no data comes back.
That's correct. According to the official documentation regarding Firestore pricing, it clearly states that:
Minimum charge for queries
There is a minimum charge of one document read for each query that you perform, even if the query returns no results.
So if you have for example 1000 contacts and you query the database for each one of them, even if your queries return no results, you're still charged with 1000 read operations.
I'm sure I'm overlooking something obvious, what I want to do is in essence to an .exist() like query that does not 'cost' a read.
That's not how Firestore works. This means that every query incurs a cost of at least one document read, no matter the results.
1: I'm not actually storing emails in firestore but their hashes, and am querying those. Same effect, but it allows me to query a secondary user database that doesn't expose true emails and other data.
As you already noticed, doesn't matter if you store the actual email address or the corresponding hash, the result is the same.
2: Not trying to be cheap per se, but if this app turns commercial this would make the billing a nightmare.
Try for this feature, Firebase realtime database and believe me, both work very well together in the same project.

Browser: How to cache large data yet enable small parts to be updated?

I have a list of 20k employees to display in a React table. When the admin user changes one, I want the change reflected in the table - even if she does a reload - but I don't want to re-fetch all 20k including the unchanged 19 999.
(The table is of course paged and shows max N at once but I still need all 20k to support search and filtering, which is impractical to do server side for various reasons)
The solution I can think of is to set caching headers for /api/employees so that it is cached for e.g. one hour and have another endpoint, /api/employees?changedSince= and somehow ensure that server knows which employees have been changed. But I am sure somebody has already implemented a solution(s) for this...
Thank you!
A timestamp solution would be the best, and simplest, way to implement it. It would only require a small amount of extra data to be stored and would provide the most maintainable and expandable solution.
All you would need to do is update the timestamp when an item in the list is updated. Then, when the page loads for the first time, access /api/employees, then periodically request /api/employees?changedSince to return all of the changed rows in the table, for React to then update.
In terms of caching the main /api/employees endpoint, I’m not sure how much benefit you would gain from doing that, but it depends on how often the data is updated.
As you are saying your a in control of the frontends backend, imho this backend should cache all of the upstream data in its own (SQL or whatever) database. The backend then can expose a proper api (with pagination and search).
The backend can also implement some logic to identify which rows have changed.
If the frontend needs live updates about changes you can use some technology that allows bi-directional communication (SignalR if your backend is .NET based, or something like socket.io if you have a node backend, or even plain websockets)

Handling user activity on web portal with performance

The users on my website do operations like login, logout, update profile, change passwords etc. I am trying to come up with something that can store these user activities for my users and also return the matching records in case some system asks me for them based on userIds.
The problem from what I can think of looks more write intensive(since users keep logging into the website very often and I have to record that). Once in a while(say when users clicks on history or some reporting team needs it), the data records are read and returned.
So my application is write intensive
There are various approaches I can think of.
Approach 1. The system that gets those user activities keeps on writing them into a queue and another one keep fetching from that queue(periodically or when it is filled completely) and write them into database(which has been sharded(assume based on hash of userId)).
The problem with this approach is if my activity manager runs on multiple nodes, it has to send those records to various shards which means a lot of data movement over network.
Approach 2:The system that gets those user activities keeps on writing them into a queue and another one keep fetching from that queue(periodically or when it is filled completely) and write them into read though write through cache which would take care of writing into the database.
Problem with this approach is I do not know If I can control as to where those records would be written(I mean to which shard). Basically I do not know if the write through cache works(does it map to a local DB or it can manage to send data to shards).
Approach 3: The login operation is the most common user activity in my system. I can have a separate queue for login which must be periodically flushed to disk.
Approach 4: Use some cloud based storage which acts as a in memory queue where data coming from all nodes in stored. This would be reliable cache that guarantees no data loss. Periodically read from this cache and store that into the database shards.
There are many problems to solve:
1. Ensuring I do not loose the data(What kind of data replication to use i.e. any queue that ensures reliability)
2. Ensuring my frequent writes do not result in performance
3. Avoid single point of failure.
4. Achieving infinite scale
I need suggestion based on above from the existing solution available.

displaying # views on a page without hitting database all the time

More and more sites are displaying the number of views (and clicks like on dzone.com) certain pages receive. What is the best practice for keeping track of view #'s without hitting the database every load?
I have a bunch of potential ideas on how to do this in my head but none of them seem viable.
Thanks,
first time user.
I would try the database approach first - returning the value of an autoincrement counter should be a fairly cheap operation so you might be surprised. Even keeping a table of many items on which to record the hit count should be fairly performant.
But the question was how to avoid hitting the db every call. I'd suggest loading the table into the webapp and incrementing it there, only backing it up to the db periodically or on webapp shutdown.
One cheap trick would be to simply cache the value for a few minutes.
The exact number of views doesn't matter much anyway since, on a busy site, in the time a visitor goes through the page, a whole batch of new views is already in.
One way is to use memcached as a counter. You could modify this rate limit implementation to instead act as general counter. The key could be in yyyymmddhhmm format with an expiration of 15 or 30 minutes (depending on what you consider to be concurrent visitors) and then simply get those keys when displaying the page.
Nice libraries for communicating with the memcache server are available in many languages.
You could set up a flat file that has the number of hits in it. This would have issues scaling, but it could work.
If you don't care about displaying the number of page views, you could use something like google analytics or piwik. Both make requests after the page is already loaded, so it won't impact load times. There might be a way to make a ajax request to the analytics server, but I don't know for sure. Piwik is opensource, so you can probably hack something together.
If you are using server side scripting, increment it in a variable. It's likely to get reset if you restart the services so not such a good idea if accuracy is needed.

Resources