What is the best way to do basic View tracking on a web page? - spam-prevention

I have a web facing, anonymously accessible, blog directory and blogs and I would like to track the number of views each of the blog posts receives.
I want to keep this as simple as possible, accuracy need only be an approximation. This is not for analytics (we have Google for that) and I dont want to do any log analysis to pull out the stats as running background tasks in this environment is tricky and I want the numbers to be as fresh as possible.
My current solution is as follows:
A web control that simply records a view in a table for each GET.
Excludes a list of known web crawlers using a regex and UserAgent string
Provides for the exclusion of certain IP Addresses (known spammers)
Provides for locking down some posts (when the spammers come for it)
This actually seems to do a pretty good job, but a couple of things annoy me. The spammers still hit some posts, thereby skewing the Views. I still have to manually monitor the views an update my list of "bad" IP addresses.
Does anyone have some better suggestions for me? Anyone know how the views on StackOverflow questions are tracked?

It sounds like your current solution is actually quite good.
We implemented one where the server code which delivered the view content also updated a database table which stored the URL (actually a special ID code for the URL since the URL could change over time) and the view count.
This was actually for a system with user-written posts that others could comment on but it applies equally to the situation where you're the only user creating the posts (if I understand your description correctly).
We had to do the following to minimise (not eliminate, unfortunately) skew.
For logged-in users, each user could only add one view point to a post. EVER. NO exceptions.
For anonymous users, each IP address could only add one view point to a post each month. This was slightly less reliable as IP addresses could be 'shared' (NAT and so on) from our point of view. The reason we relaxed the "EVER" requirement above was for this sharing reason.
The posts themselves were limited to having one view point added per time period (the period started low (say, 10 seconds) and gradually increased (to, say, 5 minutes) so new posts were allowed to accrue views faster, due to their novelty). This took care of most spam-bots, since we found that they tend to attack long after the post has been created.
Removal of a spam comment on a post, or a failed attempt to bypass CAPTCHA (see below), automatically added that IP to the blacklist and reduced the view count for that post.
If a blacklisted IP hadn't tried to leave a comment in N days (configurable), it was removed from the blacklist. This rule, and the previous rule, minimised the manual intervention in maintaining the blacklist, we only had to monitor responses for spam content.
CAPTCHA. This solved a lot of our spam problems, especially since we didn't just rely on OCR-type things (like "what's this word -> 'optionally'); we actually asked questions (like "what's 2 multiplied by half of 8?") that break the dumb character recognition bots. It won't beat the hordes of cheap labour CAPTCHA breakers (unless their maths is really bad :-) but the improvements from no-CAPTCHA were impressive.
Logged-in users weren't subject to CAPTCHA but spam got the account immediately deleted, IP blacklisted and their view subtracted from the post.
I'm ashamed to admit we didn't actually discount the web crawlers (I hope the client isn't reading this :-). To be honest, they're probably only adding a minimal number of view points each month due to our IP address rule (unless they're swarming us with multiple IP addresses).
So basically, I'm suggested the following as possible improvements. You should, of course, always monitor how they go to see if they're working or not.
CAPTCHA.
Automatic blacklist updates based on user behaviour.
Limiting view count increases from identical IP addresses.
Limiting view count increases to a certain rate.
No scheme you choose will be perfect (e.g., our one month rule) but, as long as all posts are following the same rule set, you still get a good comparative value. As you said, accuracy need only be an approximation.

Suggestions:
Move the hit count logic from a user control into a base Page class.
Redesign the exclusions list to be dynamically updatable (i.e. store it in a database or even in an xml file)
Record all hits. On a regular interval, have a cron job run through the new hits and determine whether they are included or excluded. If you do the exclusion for each hit, each user has to wait for the matching logic to take place.
Come up with some algorithm to automatically detect spammers/bots and add them to your blacklist. And/Or subscribe to a 3rd party blacklist.

Related

Logging requests into database

Should I log requests info (client ip, request status code, execution time etc.) in my web app into the database to analyse users behavoir and arised errors? And what info log for better experience?
Its often tempting to log lots of information, however I usually find that when I come to use it to answer a question it's often the case that the wrong piece of information has been recorded or only partially. Or it has been recorded but has not been stored in a usable way and takes further programming to turn the log into meaningful information.
So I would start with the question of what you want to see/find and log accordingly. Generally then logging capability can be expanded in the future as new issues/insights are required.
remember every time you log something you are slowing your application down. You are also using more disk space, no one is going to thank you for buying more disk / longer backups just because you have logged everything on every action.
I guess I would follow a train of thought a bit like:
1) What are you trying to find, if its an error you can predict then why not cater for it in your code to start with. If its usability what format does the data need to be in at at what points should it be recorded.
2) How long do you need it for, be sure to purge the logs after a period to conserve disk space.
3) Every element stored is a performance hit, might be small but for high number of transactions it adds up.
4) Be wary of privacy rules, an IP address may be considered as identifiable data, in which case you need to publish a data privacy policy (see point 2).
5) Consider using a flag to control logging on or off. Then you can use it at times of a known issue but not record everything always when not needed.

Should I set max limits on as many entities as possible in my webapp?

I am working on a CRUD webapp where users create and manager their data assets.
Having no desire to be bombarded with tons of data for the first time, I think that it would be reasonable to set limits where possible in the database.
For example, limit number of created items A = 10, B = 20, C = 50 then, if user reaches the limit, have a look at his account and figure out if I should update the rules if it doesn't break the code and performance.
Is it a good practice at all to set such limits from the performance/maintenance side, not from the business side or should I think like data entities are unlimited and try to make it well-performing with lots of data from the start?
You suggest to test your application's performance on real users, which is bad. In addition, your solution will create inconvenience for users by limiting them, when there is no reason for that (at least from user's point of view), which decreases user's satisfaction.
Instead, you should test performance before you release. It will give you understanding of your application's and infrastructure's limits of running under high load. Also, it will help you to find and eliminate bottle necks in your code. You can perform such testing with tools like JMeter and many others.
Also, if you afraid of tons of data at start moment, you can release your application as private beta: just make a simple form where users can ask for early access and receive invite. By sending invites you can easily control growth of user base and therefore loading on you app.
But you should, of course, create limitations where it is necessary, for example, limit items per page, etc.

Strategy for caching of remote service; what should I be considering?

My web app contains data gathered from an external API of which I do not have control. I'm limited to about 20,000 API requests per hour. I have about 250,000 items in my database. Each of these items is essentially a cached version. Consider that it takes 1 request to update the cache of 1 item. Obviously, it is not possible to have a perfectly up-to-date cache under these circumstances. So, what things should I be considering when developing a strategy for caching the data. These are the things that come to mind, but I'm hoping someone has some good ideas I haven't thought of.
time since item was created (less time means more important)
number of 'likes' a particular item has (could mean higher probability of being viewed)
time since last updated
A few more details: the items are photos. Every photo belongs to an event. Events that are currently occurring are more like to be viewed by client (therefore they should take priority). Though I only have 250K items in database now, that number increases rather rapidly (it will not be long until 1 million mark is reached, maybe 5 months).
Would http://instagram.com/developer/realtime/ be any use? It appears that Instagram is willing to POST to your server when there's new (and maybe updated?) images for you to check out. Would that do the trick?
Otherwise, I think your problem sounds much like the problem any search engine has—have you seen Wikipedia on crawler selection criteria? You're dealing with many of the problems faced by web crawlers: what to crawl, how often to crawl it, and how to avoid making too many requests to an individual site. You might also look at open-source crawlers (on the same page) for code and algorithms you might be able to study.
Anyway, to throw out some thoughts on standards for crawling:
Update the things that have changed often when updated. So, if an item hasn't changed in the last five updates, then maybe you could assume it won't change as often and update it less.
Create a score for each image, and update the ones with the highest scores. Or the lowest scores (depending on what kind of score you're using). This is a similar thought to what is used by LilyPond to typeset music. Some ways to create input for such a score:
A statistical model of the chance of an image being updated and needing to be recached.
An importance score for each image, using things like the recency of the image, or the currency of its event.
Update things that are being viewed frequently.
Update things that have many views.
Does time affect the probability that an image will be updated? You mentioned that newer images are more important, but what about the probability of changes on older ones? Slow down the frequency of checks of older images.
Allocate part of your requests to slowly updating everything, and split up other parts to process results from several different algorithms simultaneously. So, for example, have the following (numbers are for show/example only--I just pulled them out of a hat):
5,000 requests per hour churning through the complete contents of the database (provided they've not been updated since the last time that crawler came through)
2,500 requests processing new images (which you mentioned are more important)
2,500 requests processing images of current events
2,500 requests processing images that are in the top 15,000 most viewed (as long as there has been a change in the last 5 checks of that image, otherwise, check it on a decreasing schedule)
2,500 requests processing images that have been viewed at least
Total: 15,000 requests per hour.
How many (unique) photos / events are viewed on your site per hour? Those photos that are not viewed probably don't need to be updated often. Do you see any patterns in views for old events / phones? Old events might not be as popular so perhaps they don't have to be checked that often.
andyg0808 has good detailed information however it is important to know the patterns of your data usage before applying in practice.
At some point you will find that 20,000 API requests per hour will not be enough to update frequently viewed photos, which might lead you to different questions as well.

Best implementation of turn-based access on App Engine?

I am trying to implement a 2-player turn-based game with a GAE backend. The first thing this game requires is a very simple match making system that operates like this:
User A asks the backend for a match. The back ends tells him to come back later
User B asks the backend for a match. He will be matched with A.
User C asks the backend for a match. The back ends tells him to come back later
User D asks the backend for a match. He will be matched with C.
and so on...
(edit: my assumption is that if I can figure this one out, most other operation i a turn based game can use the same implementation)
This can be done quite easily in Apple Gamecenter and Xbox Live, however I would rather implement this on an open and platform independent backend like GAE. After some research, I have found the following options for a GAE implementation:
use memcache. However, there is no guarantee that the memcache is synchronized across different instances. I did some tests and could actually see match request disappearing due to memcache mis-synchronization.
Harden memcache with Sharding Counters. This does not always solve the multiple instance problem and mayabe results in high memcache quota usage.
Use memcache with Compare and Set. Does not solve the multiple instance problem when used as a mutex.
task queues. I have no idea how to use these but someone mentioned as a possible solution. However, I am afraid that queues will eat me GAE quota very quickly.
push queues. Same as above.
transaction. Same as above. Also probably very expensive.
channels. Same as above. Also probably very expensive.
Given that the match making is a very basic operation in online games, I cannot be the first one encountering this. Hence my questions:
Do you know of any safe mechanism for match making?
If multiple solutions exist, which is the cheapest (in terms of GAE quota usage) solution?
You could accomplish this using a cron tasks in a scheme like this:
define MatchRequest:
requestor = db.StringProperty()
opponent = db.StringProperty(default = '')
User A asks for a match, a MatchRequest entity is created with A as the requestor and the opponent blank.
User A polls to see when the opponent field has been filled.
User B asks for a match, a MatchRequest entity is created with B as as the requestor.
User B pools to see when the opponent field has been filled.
A cron job that runs every 20 seconds? or so runs:
Grab all MatchRequest where opponent == ''
Make all appropriate matches
Put all the MatchRequests as a transaction
Now when A and B poll next they will see that they they have an opponent.
According to the GAE docs on crons free apps can have up to 20 free cron tasks. The computation required for these crons for a small amount of users should be small.
This would be a safe way but I'm not sure if it is the cheapest way. It's also pretty easy to implement.

displaying # views on a page without hitting database all the time

More and more sites are displaying the number of views (and clicks like on dzone.com) certain pages receive. What is the best practice for keeping track of view #'s without hitting the database every load?
I have a bunch of potential ideas on how to do this in my head but none of them seem viable.
Thanks,
first time user.
I would try the database approach first - returning the value of an autoincrement counter should be a fairly cheap operation so you might be surprised. Even keeping a table of many items on which to record the hit count should be fairly performant.
But the question was how to avoid hitting the db every call. I'd suggest loading the table into the webapp and incrementing it there, only backing it up to the db periodically or on webapp shutdown.
One cheap trick would be to simply cache the value for a few minutes.
The exact number of views doesn't matter much anyway since, on a busy site, in the time a visitor goes through the page, a whole batch of new views is already in.
One way is to use memcached as a counter. You could modify this rate limit implementation to instead act as general counter. The key could be in yyyymmddhhmm format with an expiration of 15 or 30 minutes (depending on what you consider to be concurrent visitors) and then simply get those keys when displaying the page.
Nice libraries for communicating with the memcache server are available in many languages.
You could set up a flat file that has the number of hits in it. This would have issues scaling, but it could work.
If you don't care about displaying the number of page views, you could use something like google analytics or piwik. Both make requests after the page is already loaded, so it won't impact load times. There might be a way to make a ajax request to the analytics server, but I don't know for sure. Piwik is opensource, so you can probably hack something together.
If you are using server side scripting, increment it in a variable. It's likely to get reset if you restart the services so not such a good idea if accuracy is needed.

Resources