The ways I can think of are:
Measure the time between actions.
Compare the posts' content (if they're too similar to each other) or, better yet, only the posted links.
Checking the distribution over a period of time the user is active (if the user is active, say posting once every hour, for a week, then either we have a superman or a bot here).
Some special activity expected: like in stackoverflow, I would expect users to press their user name link (top middle) to see their new answers, comments, questions etc.
(added by chakrit) Number of links in a post.
Not heuristic. Use some async JS for user login. (Just makes life a bit harder on the bot programmer).
(added by Alekc) Not heuristic. User-agent values.
And, How could I forget Google's approach (mentioned down by Will Hartung). Give users the ability to mark someone as Spam, enough Spam votes means this is a Spam user. (calculating what is enough users, is the work here).
Any more ideas?
I might be over estimating the intelligence of bot creators, but number 6 is completely useless against any semi decent bot creator. Using the C# browser control to create your bot would pretty much render 6 useless. From what I've seen with that type of software that's a pretty common approach.
Validating on the useragent is pretty much useless too all of the blog spam I use to get was from bots appearing to be valid web browsers.
I use to get a lot of blog spam. I would literally be deleting hundreds of comments a day. I made use of reCaptcha and now I might get 1 a month.
If you really try to make something like this. I would attempt by doing the following:
User starts off with no ability to post a url.
After X number of posts have been analyzed in relation to the other posts in the thread then give them access to post urls.
The users activity on the site, the post quality, and what ever other factors you deem necessary will be a reputation for that users IP.
Then based the reputation of the IP and the other IPs on the same subnet you can make other decisions on whatever you want.
That was just the first thing that came to mind. Hope it helps.
The number of links in a post.
I believe I've read somewhere that Akismet use the number of links as one of its major heuristics.
And most of spam comments at my blog contains 10+ links in them.
Speaking of which... you just might want to check out the Akismet API itself .. they are extremely effective.
How about a search for spam related keywords in the post body?
Not a heuristic but an effective approach: You can also keep up-to-date with the stats published by StopForumSpam using their APIs.
Time between page visits is common I believe.
I need to add a comment section to my personal site and am thinking of asking people to give me their email address; I'll email them a "publish comment" link.
You might want to check if they've come from a Spam blacklist IP address (See http://www.spamhaus.org/)
There is another answer that suggests using Akismet for detecting spam, which I completely endorse.
However, they are not the only player on the block.
There is TypePad AntiSpam which uses the same heuristics as Akismet, as well as the same API (just a different URL and api key, the structure of the calls is the same). It can be safe to say they pretty much take the same approach as Akismet.
You might also want to check out Project Honeypot. From what I can tell, it can do a lookup based on the IP address of the user, and if it is a known malicious IP, it will tell you (harvester or something like that).
Finally, you can check LinkSleeve which approaches comment spam with what it claims to be a different way. Basically, it checks the links that are being linked to in comments, and based on where the links are going to, makes a determination.
Don't forget the ultimate heuristic: The "Report Spam" button that users can click. If nothing else, this gives you as administrator a chance to update your rule base for stuff that may be slipping through. Of course, you can simply delete the offending post and user right away as well.
I have some doubts about 4° point, anyway i would also add User-Agent. It's pretty easy to fake, but in my experience, about 90% of bots are using Perl as UA
I am sure there is a webservice of some kind that you can get a list of top SEO keywords, check the content for those keywords. if the content is to rich in keywords suspect it as being spam.
Related
Before all the un-helpful comments come out of the woodwork (i.e. "if you want to protect it, don't upload it") - I get it. The only way to protect it is to not put it out there. Not an option, not helpful. Home security isn't about never allowing someone to break into your home. If they want in, they can cut a hole in the wall if your windows and door are too secure. The idea is to make it harder so the average burglar will move on. That's what I'm looking for here. Make it a pain so the average would-be thief will move on. With that out there here is my situation:
I'm putting the finishing touches on a portfolio site for a photographer. The site is a headless React WP site. I've seen so many "suggestions" for how to protect their copyrighted materials that my head is spinning. "Disable right click." "DON'T disable right click." "Upload only smaller, compressed images." Trust me, I know that a Google search will give tons of answers. Problem is, too many of them are conflicting or generic. I figured I would go to a group of people who are generally more technically savvy, and have probably had to face similar situations.
Here is what I have done so far:
employed plugins that automatically compress and limit size on uploaded images. This allows the client to upload what they want, and keeps them safe.
employed plugin that automatically puts copyright info and a QR code onto the image that directs the scanner back to the main page of the site. As the point of the site is to show off the beautiful images, we decided to put them only in the corner, even though covering more of the image would be more effective.
decided NOT to bother with blocking right click as it's a nuisance at best and is easy to circumvent by users with limited technical knowledge (is this the right call?)
Any other ideas I should entertain?
I work on a social web application, and I really hesitate between using ACLs or using my own permissions system. I work for several days with ACLs to see their behavior and savoi what I could and could not do.
In my application, members can set permissions on what they have (profile, photo album, etc.). I also need to set permissions for moderators who can check the content of a profile and photo albums, but do not have the right to the modifiers or deleted, however the administrator to all rights .
For the profile, the user can decide their make public, visible only by members or friends. For this, I could create a perfectly ProfileSettings table with different fields to see what information is visible and can access the profile. The problem is if I want to implement a choice according to a list of friends in the way of Facebook. I imagined the scenario with ACL, but I fear that the requested resources are enormous. Create a profile for each aco is not bad especially if requested database include 300 members, not to mention the need to save all the permissions as appropriate and user choice. So I think not allow members to define which users can access their objects, but only groups.
There still remains the question about ACLs. In my case is this a good idea? I also have to set permissions on the forums, eg a moderator can update a thread / post, a forum can be visible only by the admin / moderators etc.
Where I am afraid to use ACL is that the slightest mistake can cause serious damage, but at the same time, it allows me to set up permissions faster.
Thank you for your help and your advice :)
The correct answer is probably "Yes, your use-case makes sense to use Cake's ACL."
However, I've tried a number of times to use ACL. The learning curve was too high/time consuming for me. I used it on an app and a few months in had the problem you mentioned - one small issue, and everything fell apart. And without knowing exactly what I was doing / what I needed to do to resolve, it became a nightmare. Not only that, but when I wanted to extend it or do ANYTHING other than the specific example, I couldn't get it to work.
From then on, I just built my own to the specific needs of the project.
If you have someone you know who uses it regularly and can teach you the the ins and outs of Cake's ACL, then it might be fine to use it. Otherwise, I'd build my own. Even if it's not "better", at least if/when something goes wrong, you'll understand it enough to fix it.
I'm concerned - when I take a picture, I usually (ie, last week) am able to share the image to my app.
Now, however, only Google + contacts appear as share targets. For example, if I turn off sharing to G+, I get no Share options at all, only a greyed-out Share dialog that says "Visit google.com/myglass to add friends"
However, when I go to that address I clearly see my app and a number of contacts (who aren't in G+) who also usually show up.
Has something changed to cause this behavior? For example, is the code listed in the starter-project no longer sufficient to register a share target for photos?
For example, I could imagine that suddenly the acceptTypes[] parameter was now mandatory. But I'd love to hear someone closer to the API weigh in, if possible.
Thanks!
AKA
I solved this by following Alain's comment's advice.
It's very easy to think that the "Contacts" page you see at https://glass.google.com/myglass is all there is.
But if you want your app to receive shared stuff, you have to go here: https://glass.google.com/myglass/share
I'm stumped and need some ideas on how to do this or even whether it can be done at all.
I have a client who would like to build a website tailored to English-speaking travelers in a specific country (Thailand, in this case). The different modes of transportation (bus & train) have good web sites for providing their respective information. And both are very static in terms of the data they present (the schedules rarely change). Here's one of the sites I would need to get info from: train schedules The client wants to provide users the ability to search for a beginning and end location and determine, using the external website's information, how they can best get there, being provided a route with schedule times for the different modes of chosen transport.
Now, in my limited experience, I would think the way to do that would be to retrieve the original schedule info from the external site's server (via API or some other means) and retain the info in a database, which can be queried as needed. Our first thought was to contact the respective authorities to determine how/if this can be done, but this has proven to be problematic due to the language barrier, mainly.
My client suggested what is basically "screen scraping", but that sounds like it would be complicated at best, downloading the web page(s) and filtering through the HTML for relevant/necessary data to put into the database. My worry is that the info on these mainly static sites is so static, that the data isn't even kept in a database to build the page and the web page itself is updated (hard-coded) when something changes.
I could really use some help and suggestions here. Thanks!
Screen scraping is always problematic IMO as you are at the mercy of the person who wrote the page. If the content is static, then I think it would be easier to copy the data manually to your database. If you wanted to keep up to date with changes, you could then snapshot the page when you transcribe the info and run a job to periodically check whether the page has changed from the snapshot. When it does, it sends an email for you to update it.
The above method could also be used in conjunction with some sort of screen scaper which could fall back to a manual process if the page changes too drastically.
Ultimately, it is a case of how much effort (cost) is your client willing to bear for accuracy
I have done this for the following site: http://www.buscatchers.com/ so it's definitely more than doable! A key feature of a web scraping solution for travel sites is that it must send you emails if anything went wrong during the scraping process. On the site, I use a two day window so that I have two days to fix the code if the design changes. Only once or twice have I had to change my code, and it's very easy to do.
As for some examples. There is some simplified source code here: http://www.buscatchers.com/about/guide. The full source code for the project is here: https://github.com/nicodjimenez/bus_catchers. This should give you some ideas on how to get started.
I can tell that the data is dynamic, it's to well structured. It's not hard for someone who is familiar with xpath to scrape this site.
This question already has answers here:
Detecting 'stealth' web-crawlers
(11 answers)
Closed 9 years ago.
What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot?
Captchas
Form submitted in less than a second
Hidden (by css) field gets a value submitted during form submit
Frequent page visits
Simple bots can not scrap text from flash, images or sound.
Unfortunately your question is similar to people asking how do you block spam. There's no fixed answer, and it won't stop someone/bot which is persistent.
However, here are some methods that can be implemented:
Check User-Agent (this could be spoofed though)
Use robots.txt (proper bots will - hopefully respect this)
Detect IP addresses that access a lot of pages too consistently (every "x" seconds).
Manually, or create flags in your system to check who all are going on your site and block certain routes the scrapers take.
Don't use a standard template on your site, and create generic css classes - and don't put in HTML comments in your code.
You can use robots.txt to block bots that take notice of it (but still let through other known instances such as google, etc) - but that won't stop those that ignore it. You may be able to get the user agent from your web server logs, or you could update your code to record it somewhere. If you then wanted you could block particular user agents from accessing your website, just be returning either a empty/default screen and/or a particular server code.
I don't think there is a way of doing exactly what you need, because in websites crawlers/scrapers you can edit all headers when requesting a page, like User-Agent, and you won't be able to identify if there is a user from Mozilla Firefox or just a scraper/crawler...
Scrapers rely to some extent on the consistency of markup from page load to page load. If you want to make life difficult for them, come up with a means of serving altered markup from request to request.
Something like "Bad Behavior" might help: http://www.bad-behavior.ioerror.us/
From their site:
Bad Behavior is designed to integrate into your PHP-based Web site, running as early as possible to throw out spam bots before they have the opportunity to vandalize your site with their junk, or even to scrape your pages for e-mail addresses and forms to fill out.
Not only does Bad Behavior block actual vandalism to your site, it also blocks many e-mail address harvesters, resulting in less e-mail spam, and many automated Web site cracking tools, helping to improve your Web site’s security.