I see in websites such as Facebook or Twitter, images such as profile pictures have filenames and locations such as 640122062739084800/BXK8aBbv.jpg.
This is quite clearly generated. But why do websites to this? Why not instead have (user_id)/image.jpg instead which is much more logical?
Is there a security risk or is there another reason? Thanks.
There a script behind every 'token' you see in URL.
Tokens are a way to control what's happening and when with security.
Some characters sequence are specification of the request, even you don't understand it at fist time.
In short. Yes, its generally for security purposes, but for controls and request specifications too.
Hope it was usesfull
Related
I am designing a system which takes user submitted code and saves it in database. Code can be in any language, ruby, python, elixir, javascript, etc. There's no restriction on language. Code saved in database is never meant to be run. It will be displayed in blog article or converted into file for download. Similar example might be GitHub gist or Cacher, both takes user submitted code and displays on website.
How do I make sure User submitted code is sanitised and secure to be displayed on webpage with code highlighter?
What processing do I need to do on code such that I can safely display it? I don't want to impose strict restrictions on users.
Any gotcha I need to be aware?
Any idea how those website implement this feature?
I am using Elixir and Phoenix framework. Is there any pitfalls I should be careful about? I am thinking of using Phoenix.HTML module to escape codes. I just wanna be sure that my approach doesn't have known loop holes.
I think you are looking for this https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet
I have a large link database, that I would want to protect against others who would want to copy them. Is there anything I can do other than force people to enter a CAPTCHA before each link?
you can output the links using ROT13, and then use javascript to put them back to normal.
this way, the scrapers must support javascript in order to steal your links, which should cut down on the number of eligible scrapers
bonus points: replace ROT13 with something harder, and obfuscate your 'decode' javascript.
The javascript suggestion could work, but you would render your page inaccessible to those using assistive technologies like screen readers as well as anyone without javascript.
Another possible option would be to generate a cryptographic nonce. This technique is currently used to protect against CSRF attacks, but could also be used to ensure that the scraper would have to request a page from your site before accessing a link. This approach may not be appropriate if you support hotlinking, but if you just want to make sure that someone went to your site first, it could work.
Another somewhat ghetto option would be use referrers. These can be easily faked, but it might prevent some of the dumber scrapers. This also requires that you know where your users came from before they hit your site.
Can you let us know if you are hotlinking or if the user comes to your site before going to the protected link? We might be able to provide better advice that way.
The ways I can think of are:
Measure the time between actions.
Compare the posts' content (if they're too similar to each other) or, better yet, only the posted links.
Checking the distribution over a period of time the user is active (if the user is active, say posting once every hour, for a week, then either we have a superman or a bot here).
Some special activity expected: like in stackoverflow, I would expect users to press their user name link (top middle) to see their new answers, comments, questions etc.
(added by chakrit) Number of links in a post.
Not heuristic. Use some async JS for user login. (Just makes life a bit harder on the bot programmer).
(added by Alekc) Not heuristic. User-agent values.
And, How could I forget Google's approach (mentioned down by Will Hartung). Give users the ability to mark someone as Spam, enough Spam votes means this is a Spam user. (calculating what is enough users, is the work here).
Any more ideas?
I might be over estimating the intelligence of bot creators, but number 6 is completely useless against any semi decent bot creator. Using the C# browser control to create your bot would pretty much render 6 useless. From what I've seen with that type of software that's a pretty common approach.
Validating on the useragent is pretty much useless too all of the blog spam I use to get was from bots appearing to be valid web browsers.
I use to get a lot of blog spam. I would literally be deleting hundreds of comments a day. I made use of reCaptcha and now I might get 1 a month.
If you really try to make something like this. I would attempt by doing the following:
User starts off with no ability to post a url.
After X number of posts have been analyzed in relation to the other posts in the thread then give them access to post urls.
The users activity on the site, the post quality, and what ever other factors you deem necessary will be a reputation for that users IP.
Then based the reputation of the IP and the other IPs on the same subnet you can make other decisions on whatever you want.
That was just the first thing that came to mind. Hope it helps.
The number of links in a post.
I believe I've read somewhere that Akismet use the number of links as one of its major heuristics.
And most of spam comments at my blog contains 10+ links in them.
Speaking of which... you just might want to check out the Akismet API itself .. they are extremely effective.
How about a search for spam related keywords in the post body?
Not a heuristic but an effective approach: You can also keep up-to-date with the stats published by StopForumSpam using their APIs.
Time between page visits is common I believe.
I need to add a comment section to my personal site and am thinking of asking people to give me their email address; I'll email them a "publish comment" link.
You might want to check if they've come from a Spam blacklist IP address (See http://www.spamhaus.org/)
There is another answer that suggests using Akismet for detecting spam, which I completely endorse.
However, they are not the only player on the block.
There is TypePad AntiSpam which uses the same heuristics as Akismet, as well as the same API (just a different URL and api key, the structure of the calls is the same). It can be safe to say they pretty much take the same approach as Akismet.
You might also want to check out Project Honeypot. From what I can tell, it can do a lookup based on the IP address of the user, and if it is a known malicious IP, it will tell you (harvester or something like that).
Finally, you can check LinkSleeve which approaches comment spam with what it claims to be a different way. Basically, it checks the links that are being linked to in comments, and based on where the links are going to, makes a determination.
Don't forget the ultimate heuristic: The "Report Spam" button that users can click. If nothing else, this gives you as administrator a chance to update your rule base for stuff that may be slipping through. Of course, you can simply delete the offending post and user right away as well.
I have some doubts about 4° point, anyway i would also add User-Agent. It's pretty easy to fake, but in my experience, about 90% of bots are using Perl as UA
I am sure there is a webservice of some kind that you can get a list of top SEO keywords, check the content for those keywords. if the content is to rich in keywords suspect it as being spam.
So, here is the task I've found myself thinking of. Pretend for a moment, that I have a large body of content. I want to see what websites are linking to my content. I know that I could look into TrackBack or PingBack but what about those that aren't using tools capable of dealing with that?
It would seem that some form of Web Crawler that looks for pages linking to the original document might be useful. My question to the greater community is what would be the best way to get started here? Do TrackBack and PingBack do more than I assume? Are there services or tools out there that already do what I'm thinking?
Google is your friend!
Use the link prefix:
link:whatsite.com
And yes, trackbacks do more.
If you have HTTP referers setup in your logs, you can mine them.
You can even discover pages taht does not know about.
Else, there is the paying Linkscape from Seomoz or the free majesticSEO (if you confirm ownership of the domain).
MajesticSEO has a bigger backlink index and an API (need to login!).
I'm storing URLs in a database, and I want to be able to know if two URLs are identical.
Generally, a trailing slash at the end doesn't change the response you'd get from a server. (ie. http://www.google.com/ is the same as http://www.google.com)
Can I always blindly remove the trailing slash from any URL, without looking at anything?
Is that safe?
What I mean by "without looking at anything" is that I'd remove the slash from:
http://www.google.com/q?xxx=something&yyy=something/
I know the web server could theoretically return completely different things if it wanted, and I know sometimes going to a URL without the slash will redirect to one with the slash. My only intention here is determining if both URLs are the same.
Is this method safe?
No it is not always safe. A web server could interpret the path part of the URL anyway it likes. You cannot tell what it will do (resolve the URI) without using a GET or HEAD on the URL.
It may be safe in the sense that you'll get the same response with or without a trailing slash (and I can't guarantee that's true), but they can definitely mean different things. Consider a URL that references a directory, or something presented by the site as a directory. Using the URL
http://www.somesite.com/directory/
...makes it clear you're asking for a directory. If you hack off the trailing slash:
http://www.somesite.com/directory
...the site's going to take this as a request for a file called "directory", and get all confused for a moment. It'll likely interpret this as a request for a directory, but the meanings are not the same, and you might not get what you expect.
See this article for more detail.
No. I've encountered situations where, depending on the settings in a .htaccess file, some directories or "clean URLs" (such as those generated by a CMS) could not be accessed without a trailing slash. It's rare and it might be a mistake on the part of the webmaster, but it can happen.
As others have noted, it's not always safe. If it will work for you, my recommendation is to store the URL's with the slashes, and strip them off when you do your comparison. You'll take a performance hit, but I'd think that's better than sending someone to the wrong web page.