I thought there would be an easy, well documented answer to this but I can't find one anywhere, so maybe I've missed it, sorry if that's the case.
My website has an input field where users can write comments on a post, I want them to be able to put links in these comments. An example input from a user would be 'I think https://example.com is a great site', I've seen on some sites they have a link button which I guess they use to make this process way simpler. Is there a way to automatically detect the link? Then how is this stored in a database so it can be displayed on a page?
Related
I have a webpage that needs to be scraped to look for certain text. The problem is it's not really webscraping that I am trying to achieve. The website is opened by a separate process. I am specifically talking about a webpage but really, it is more of a universal screen scraping issue. Conceptually, It's more like I am scraping the browser instead of the page itself. Is there a program that can scan any open process and look for and match text? To put it another way, it would be like having a separate program from the browser's built-in ctrl+f find function. I just need a simple utility to tell my if a given text is present in a boolean type fashion. I realize this is a very broad question but I haven't been able to find anything about it. Maybe I don't quite know how to articulate it in a Google search because my research keeps coming up empty.
If you already know the structure of the page, like it's always Google search results, or always an Amazon product, you might look at Selenium or one of the many Chrome screen-scraping add-ons.
If you want to grab data off of any page without knowing the format in advance, I don't know a way.
Sorry about the broad question. I'm just curious if someone could point me in the right direction.
Say there's a database of contact information, and there's a site where you can input a persons name and it brings you to a page with all of their information on that database. How does this happen exactly? The server would have to dynamically create this page, but does it have a generic format that it just fills with the information? And how does this happen?
Like you said, this is a an extremely broad question. It could be either way. The server could generate the entire contents dinamically, or it could be "filling the blanks" into a preformatted layout.
Google some PHP basic tutorials. That should give you a good idea about how this "dynamism" works. Sorry but your question is too broad to ellaborate more.
The server would dynamically create the page using PHP and SQL. There is a quick tutorial at http://www.mysqltutorial.org/php-querying-data-from-mysql-table/ that shows how it would be setup.
If I understood your question right, you are asking things like how this page was created, for example, in which case it can be as simple as a basic PHP and SQL combination. You can check an example on the w3sschools website:
Try it yourself example
There would be special place holders for the data and a query will extract the data and put it into the given place holders, please note, you can also use loops to add things like tables, fetch through multiple rows and so on.
I'm concerned - when I take a picture, I usually (ie, last week) am able to share the image to my app.
Now, however, only Google + contacts appear as share targets. For example, if I turn off sharing to G+, I get no Share options at all, only a greyed-out Share dialog that says "Visit google.com/myglass to add friends"
However, when I go to that address I clearly see my app and a number of contacts (who aren't in G+) who also usually show up.
Has something changed to cause this behavior? For example, is the code listed in the starter-project no longer sufficient to register a share target for photos?
For example, I could imagine that suddenly the acceptTypes[] parameter was now mandatory. But I'd love to hear someone closer to the API weigh in, if possible.
Thanks!
AKA
I solved this by following Alain's comment's advice.
It's very easy to think that the "Contacts" page you see at https://glass.google.com/myglass is all there is.
But if you want your app to receive shared stuff, you have to go here: https://glass.google.com/myglass/share
I have a large link database, that I would want to protect against others who would want to copy them. Is there anything I can do other than force people to enter a CAPTCHA before each link?
you can output the links using ROT13, and then use javascript to put them back to normal.
this way, the scrapers must support javascript in order to steal your links, which should cut down on the number of eligible scrapers
bonus points: replace ROT13 with something harder, and obfuscate your 'decode' javascript.
The javascript suggestion could work, but you would render your page inaccessible to those using assistive technologies like screen readers as well as anyone without javascript.
Another possible option would be to generate a cryptographic nonce. This technique is currently used to protect against CSRF attacks, but could also be used to ensure that the scraper would have to request a page from your site before accessing a link. This approach may not be appropriate if you support hotlinking, but if you just want to make sure that someone went to your site first, it could work.
Another somewhat ghetto option would be use referrers. These can be easily faked, but it might prevent some of the dumber scrapers. This also requires that you know where your users came from before they hit your site.
Can you let us know if you are hotlinking or if the user comes to your site before going to the protected link? We might be able to provide better advice that way.
The ways I can think of are:
Measure the time between actions.
Compare the posts' content (if they're too similar to each other) or, better yet, only the posted links.
Checking the distribution over a period of time the user is active (if the user is active, say posting once every hour, for a week, then either we have a superman or a bot here).
Some special activity expected: like in stackoverflow, I would expect users to press their user name link (top middle) to see their new answers, comments, questions etc.
(added by chakrit) Number of links in a post.
Not heuristic. Use some async JS for user login. (Just makes life a bit harder on the bot programmer).
(added by Alekc) Not heuristic. User-agent values.
And, How could I forget Google's approach (mentioned down by Will Hartung). Give users the ability to mark someone as Spam, enough Spam votes means this is a Spam user. (calculating what is enough users, is the work here).
Any more ideas?
I might be over estimating the intelligence of bot creators, but number 6 is completely useless against any semi decent bot creator. Using the C# browser control to create your bot would pretty much render 6 useless. From what I've seen with that type of software that's a pretty common approach.
Validating on the useragent is pretty much useless too all of the blog spam I use to get was from bots appearing to be valid web browsers.
I use to get a lot of blog spam. I would literally be deleting hundreds of comments a day. I made use of reCaptcha and now I might get 1 a month.
If you really try to make something like this. I would attempt by doing the following:
User starts off with no ability to post a url.
After X number of posts have been analyzed in relation to the other posts in the thread then give them access to post urls.
The users activity on the site, the post quality, and what ever other factors you deem necessary will be a reputation for that users IP.
Then based the reputation of the IP and the other IPs on the same subnet you can make other decisions on whatever you want.
That was just the first thing that came to mind. Hope it helps.
The number of links in a post.
I believe I've read somewhere that Akismet use the number of links as one of its major heuristics.
And most of spam comments at my blog contains 10+ links in them.
Speaking of which... you just might want to check out the Akismet API itself .. they are extremely effective.
How about a search for spam related keywords in the post body?
Not a heuristic but an effective approach: You can also keep up-to-date with the stats published by StopForumSpam using their APIs.
Time between page visits is common I believe.
I need to add a comment section to my personal site and am thinking of asking people to give me their email address; I'll email them a "publish comment" link.
You might want to check if they've come from a Spam blacklist IP address (See http://www.spamhaus.org/)
There is another answer that suggests using Akismet for detecting spam, which I completely endorse.
However, they are not the only player on the block.
There is TypePad AntiSpam which uses the same heuristics as Akismet, as well as the same API (just a different URL and api key, the structure of the calls is the same). It can be safe to say they pretty much take the same approach as Akismet.
You might also want to check out Project Honeypot. From what I can tell, it can do a lookup based on the IP address of the user, and if it is a known malicious IP, it will tell you (harvester or something like that).
Finally, you can check LinkSleeve which approaches comment spam with what it claims to be a different way. Basically, it checks the links that are being linked to in comments, and based on where the links are going to, makes a determination.
Don't forget the ultimate heuristic: The "Report Spam" button that users can click. If nothing else, this gives you as administrator a chance to update your rule base for stuff that may be slipping through. Of course, you can simply delete the offending post and user right away as well.
I have some doubts about 4° point, anyway i would also add User-Agent. It's pretty easy to fake, but in my experience, about 90% of bots are using Perl as UA
I am sure there is a webservice of some kind that you can get a list of top SEO keywords, check the content for those keywords. if the content is to rich in keywords suspect it as being spam.