block compete.com - alexa

Is there a way to block compete.com or alexa from getting the traffic on my website? For competitive reasons, I don't want anyone knowing that info. I don't have advertising, just a paid-only site....would blocking compete.com or alexa in anyway affect my search engine rankings?

Is there a way to block compete.com or alexa from getting the traffic on my website
As far as I know, Alexa and Compete.com both use user-installed toolbars that analyze what sites the user visits (similar to how TV Ratings are done; see e.g. the Wikipedia article on the Alexa Toolbar). They do not fetch your traffic numbers as such; they estimate their own.
To mess with their counting process, you would have to block users with the Alexa Toolbar installed. While that may be possible, it's foolish in the extreme: You would be locking out innocent users, and decrease your Alexa ratings, which wouldn't affect them in the slightest.

Yes you can do it easily and automatically without having to contact them.
For Alexa, you can do it using robot.txt
User-agent: ia_archiver
Disallow: /
The toolbar plugin is used to monitor traffic.
I don't have details for the others, but I've just found some information about robot.txt for Compete, Quantcast and Alexa: look a the end of this article for practical details.
Please note that by adding that in your robot.txt it will also force them to remove all data they gathered retroactively.
This is useful for websites like archive.org (linked to Alexa, in fact ia in ia_archiver stands for Internet Archive). They explain the process here: http://www.archive.org/about/exclude.php
Please note that in most countries, being able to remove your data from databases is a legal right.

Related

Creating different response per device

Is it possible for alexa user to have different responses based on config in the app. For example my skill is returning measurements. Some users may prefer metric and others imperial. I'd like users to be able to specify this (and may be some other things) to give a personalised experience. Can this be configured in the Amazon Alexa app?
I was thinking I might have to have some persistent storage for this (DDB for example) which would mean the app would write to the DDB and the skill would read from it to get the personalised response.
Thanks
Can this be configured in the Amazon Alexa app?
Unfortunately not in the way which you seem to be suggesting.
If you really wanted users to set preferences through the app, this could be done through account linking. However, it is generally discouraged (Alexa is meant to be "Voice-First") and likely to present additional obstacles if what you're wanting to do is allow users to set preferences for different devices.
However, using persistent storage for user preferences in generally is a good idea and as you've suggested, DynamoDB can do this.
If you take this approach you could ask users what their preferences are the first time they use a skill on a device and store this together with the device ID.
There is some good information about device ID in the Amazon documentation and some helpful tips here:
Get unique device id for every amazon echo devices

Options to change the identified intent in Alexa fulfillment

My understanding is that Amazon ASK still does not provide:
The raw user input
An option for a fallback intent
An API to
dynamically add possible options from which Alexa can be better
informed to select an intent.
Is this right or am I missing out on knowing about some critical capabilities?
Actions on Google w/ Dialogflow provides:
raw user input for analysis: request.body.result.resolvedQuery
fallback intents:
https://dialogflow.com/docs/intents#fallback_intents
An APi to dynamically add user expressions (aka sample utterances): PUT
/intents/{id}
These tools provide devs with the ability to check to see if the identified intent is correct and if not fix it.
I know there have been a lot of questions asked previously, just a few here:
How to add slot values dynamically to alexa skill
Can Alexa skill handler receive full user input?
Amazon Alexa dynamic variables for intent
I have far more users on my Alexa skill than my AoG app simply because of Amazon's dominance to date in the market - but their experience falls short of a Google Assistant user experience because of these limitations. I've been waiting for almost a year for new Alexa capabilities here, thinking that after Amazon's guidance to not use AMAZON.LITERAL there would be improvements coming to custom slots. To date it still looks like this old blog post is still the only guidance given. With Google, I dynamically pull in utterance options from a db that are custom for a given user following account linking. By having the user's raw input, I can correct the choice of intent if necessary.
If you've wanted these capabilities but have had to move forward without them, what tricks do you have to get accurate intent handling with Amazon when you don't know what the user will say?
EDIT 11/21/17:
In September Amazon announced the Alexa Skill Management API (SMAPI) which does provide the 3rd bullet above.
Actually this should be better a comment but i write to less at stackoverflow to be able to comment. I am with you on all.
But Amazons Alexa has also a very big advance.
The intent Schema is seeming to directly influence the Voice to Text recognition. Btw. can someone confirm if this is correct?
At Google Home it seems not to be the case.
So matching of unusual names is even more complicated than at alexa.
And it sometimes just recognize absolute bullshit.
Not sure which I prefer currently.
My feeling is for small apps is Alexa much better, because it better match the Intent phrases when it has lesser choices.
But for large Intent schemas, it get really trouble and in my tests some of the intents were not matched at all correct.
Here the google home and action SDK wins, probably? Cause Speech to text seem to be done before and than a string pattern to intent schema matching is happening. So this is probably more robust for larger schemas?
To get something like an answer on your questions:
You can try to add as much as possible that can be said to a slot. And than match the result from the Alexa request to your database via Jaro winkler or some other string distance.
Was I tried for Alexa was to find phrases that are close to what the user say. And this i added as phrases to fill a slot.
So a module in our webpage was an intent in the schema. And Than I requested To say what exactly should be done in that module (this was the slot filling request). The Answer was the slot filling utterance.
For me that was slightly better working than the regulary intent schema. But it require more talking so i dont like it so much.
Let me go straight to answering your 3 questions:
1) Alexa does provide the raw input via the slot type AMAZON.Literal but it's now deprecated and you're advised to use AMAZON.SearchQuery for free form capture. However, if instead of using SearchQuery you define a custom slot type and provide samples (training data) the ASR will work better.
2) Alexa supports FallbackIntent since I believe May 2018. The way it work is by automatically generating a model for your skill where out-of-domain requests are routed through a fallback intent. It works well
3) Dynamically adding slot type values is not feasible since when you provide samples you're really providing training data for a model than will be able to then process similar values beyond the ones you defined. If you noticed when you provide a voice interaction model schema then you have to build the model (in this step the training data provided in the samples is used to create the model). One example, when you define a custom slot of type "Car" and you provide the samples "Toyota", "Jeep", "Chevrolet" and "Honda" then, the system will also go to the same intent if the user says "Ford"
Note: SMAPI does allow to get and update the interaction model, so technically you could download the model via API, modify it with new training data, upload it again and rebuild the model. This is kind of awkward though

How to get site data

I have a site that features other websites, and displayed details. Now I want to get more information about the sites I feature like page views, visits, etc.
How do I do that? Is there an API for it?
First of all, information about how many visits, pageviews, etc. other websites have is generally not publicly available, because (obviously) many companies / website owners don't want to share that information and there's no general-purpose way of getting it.
That said, here's list of websites which attempt to display that kind of information:
Quantcast
Alexa
Compete.com
Google Ad Planner
I'm sure there are others, but these are the ones I'm familiar with. Some of them have APIs, but you should keep in mind that none of them provide accurate data, but only estimates, simply because exact numbers are unknown unless published by the website owner.
Have a look at Google Analytics. It can give you information about visits, pageviews, trends, used webbrowers, screen resolutions and many more!

Stopping spam in web page

So right now my only spam protection is going to be to check all incoming messages against this table, http://www.stopforumspam.com/downloads/, that I have imported into my database, and if the IP is found, their message will not be posted.
We don't really want to hinder usability by having one of those "Type what you see..." or a sort of e-mail confirm system similar to Craigs List.
Will this IP check be enough to get rid of (most) spam comments, or should I really look into adding something else. Maybe there is some free plugin that I haven't found that doesn't hinder usability and will help us out more?
Thanks!
There you go :) http://akismet.com/
There's an API, you send them the comment body and they reply if it's spam or not. This is (maybe the best) spam hunting service, they have large word databases and good self-learning filters.
Additionally, it's free for personal use. I don't know how much it costs for business.
I'm in no way affiliated with them, I just found it by chance a couple of years ago.
akismet.com offers a quality service that will protect your site. Depending on the nature of your site there may be a fee. If your site is a personal blog they have a "WHAT IS AKISMET WORTH TO YOU?" plan where you can choose to pay $0. They would prefer that you pay $3 to $5 per month.
There's a reason captchas ("type what you see..." things) and email confirmation lists exist - there's always someone attempting to circumvent your site's security for personal gain. In all likelihood this will extend beyond spam, as well.
Just keep in mind that you're putting your trust in any external solution that you go with (which is why things like in-application email confirmations and captchas have gotten popular, considering they're not too difficult to implement and you have full control over them).

Identifying hostile web crawlers

I am wondering if there are any techniques to identify a web crawler that collects information for illegal use. Plainly speaking, data theft to create carbon copies of a site.
Ideally, this system would detect a crawling pattern from an unknown source (if not on the list with the Google crawler, etc), and send bogus information to the scraping crawler.
If, as a defender, I detect an unknown crawler that hits the site at regular intervals, the attacker will randomize the intervals.
If, as a defender, I detect the same agent/IP, the attacker will randomize the agent.
And this is where I get lost - if an attacker randomizes the intervals and the agent, how would I not discriminate against proxies and machines hitting the site from the same network?
I am thinking of checking the suspect agent with javascript and cookie support. If the bogey can't do either consistently, then it's a bad guy.
What else can I do? Are there any algorithms, or even systems designed for quick on-the-fly analysis of historical data?
My solution would be to make a trap. Put some pages on your site where access are banned by robots.txt. Make a link on you page, but hide it with CSS, then ip ban anybody who goes to that page.
This will force the offender to obey robots.txt, which means that you can put important information or services permanently away from him, which will make his carbon-copy clone useless.
Don't try and recognize by IP and timing or intervals--use the data you send to the crawler to trace them.
Create a whitelist of known good crawlers--you'll serve them your content normally. For the rest, serve pages with an extra bit of unique content that only you will know how to look for. Use that signature to later identify who has been copying your content and block them.
And how do you keep someone from hiring a person in a country with low wages to use a browser to access your site and record all of the information? Set up a robots.txt file, invest in a security infrastructure to prevent DoS attacks, obfuscate your code (if accessible, like javascript), patent your inventions, and copyright your site. Let the legal people worry about someone ripping you off.

Resources