Kik search engine not picking up mobile app even though robots.txt is empty - kik

Kik's search engine is not picking up our HTML5 mobile app even though our robots.txt file is empty.
Our mobile app is at http://panabee-games.herokuapp.com/spoof/spoof.
Our robots.txt file is at http://panabee-games.herokuapp.com/robots.txt.
Did this happen to others?
Please note: we were asked by Kik customer support to post this question here.

Kik indexing service determines listing eligibility based on 3 things:
kik.js is included
all necessary tags are included (description meta, icon link)
robots.txt doesn't allow indexing
This is the rule I would recommend using:
User-agent: KikBot
Disallow:
# Other rules if needed
# User-agent:*
# Disallow: /
I would also recommend putting your kik.js script at the bottom of the body. Other than that your code looks good.
Also, keep in mind that once your app is indexed it is added to the whitelist queue that is checked by a representative at Kik. They very quickly verify that the app does not contain any inappropriate content and then add it to the list of searchable apps.

Related

Should I input the after-login pages in robots.txt

In our sites, some pages could be accessed only after login...
Is it good to set these after-login pages in robots.txt disallowed ?
Really search the answer from google, but nothing could help...
In general, I would heed the advice from this article:
To summarize always add the login page to the robots exclusion protocol file, otherwise you will end up:
1 - sacrificing valuable "search engine crawling time" in your site.
2 - spending unnecessary bandwidth and server resources.
3 - potentially even blocking crawlsers from your content.
https://blogs.msdn.microsoft.com/carlosag/2009/07/06/seo-tip-beware-of-the-login-pages-add-them-to-robots-exclusion/
Similarly:
https://webmasters.stackexchange.com/questions/86395/using-robots-txt-to-block-sessionid-urls
Ideally, you'd be able to easily exclude all of those pages via some sort of regex. For example, if all of the urls for these pages started with /my-account/ then you should be able to do this:
disallow: /my-account/*

why i couldn't see any text in "http://crawlservice.appspot.com/?key=123456&url=http://mydomain.com#!article"?

Ok, i found this link https://code.google.com/p/gwt-platform/wiki/CrawlerSupport#Using_gwtp-crawler-service that explain how you can make your GWTP app crawlable.
I got some GWTP experience, but i know nothing about AppEngine.
Google said its "crawlservice.appspot.com" can parse any Ajax page. Now I have a page "http://mydomain.com#!article" that has an artice that was pulled from Database. Say that page has the text "this is my article". Now I open this link:
crawlservice.appspot.com/?key=123456&url=http://mydomain.com#!article, then i can see all javascript but I couldn't find the text "this is my article".
Why?
Now let check with a real life example
open this link https://groups.google.com/forum/#!topic/google-web-toolkit/Syi04ArKl4k & you will see the text "If i open that url in IE"
Now you open http://crawlservice.appspot.com/?key=123456&url=https://groups.google.com/forum/#!topic/google-web-toolkit/Syi04ArKl4k you can see all javascript but there is no text "If i open that url in IE",
Why is it?
SO if i use http://crawlservice.appspot.com/?key=123456&url=mydomain#!article then Can google crawler be able to see the text in mydomain#!article?
also why the key=123456, it means everyone can use this service? do we have our own key? does google limit the number of calls to their service?
Could you explain all these things?
Extra Info:
Christopher suggested me to use this example
https://github.com/ArcBees/GWTP-Samples/tree/master/gwtp-samples/gwtp-sample-crawler-service
However, I ran into other problem. My app is a pure GWTP, it doesn't have appengine-web.xml in WEB-INF. I have no idea what is appengine or GAE mean or what is Maven.
DO i need to register AppEngine?
My Appp may have a lot of traffic. Also I am using Godaddy VPS. I don't want to register App Engine since I have to pay for Google for extra traffic.
Everything in my GWTP App is ok right now except Crawler Function.
So if I don't use Google App Engine, then how can i build Crawler Function for GWTP?
I tried to use HTMLUnit for my app, but HTMLUnit doesn't work for GWTP (See details in here Why HTMLUnit always shows the HostPage no matter what url I type in (Crawlable GWT APP)? )
I believe you are not allowed to crawl Google Groups. Probably they are actively trying to prevent this, so you do not see the expected content.
There's a couple points I wish to elaborate on:
The Google Code documentation is no longer maintained. You should look on Github instead: https://github.com/ArcBees/GWTP/wiki/Crawler-Support
You shouldn't use http://crawlservice.appspot.com. This isn't a Google service, it's out of date and we may decide to delete it down the road. This only serves as a public example. You should create your own application on App Engine (https://appengine.google.com/)
There is a sample here (https://github.com/ArcBees/GWTP-Samples/tree/master/gwtp-samples/gwtp-sample-crawler-service) using GWTP's Crawler Service. You can basically copy-paste it. Just make sure you update the <application> tag in appengine-web.xml to the name of your application and use your own service key in CrawlerModule.
Finally, if your client uses GWTP and you followed the documentation, it will work. If you want to try it manually, you must encode the Query Parameters.
For example http://crawlservice.appspot.com/?key=123456&url=http://www.arcbees.com#!service will not work because the hash (everything including and after #) is not sent to the server.
On the other hand http://crawlservice.appspot.com/?key=123456&url=http%3A%2F%2Fwww.arcbees.com%2F%23!service will work.

Get all the urls under domain (YQL?)

I want to get all the urls under a domain.
When I looked at their robots.txt. It clearly states that some of the folders are not for robots but I am wondering is there a way to get the all the urls that are open to robots. There is no sitemap on the robots.txt.
For example, on their robots.txt, it has the information looks similar like this:
User-agent: *
Allow: /
Disallow: /A/
Disallow: /B/
Disallow: /C/
...
But I am interested in all the urls available to the robots but not included in this blacklist, like
/contact
/welcome
/product1
/product2
...
Any idea will be appreicated and I am also curious if there will be a Yahoo Query Language(YQL) solution for this problem because this work has probably already been done by Yahoo.
Thanks !
yes there is a way to get all the urls open to robots.
A simple solution would be to go to www.google.com and type site:www.website.com into the search bar.
While that isn't a guarantee to get you every page it will get you all the pages google has indexed. And google adhered to robots.txt so it seems to fit your purpose.

Caching & GZip on GAE (Community Wiki)

Why does it seem like Google App Engine isn’t setting appropriate cache-friendly headers (like far-future expiration dates) on my CSS stylesheets and JavaScript files? When does GAE gzip those files? My app.yaml marks the respective directories as static_dirs, so the lack of far-future expiration dates is kind of surprising to me.
This is a community wiki to showcase the best practices regarding static file caching and gzipping on GAE!
How does GAE handle caching?
It seems GAE sets near-future cache expiration times, but does use the etag header. This is used so browsers can ask, “Has this file changed since when it had a etag of X68f0o?” and hear “Nope – 304 Not Modified” back in response.
As opposed to far-future expiration dates, this has the following trade-offs:
Your end users will get the latest copies of your resources, even if they have the same name (unlike far-future expiration). This is good.
Your end users will however still have to make a request to check on the status of that file. This does slow down your site, and is “pure overhead” when the content hasn’t changed. This is not ideal.
Opting for far-future cache expiration instead of (just) etag
To use far-future expiration dates takes two steps and a bit of understanding.
You have to manually update your app to request new versions of resources, by e.g. naming files like mysitesstyles.2011-02-11T0411.css instead of mysitestyles.css. There are tools to help automate this, but I’m not aware of any that directly relate to GAE.
Configure GAE to set the expiration times you want by using default_expiration and/or expiration in app.yaml. GAE docs on static files
A third option: Application manifests
Cache manifests are an HTML5 feature that overrides cache headers. MDN article, DiveIntoHTML5, W3C. This affects more than just your script and style files' caching, however. Use with care!
When does GAE gzip?
According to Google’s FAQ,
Google App Engine does its best to serve gzipped content to browsers that support it. Taking advantage of this scheme is automatic and requires no modifications to applications.
We use a combination of request headers (Accept-Encoding, User-Agent) and response headers (Content-Type) to determine whether or not the end-user can take advantage of gzipped content. This approach avoids some well-known bugs with gzipped content in popular browsers. To force gzipped content to be served, clients may supply 'gzip' as the value of both the Accept-Encoding and User-Agent request headers. Content will never be gzipped if no Accept-Encoding header is present.
This is covered further in the runtime environment documentation (Java | Python).
Some real-world observations do show this to generally be true. Assuming a gzip-capable browser:
GAE gzips actual pages (if they have proper content-type headers like text/html; charset=utf-8)
GAE gzips scripts and styles in static_dirs (defined in app.yaml).
Note that you should not expect GAE to gzip images like GIFs or JPEGs as they are already compressed.

Subdomain is preventing my search results from rising as it should in page rank

My problem is that I have a site which has requires a dedicated page for every city I choose to support. Early on, I decided to use subdomains rather than a directly after my domain (ie i used la.truxmap.com rather than truxmap.com/la). I realize now that this was a major mistake because Google seems to treat la.truxmap.com as a completely different site as ny.truxmap.com. So for instance, if i search "la food truck map" my site will be near the top, however, if i search "nyc food truck map" im no where in sight because ny.truxmap.com wouldnt be very high in the page rank by itself, and it doesnt have the boost that it ought to be getting from the better known la.truxmap.com
So a mistake I made a year ago is now haunting my page rank. I'd like to know what the most painless way of resolving my dilemma might be. I have received so much press at la.truxmap.com that I can't just kill the site, but could I re-direct all requests at la.truxmap.com to truxmap.com/la and do the same for all cities supported without trashing my current, satisfactory page rank results I'm getting from la.truxmap.com ??
EDIT
I left out some critical information. I am using Google Apps to manage my domain (that is, to add the subdomains) and Google App Engine to host my site. Thus, Google Apps provides a simple mechanism to mask truxmap.appspot.com (the app engine domain) as la.truxmap.com, but I don't see how I can mask it as truxmap.com/la. If I can get this done, then I can just 301 redirect la.truxmap.com to truxmap.com/la as suggested below.
Thanks so much!
You could send a "301 Moved Permanently" redirect to cause the Google crawler to update its references to your site, no?
See this article on 301 redirects and SEO.
You'll need to modify your app as follows:
Add www.truxmap.com as an alias for the app (you can't serve naked domains in App Engine, so just truxmap.com won't work)
Add support to your app for handling URLs of the form www.truxmap.com/something/, routing to the same handlers as the subdomain. You'll need to make sure you've debugged any relative path issues well before continuing.
Modify your app to serve 302 redirects for every url under something.truxmap.com/whatever to www.truxmap.com/something/whatever.

Resources