Is there a way to make more than 10K requests on Google search from the same IP? - screen-scraping

I am currently working to an app that requires to scrape data from Google's search results. For example google.com/search?q=domain.com and so on. But Google blocks my IP address after making some requests. I know there are Google APIs, but there are many sites around that just scrape the data directly.

Scraping Google search results is a breech of the terms-of-service. Google actively discourages such and blocks those who do. They share their information with you free of charge but they don't appreciate you trying to get a copy of all of it.
Better to do your own crawling of the domain.

Too bad I did not see your question earlier, if it's not too late:
Scraping Google does indeed violate their terms of service, on the other hand you may choose not to accept them. You would accept their TOS when you create a Google account for example but as far as I know you can also reject the acceptance again (at least when they change them).
For a smaller amount of data you can use their API or also their commercial API but if you need the results and ranks exactly as a user will see them (SEO purposes) I know no official way to get their permission.
I am not a lawyer, so you might want to consult one if you want to make sure about legal consequences.
However, scraping Google usually does not lead to any legal problems. I remember that even Bing (Microsofts engine) got caught scraping Google for unknown keywords. That happened a few years ago. My personal guess is that the majority of their original results were copied from Google in secret.
There is an open source project http://google-rank-checker.squabbel.com which does work to scrape large amounts of Google results. As far as I remember, without modification it is limited to about 50-70k resultpages per day.
I suggest to take a look at the code, it's PHP with libcURL.
You will need proper IP addresses (not shared, not previously abused) as well. Scraping with a single IP will result in getting blocked by Google within an hour.
Usually the first thing that happens is a captcha, by solving the captcha you generate a cookie which allows you to keep making requests.
If you continue you will get a complete ban.
And if you "hammer" Google with a huge amount of requests you will alert their staff and they can put a manual ban on the whole ISP or network block.
A proper amount is around 10 requests per hour with an IP, that's what I have been sticking to on my related projects.
So if someone scrapes Google, make sure you have functions which validate the results and watch for unexpected returns. In such a case your code should immediately stop accessing Google to prevent further accessing a page which is just showing a captcha.

Related

Method to find API call usage

I am operating in a production environment with a number of different applications using the Amazon API. Of these, some are our own home-grown apps, and others are 3rd party shipping applications.
I have a situation where I am hitting an hourly throttle for the Reports API 'GetReport' request, and I am trying to determine what is causing us to be throttled. By my count, we shouldn't be exceeding ~60 calls per hour at the absolute maximum. (Just a note, while API info says this function call throttles at 60 requests per hour, the exception I received back indicated a cap of 120 requests per hour. Maybe the exception is wrong, and I'm hitting a 60 request cap?)
Is there either an API call to determine current call usage, or a method of accessing this information via Amazon Seller Central / Developers Program? I've done some searching around but everything I can find is describing how the throttling works which isn't my problem.
I am currently using C# Amazon MWS libraries for all function calls, although that information is a bit superfluous. Any insight into the proper API call to use, or how to gain access to this information would be greatly appreciated.
In the response to most calls you get back something like the following in the response.
"x-mws-quota-max"=>"60.0",
"x-mws-quota-remaining"=>"51.0",
"x-mws-quota-resetsOn"=>"2016-03-25T16:00:00.000Z"
You should be able to use this to figure out what is causing you to hit the limit quicker than expected. Perhaps logging out the call and the response with the data above??
Contact MWS Support here and ask for clarification on your issue. They surely know of your usage in order to be able to cap it. I met with the MWS team a few months ago in Detroit and they said any time you have a technical question to ask them. They've been really helpful to me.

Gmail API quota units cost

We are building a service that utilizes the Gmail API. In order to understand our costs as we scale, I would like to know how much it costs to use the Gmail API. I've followed the instructions at https://developers.google.com/gmail/api/v1/reference/quota through to the point at which it says:
If you have enabled billing for your project [we have], clicking Quota
takes you to a page where you can view and change quota-related
settings.
The only option on that page for changing our daily quota is to "Apply for higher quota"; however, clicking that opens a window that says:
Please be sure to review the existing quota limits to confirm you need
more than the daily default.... If you simply have a question on limits, please ask it on the Stack
Overflow forum
Thus, I am asking here: what is the cost per API unit when one's needs exceed the daily free quota?
The API isn't marked as "billable" meaning it's free up to a limit and there's no set/published pricing above that. If you are using your existing quota or are getting close and want to ask for more, I think best place is to ask on the quota request form. It's quite reasonable to ask for quota to provision for a few quarters of growth IMO and if you're migrating from some other API (e.g. IMAP, atom feed, DOM hacking) then obviously that should be quite reasonable to provision all that beforehand as well.

Gmail API all messages

I need to get all messages in Inbox with gmail api. But I see only one way to do it.
Get list of messages(id, threadID)
GET https://www.googleapis.com/gmail/v1/users/somebody%40gmail.com/messages?labelIds=INBOX&key={YOUR_API_KEY}
With id`s get all messages in loop
While
GET https://www.googleapis.com/gmail/v1/users/somebody%40gmail.com/messages/147199d21bbaf5a5?key={YOUR_API_KEY}
End of While
But for this way needed 100500 request.
Have anybody idea how to get with one request all messages(or just payload field)?
Use batch and request 100 messages at a time. You will need to make 1000 requests but the good news is that's quite fine and it'll be easier for everyone (no downloading 1GB response in a single request!).
Documented at:
https://developers.google.com/gmail/api/guides/batch
There's a few other people that have asked about batching Gmail Api here on Stack Overflow, so just do a quick search to find answers and examples.
The approach you are doing is correct, as there is no 'GetAll' API to download them
Reasons include:
Unbounded Result Sets
Pulling out an unlimited amount of emails (aka unbounded result set) is a resource hog on Google servers. Did you want the attachments AND images? These could be gigabytes of data.
Network Problems
Google has to read gigabytes form disk, store it in memory and send them over the internet. Google's server would handle it, but the bandwidth of internet connectivity would not work. Worst of all, if you issue this request again and again, you could perform a DDoS attack on Google.
Security Risk
If someone gains an API key of another user, they could download their entire mailbox. Hence Google provide paging to ensure that they can provide a securer service and reduce resource contention.
Therefore, it is there to protect you and other users, and themselves.

Google app engine - Sudden Increase of Datastore Read Operations

I'm maintaining a blog app(blog.wokanxing.info, it's in Chinese) for myself which was built upon Google app engine. It's been like two or three years since first deployment and I've never met any quota issue because its simpicity and small visit count.
However since early last month, I noticed that from time to time the app reported 500 server error, and in admin panel it shows a mysterious fast consumption of free datatstore read operation quota. Within a single hour about 10% of free read quota (~5k ops) are consumed, but I'm counting only a dozen requests that involve datastore read ops, 30 tops, which means an average 150 to 200 read op per request, which sounds impossible to me.
I've not commited any change to my codebase for months, and I'm not seeing any change in datastore or quote policy either. Despite that, it also confuses me how such consumption can be made. I use memcache a lot, which leaves first page the biggest player, which fetch the first threads using Post.all.order('-date').fetch(10, offset). Other request merely fetch a single model using Post.get_by_key_name and iterates post.comment_set.
Sorry for my poor English, but can anyone give me some clues? Thanks.
From Admin console check your log.
Do not check for errors only, rather check all types of messages inside the log.
Look for the requests made by robots/web crawlers. In most cases, you can detect such "users" by words "robot" or "bot" (well, if they are honest...).
The first thing you can do is to edit your "robot" file. For more detail read How to identify web-crawler? . Also, GAE has help for use of "robot" file.
If that fails, try to detect IP address used by bot/bots. Using GAE Admin console put such addresses in blacklist and check your quota consumption again.

Project suited for Google App Engine?

I'm currently developing a small hobby project (open sourced at https://github.com/grav/mailbum) which quite simply takes images from a Gmail account and puts them in albums on Picasa Web.
Since it's (currently) only dealing with Google-hosted data, I was thinking about hosting it on Google App Engine, but I'm not sure if it's well-suited for GAE:
Will the maximum execution time be a problem? It's currently 10 minutes according to http://googleappengine.blogspot.com/2010/12/happy-holidays-from-app-engine-team-140.html, but I'd think the tasks (i.e. processing a single mail) would be easy to run in parallel. I'm also guessing that dealing with Google-hosted data would be quite efficient on GAE?
Will the fact that it's written in Clojure be an obstacle? I've researched a bit in getting Clojure to run on GAE, but I've never tried it. Any pin-pointers?
Thanks for any advice and thoughts on the project!
It seems like your application is doable on GAE. My points of concern would be:
Does your code ever store the images that it is processing to temporary files? If so it will need to be changed to do everything in memory, because GAE applications are sandboxed and not allowed to write to the filesystem (if you need temporary persistent storage, you might be able to work something out where you write your file data to a BLOB field in the GAE datastore).
How do you get the images into Picasa Web? If they provide a simple REST/HTTP API then all is well. If you need something more involved than that (like a raw TCP socket) then it won't work.
The 10-minute execution time limit only applies to background tasks. When actually servicing web requests the time limit is 30 seconds. So if you provide a web-based interface to your app, you need to structure things so that the interface is just scheduling jobs that run in the background (i.e. you can't fire off a job directly as part of servicing a web request).
If none of those sound like show-stoppers to you, then I think your app should work just fine on GAE.
Can't really say if Clojure will work though. I have, however, spent time in the past getting some third-party libraries to work on App-Engine. Generally all I had to do was remove/modify/disable any parts of the library that accessed features that are forbidden by the sandbox (for instance, I had to disable the automatic caching to disk to get commons-fileupload to work on GAE). Not sure if the same would apply to Clojure, or even what the scope would be on a task like that.
I have been dabbling with Clojure and App Engine for a while now and I have to recommend appengine-magic. It abstracts most of the Java stuff away and is very easy to use. As a plus the project seems to be very active.

Resources