I am using Nokogiri to screen scrape a few websites.
My website is hosted on US servers, and so when it fetches the website, the website responses as if the user was a US user. I need the website to responds as if I was an Australian user, even though the server is located in the US.
When running locally it works fine because locally it responds as if it was Australian.
How can I read the site using Nokogiri as if I was from another country?
You have to run your requests through a proxy in Australia.
This doesn't have anything to do with Nokogiri - it applies no matter how you're trying to scrape a page. HTTP travels over TCP, which is a bi-directional protocol so you can't spoof its IP address. If you tried to spoof the IP address of a TCP packet, you would never get your response back.
You can configure Tor to always use exit nodes from a specific country. Please, however, do not use this method if this will put the Tor network under some serious strain (i.e. fetching the pages continually); in this case please consider buying an (Austrlia-based) annonimizing service (or simply a proxy).
Related
I am implementing a Web proxy (in C), with the end goal of implementing some simple caching and adblocking. Currently, the proxy supports normal HTTP sites, and also supports HTTPS sites by implementing tunneling with HTTP CONNECT. The proxy works great running from localhost and configured with my browser.
Despite all of this, I'll never be able to implement my desired features as long as the proxy can not decrypt HTTPS traffic. The essence of my question is: what general steps do I need to take to be able to decrypt this traffic and implement what I would like? I've been researching this, and there seems to be a good amount of information on existing proxies that are capable of this, such as Squid.
Currently, my server uses select() and keeps all client ids in an fd_set. When a CONNECT request is made, it makes a TCP connection to the specified host, and places the file descriptor of both the client and the host into the fd_set. It also places the tuple of fd's into a list, and the list is scanned whenever more data is ready from select() to see if data is coming from an existing tunnel. The data is then read and forwarded blindly. I am struggling to see how to intercept this data at all, due to the nature of the CONNECT verb requiring opening a simple TCP socket to the desired host, and then "staying out of it" while the client and host set up their own SSL sockets. I am simply asking for the right direction for how I can go about using the proxy as a MITM attacker in order to read and manipulate the data coming in.
As a brief aside, this project is solely for my own use, so no security or advanced functionality is needed. I just need it to work for one browser, and I am happy to get any warnings from the browser if certificate-spoofing is the best approach.
proxy can not decrypt HTTPS traffic
You are trying to mount a man-in-the-middle attack. SSL is designed to prevent that. But - there is a weak point - a list of trusted certificate authorities.
I am simply asking for the right direction for how I can go about using the proxy as a MITM attacker in order to read and manipulate the data coming in.
You can get inspiration from Fiddler. The Fiddler has its own CA certificate (certification authority) and once you add this CA certificate as trusted, then Fiddler generates server certificates for each connection you use on the fly.
It comes with serious security consideration, your browser will trust any site. I've even seen using the Fiddler core inside a malware, so be careful
Using the windows host file located in
windows/system32/drivers/etc/host
Is it possible to respond a request from an application like when it is offline(not connected to the Internet)? Could you please give an example of this is done.
The hosts file only lists aliases for ip-addresses. For example:
192.168.0.1 foo bar foo.com bar.com
If that line is in the hosts file, then you can use the host-names foo, bar, foo.com and bar.com to reach the computer with ip-address 192.168.0.1.
If the computer, or the service you want to reach on that address, is not online, you can't reach it no matter what you have in your hosts file.
If you are willing to map your local development environment to a domain name, you can edit the domain name in hosts file and map it to 127.0.0.1, which is the loopback address.
That way, any requests done to that particular domain will fallback to your local machine.
You can also assign different LAN/WAN IP addresses.
When your development phase is done, you can remove the entry.
I would not recommend doing so, stick with the localhost and just make use of that to test virtualhosts setup or some sort of domain based configurations.
If there is anything else I can answer, please don't hesitate to post further comments on my answer.
If you mean to respond to HTTP requests then you need a Web Server configured to respond to any host (or that specific host name) on port 80. If you are not using it for anything else IIS can do this1 – configure it to return 404 (not found) or some other relatively neutral fail response.
1 If IIS is already being used then things get much harder, later versions of IIS are more flexible either with a combination of using HTTP.SYS to allow other applications to respond to certain URLs or using different Web Sites in IIS (until Windows 7, or maybe Vista, only available on Server editions of Windows).
I want to host an SMS application on GAE and all my traffic will come
from a SMS GATEWAY with a single IP address.
Is that fine(I'm expecting 500 dynamic requests/sec) ?
Will there be any problems like unusual traffic errors or any other issues ?
EDITED
More info:
My users send queries through SMS which will be routed to my app from SMS GATEWAY(single IP address).
My app processes those queries and reply back through SMS (again through SMS GATEWAY).
I can reply using URL FETCH(not a problem) but what i'm worried about is if I receive some 500 dynamic requests/sec from single IP address , GAE might block them thinking them as Dos(denial of service) attack .
GAE either asks the user to enter a captcha at https://www.google.com/accounts/DisplayUnlockCaptcha or redirects to sorry.google.com and displays an error message itseems if it receives unusual traffic from single IP . But my users access app only through SMS.
Please look at this production issue filed.
It would be technically doable - your app can detect the user's IP via the REMOTE_ADDR environment variable, and if it's the one you want, show them the actual page (showing them a 403 otherwise). Your second question is a bit trickier to answer - your App Engine app could handle it assuming you wrote it in a scalable manner (not a trivial assumption!) and if you can afford the amount of traffic you're trying to throw at it.
You're right to be concerned that getting that level of traffic from a single IP might set off some form of DoS protection - it shouldn't, but it's impossible to rule it out. If it were to happen, you could file a production issue, and we'd take care of it.
I'm working with a third party webservice who requires that all calls to their service are made from whitelisted IP addresses. That is, I must give them IP addresses from which I will be making calls to their service.
Problem is I'm using Google Appengine. Is there any way to get a static IP address when making outgoing http requests from Appengine? Failing that - is there a block of IP addresses that all requests will come from? I could get the entire bloc whitelisted. If this exists, how likely is it to change?
I know I could setup a simple Amazon EC2 instance to use as a proxy (will ask another question for how to do this specifically) but just wanted to make sure there was no other way.
I had the same problem a couple of weeks ago connecting via Urlfetch from Google App Engine to the Stack Exchange API (The team has promptly fixed the problem whitelisting all the GAE IPs).
The range of IP addresses that urlfetch connections may come from, can be found by performing the following DNS lookup:
dig -t TXT _netblocks.google.com #ns1.google.com
Last I checked this wasn't possible. You can get the current IP address dynamically, but it isn't predictable.
Please note: _netblocks.google.com is apparently not accurate. Currently I have noticed that GAE connects from addresses not listed when you dig _netblocks, for example from 8.35.201.166.
This range is not listed in _netblocks, _netblocks2 or _netblocks3.
Current dig output:
ip4:216.239.32.0/19
ip4:64.233.160.0/19
ip4:66.249.80.0/20
ip4:72.14.192.0/18
ip4:209.85.128.0/17
ip4:66.102.0.0/20
ip4:74.125.0.0/16
ip4:64.18.0.0/20
ip4:207.126.144.0/20
ip4:173.194.0.0/16
I would like to know what the ip address of a Silverlight 4 out of browser application. This would be the ip address that is on the internet, not the LAN ip. I am communicating with a WCF service that is NOT hosted by IIS but by my own Windows service.
I'd say the best way is to send a request off to a really dumb web service whose sole job is to return the IP address of the requester back.
It'd be async, though, so that may not be great depending on your scenario. And of course if the IP address changed (DHCP renewed, say) then you wouldn't know to go ask again.
Why do you want to know the client IP address? Maybe there's a way to solve your problem without needing to know it.