scrapy how to set referer url - screen-scraping

I need to set the referer url, before scraping a site, the site uses refering url based Authentication, so it does not allow me to login if the referer is not valid.
Could someone tell how to do this in Scrapy?

If you want to change the referer in your spider's request, you can change DEFAULT_REQUEST_HEADERS in the settings.py file:
DEFAULT_REQUEST_HEADERS = {
'Referer': 'http://www.google.com'
}

You should do exactly as #warwaruk indicated, below is my example elaboration for a crawl spider:
from scrapy.spiders import CrawlSpider
from scrapy import Request
class MySpider(CrawlSpider):
name = "myspider"
allowed_domains = ["example.com"]
start_urls = [
'http://example.com/foo'
'http://example.com/bar'
'http://example.com/baz'
]
rules = [(...)]
def start_requests(self):
requests = []
for item in self.start_urls:
requests.append(Request(url=item, headers={'Referer':'http://www.example.com/'}))
return requests
def parse_me(self, response):
(...)
This should generate following logs in your terminal:
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/foo> (referer: http://www.example.com/)
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/bar> (referer: http://www.example.com/)
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/baz> (referer: http://www.example.com/)
(...)
Will work same with BaseSpider. In the end start_requests method is BaseSpider method, from which CrawlSpider inherits from.
Documentation explains more options to be set in Request apart from headers, such as: cookies , callback function, priority of the request etc.

Just set Referer url in the Request headers
class scrapy.http.Request(url[, method='GET', body, headers, ...
headers (dict) – the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).
Example:
return Request(url=your_url,
headers={'Referer':'http://your_referer_url'})

Override BaseSpider.start_requests and create there your custom Request passing it your referer header.

Related

CORS error with react native app run with expo web [duplicate]

I created an API endpoint using Google Cloud Functions and am trying to call it from a JS fetch function.
I am running into errors that I am pretty sure are related to either CORS or the output format, but I'm not really sure what is going on. A few other SO questions are similar, and helped me realize I needed to remove the mode: "no-cors". Most mention enabling CORS on the BE, so I added response.headers.set('Access-Control-Allow-Origin', '*') - which I learned of in this article - to ensure CORS would be enabled... But I still get the "Failed to fetch" error.
The Full Errors (reproducible in the live demo linked below) are:
Uncaught Error: Cannot add node 1 because a node with that id is
already in the Store. (This one is probably unrelated?)
Access to fetch at
'https://us-central1-stargazr-ncc-2893.cloudfunctions.net/nearest_csc?lat=37.75&lon=-122.5'
from origin 'https://o2gxx.csb.app' has been blocked by CORS policy:
Request header field access-control-allow-origin is not allowed by
Access-Control-Allow-Headers in preflight response.
GET
https://us-central1-stargazr-ncc-2893.cloudfunctions.net/nearest_csc?lat=37.75&lon=-122.5 net::ERR_FAILED
Uncaught (in promise) TypeError: Failed to fetch
See Code Snippets below, please note where I used <---- *** Message *** to denote parts of the code that have recently changed, giving me one of those two errors.
Front End Code:
function getCSC() {
let lat = 37.75;
let lng = -122.5;
fetch(
`https://us-central1-stargazr-ncc-2893.cloudfunctions.net/nearest_csc?lat=${lat}&lon=${lng}`,
{
method: "GET",
// mode: "no-cors", <---- **Uncommenting this predictably gets rid of CORS error but returns a Opaque object which seems to have no data**
headers: {
// Accept: "application/json", <---- **Originally BE returned stringified json. Not sure if I should be returning it as something else or if this is still needed**
Origin: "https://lget3.csb.app",
"Access-Control-Allow-Origin": "*"
}
}
)
.then(response => {
console.log(response);
console.log(response.json());
});
}
Back End Code:
import json
import math
import os
import flask
def nearest_csc(request):
"""
args: request object w/ args for lat/lon
returns: String, either with json representation of nearest site information or an error message
"""
lat = request.args.get('lat', type = float)
lon = request.args.get('lon', type = float)
# Get list of all csc site locations
with open(file_path, 'r') as f:
data = json.load(f)
nearby_csc = []
# Removed from snippet for clarity:
# populate nearby_csc (list) with sites (dictionaries) as elems
# Determine which site is the closest, assigned to var 'closest_site'
# Grab site url and return site data if within 100 km
if dist_km < 100:
closest_site['dist_km'] = dist_km
// return json.dumps(closest_site) <--- **Original return statement. Added 4 lines below in an attempt to get CORS set up, but did not seem to work**
response = flask.jsonify(closest_site)
response.headers.set('Access-Control-Allow-Origin', '*')
response.headers.set('Access-Control-Allow-Methods', 'GET, POST')
return response
return "No sites found within 100 km"
Fuller context for code snippets above:
Here is a Code Sandbox Demo of the above.
Here is the full BE code on GitHub, minus the most recent attempt at adding CORS.
The API endpoint.
I'm also wondering if it's possible that CodeSandbox does CORS in a weird way, but have had the same issue running it on localhost:3000, and of course in prod would have this on my own personal domain.
The Error would appear to be CORS-related ( 'https://o2gxx.csb.app' has been blocked by CORS policy: Request header field access-control-allow-origin is not allowed by Access-Control-Allow-Headers in preflight response.) but I thought adding response.headers.set('Access-Control-Allow-Origin', '*') would solve that. Do I need to change something else on the BE? On the FE?
TLDR;
I am getting the Errors "Failed to fetch" and "field access-control-allow-origin is not allowed by Access-Control-Allow-Headers" even after attempts to enable CORS on backend and add headers to FE. See the links above for live demo of code.
Drop the part of your frontend code that adds a Access-Control-Allow-Origin header.
Never add Access-Control-Allow-Origin as a request header in your frontend code.
The only effect that’ll ever have is a negative one: it’ll cause browsers to do CORS preflight OPTIONS requests even in cases when the actual (GET, POST, etc.) request from your frontend code would otherwise not trigger a preflight. And then the preflight will fail with this message:
Request header field Access-Control-Allow-Origin is not allowed by Access-Control-Allow-Headers in preflight response
…that is, it’ll fail with that unless the server the request is being made to has been configured to send an Access-Control-Allow-Headers: Access-Control-Allow-Origin response header.
But you never want Access-Control-Allow-Origin in the Access-Control-Allow-Headers response-header value. If that ends up making things work, you’re actually just fixing the wrong problem. Because the real fix is: never set Access-Control-Allow-Origin as a request header.
Intuitively, it may seem logical to look at it as “I’ve set Access-Control-Allow-Origin both in the request and in the response, so that should be better than just having it in the response” — but it’s actually worse than only setting it in the response (for the reasons described above).
So the bottom line: Access-Control-Allow-Origin is solely a response header, not a request header. You only ever want to set it in server-side response code, not frontend JavaScript code.
The code in the question was also trying to add an Origin header. You also never want to try to set that header in your frontend JavaScript code.
Unlike the case with the Access-Control-Allow-Origin header, Origin is actually a request header — but it’s a special header that’s controlled completely by browsers, and browsers won’t ever allow your frontend JavaScript code to set it. So don’t ever try to.

Service account request to IAP-protected app results in 'Invalid GCIP ID token: JWT signature is invalid'

I am trying to programmatically access an IAP-protected App Engine Standard app via Python from outside of the GCP environment.
I have tried various methods, including the method shown in the docs here: https://cloud.google.com/iap/docs/authentication-howto#iap-make-request-python. Here is my code:
from google.auth.transport.requests import Request
from google.oauth2 import id_token
import requests
def make_iap_request(url, client_id, method='GET', **kwargs):
"""Makes a request to an application protected by Identity-Aware Proxy.
Args:
url: The Identity-Aware Proxy-protected URL to fetch.
client_id: The client ID used by Identity-Aware Proxy.
method: The request method to use
('GET', 'OPTIONS', 'HEAD', 'POST', 'PUT', 'PATCH', 'DELETE')
**kwargs: Any of the parameters defined for the request function:
https://github.com/requests/requests/blob/master/requests/api.py
If no timeout is provided, it is set to 90 by default.
Returns:
The page body, or raises an exception if the page couldn't be retrieved.
"""
# Set the default timeout, if missing
if 'timeout' not in kwargs:
kwargs['timeout'] = 90
# Obtain an OpenID Connect (OIDC) token from metadata server or using service
# account.
open_id_connect_token = id_token.fetch_id_token(Request(), client_id)
print(f'{open_id_connect_token=}')
# Fetch the Identity-Aware Proxy-protected URL, including an
# Authorization header containing "Bearer " followed by a
# Google-issued OpenID Connect token for the service account.
resp = requests.request(
method, url,
headers={'Authorization': 'Bearer {}'.format(
open_id_connect_token)}, **kwargs)
print(f'{resp=}')
if resp.status_code == 403:
raise Exception('Service account does not have permission to '
'access the IAP-protected application.')
elif resp.status_code != 200:
raise Exception(
'Bad response from application: {!r} / {!r} / {!r}'.format(
resp.status_code, resp.headers, resp.text))
else:
return resp.text
if __name__ == '__main__':
res = make_iap_request(
'https://MYAPP.ue.r.appspot.com/',
'Client ID from IAP>App Engine app>Edit OAuth Client>Client ID'
)
print(res)
When I run it locally, I have the GOOGLE_APPLICATION_CREDENTIALS environment variable set to a local JSON credential file containing the keys for the service account I want to use. I have also tried running this in Cloud Functions so it would presumably use the metadata service to pick up the App Engine default service account (I think?).
In both cases, I am able to generate a token that appears valid. Using jwt.io, I see that it contains the expected data and the signature is valid. However, when I make a request to the app using the token, I always get this exception:
Bad response from application: 401 / {'X-Goog-IAP-Generated-Response': 'true', 'Date': 'Tue, 09 Feb 2021 19:25:43 GMT', 'Content-Type': 'text/html', 'Server': 'Google Frontend', 'Content-Length': '47', 'Alt-Svc': 'h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"'} / 'Invalid GCIP ID token: JWT signature is invalid'
What could I be doing wrong?
The solution to this problem is to exchange the Google Identity Token for an Identity Platform Identity Token.
The reason for the error Invalid GCIP ID token: JWT signature is invalid is caused by using a Google Identity Token which is signed by a Google RSA private key and not by a Google Identity Platform RSA private key. I overlooked GCIP in the error message, which would have told me the solution once we validated that the token was not corrupted in use.
In the question, this line of code fetches the Google Identity Token:
open_id_connect_token = id_token.fetch_id_token(Request(), client_id)
The above line of code requires that Google Cloud Application Default Credentials are setup. Example: set GOOGLE_APPLICATION_CREDENTIALS=c:\config\service-account.json
The next step is to exchange this token for an Identity Platform token:
def exchange_google_id_token_for_gcip_id_token(google_open_id_connect_token):
SIGN_IN_WITH_IDP_API = 'https://identitytoolkit.googleapis.com/v1/accounts:signInWithIdp'
API_KEY = '';
url = SIGN_IN_WITH_IDP_API + '?key=' + API_KEY;
data={
'requestUri': 'http://localhost',
'returnSecureToken': True,
'postBody':'id_token=' + google_open_id_connect_token + '&providerId=google.com'}
try:
resp = requests.post(url, data)
res = resp.json()
if 'error' in res:
print("Error: {}".format(res['error']['message']))
exit(1)
# print(res)
return res['idToken']
except Exception as ex:
print("Exception: {}".format(ex))
exit(1)
The API Key can be found in the Google Cloud Console -> Identity Platform. Top right "Application Setup Details". This will show the apiKey and authDomain.
More information can be found at this link:
Exchanging a Google token for an Identity Platform token

How to access Google Cloud Endpoints request header in Python and Java

In the endpoints method, how to access request header information?
Python:
In the endpoint method, self.request_state.headers provides this information.
E.g., self.request_state.headers.get('authorization')
Java:
Add an HttpServletRequest (req) parameter to your endpoint method. The headers are accessible through the method getHeader()
e.g., req.getHeader("Authorization")
See this question
What is working for me in python is the following:
The request I use: http://localhost:8080/api/hello/v1/header?message=Hello World!
python code:
#endpoints.api(name='hello', version='v1', base_path='/api/')
class EchoApi(remote.Service):
#endpoints.method(
# This method takes a ResourceContainer defined above.
message_types.VoidMessage,
# This method returns an Echo message.
EchoResponse,
path='header',
http_method='GET',
name='header')
def header(self, request):
header = request._Message__unrecognized_fields
output_message = header.get(u'message', None)[0]
return EchoResponse(message=output_message)

Laravel 5.2 and angularJS token mismatch

I am developing a webapp with static files on one server and api on another. The front end is developed using angular and backend using laravel.
For CSRF-TOKEN fetching during the first load, within angular run block I have this code
if(!$cookies.get('XSRF-TOKEN')){
$http.get(API+'/csrf_token').success(function(d){
$cookies.put('XSRF-TOKEN',d.XSRF_TOKEN);
//$cookies.put('laravel-session',d.LARAVEL_ID);
//$http.defaults.headers.common.X-CSRF-TOKEN = 'Basic YmVlcDpib29w';
//$http.defaults.headers.post['X-CSRF-TOKEN']=$cookies.get('XSRF-TOKEN');
$http.defaults.headers.post['X-CSRF-TOKEN']=d.XSRF_TOKEN;
});
The other way I have tried to get the same was using this way.
Also set $httpProvider.defaults.withCredentials = true; so that cookies be sent along with requests.
The route /csrf_token setup as
Route::get("/csrf_token", function(){
//return \Response::json("asd",200)->withCookie(cookie("XSRF-TOKEN",csrf_token()));
return csrf_token(); //\Crypt::encrypt(csrf_token())
});
All the ajax POST requests throw TokenMismatchException in VerifyCsrfToken.php line 67:.
Next I have sent the csrf_token parameter as _token attached with the post parameters, still the same problem.
Tried all the above, returning encrypted token from /csrf_token, but still same problem.
Repeated all the steps clearing the config:cache and composer dumpautoload in api server, but still same problem.
Reviewed config file ,some values -
'driver' => env('SESSION_DRIVER', 'file'),
'encrypt' => false,
'files' => storage_path('framework/sessions'),
'secure' => false,
(These values seem to be okay)
Next reviewed Virtual config file for CORS configuration (inside directory tag)
Header set Access-Control-Allow-Origin "www.mydomain.com" #real domain not posted
Header set Access-Control-Allow-Credentials 'true'
Header always set Access-Control-Max-Age "2000"
Header set Access-Control-Allow-Headers 'X-CSRF-TOKEN'
Header always set Access-Control-Allow-Headers "X-Requested-With, Content-Type, Origin, Authorization, Accept, Client-Security-Token, Accept-Encoding"
Header always set Access-Control-Allow-Methods "POST, GET, OPTIONS, DELETE, PUT"
Options Indexes FollowSymLinks Includes ExecCGI
AllowOverride All
Require local
Wasted hours googling.(frustrated). Need help.
NB: I couldn't find any more tutorial/answers similar to token mismatch problem netiher on stackoverflow nor on any other website that I havn't tried. Thanks.

NetworkError: 405 Method Not Allowed AngularJS REST

In AngularJS, I had the following function, which worked fine:
$http.get( "fruits.json" ).success( $scope.handleLoaded );
Now I would like to change this from a file to a url (that returns json using some sweet Laravel 4):
$http.get( "http://localhost/fruitapp/fruits").success( $scope.handleLoaded );
The error I get is:
"NetworkError: 405 Method Not Allowed - http://localhost/fruitapp/fruits"
What's the problem? Is it because fruit.json was "local" and localhost is not?
From w3:
10.4.6 405 Method Not Allowed
The method specified in the Request-Line is not allowed for the resource
identified by the Request-URI. The response MUST include an Allow header
containing a list of valid methods for the requested resource.
It means the for the URL: http://localhost/fruitapp/fruits The server is responding that the GET method isn't allowed. Is it a POST or PUT?
The angular js version you are using would be <= 1.2.9.
If Yes, try this.
return $http({
url: 'http://localhost/fruitapp/fruits',
method: "GET",
headers: {
'Content-Type': 'application/json',
'Accept': 'application/json'
}
});
I had a similar issue with my SpringBoot project, I was getting the same error in the browser console but I saw a different error message when I looked at the back-end log, It was throwing this error: "org.springframework.web.HttpRequestMethodNotSupportedException, message=Request method 'DELETE' not supported " It turned out that I was missing the {id} parameter in the back-end controller:
** Wrong code :**
#RequestMapping(value="books",method=RequestMethod.DELETE)
public Book delete(#PathVariable long id){
Book deletedBook = bookRepository.findOne(id);
bookRepository.delete(id);
return deletedBook;
}
** Correct code :**
#RequestMapping(value="books/{id}",method=RequestMethod.DELETE)
public Book delete(#PathVariable long id){
Book deletedBook = bookRepository.findOne(id);
bookRepository.delete(id);
return deletedBook;
}
For me, it was the server not being configured for CORS.
Here is how I did it on Azure: CORS enabling on Azure
I hope something similar works with your server, too.
I also found a proposal how to configure CORS on the web.config, but no guarantee: configure CORS in the web.config. In general, there is a preflight request to your server, and if you did a cross-origin request (that is from another url than your server has), you need to allow all origins on your server (Access-Control-Allow-Origin *).

Resources