Retrieve content of remote txt file with casperjs, without html tags - screen-scraping

Upon using casperjs, when I launch this code:
var casper = require('casper').create();
var url = 'https://www.youtube.com/robots.txt';
casper.start(url, function() {
var js = this.evaluate(function() {
return document;
});
this.echo(js.all[0].innerHTML);
});
casper.run();
Instead of getting this:
# robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid 90's which wiped out all humans.
User-agent: Mediapartners-Google*
Disallow:
User-agent: *
Disallow: /bulletin
Disallow: /comment
Disallow: /forgot
Disallow: /get_video
Disallow: /get_video_info
Disallow: /login
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /t/privacy
Disallow: /verify_age
Disallow: /videos
Disallow: /watch_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax
I get this result:
<head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;"># robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid 90's which wiped out all humans.
User-agent: Mediapartners-Google*
Disallow:
User-agent: *
Disallow: /bulletin
Disallow: /comment
Disallow: /forgot
Disallow: /get_video
Disallow: /get_video_info
Disallow: /login
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /t/privacy
Disallow: /verify_age
Disallow: /videos
Disallow: /watch_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax
</pre></body>
It seems that casperjs is adding html tags. How can I get the plain txt file exactly as source ?

What about the download function?
The script become
var casper = require('casper').create();
var url = 'https://www.youtube.com/robots.txt';
casper.start(url, function() {
this.download(url, 'robots.txt');
});
casper.run();
UPDATE
If you want to store the remote file contents into a string, use base64encode
var casper = require('casper').create();
var url = 'https://www.youtube.com/robots.txt';
var contents;
casper.start(url, function() {
contents = atob(this.base64encode(url));
console.log(contents);
});
casper.run();

Related

How to handle preflight request in laravel

I am using ionic v-3 app and laravel 5.5 as backend
I have a get request in my ionic app
const httpOptions = {
headers: new HttpHeaders({
'Authorization': 'Bearer '+localStorage.getItem('token')
})
};
this.prf=this.httpClient.get('http://localhost/blog/public/api/user',httpOptions);
It sending an OPTION request first then after getting success response it sends a GET request
Now i handle this in my api.php file by using two routing as
//first routing
Route::middleware('cors','auth:api')->get('/user', function (Request $request) {
return $request->user();
});
//second routing
Route::options('user', function(){
return response(200);
})->middleware('cors');
This is working correctly
My requset header includes Access-Control-Request-Method: GET
Accept: */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Access-Control-Request-Headers: authorization
Access-Control-Request-Method: GET
Connection: keep-alive
Host: localhost
Origin: http://localhost:8100
User-Agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Mobile
Safari/537.36
In this article it says https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS
However, if the request is one that triggers a preflight due to the
presence of the Authorization header in the request, you won’t be
able to work around the limitation using the steps above. And you
won’t be able to work around it at all unless you have control over
the server the request is being made to.
Is there any other way/best practice for handling this?
I found a solution .
Created a middleware app/Http/Middleware/PreflightResponse.php
<?php
namespace App\Http\Middleware;
use Closure;
class PreflightResponse
{
/**
* Handle an incoming request.
*
* #param \Illuminate\Http\Request $request
* #param \Closure $next
* #return mixed
*/
public function handle($request, Closure $next )
{
if ($request->getMethod() === "OPTIONS") {
return response('');
}
return $next($request);
}
}
add in app/Http/Kernel.php
'preflight' => \App\Http\Middleware\PreflightResponse::class,
But still in my api.php i have to use these codes
Route::middleware('cors','preflight')->group(function () {
Route::options('register', function(){});
Route::options('login', function(){});
Route::options('address', function(){});
Route::options('address/{id}', function(){});
Route::options('getaddress', function(){});
Route::options('getorders', function(){});
Route::options('order', function(){});
Route::options('orderdetails', function(){});
Route::options('user', function(){});
Route::options('profile', function(){});
Route::options('calendar', function(){});
Route::options('calendarapp', function(){});
});

request.body always empty for multipart form data

I have the following in the request being sent from the browser:
Remote Address:127.0.0.1:80
Request URL:http://doctor.com/api/v2/chat/message
Request Method:POST
Status Code:501 Not Implemented
**Response Headers**
view source
Access-Control-Allow-Headers:Content-Type,X-Requested-With
Access-Control-Allow-Methods:POST, GET, PUT, DELETE, OPTIONS
Access-Control-Allow-Origin:*
Connection:keep-alive
Content-Length:61
Content-Type:text/html; charset=utf-8
Date:Wed, 24 Jun 2015 11:16:33 GMT
ETag:W/"3d-70662653"
X-Powered-By:Express
**Request Headers**
view source
Accept:application/json, text/plain, */*
Accept-Encoding:gzip, deflate
Accept-Language:en-US,en;q=0.8
Cache-Control:no-cache
Connection:keep-alive
Content-Length:39235
Content-Type:multipart/form-data; boundary=----WebKitFormBoundary3qsbh041bbj3MYfd
Cookie:serviceToken=558a86bb69f3197ab93fd64c
DNT:1
Host:doctor.com
Origin:http://doctor.com
Pragma:no-cache
Referer:http://doctor.com/platform/chat
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36
Request Payload
------WebKitFormBoundary3qsbh041bbj3MYfd
Content-Disposition: form-data; name="id"
55896d9bc57f69df66284176
------WebKitFormBoundary3qsbh041bbj3MYfd
Content-Disposition: form-data; name="attachment"; filename="Screen Shot 2015-06-24 at 3.18.10 am.png"
Content-Type: image/png
------WebKitFormBoundary3qsbh041bbj3MYfd--
This request is intercepted by a node server. Here is how it looks:
var express = require('express');
var http = require('http');
var bodyParser = require('body-parser');
var cookieParser = require('cookie-parser');
var fs = require('fs');
var path = require('path');
var request = require('request');
var _ = require('underscore-node');
var express = require('express');
app.use(bodyParser.json());
app.use(cookieParser());
app.use(bodyParser.urlencoded({extended:false}));
app.use('/api/*', function (req, res, next) {
console.log(req.body);
});
Problem is that, I always get the req.body as empty. It works fine when the json is posted.
From the body-parser documentation: https://github.com/expressjs/body-parser
You need an extra middleware.
This does not handle multipart bodies, due to their complex and
typically large nature. For multipart bodies, you may be interested in
the following modules:
busboy and connect-busboy
multiparty and connect-multiparty
formidable
multer

AngularJS http request doesn't send session cookie for CORS

My script is:
<script>
function eventController($scope, $http) {
$http.get("http://example.com/api/Event", {withCredentials: true})
.success(function (response) {
$scope.events = response;
});
}
</script>
Request header from Fiddler:
GET http://example.com/api/Event HTTP/1.1
Host: example.com
Connection: keep-alive
Cache-Control: max-age=0
Accept: application/json, text/plain, */*
Origin: http://www.example.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36
Referer: http://www.example.com/Home/Event
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
I need a authentication session cookie to be sent with this request.
It only sends cookie when the url of the webpage has no www in the front.
I have found a similar question in this site and the answer is:
.config(function ($routeProvider, $httpProvider) {
$httpProvider.defaults.withCredentials = true;
//rest of route code
But where do I put this code in my script?
The web application I'm developing is an ASP.NET MVC5 app with Web Api 2.
That configuration needs to be injected into the angular module, like this:
// get an existing module by name
var app = angular.module('app');
// inject $httpProvider configuration
app.config(function ($httpProvider) {
$httpProvider.defaults.withCredentials = true;
});
It turns out that my AngularJS script works fine (Just pass {withCredentials: true} as a parameter in the $http.get method). What went wrong was that my MVC application doesn't store session cookie in the subdomain. So I fixed it by adding CookieDomain in the CookieAuthenticationOptions.

Unable to see uploaded images on Azure Blob

I am new to azure blob storage and I am trying to upload image to blob storage. I am using angular at client end and uploaded image with following headers :
'Content-type'
'x-ms-blob-type'
'Content-Length'
my blob is saved and i can see it in Azure Portal. But I am not able to see images. I am unable to understand the reason.
Link of my client is http://educms.azurewebsites.net/#/pages/results
There is no button. as soon as you select an image file, it will get uploaded. You can see the uploaded file at https://hobcity.blob.core.windows.net/images2/filename.extension
Upload Code of AngularJs
$scope.uploadFile = function(files) {
var fd = new FormData();
//Take the first selected file
fd.append("file", files[0]);
var size = files[0].size;
var name = files[0].name;
var type = files[0].type;
var postData = {"name" : name};
postData.containerName = 'images2';
DataService.save('/tables/results/', postData).then(function(data){
var header = {
'Access-Control-Allow-Origin': '*',
'Content-type' : type,
'x-ms-blob-type' : 'BlockBlob',
'Content-Length' : size
}
var url = data.imageUri
var queryString = data.sasQueryString
var uploadUrl = url+ '/' + name + '/?'+queryString
$http.put(uploadUrl, fd, {
headers: header,
transformRequest: angular.identity
}).success(function(data){
console.log(data)
}).error(function(err){
console.error(err);
});
});
}
HTML :
code can be seen live at http://educms.azurewebsites.net/scripts/controllers/results.js
Anyone know what's wrong ?
So I uploaded a simple text file and traced the request through Fiddler. Here's what I saw:
PUT http://hobcity.blob.core.windows.net/images2/simpletextfile.txt/?se=2014-08-11T17%3A13%3A52Z&sr=c&sp=w&sig=SlY7wURwfSjM72Hw22507OHpnaCC1Ky6POk6hhR6fbU%3D HTTP/1.1
Accept: application/json, text/plain, */*
Access-Control-Allow-Origin: *
Content-Type: text/plain, multipart/form-data; boundary=---------------------------7de26921205a0
x-ms-blob-type: BlockBlob
Referer: http://educms.azurewebsites.net/#/pages/results
Accept-Language: en-US
Origin: http://educms.azurewebsites.net
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko
Host: hobcity.blob.core.windows.net
Content-Length: 254
DNT: 1
Connection: Keep-Alive
Pragma: no-cache
-----------------------------7de26921205a0
Content-Disposition: form-data; name="file"; filename="simpletextfile.txt"
Content-Type: text/plain
https://hobcity.blob.core.windows.net/images2/Add-Item.png
-----------------------------7de26921205a0--
I believe the reason you're running into the issue is because you're uploading the file as is (see your Content-Type is multipart/form-data) and this is corrupting the data. What you would need to do is read the file contents into a byte array and then upload that byte array. If you search for HTML 5 File API, you will find examples of how to read a file using JavaScript. Also I wrote a blog post about uploading files in Azure Blob Storage using JavaScript which you may find useful: http://gauravmantri.com/2013/12/01/windows-azure-storage-and-cors-lets-have-some-fun/ (though this post make use of jQuery instead of Angular but should give you some idea).

How and where to implement basic authentication in Kibana 3

I have put my elasticsearch server behind a Apache reverse proxy that provides basic authentication.
Authenticating to Apache directly from the browser works fine. However, when I use Kibana 3 to access the server, I receive authentication errors.
Obviously because no auth headers are sent along with Kibana's Ajax calls.
I added the below to elastic-angular-client.js in the Kibana vendor directory to implement authentication quick and dirty. But for some reason it does not work.
$http.defaults.headers.common.Authorization = 'Basic ' + Base64Encode('user:Password');
What is the best approach and place to implement basic authentication in Kibana?
/*! elastic.js - v1.1.1 - 2013-05-24
* https://github.com/fullscale/elastic.js
* Copyright (c) 2013 FullScale Labs, LLC; Licensed MIT */
/*jshint browser:true */
/*global angular:true */
'use strict';
/*
Angular.js service wrapping the elastic.js API. This module can simply
be injected into your angular controllers.
*/
angular.module('elasticjs.service', [])
.factory('ejsResource', ['$http', function ($http) {
return function (config) {
var
// use existing ejs object if it exists
ejs = window.ejs || {},
/* results are returned as a promise */
promiseThen = function (httpPromise, successcb, errorcb) {
return httpPromise.then(function (response) {
(successcb || angular.noop)(response.data);
return response.data;
}, function (response) {
(errorcb || angular.noop)(response.data);
return response.data;
});
};
// check if we have a config object
// if not, we have the server url so
// we convert it to a config object
if (config !== Object(config)) {
config = {server: config};
}
// set url to empty string if it was not specified
if (config.server == null) {
config.server = '';
}
/* implement the elastic.js client interface for angular */
ejs.client = {
server: function (s) {
if (s == null) {
return config.server;
}
config.server = s;
return this;
},
post: function (path, data, successcb, errorcb) {
$http.defaults.headers.common.Authorization = 'Basic ' + Base64Encode('user:Password');
console.log($http.defaults.headers);
path = config.server + path;
var reqConfig = {url: path, data: data, method: 'POST'};
return promiseThen($http(angular.extend(reqConfig, config)), successcb, errorcb);
},
get: function (path, data, successcb, errorcb) {
$http.defaults.headers.common.Authorization = 'Basic ' + Base64Encode('user:Password');
path = config.server + path;
// no body on get request, data will be request params
var reqConfig = {url: path, params: data, method: 'GET'};
return promiseThen($http(angular.extend(reqConfig, config)), successcb, errorcb);
},
put: function (path, data, successcb, errorcb) {
$http.defaults.headers.common.Authorization = 'Basic ' + Base64Encode('user:Password');
path = config.server + path;
var reqConfig = {url: path, data: data, method: 'PUT'};
return promiseThen($http(angular.extend(reqConfig, config)), successcb, errorcb);
},
del: function (path, data, successcb, errorcb) {
$http.defaults.headers.common.Authorization = 'Basic ' + Base64Encode('user:Password');
path = config.server + path;
var reqConfig = {url: path, data: data, method: 'DELETE'};
return promiseThen($http(angular.extend(reqConfig, config)), successcb, errorcb);
},
head: function (path, data, successcb, errorcb) {
$http.defaults.headers.common.Authorization = 'Basic ' + Base64Encode('user:Password');
path = config.server + path;
// no body on HEAD request, data will be request params
var reqConfig = {url: path, params: data, method: 'HEAD'};
return $http(angular.extend(reqConfig, config))
.then(function (response) {
(successcb || angular.noop)(response.headers());
return response.headers();
}, function (response) {
(errorcb || angular.noop)(undefined);
return undefined;
});
}
};
return ejs;
};
}]);
UPDATE 1: I implemented Matts suggestion. However, the server returns a weird response. It seems that the authorization header is not working. Could it have to do with the fact, that I am running Kibana on port 81 and elasticsearch on 8181?
OPTIONS /solar_vendor/_search HTTP/1.1
Host: 46.252.46.173:8181
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Origin: http://46.252.46.173:81
Access-Control-Request-Method: POST
Access-Control-Request-Headers: authorization,content-type
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
This is the response
HTTP/1.1 401 Authorization Required
Date: Fri, 08 Nov 2013 23:47:02 GMT
WWW-Authenticate: Basic realm="Username/Password"
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 346
Connection: close
Content-Type: text/html; charset=iso-8859-1
UPDATE 2: Updated all instances with the modified headers in these Kibana files
root#localhost:/var/www/kibana# grep -r 'ejsResource(' .
./src/app/controllers/dash.js: $scope.ejs = ejsResource({server: config.elasticsearch, headers: {'Access-Control-Request-Headers': 'Accept, Origin, Authorization', 'Authorization': 'Basic XXXXXXXXXXXXXXXXXXXXXXXXXXXXX=='}});
./src/app/services/querySrv.js: var ejs = ejsResource({server: config.elasticsearch, headers: {'Access-Control-Request-Headers': 'Accept, Origin, Authorization', 'Authorization': 'Basic XXXXXXXXXXXXXXXXXXXXXXXXXXXXX=='}});
./src/app/services/filterSrv.js: var ejs = ejsResource({server: config.elasticsearch, headers: {'Access-Control-Request-Headers': 'Accept, Origin, Authorization', 'Authorization': 'Basic XXXXXXXXXXXXXXXXXXXXXXXXXXXXX=='}});
./src/app/services/dashboard.js: var ejs = ejsResource({server: config.elasticsearch, headers: {'Access-Control-Request-Headers': 'Accept, Origin, Authorization', 'Authorization': 'Basic XXXXXXXXXXXXXXXXXXXXXXXXXXXXX=='}});
And modified my vhost conf for the reverse proxy like this
<VirtualHost *:8181>
ProxyRequests Off
ProxyPass / http://127.0.0.1:9200/
ProxyPassReverse / https://127.0.0.1:9200/
<Location />
Order deny,allow
Allow from all
AuthType Basic
AuthName “Username/Password”
AuthUserFile /var/www/cake2.2.4/.htpasswd
Require valid-user
Header always set Access-Control-Allow-Methods "GET, POST, DELETE, OPTIONS, PUT"
Header always set Access-Control-Allow-Headers "Content-Type, X-Requested-With, X-HTTP-Method-Override, Origin, Accept, Authorization"
Header always set Access-Control-Allow-Credentials "true"
Header always set Cache-Control "max-age=0"
Header always set Access-Control-Allow-Origin *
</Location>
ErrorLog ${APACHE_LOG_DIR}/error.log
</VirtualHost>
Apache sends back the new response headers but the request header still seems to be wrong somewhere. Authentication just doesn't work.
Request Headers
OPTIONS /solar_vendor/_search HTTP/1.1
Host: 46.252.26.173:8181
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Origin: http://46.252.26.173:81
Access-Control-Request-Method: POST
Access-Control-Request-Headers: authorization,content-type
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
Response Headers
HTTP/1.1 401 Authorization Required
Date: Sat, 09 Nov 2013 08:48:48 GMT
Access-Control-Allow-Methods: GET, POST, DELETE, OPTIONS, PUT
Access-Control-Allow-Headers: Content-Type, X-Requested-With, X-HTTP-Method-Override, Origin, Accept, Authorization
Access-Control-Allow-Credentials: true
Cache-Control: max-age=0
Access-Control-Allow-Origin: *
WWW-Authenticate: Basic realm="Username/Password"
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 346
Connection: close
Content-Type: text/html; charset=iso-8859-1
SOLUTION:
After doing some more research, I found out that this is definitely a configuration issue with regard to CORS. There are quite a few posts available regarding that topic but it appears that in order to solve my problem, it would be necessary to to make some very granular configurations on apache and also make sure that the right stuff is sent from the browser.
So I reconsidered the strategy and found a much simpler solution. Just modify the vhost reverse proxy config to move the elastisearch server AND kibana on the same http port. This also adds even better security to Kibana.
This is what I did:
<VirtualHost *:8181>
ProxyRequests Off
ProxyPass /bigdatadesk/ http://127.0.0.1:81/bigdatadesk/src/
ProxyPassReverse /bigdatadesk/ http://127.0.0.1:81/bigdatadesk/src/
ProxyPass / http://127.0.0.1:9200/
ProxyPassReverse / https://127.0.0.1:9200/
<Location />
Order deny,allow
Allow from all
AuthType Basic
AuthName “Username/Password”
AuthUserFile /var/www/.htpasswd
Require valid-user
</Location>
ErrorLog ${APACHE_LOG_DIR}/error.log
</VirtualHost>
Here is a perfect solution:
https://github.com/fangli/kibana-authentication-proxy
Support not only basicAuth backend, but also GoogleOAuth and BasicAuth for the client.
If works, please give a star, thanks.
In Kibana, replace the existing elastic-angular-client.js with the latest which can be found here. Then, in the Kibana code replace all instances of:
$scope.ejs = ejsResource(config.elasticsearch);
with
$scope.ejs = ejsResource({server: config.elasticsearch, headers: {'Access-Control-Request-Headers': 'accept, origin, authorization', 'Authorization': 'Basic ' + Base64Encode('user:Password')}});
That should be all you need.
Update:
Is apache configured for for CORS? See this.
Header always set Access-Control-Allow-Methods "GET, POST, DELETE, OPTIONS, PUT"
Header always set Access-Control-Allow-Headers "Content-Type, X-Requested-With, X-HTTP-Method-Override, Origin, Accept, Authorization"
Header always set Access-Control-Allow-Credentials "true"
Header always set Cache-Control "max-age=0"
Header always set Access-Control-Allow-Origin *
You are correct in that it's a CORS issue. Kibana 3 uses CORS to communicate with ElasticSearch.
In order to enable HTTP Authentication Headers and Cookies to be sent with the Kibana CORS requests you need to do two things:
ONE: In your Kibana config.js file, find the setting where your ElasticSearch server is defined:
elasticsearch: "http://localhost:9200",
This needs to be changed to:
elasticsearch: {server: "http://localhost:9200", withCredentials: true},
This will tell Kibana to send the Authentication headers and cookies IF the server is capable of received them.
TWO: Next you need to go into your ElasticSearch config file (elasticsearch.yml on the host server; mine was located at /etc/elasticsearch/elasticsearch.yml on a CentOS7 server). In this file you will find a "Network And HTTP" section. You will need to find the line that says:
#http.port: 9200
Uncomment this line and change the port to the port that you want ElasticSearch to run on. I chose 19200. Then do the same for the #transport.tcp.port: 9300 setting. Again I chose 19300.
Lastly, at the end of this section (just for organizational sake, you could also simply append the following to the end of the file) add in:
http.cors.allow-origin: http://localhost:8080
http.cors.allow-credentials: true
http.cors.enabled: true
You can change the above origin address to wherever your web server is serving Kibana from. Alternatively you could simply put /.*/ to match all origins but this is not adviseable.
Now save the elasticsearch.yml file and restart the elasticsearch server. Your reverse proxy should be configured to run on port 9200 and point to 19200 if the request authenticates.
Word of warning, if you are using Cookies to authenticate requests you should make sure to white list the HTTP OPTIONS method in your reverse proxy configuration as only GET, PUT, POST and DELETE requests include the cookies. I haven't tested whether OPTIONS include the Authentication header as well but it may be the same situation as the cookies. Kibana will not function correctly if the OPTIONS requests cannot get through as well.
It is also a good idea to configure your reverse proxy to blacklist any request that ends in _shutdown as this command shouldn't be needed via external requests in most cases.

Resources