I am using R to web scrape data about reviews of 3D printer hubs from here. I need to grab the URL for each of the hubs in the search. I started by using the rvest package, but the data is loading dynamically (I believe using AngularJS) and rvest could not capture it.
After reviewing stackoverflow, I found I could load the webpage onto my computer and save it as a HTML file using phantomjs.org. I did that with the following code.
# this example scrapes the user table from:
url <- "https://www.3dhubs.com/3dprint#/?place=New%20York&latitude=40.7144&longitude=-74.006&distanceLimit=250&distanceUnit=miles&shipsToCountry=US&shipsToState=NY"
# write out a script phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
# process it with phantomjs
system("phantomjs scrape.js > scrape.html")
# use rvest as you would normally use it
page_html <- read_html("scrape.html")
The above code did not load any of the desired data into R. Then I found the package rdom (https://github.com/cpsievert/rdom). Rdom uses a similar technique as above, but it was able to load in the names of each of the hubs, but not the link to the hub page.
tbl <- rdom::rdom("https://www.3dhubs.com/3dprint#/?place=New%20York&latitude=40.7144&longitude=-74.006&distanceLimit=250&distanceUnit=miles&shipsToCountry=US&shipsToState=NY")
htmltxt <- paste(capture.output(tbl, file=NULL), collapse="\n")
write(htmltxt, file = "scrape.html")
page_html <- read_html("scrape.html")
I have a very basic working knowledge of GET and POST requests. So using the Firebug add-in on Firefox, I was able to find the Post request that populates the fields.
https://hub-listings.3dhubs.com/listings
In the heading, the website only allows requests from 3dhubs.com. Here is the header for reference:
HTTP/1.1 200 OK
Access-Control-Allow-Origin: https://www.3dhubs.com
access-control-expose-headers: api-version, content-length, content-md5, content-type, date, request-id, response-time
Content-Type: application/json
Date: Wed, 14 Sep 2016 15:48:30 GMT
Content-Length: 227629
Connection: keep-alive
Is there some other technique I should try? Or does the “Access-Control-Allow-Origin” make it impossible?
Additional question, the search results are paginated. The second page is only loaded in when the “2” is selected at the bottom of the page, but the URL does not change from page 1 to 2. How would you account for this in web scraping?
Here is another approach that can be considered :
library(RSelenium)
url <- "https://www.hubs.com/3d-printing/#/?place=New%20York&latitude=40.7144&longitude=-74.006&distanceLimit=250&distanceUnit=miles&shipsToCountry=US&shipsToState=NY"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
htmltxt <- remDr$getPageSource()[[1]]
You can also consider the following approach :
library(RDCOMClient)
url <- "https://www.hubs.com/3d-printing/#/?place=New%20York&latitude=40.7144&longitude=-74.006&distanceLimit=250&distanceUnit=miles&shipsToCountry=US&shipsToState=NY"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
htmltxt <- doc$documentElement()$innerHtml()
Related
Im running an Squid-Proxy and want to see the full requested URL in the /var/log/squid/access.log log.
The only thing that is shown is:
1607565964.095 64 XXX.XXX.126.82 TCP_TUNNEL/200 13744 CONNECT www.mediamarkt.de:443 XXX HIER_DIRECT/XXX.XXX.208.234 -
I requested an full link from mediamarkt, not only the main homepage.
So is there an way to see the full requested link?
For requests against my server it works:
1607565951.279 0 XXX.XXX.122.16 NONE/400 3947 GET /boaform/admin/formLogin?username=XXXXX&psd=XXXXX - HIER_NONE/- text/html
For showing the full requested URL, i need to use ssl_bump.
when i try to synchronize my caldav server implementation with Thunderbird 45.4.0 and Lightning 4.7.4 (one particular calendar collection) it doesnt show any data or events in the calendar though the last call of the sequence provided the data.
In the Thunderbird error log i can see one error:
Zeitstempel: 07.11.16, 14:21:12
Fehler: [calCachedCalendar] replay action failed: null,
uri=http://127.0.0.1:8003/sap/sports/webdav/appsvc/webdav/services/
server.xsjs/cal/_D043133/, result=2147500037, op=[xpconnect wrapped
calIOperation]
Quelldatei:
file:///Users/d043133/Library/Thunderbird/Profiles/hfbvuk9f.default/
extensions/%7Be2fda1a4-762b-4020-b5ad-a41df1933103%7D/calendar-
js/calCachedCalendar.js
Zeile: 327
the call sequence is as follows (detailed content via gist-links):
Propfind Request - Response
Options Request - Response
Propfind Request - Response
Report Request - Response - Response Raw
The synchronization with other clients like macOS-calendar and ios-calendar works in principle and shows the data. Does anyone has a clue what is going wrong here?
Not sure whether that is the cause but I can see two incorrect things:
a) Your <href/> property has trailing spaces:
<d:href>/sap/sports/webdav/appsvc/webdav/services/server.xsjs/cal/_D043133/EVENT%3A070768ba5dd78ff15458f1985cdaabb1.ics
</d:href>
b) your ORGANIZER property is not a valid URI
ORGANIZER:_D043133
i was able to find the cause of the above issue by debugging Thunderbird as propsed by Philipp. The Report Response has http status code 200, but as it is a multistatus response Thunderbird/Lightning expects status code 207 ;-)
Thanks for the hints!
I've run a pen test tool (Burp) against my node(express)/angular application and it identified a reflected XSS vulnerability specifically when attempting a GET request for static assets (noticeably vulnerabilities were not found for any of the requests being made when a user interacts with the application).
The issue detail is:
The name of an arbitrarily supplied URL parameter is copied into a
JavaScript expression which is not encapsulated in any quotation
marks. The payload 41b68(a)184a9=1 was submitted in the name of an
arbitrarily supplied URL parameter. This input was echoed unmodified
in the application's response.
This behavior demonstrates that it is possible to inject JavaScript
commands into the returned document. An attempt was made to identify a
full proof-of-concept attack for injecting arbitrary JavaScript but
this was not successful. You should manually examine the application's
behavior and attempt to identify any unusual input validation or other
obstacles that may be in place.
The vulnerability was tested by passing an arbitrary url parameter to the request like so:
GET /images/?41b68(a)184a9=1
The response was:
HTTP/1.1 404 Not Found
X-Content-Security-Policy: connect-src 'self'; default-src 'self'; font-src 'self'; frame-src; img-src 'self' *.google-analytics.com; media-src; object-src; script-src 'self' 'unsafe-eval' *.google-analytics.com; style-src 'self' 'unsafe-inline'
X-XSS-Protection: 1; mode=block
X-Frame-Options: DENY
Strict-Transport-Security: max-age=10886400; includeSubDomains; preload
X-Download-Options: noopen
X-Content-Type-Options: nosniff
Content-Type: text/html; charset=utf-8
Content-Length: 52
Date: Wed, 08 Oct 2015 10:46:43 GMT
Connection: close
Cannot GET /images/?41b68(a)184a9=1
You can see that I have CSP in place (using Helmet to implement) and other protections against exploits. The app is served over https, but no user auth is required. CSP restricts request to the app's domain only plus google analytics.
The pen test report advises validating input (I am, but surely that would make requests including data sent by a user unsafe if I wasn't?), and encoding html which angular does by default.
I'm really struggling to find a solution to preventing or mitigating this for those requests for static assets:
Should I whitelist all requests for my application under csp?
Can I even do this, or will it only whitelist domains?
Can/should all responses from node/express to requests for static assets be encoded in some way?
The report states that "The name of an arbitrarily supplied URL parameter is copied into a JavaScript expression which is not encapsulated in any quotation marks". Could this expression be somewhere in the express code that handles returning static assets?
Or that GET request param can somehow be evaluated in my application code?
Update
Having done some investigation into this it seems that at least part of the mitigation is to escape data in url param values and sanitize the input in the url.
Escaping of the url is already in place so:
curl 'http://mydomain/images/?<script>alert('hello')</script>'
returns
Cannot GET /images/?<script>alert(hello)</script>
I've also put express-sanitized in place on top of this.
However, if I curl the original test the request param is still reflected back.
curl 'http://mydomain/images/?41b68(a)184a9=1'
Cannot GET /images/?41b68(a)184a9=1
Which you would expect because html is not being inserted into the url.
The responses to GET requests for static assets are all handled by app.use(express.static('static-dir')) so the query is passed into this. express.static is based on serve-static which depends on parseurl.
The cause of the issue is that for invalid GET requests express will return something like:
Cannot GET /pathname/?yourQueryString
Which in many cases is a valid response, even for serving static assets. However, in my case and I'm sure for others the only valid requests for static assets will be something like:
GET /pathname/your-file.jpg
I have a custom 404 handler that returns a data object:
var data = {
status: 404,
message: 'Not Found',
description: description,
url: req.url
};
This is only handled for invalid template requests in app.js with:
app.use('/template-path/*', function(req, res, next) {
custom404.send404(req, res);
});
I've now added explicit handlers for requests to static folders:
app.use('/static-path/*', function(req, res, next) {
custom404.send404(req, res);
});
Optionally I could also strip out request query params before the 404 is returned:
var data = {
status: 404,
message: 'Not Found',
description: description,
url: url.parse(req.url).pathname // needs a var url = require('url')
};
I am trying to access a website, and then return whatever it outputs in the body -> eg. "Success" or "Failed".
When I try with my code, I am getting the following back.
<<< REQ >>>
HTTP/1.1 200 OK
Date: Sat, 30 Aug 2014 17:36:31 GMT
Content-Type: text/html
Connection: close
Set-Cookie: __cfduid=d8a4fc3c84849b6786c6ca890b92e2cc01409420191023; expires=Mon, 23-Dec-2019 23:50:00 GMT; path=/; domain=.japseyz.com; HttpOnly
Vary: Accept-Encoding
X-Powered-By: PHP/5.3.28
Server.
My code is: http://pastebin.com/WwWbnLNn
If all you want to know is whether the HTTP transaction succeeded or failed, then you need to examine the HTTP Response code... which is in the first line of the response. In your example it is "200"... the human readable interpretation of it is "OK".
Here is a link to most of the HTTP 1.1 response codes: w3.org-rfc2616 RespCodes
Your question indicated you wanted to extract this information from the "body"...
... but that information is not located in the "body", it is in the first response
header, as described above.
have you tried ethercard samples? there is a webclient sample, in which you can find procedure called CALLBACK - in that procedure you can process data stored in buf variable.
in your case you need to look for first empty line, which tells you that headers has been sent and page content(what php writes to the page i.e.) follows.
how familiar are you at pointers? how deep you do need to process the page output? i.e. OK or ERROR is enough, or you do need to pass same parameters back to duino?
fetch data from server returns me json data as a string datatype rather than as application/json datatype, as a result the collection does not get refreshed.
I have tried giving the jquery.ajax option contentType:"application/json" to the fetch options, but still does not work.
how can i make it work? do i send a mimetype from the server? if so, how?
i am using json_encode on the data sent.
preloader.fetch({
contentType:'application/json'
});
preloader is an instance of my collection.
edit:
my template for a subview was not getting detected as i had kept it out of the masterview's $el element, corrected it, and now i am getting underscore.js error, that
str is null in
str.replace(/\\/g, '\\\\') //at line 913
is this because the backbone app is not taking it as a json object?
Request headers
Connection close
Content-Type text/html
Date Thu, 12 Apr 2012 13:00:58 GMT
Server Apache
Transfer-Encoding chunked
Vary Accept-Encoding
Response headers
has the line
Accept application/json, text/javascript, */*; q=0.01
means it is a json, then what is the problem?
I think the contentType option is for the request (your request).
Try dataType:"json".