Show full URL in the squid access.log log - ubuntu-18.04

Im running an Squid-Proxy and want to see the full requested URL in the /var/log/squid/access.log log.
The only thing that is shown is:
1607565964.095 64 XXX.XXX.126.82 TCP_TUNNEL/200 13744 CONNECT www.mediamarkt.de:443 XXX HIER_DIRECT/XXX.XXX.208.234 -
I requested an full link from mediamarkt, not only the main homepage.
So is there an way to see the full requested link?
For requests against my server it works:
1607565951.279 0 XXX.XXX.122.16 NONE/400 3947 GET /boaform/admin/formLogin?username=XXXXX&psd=XXXXX - HIER_NONE/- text/html

For showing the full requested URL, i need to use ssl_bump.

Related

Remove L parameter in request URL

I'm using Solr extension with TYPO3 9.5.3 and I couldn't index the Pages, I get this error https://imgur.com/1e6LfIy
Failed to execute Page Indexer Request. Request ID: 5d78d130b8b4d
When I look at the Solr log, I see that Typo3 add &L=0 to the request URL, the pages with &L=0 return '404 page not found' error :
request url => 'http://example.com/index.php?id=5&L=0' (43 chars)
I added the following code to my TS setup, But that did not work and the request url always ends with &L=0
plugin.tx_solr.index.queue.pages.fields.url.typolink.additionalParams >
I'm not sure that's the only reason solr doesn't index the pages (news can be indexed without any problem), but first, how can I solve the problem and remove &L=0 from request URL in Solr ?
Can you check your TypoScript if you have a configuration like
config.defaultGetVars.L = 0
or if other old language settings exist
I
m not dure, but have you an older languge-Configuration where you Deine the language-Parameter deines?

Scrapy: continue process result from parse function

I am trying to parse page A, download files listed in the page to local disk, replace URL in page A with URL to the files I saved, and finally save page A to local disk.
I tried file pipeline but it just does not work. The URL in page A looks like http:...php?id=1234 so build-in file_path() returns an error. Overriding file_path() just stops pipeline working without any debug output.
So I found this post:
Answer I referred
After I applied I found the parsing function won't change the data I passed in meta. My code is like:
def ParseClientCaseNote(self,response):
# The function is to download all attachments and replace URL inside pointing to local files
TestMeta='this is to test meta argu'
for a in AttachmentList:
yield scrapy.Request(a,callback=self.DownClientCaseNoteAttach,meta={'test':TestMeta})
self.logger.info('ParseClientCaseNote: after call DownClientCaseNoteAttach, testmeta is: ' + TestMeta)
return
def DownClientCaseNoteAttach(self,response):
TestArg=response.meta['test']
self.logger.info('DownClientCaseNoteAttach: test meta')
self.logger.info(TestArg)
TestArg='this is revised from DownClientCaseNoteAttach'
with open(AbsPath,'wb') as f:
f.write(response.body)
return
I got below result in log:
2018-09-29 09:26:13 [debug] INFO: ParseClientCaseNote: after call DownClientCaseNoteAttach, testmeta is: this is to test meta argu
2018-09-29 09:26:17 [debug] INFO: DownClientCaseNoteAttach: test meta
2018-09-29 09:26:17 [debug] INFO: this is to test meta argu
It seems parsing function is deferred. How can I get the result correctly?
Thanks
I used a workaround to address this. In page A I get file name on web and pass the name to own download function change the url pointing to local file with name on web.
In download function I verify the file name from response.headers['Content-Disposition'].decode(response.headers.encoding) to ensure it is the same as I find on page A before save it.

Single page addition to 'singlepage' website not working

I might be asking a stupid question but I've spent days on stackoverflow and git as well as Hugo's official documentation and I've gotten 15 different ways of doing something and nothing seems to work.
I have a 1 page hugo website and I want to add in a privacy policy.
Within the root/config.toml I have the following:
[[params.footer.quicklinks]]
text = "Privacy Policy"
link = "privacypolicy.html"
Within root/content I have a file called privacypolicy.md with the following:
---
title: "Privacy Policy"
type: page
page: "privacypolicy.html"
---
Within root/layout/page I have privacypolicy.html
When I click the link on the core page to go to the privacy policy I get a '404 page not found'
Fix the typo layouts. Put the privacypolicy.html file in root/layouts/page dir.
Create a new page dir and put the privacypolicy.md in root/content/page.
Use Url tag in md file like this:
---
title: "your title"
type: page
Url: page/privacypolicy
---
Your content here...
This will open in your http://baseUrl/page/privacypolicy. Recommended to rerun hugo server and hard refresh(ctrl shift R) web pages.

Thunderbird Lightning caldav sync doesn't show any data/events

when i try to synchronize my caldav server implementation with Thunderbird 45.4.0 and Lightning 4.7.4 (one particular calendar collection) it doesnt show any data or events in the calendar though the last call of the sequence provided the data.
In the Thunderbird error log i can see one error:
Zeitstempel: 07.11.16, 14:21:12
Fehler: [calCachedCalendar] replay action failed: null,
uri=http://127.0.0.1:8003/sap/sports/webdav/appsvc/webdav/services/
server.xsjs/cal/_D043133/, result=2147500037, op=[xpconnect wrapped
calIOperation]
Quelldatei:
file:///Users/d043133/Library/Thunderbird/Profiles/hfbvuk9f.default/
extensions/%7Be2fda1a4-762b-4020-b5ad-a41df1933103%7D/calendar-
js/calCachedCalendar.js
Zeile: 327
the call sequence is as follows (detailed content via gist-links):
Propfind Request - Response
Options Request - Response
Propfind Request - Response
Report Request - Response - Response Raw
The synchronization with other clients like macOS-calendar and ios-calendar works in principle and shows the data. Does anyone has a clue what is going wrong here?
Not sure whether that is the cause but I can see two incorrect things:
a) Your <href/> property has trailing spaces:
<d:href>/sap/sports/webdav/appsvc/webdav/services/server.xsjs/cal/_D043133/EVENT%3A070768ba5dd78ff15458f1985cdaabb1.ics
</d:href>
b) your ORGANIZER property is not a valid URI
ORGANIZER:_D043133
i was able to find the cause of the above issue by debugging Thunderbird as propsed by Philipp. The Report Response has http status code 200, but as it is a multistatus response Thunderbird/Lightning expects status code 207 ;-)
Thanks for the hints!

Web scraping dynamically loading data in R

I am using R to web scrape data about reviews of 3D printer hubs from here. I need to grab the URL for each of the hubs in the search. I started by using the rvest package, but the data is loading dynamically (I believe using AngularJS) and rvest could not capture it.
After reviewing stackoverflow, I found I could load the webpage onto my computer and save it as a HTML file using phantomjs.org. I did that with the following code.
# this example scrapes the user table from:
url <- "https://www.3dhubs.com/3dprint#/?place=New%20York&latitude=40.7144&longitude=-74.006&distanceLimit=250&distanceUnit=miles&shipsToCountry=US&shipsToState=NY"
# write out a script phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
# process it with phantomjs
system("phantomjs scrape.js > scrape.html")
# use rvest as you would normally use it
page_html <- read_html("scrape.html")
The above code did not load any of the desired data into R. Then I found the package rdom (https://github.com/cpsievert/rdom). Rdom uses a similar technique as above, but it was able to load in the names of each of the hubs, but not the link to the hub page.
tbl <- rdom::rdom("https://www.3dhubs.com/3dprint#/?place=New%20York&latitude=40.7144&longitude=-74.006&distanceLimit=250&distanceUnit=miles&shipsToCountry=US&shipsToState=NY")
htmltxt <- paste(capture.output(tbl, file=NULL), collapse="\n")
write(htmltxt, file = "scrape.html")
page_html <- read_html("scrape.html")
I have a very basic working knowledge of GET and POST requests. So using the Firebug add-in on Firefox, I was able to find the Post request that populates the fields.
https://hub-listings.3dhubs.com/listings
In the heading, the website only allows requests from 3dhubs.com. Here is the header for reference:
HTTP/1.1 200 OK
Access-Control-Allow-Origin: https://www.3dhubs.com
access-control-expose-headers: api-version, content-length, content-md5, content-type, date, request-id, response-time
Content-Type: application/json
Date: Wed, 14 Sep 2016 15:48:30 GMT
Content-Length: 227629
Connection: keep-alive
Is there some other technique I should try? Or does the “Access-Control-Allow-Origin” make it impossible?
Additional question, the search results are paginated. The second page is only loaded in when the “2” is selected at the bottom of the page, but the URL does not change from page 1 to 2. How would you account for this in web scraping?
Here is another approach that can be considered :
library(RSelenium)
url <- "https://www.hubs.com/3d-printing/#/?place=New%20York&latitude=40.7144&longitude=-74.006&distanceLimit=250&distanceUnit=miles&shipsToCountry=US&shipsToState=NY"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
htmltxt <- remDr$getPageSource()[[1]]
You can also consider the following approach :
library(RDCOMClient)
url <- "https://www.hubs.com/3d-printing/#/?place=New%20York&latitude=40.7144&longitude=-74.006&distanceLimit=250&distanceUnit=miles&shipsToCountry=US&shipsToState=NY"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
htmltxt <- doc$documentElement()$innerHtml()

Resources