Programmatically get web request initiator - selenium-webdriver

The Chrome Dev Tools network tab has an initiator column that will show you exactly what code initiated the network request.
I'd like to be able to get network request initiator information programmatically, so I could run a script with a url and request search string argument, and it would return details about where every request with a url matching request search string came from on the page at url. So given the arguments www.stackoverflow.com and google the output might look something like this (showing requesting url, line number, and requested url):
/ 19 http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js
/ 4291 http://www.google-analytics.com/analytics.js
I looked into PhantomJS, but its onResourceRequested callback doesn't provide any initiator information, or context from which it can be derived, according to the documentation: http://phantomjs.org/api/webpage/handler/on-resource-requested.html
Is it possible to do with with PhantomJS at all, or some other tool or service such as selenium?
UPDATE
From the comments and answers so far it seems as though this isn't currently supported by Phantom, Selenium or anything else. So here's an alternative approach that might work: Load the page, and all of the assets, and then find any occurrences of request search string in all of the files. How could I do that?

You should file a feature request in the issue tracker against the DevTools. The initiator information is not exported in the HAR, so getting it out of there isn't going to work. As far as I know, no existing API allows for this either.

I've been able to implement a solution that uses PhantomJS to get all of the URLs loaded by a page, and then use a combination of xargs, curl and grep to find the search string at those URLs.
The first piece is this PhantomJS script, which simply outputs every URL requested by a page:
system = require('system');
var page = require('webpage').create();
page.onResourceRequested= function(req) {
console.log(req.url);
};
page.open(system.args[1], function(status) {
phantom.exit(1);
});
Here it is in action:
$ phantomjs urls.js http://www.stackoverflow.com | head -n6
http://www.stackoverflow.com/
http://stackoverflow.com/
http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js
http://cdn.sstatic.net/Js/stub.en.js?v=06bb9dbfaca7
http://cdn.sstatic.net/stackoverflow/all.css?v=af4b547e0e9f
http://cdn.sstatic.net/img/share-sprite-new.svg?v=d09c08f3cb07
For my problem I'm not interested in images, and those can be fitlered out by adding the phantomjs arg --load-images=no.
The second piece is taking all of the URLs and searching them. It's not enough to just output the match, I also need the context around which URL was matched, and ideally which line number too. Here's how to do that:
$ cat urls | xargs -I% sh -c "curl -s % | grep -E -n -o '(.{0,30})SEARCH_TERM(.{0,30})' | sed 's#^#% #'"
We can wrap this all up in a little script, where we'll pipe the output back through grep to get color highlighting on the search string:
#!/bin/bash
phantomjs --load-images=no urls.js $1 | xargs -I% sh -c "curl -s % | grep -E -n -o '(.{0,30})$2(.{0,30})' | sed 's#^#% #' | grep $2 --color=always"
We can then use it to search for any term on any site. Here we're looking for adzerk.net on stackoverflow.com:
So you can see that the adzerk.net request gets initiated somewhere around line 4158 of the main stackoverflow page. It's not a perfect solution because the invocation might be somewhere completely different from where the URL is defined, but it's probably a close, and certainly a good point to start tracking down the exact invocation site.
There might be a better way to search the contents of each URL. It doesn't look like PhantonJS's onResourceReceived handler currently exposes the resource content, but there is ongoing work to address that, and once that's available all of this will be much simpler.

You can use Chrome's debugger protocol from a process external to Chrome or use the chrome.debugger API in a Chrome extension (see How to retrieve the Initiator of a request when extending Chrome DevTool?).

Related

Is there any way to connect to a remote container without open a foder?

I am implementing a dev environment for Arduino an other MCUs. I have a container image with all the compilers and tool-chains required and I have an script to connect VSCode to it.
The connection magic is done by this:
CONTAINER_NAME="dev-environments-mcus"
hex=$(printf \{\"containerName\"\:\""$CONTAINER_NAME"\"\} | od -A n -t x1 | tr -d '[\n\t ]')
code --folder-uri vscode-remote://attached-container+${hex}/App_Home/mcu-projects
This works perfectly but the problem is that by doing this I am opening a specific folder in the container which is not ideal for a generic dev enviroment.
I would like to know if it is possible to replicate in cmdline the "Attach in new window" button behaviour, which open an "empty" window when you click on it.
Edit1: Replacing --folder-uri by --file-uri make my script work better but I would like to open no file or at least open the start page.
PS: Just in case you are curious this is the project github
Ok I think that I manage to solve it. I will share what I did just in case anyone find himself on the same situation.
I just had to use the option --file-uri rather than --folder-uri and append a slash / at then end of the command. Now no folder or empty file is open when VSCode starts.
This is how the script looks now:
CONTAINER_NAME="dev-environments-mcus"
hex=$(printf \{\"containerName\"\:\""$CONTAINER_NAME"\"\} | od -A n -t x1 | tr -d '[\n\t ]')
code --file-uri vscode-remote://attached-container+${hex}/

How to track Icecast2 visits with Matomo?

My beloved web radio has an icecast2 instance and it just works. We have also a Matomo instance to track visits on our WordPress website, using only Free/Libre and open source software.
The main issue is that, since Matomo tracks visits via JavaScript, direct visits to the web-radio stream are not intercepted by Matomo as default.
How to use Matomo to track visits to Icecast2 audio streams?
Yep it's possible. Here my way.
First of all, try the Matomo internal import script. Be sure to set your --idsite= and the correct path to your Matomo installation:
su www-data -s /bin/bash
python2.7 /var/www/matomo/misc/log-analytics/import_logs.py --show-progress --url=https://matomo.example.com --idsite=1 --recorders=2 --enable-http-errors --log-format-name=icecast2 --strip-query-string /var/log/icecast2/access.log
NOTE: If you see this error
[INFO] Error when connecting to Matomo: HTTP Error 400: Bad Request
In this case, be sure to have all needed plugins activated:
Administration > System > Plugins > Bulk plugin
So, if the script works, it should start printing something like this:
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log /var/log/icecast2/access.log...
1013 lines parsed, 200 lines recorded, 99 records/sec (avg), 200 records/sec (current)
If so, immediately stop the script to avoid to import duplicate entries before installing the definitive solution.
To stop the script use CTRL+C.
Now we need to run this script every time the log is rotated, before rotation.
The official documentation suggests a crontab but I don't recommend this solution. Instead, I suggest to configure logrotate instead.
Configure the file /etc/logrotate.d/icecast2. From:
/var/log/icecast2/*.log {
...
weekly
...
}
To:
/var/log/icecast2/*.log {
...
daily
prerotate
su www-data -s /bin/bash --command 'python2.7 ... /var/log/icecast2/access.log' > /var/log/logrotate-icecast2-matomo.log
endscript
...
}
IMPORTANT: In the above example replace ... with the right command.
Now you can also try it manually:
logrotate -vf /etc/logrotate.d/icecast2
From another terminal you should be able to see its result in real-time with:
tail -f /var/log/logrotate-icecast2-matomo.log
If it works it means everything will work perfectly and automatically, importing all visits every day, without any duplicate and without missing any lines.
More documentation here about the import script itself:
https://github.com/matomo-org/matomo-log-analytics
More documentation here about logrotate:
https://linux.die.net/man/8/logrotate

Defining a variable from a value in a JSON array in Bash

The US Naval Observatory has an API that outputs a JSON file containing the sunrise and sunset times, among other things, as documented here.
Here is an example of the output JSON file:
{
"error":false,
"apiversion":"2.0.0",
"year":2017,
"month":6,
"day":10,
"dayofweek":"Saturday",
"datechanged":false,
"lon":130.000000,
"lat":30.000000,
"tz":0,
"sundata":[
{"phen":"U", "time":"03:19"},
{"phen":"S", "time":"10:21"},
{"phen":"EC", "time":"10:48"},
{"phen":"BC", "time":"19:51"},
{"phen":"R", "time":"20:18"}],
"moondata":[
{"phen":"R", "time":"10:49"},
{"phen":"U", "time":"16:13"},
{"phen":"S", "time":"21:36"}],
"prevsundata":[
{"phen":"BC","time":"19:51"},
{"phen":"R","time":"20:18"}],
"closestphase":{"phase":"Full Moon","date":"June 9, 2017","time":"13:09"},
"fracillum":"99%",
"curphase":"Waning Gibbous"
}
I'm relatively new to using JSON, but I understand that everything in square brackets after "sundata" is a JSON array (please correct me if I'm wrong). So I searched for instructions on how to get a value from a JSON array, without success.
I have downloaded the file to my system using:
wget -O usno.json "http://api.usno.navy.mil/rstt/oneday?ID=iOnTheSk&date=today&tz=0&coords=30,130"
I need to extract the time (in HH:MM format) from this line:
{"phen":"S", "time":"10:21"},
...and then use it to create a variable (that I will later write to a separate file).
I would prefer to use Bash if possible, preferably using a JSON parser (such as jq) if it'll be easier to understand/implement. I'd rather not use Python (which was suggested by a lot of the articles I have read previously) if possible as I am trying to become more familiar with Bash specifically.
I have examined a lot of different webpages, including answers on Stack Overflow, but none of them have specifically covered an array line with two key/value pairs per line (they've only explained how to do it with only one pair per line, which isn't what the above file structure has, sadly).
Specifically, I have read these articles, but they did not solve my particular problem:
https://unix.stackexchange.com/questions/177843/parse-one-field-from-an-json-array-into-bash-array
Parsing JSON with Unix tools
Parse json array in shell script
Parse JSON to array in a shell script
What is JSON and why would I use it?
https://developers.squarespace.com/what-is-json/
Read the json data in shell script
Thanks in advance for any thoughts.
Side note: I have managed to do this with a complex 150-odd line script made up of "sed"s, "grep"s, "awk"s, and whatnot, but obviously if there's a one-liner JSON-native solution that's more elegant, I'd prefer to use that as I need to minimise power usage wherever possible (it's being run on a battery-powered device).
(Side-note to the side-note: the script was so long because I need to do it for each line in the JSON file, not just the "S" value)
If you already have jq you can easily select your desired time with:
sun_time=$(jq '.sundata[] | select(.phen == "S").time' usno.json)
echo $sun_time
# "10:21"
If you must use "regular" bash commands (really, use jq):
wget -O - "http://api.usno.navy.mil/rstt/oneday?ID=iOnTheSk&date=today&tz=0&coords=30,130" \
| sed -n '/^"sundata":/,/}],$/p' \
| sed -n -e '/"phen":"S"/{s/^.*"time":"//'\;s/...$//\;p}
Example:
$ wget -O - "http://api.usno.navy.mil/rstt/oneday?ID=iOnTheSk&date=today&tz=0&coords=30,130" | sed -n '/^"sundata":/,/}],$/p' | sed -n -e '/"phen":"S"/{s/^.*"time":"//'\;s/...$//\;p}
--2017-06-10 08:02:46-- http://api.usno.navy.mil/rstt/oneday?ID=iOnTheSk&date=today&tz=0&coords=30,130
Resolving api.usno.navy.mil (api.usno.navy.mil)... 199.211.133.93
Connecting to api.usno.navy.mil (api.usno.navy.mil)|199.211.133.93|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘STDOUT’
- [ <=> ] 753 --.-KB/s in 0s
2017-06-10 08:02:47 (42.6 MB/s) - written to stdout [753]
10:21

How do I test a manual check in check_mk / Nagios

My organization is using Nagios with the check_mk plugin to monitor our nodes. My question is: is it possible run a manual check from the command line? It is important, process-wise, to be able to test a configuration change before deploying it.
For example, I've prepared a configuration change which uses the ps.perf check type to check the number of httpd processes on our web servers. The check looks like this:
checks = [
( ["web"], ALL_HOSTS, "ps.perf", "Number of httpd processes", ( "/usr/sbin/httpd", 1, 2, 80, 100 ) )
]
I would like to test this configuration change before committing and deploying it.
Is it possible to run this check via the command line, without first adding it to main.mk? I'm envisioning something like:
useful_program -H my.web.node -c ps.perf -A /usr/sbin/httpd,1,2,80,100
I don't see any way to do something like this in the check_mk documentation, but am hoping there is a way to achieve something like this.
Thanks!
that is easy to check.
Just make your config changes and then run:
cmk -nv HOSTNAME.
That (-n) will try run everything and return (-v) the output.
So can see the same results like later in the GUI.
List the check
$check_mk -L | grep ps.perf
if it listing ps.perf then run following command,
$check_mk --checks=ps.perf -I Hostname

Nutch updatedb killed and skipped batch ids

I am using nutch 2.0 and solr 4.0 and am having minimal success I have 3 urls and my regex-urlfilter.xml is set to allow everything.
I ran this script
#!/bin/bash
# Nutch crawl
export NUTCH_HOME=~/java/workspace/Nutch2.0/runtime/local
# depth in the web exploration
n=1
# number of selected urls for fetching
maxUrls=50000
# solr server
solrUrl=http://localhost:8983
for (( i = 1 ; i <= $n ; i++ ))
do
log=$NUTCH_HOME/logs/log
# Generate
$NUTCH_HOME/bin/nutch generate -topN $maxUrls > $log
batchId=`sed -n 's|.*batch id: \(.*\)|\1|p' < $log`
# rename log file by appending the batch id
log2=$log$batchId
mv $log $log2
log=$log2
# Fetch
$NUTCH_HOME/bin/nutch fetch $batchId >> $log
# Parse
$NUTCH_HOME/bin/nutch parse $batchId >> $log
# Update
$NUTCH_HOME/bin/nutch updatedb >> $log
# Index
$NUTCH_HOME/bin/nutch solrindex $solrUrl $batchId >> $log
done
----------------------------
Of course I bin/nutch inject urls before i run the script, but when I look at the logs, I see Skipping : different batch id and some of the urls that I see are ones that arent in the seed.txt and I want to include them
into solr, but they aren't added.
I have 3 urls in my seed.txt
After I ran this script I had tried
bin/nutch parse -force -all
bin/nutch updatedb
bin/nutch solrindex http://127.0.0.1:8983/solr/sites -reindex
My questions are as follows.
1. The last three commands why were they necessary?
2. How do I get all of the urls during the parse job, even with the -force -all i still get different batch id skipping
3. The script above, if i set generate -topN to 5. Does this mean if a site has a link to another site to another site to another site to another site to another site. That it will be included in the fetch/parse cycle?
4. What about this command, why is this even mentioned :
bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/sites -depth 3 -topN 10000 -threads 3.
5. When i run bin/nutch updateb it takes 1-2 mineuts then it echos Killed. This concerns me. Please Help.
And yes, I have read a lot of pages on nutch and solr and I have been trying to figure this out for weeks now.
some of the URLs that I see are ones that aren't in the seed.txt
I think that this is happening due to URL normalization. Nutch does this URL normalization due to which the original URL is changed or converted to a more standard format.
for #1: You injected and then executed generate-fetch phases...right? Those 3 phases in your question are required for parsing of crawled data, updating the db with newly discovered pages and to index them respectively.
for #2: Sorry but I didn't get your question.
for #3: No. topN set to 5 means nutch will select top 5 URLs from the while bunch of URLs eligible for fetching. It will consider only these selected high scored URLs for fetching.
for #4: That is a single command which invokes all nutch phases automatically. So you won't have to manually execute separate command for each phase. Just have single command and it will do all stuff.
for #5: There will be some exception logged in the hadoop logs. Provide the stack trace and error message so that I comment on it. Without that I can't think of anything.

Resources