phantomjs page.evaluate to scrape "resultStats" from http://www.google.com/search?q=site:%s works in local server but not production server - screen-scraping

Using phantomjs page.evaluate to extract "resultStats" (div id) from http://www.google.com/search/?q=site:%s works on my local server but not on production server.
NOTE: I'm using the latest phantomjs 1.9.7, however I experienced the same issue with the previous version 1.9.6
NOTE: Phantomjs page.render (on Google home page as well as any other domain name) is working on both servers and creates nice screenshots.
On my production server (Debian stable 7.3 #linode.com) the PHP code below for a top level domain name as the "$url" returns:
TypeError: 'null' is not an object (evaluating 'document.getElementById('resultStats').textContent') phantomjs://webpage.evaluate():2 phantomjs://webpage.evaluate():3 phantomjs://webpage.evaluate():3 null
On my local server (debian testing) the PHP code below for the same "$url" returns:
About 43 results
This happens with any domain name/url I use as the argument - I've tested it on dozens.
What might cause this to occur in my remote production server and not my local server?
gsiteindex.js
var page = require('webpage').create(), site;
var site = phantom.args[0];
page.open("https://www.google.com/search?q=site:" + site, function (status) {
var result = page.evaluate(function () {
return document.getElementById('resultStats').textContent;
});
console.info(result);
phantom.exit();
});
.php
$phantomjs = "phantomjs";
$script = "gsiteindex.js";
$site = $url;
$command = "$phantomjs $script $site";
$googlestring = shell_exec($command);
echo $googlestring;
die();

Well, nrabinowitz was right. I tested it more on my own server using proxies, most timed out, some returned the above error, and a couple returned correct results (well I assume they were correct based on the location the IP address of the proxy - because the figures were a little different than using my ISPs public IP address (calif., USA)).
So it's simply a matter of google blocking certain types of requests from certain IP addresses.
Thanks again for the comment.

Incleude header with user-agent e.g.
header = {'user-asgent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64;
rv:68.0) Gecko/20100101 Firefox/68.0'}
Withuot user agent you get googles gefault style page without resultStats a also had this issue and adding header helped
Default google search page looks like this
enter image description here

Related

Phpstan doctrine database connection error

I'm using doctrine in a project (not symfony). In this project I also use phpstan, i installed both phpstan/phpstan-doctrine and phpstan/extension-installer.
My phpstan.neon is like this:
parameters:
level: 8
paths:
- src/
doctrine:
objectManagerLoader: tests/object-manager.php
Inside tests/object-manager.php it return the result of a call to a function that return the entity manager.
Here is the code that create the entity manager
$database_url = $_ENV['DATABASE_URL'];
$isDevMode = $this->isDevMode();
$proxyDir = null;
$cache = null;
$useSimpleAnnotationReader = false;
$config = Setup::createAnnotationMetadataConfiguration(
[$this->getProjectDirectory() . '/src'],
$isDevMode,
$proxyDir,
$cache,
$useSimpleAnnotationReader
);
// database configuration parameters
$conn = [
'url' => $database_url,
];
// obtaining the entity manager
$entityManager = EntityManager::create($conn, $config);
When i run vendor/bin/phpstan analyze i get this error:
Internal error: An exception occurred in the driver: SQLSTATE[08006] [7] could not translate host name "postgres_db" to address: nodename nor servname provided, or not known
This appear because i'm using docker and my database url is postgres://user:password#postgres_db/database postgres_db is the name of my database container so the hostname is known inside the docker container.
When i run phpstan inside the container i do not have the error.
So is there a way to run phpstan outside docker ? Because i'm pretty sure that when i'll push my code the github workflow will fail because of this
Do phpstan need to try to reach the database ?
I opened an issue on the github of phpstan-doctrine and i had an answer by #jlherren that explained :
The problem is that Doctrine needs to know what version of the DB server it is working with in order to instantiate the correct AbstractPlatform implementation, of which there are several available for the same DB vendor (e.g. PostgreSQL94Platform or PostgreSQL100Platform for postgres, and similarly for other DB drivers). To auto-detect this information, it will simply connect to the DB and query the version.
I just changed my database url from:
DATABASE_URL=postgres://user:password#database_ip/database
To:
DATABASE_URL=postgres://user:password#database_ip/database?serverVersion=14.2

In Selenium webdriver, for remote Firefox how to use OSS bridge instead of w3c bridge for handshaking

I am using selenoid for remote browser testing in ruby.
In that I am using 'selenium-webdriver', 'capybara', 'rspec' for automation. And I am using attach_file method for uploading file to browser
I want to upload file on Firefox and Chrome browser but it raises error on both;
In chrome
Selenium::WebDriver::Error::UnknownCommandError: unknown command: unknown command: session/***8d32e045e3***/se/file
In firefox
unexpected token at 'HTTP method not allowed'
So After searching I found the solution for chrome which is to set w3c option false in caps['goog:chromeOptions'] > caps['goog:chromeOptions'] = {w3c: false}
So now chrome is using OSS bridge for handshaking but I don't know how to do it in Firefox. Similar solution is not available for Firefox.
My browser capabilities are following:
if ENV['BROWSER'] == 'firefox'
caps = Selenium::WebDriver::Remote::Capabilities.new
caps['browserName'] = 'firefox'
# caps['moz:firefoxOptions'] = {w3c: false} ## It is not working
else
caps = Selenium::WebDriver::Remote::Capabilities.new
caps["browserName"] = "chrome"
caps["version"] = "81.0"
caps['goog:chromeOptions'] = {w3c: false}
end
caps["enableVNC"] = true
caps["screenResolution"] = "1280x800"
caps['sessionTimeout'] = '15m'
Capybara.register_driver :selenium do |app|
Capybara::Selenium::Driver.new(app, browser: :remote,
:desired_capabilities => caps,
:url => ENV["REMOTE_URL"] || "http://*.*.*.*:4444/wd/hub"
)
end
Capybara.configure do |config|
config.default_driver = :selenium
end
I have found the problem. There is bug in selenium server which run on java so I have to change my selenium-webdriver gem version 3.142.7 and monkey-patch.
You can find more information here about the bug and resolution.
For now I have to change my gem and monkey patch the selenium-webdriver-3.142.7\lib\selenium\webdriver\remote\w3c\commands.rb file. check for below line which is on line no 150.
upload_file: [:post, 'session/:session_id/se/file']
and update it to
upload_file: [:post, 'session/:session_id/file']
i had a similar issue with rails 7. the issue is connected with the w3c standard. the core problem is that the webdriver for chrome uses a non-w3c standard url for handling file uploads. when uploading a file, the webdriver uses the /se/file url path to upload. this path is only supported by the selenium server. subsequently, the docker image provided by selenium works fine. yet, if we use chromedriver, the upload fails. more info.
we can solve this, by forcing the webdriver to use the standard-compliant url by overriding the :upload_file key in Selenium::WebDriver::Remote::Bridge::COMMANDS. since, the initialization of this the COMMANDS constant does not happen when the module is loaded, we can override the attach_file method to make sure the constant is set correctly. here the hacky code:
module Capybara::Node::Actions
alias_method :original_attach_file, :attach_file
def attach_file(*args, **kwargs)
implement_hacky_fix_for_file_uploads_with_chromedriver
original_attach_file(*args, **kwargs)
end
def implement_hacky_fix_for_file_uploads_with_chromedriver
return if #hacky_fix_implemented
original_verbose, $VERBOSE = $VERBOSE, nil # ignore warnings
cmds = Selenium::WebDriver::Remote::Bridge::COMMANDS.dup
cmds[:upload_file] = [:post, "session/:session_id/file"]
Selenium::WebDriver::Remote::Bridge.const_set(:COMMANDS, cmds)
$VERBOSE = original_verbose
#hacky_fix_implemented = true
end
end
In Firefox images we support /session/<id>/file API by adding Selenoid binary which emulates this API instead of Geckodriver (which does not implement it).

Gatling not logging to influxdb?

I've tried following the guide at http://gatling.io/docs/2.2.3/realtime_monitoring/index.html to log my test results to influxdb and display the data in a grafana that I have previously set up. However I can't see any of the data that gatling is supposed to log anywhere in influxdb.
I've edited by influxdb.conf file so that it contains the following fields:
[[graphite]]
enabled = true
database = "gatlingdb"
bind-address = ":2003"
protocol = "tcp"
consistency-level = "one"
name-separator = "."
templates = [
"gatling.*.*.*.count measurement.simulation.request.status.field",
"gatling.*.*.*.min measurement.simulation.request.status.field",
"gatling.*.*.*.max measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles50 measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles75 measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles95 measurement.simulation.request.status.field",
"gatling.*.*.*.percentiles99 measurement.simulation.request.status.field"
]
and my gatling.conf file contains the following fields:
data {
writers = [console, file, graphite] # The list of DataWriters to which Gatling write simulation data (currently supported : console, file, graphite, jdbc)
console {
#light = false # When set to true, displays a light version without detailed request stats
}
graphite {
#light = false # only send the all* stats
host = "127.0.0.1" # The host where the Carbon server is located
port = 2003 # The port to which the Carbon server listens to (2003 is default for plaintext, 2004 is default for pickle)
protocol = "tcp" # The protocol used to send data to Carbon (currently supported : "tcp", "udp")
rootPathPrefix = "gatling" # The common prefix of all metrics sent to Graphite
#bufferSize = 8192 # GraphiteDataWriter's internal data buffer size, in bytes
#writeInterval = 1 # GraphiteDataWriter's write interval, in seconds
}
Whenever i run my gatling tests I see no error messages or anything that indicates that anything is wrong, but I cannot see anything in the influxd logs that indicates that anything has been logged to influxdb, nor can I see any data in the gatlingdb database. I am using influxdb v0.10 and gatling v2.2.3 on Ubuntu
Can anyone help me figure out what I am doing wrong?
Updated to influxdb v1.1 and the problem seemed to have resolved itself from doing that

X-Appengine-Inbound-Appid header not set

I have two AppEngine modules, a default module running Python and "java" module running Java. I'm accessing the Java module from the default module using urlfetch. According to the AppEngine docs (cloud.google.com/appengine/docs/java/appidentity), I can verify in the Java module that the request originates from a module in the same app by checking the X-Appengine-Inbound-Appid header.
However, this header is not being set (in a production deployment). I use urlfetch in the Python module as follows:
hostname = modules.get_hostname(module="java")
hostname = hostname.replace('.', '-dot-', 2)
url = "http://%s/%s" % (hostname, "_ah/api/...")
result = urlfetch.fetch(url=url, follow_redirects=False, method=urlfetch.GET)
Note that I'm using the notation:
<version>-dot-<module>-dot-<app>.appspot.com
rather than the notation:
<version>.<module>.<app>.appspot.com
which for some reason results in a 404 response.
In the Java module I'm running a servlet filter which looks at all the request headers as follows:
Enumeration<String> headerNames = httpRequest.getHeaderNames();
while (headerNames.hasMoreElements()) {
String headerName = headerNames.nextElement();
String headerValue = httpRequest.getHeader(headerName);
mLog.info("Header: " + headerName + " = " + headerValue);
}
AppEngine does set some headers, e.g. X-AppEngine-Country. But the X-Appengine-Inbound-Appid header is not set.
Why am I'm not seeing the documented behaviour? Any suggestions would be much appreciated.
Have a look at what I've been answered on Google groups, which led to an issue opened on the public issue tracker.
As suggested in the answer I received you can follow, for any update, the issue over there.

How to properly connect cloudant + sag

I'm working in getting a connection to cloudant done.
The following is using sag library for php:
<?php
header('Content-Type: text/html; charset=utf-8');
require_once('../../src/Sag.php');
//this credentials are from API key
$uName="";
$pName="";
$sag = new Sag('user.cloudant.com');
$sag->login($uName, $pName);
$sag->setDatabase('test');
try {
$result = $sag->get('/test/_design/wordsP/_view/errores');
echo ($result);
}
catch(Exception $e) {
error_log("Something's wrong");
var_dump($e);
}
?>
However I'm not getting expected result (). The view does work, if used just in the url bar.
The response is:
object(SagException)#3(6){
[
"message:protected"
]=>string(50)"Sag Error: cURL error #7: couldn't connect to host"[
"string:private"
]=>string(0)""[
"code:protected"
]=>int(0)[
"file:protected"
]=>string(73)"/home2/.../public_html/clant/src/httpAdapters/SagCURLHTTPAdapter.php"[
"line:protected"
]=>int(134)[
"trace:private"
]=>array(3){ .............
Is there something I'm not using corretly in the php script? (removed current password + username as well as account, but they're there).
Curl error 7 means you are unable to connect to the host or proxy:
CURLE_COULDNT_CONNECT (7)
Failed to connect() to host or proxy.
Source: http://curl.haxx.se/libcurl/c/libcurl-errors.html
When you connect to the URL from your browser if your browser is using a proxy, you will also need to configure sag to use that proxy.
Also, when you say that you replaced the account in your code example, did you mean that you replaced user in $sag = new Sag('user.cloudant.com'); with your actual username? If not, you will need to use your actual username.

Resources