I am not much knowledgeable with it comes to networking (i.e. http) or JSoup. I am using JSoup to get meta tag contents from a url. I am getting the error
Connection closed unexpectedly by server at URL: http://blahblah
Here is my code
Document doc = Jsoup.connect(url).get();
Elements metas = doc.getElementsByTag("meta");
...
How do I "configure" JSoup to just grab the content of the webpage, close the connection, and then proceed to parse the content obtained? I am asking the question like this because I imagine the closing of connection is due to it taking too long. Or is it something else? Like the server knows it's not a human caller or such? Say the site is cnn or whatever and I am trying to parse a news article for meta-tag contents. And no I am not crawling: I am given a url and I am sifting through that one page.
May be You have to send some header data as below.
Please try it.
Document doc = Jsoup
.connect(url.trim())
.timeout(3000)
.header("Host", "someip")
.header("Connection", "keep-alive")
.header("Content-Length", "111")
.header("Cache-Control", "max-age=0")
.header("Accept",
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
.header("User-Agent",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36")
.header("Content-Type", "application/x-www-form-urlencoded")
.header("Referer", url.trim())
.header("Accept-Encoding", "gzip,deflate,sdch")
.header("Accept-Language", "en-US,en;q=0.8,ru;q=0.6")
.userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36")
.get();
I have absolutely no idea why, but the problem stops when I do
Connection connection = Jsoup.connect(url);
Document doc = connection.get();
Elements metas = doc.getElementsByTag("meta");
...
Instead of
Document doc = Jsoup.connect(url).get();
Elements metas = doc.getElementsByTag("meta");
...
It makes completely no sense to me. But it is what it is. I have heard of "constructors escaping", which is what lead me to do the separation. And while this is probably not the same thing, but some similar type of voodoo may be happening under the hood that I just don't understand.
Related
I am trying to use Shibboleth 3 as the sp and azure AD as the ipd and I can see that I have successfully implemented based on the Shibboleth transaction log.
2022-12-16 12:35:54|Shibboleth-TRANSACTION.AuthnRequest|||https://sts.windows.net/c04845f0-4224-4637-aed2-9beea8319b5b/||||||urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect||||||
2022-12-16 12:35:55|Shibboleth-TRANSACTION.Login||_292e2cf9f81890bcdf7ffe1cd147c92f|https://sts.windows.net/c04845f0-4224-4637-aed2-9beea8319b5b/|_ff1422a3-4c91-4255-adec-fa6fd52d2600|urn:oasis:names:tc:SAML:2.0:ac:classes:Password|2022-12-16T07:00:19|authnmethodsreferences(2),displayname(1),emailaddress(1),givenname(1),groups(1),identityprovider(1),objectidentifier(1),surname(1),tenantid(1)|davisg1#XXXXX.com|urn:oasis:names:tc:SAML:2.0:bindings:HTTP-POST||urn:oasis:names:tc:SAML:2.0:status:Success|||Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46|167.244.201.154
I changed the email in the text above to "davisg1#XXXXX.com" for obvious reasons.
However I can't seem to retrieve the variables on my Coldfusion page. I have googled endlessly and not found an answer.
I tried dumping cgi and getHTTPRequestData() and i also tried hardcoding like http_givenName #cgi['http_givenName']# and HTTP_REMOTE_USER #cgi['HTTP_REMOTE_USER']# but nothing useful appears
I have updated by attributes-map.xml to use the "name" field returned by azure AD and made sure that in shibboleth.xml that ApplicationDefaults REMOTE_USER uses persistentID
<Attribute name="http://schemas.xmlsoap.org/ws/2005/05/identity/claims/name" id="persistent-id"> <AttributeDecoder xsi:type="NameIDAttributeDecoder" formatter="$NameQualifier!$SPNameQualifier!$Name" defaultQualifiers="true"/>
</Attribute>
<ApplicationDefaults entityID="https://intranettest.amc.edu/shibboleth-sp"
REMOTE_USER="eppn subject-id pairwise-id persistent-id"
cipherSuites="DEFAULT:!EXP:!LOW:!aNULL:!eNULL:!DES:!IDEA:!SEED:!RC4:!3DES:!kRSA:!SSLv2:!SSLv3:!TLSv1:!TLSv1.1">
The answer was to add useHeaders="true" to the ISAPI tag in shibboleth2.xml
<ISAPI normalizeRequest="true" safeHeaderNames="true" useHeaders="true">
Please find below a sample log message that I am receiving from syslog
<159>Apr 15 17:27:31 192.168.100.40 CEF:0|Websense|Security|7.8.1|68|Transaction permitted|1| act=permitted app=http dvc=192.168.100.40 dst=221.135.111.120 dhost=img-d01.moneycontrol.co.in dpt=80 src=172.16.237.89 spt=55016 suser=LDAP://172.17.251.11 OU\=Users,OU\=Migrated,DC\=abc,DC\=com/Sourabh Jain destinationTranslatedPort=38419 rt=1460721451000 in=496 out=6999 requestMethod=GET requestClientApplication=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0 reason=- cs1Label=Policy cs1=role-8**Default cs2Label=DynCat cs2=0 cs3Label=ContentType cs3=image/jpeg cn1Label=DispositionCode cn1=1048 cn2Label=ScanDuration cn2=3 request=http://img-d01.moneycontrol.co.in/news_html_files/wealth-experts/abhim1132661059.jpg
If you observer , there are key values pairs in the data. Is there any way , I can extract values and store the data. I can't use space as seperater as some of the values in key pair contains space
e.g:
suser=LDAP://172.17.251.11 OU\=Users,OU\=Migrated,DC\=abc,DC\=com/Sourabh S Jain
There are spaces between "Sourabh S Jain"
Able to solve it using the OR operator .
(suser=-|suser=LDAP://.{1,150}/)
I'm looking for a way to get the ip address with camel rest dsl and the Netty4 Http component.
I checked on the documentation, I've put a breakpoint on my rest and checked on the headers, the properties,...everywhere, and couldn't find a proper way get this information.
Headers log:
GET: http://localhost:8080/category,
{Accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8, Accept-Encoding=gzip, deflate, sdch, Accept-Language=fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4, breadcrumbId=ID-nateriver-54582-1445489005229-0-1, CamelCATEGORY_ACTION=listAction, CamelHttpMethod=GET, CamelHttpPath=, CamelHttpUri=/category, CamelHttpUrl=http://localhost:8080/category, CamelJmsDeliveryMode=2, Connection=keep-alive, Content-Length=0, Cookie=JSESSIONID=fowfzar8n09e16ej9jui6nmsv, Host=localhost:8080, JMSCorrelationID=null, JMSDeliveryMode=2, JMSDestination=topic://Statistics, JMSExpiration=0, JMSMessageID=ID:nateriver-54592-1445489009836-3:1:7:1:1, JMSPriority=4, JMSRedelivered=false, JMSReplyTo=null, JMSTimestamp=1445489017233, JMSType=null, JMSXGroupID=null, JMSXUserID=null, Upgrade-Insecure-Requests=1, User-Agent=Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36}
You should get two headers populated:
CamelNettyLocalAddress and CamelNettyRemoteAddress.
See here where the debug log of netty-http shows this clearly.
http://camel.465427.n5.nabble.com/How-to-create-case-insensitive-URI-route-with-netty4-http-td5766517.html#a5766558
Using phantomjs page.evaluate to extract "resultStats" (div id) from http://www.google.com/search/?q=site:%s works on my local server but not on production server.
NOTE: I'm using the latest phantomjs 1.9.7, however I experienced the same issue with the previous version 1.9.6
NOTE: Phantomjs page.render (on Google home page as well as any other domain name) is working on both servers and creates nice screenshots.
On my production server (Debian stable 7.3 #linode.com) the PHP code below for a top level domain name as the "$url" returns:
TypeError: 'null' is not an object (evaluating 'document.getElementById('resultStats').textContent') phantomjs://webpage.evaluate():2 phantomjs://webpage.evaluate():3 phantomjs://webpage.evaluate():3 null
On my local server (debian testing) the PHP code below for the same "$url" returns:
About 43 results
This happens with any domain name/url I use as the argument - I've tested it on dozens.
What might cause this to occur in my remote production server and not my local server?
gsiteindex.js
var page = require('webpage').create(), site;
var site = phantom.args[0];
page.open("https://www.google.com/search?q=site:" + site, function (status) {
var result = page.evaluate(function () {
return document.getElementById('resultStats').textContent;
});
console.info(result);
phantom.exit();
});
.php
$phantomjs = "phantomjs";
$script = "gsiteindex.js";
$site = $url;
$command = "$phantomjs $script $site";
$googlestring = shell_exec($command);
echo $googlestring;
die();
Well, nrabinowitz was right. I tested it more on my own server using proxies, most timed out, some returned the above error, and a couple returned correct results (well I assume they were correct based on the location the IP address of the proxy - because the figures were a little different than using my ISPs public IP address (calif., USA)).
So it's simply a matter of google blocking certain types of requests from certain IP addresses.
Thanks again for the comment.
Incleude header with user-agent e.g.
header = {'user-asgent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64;
rv:68.0) Gecko/20100101 Firefox/68.0'}
Withuot user agent you get googles gefault style page without resultStats a also had this issue and adding header helped
Default google search page looks like this
enter image description here
When I go to https://appengine.google.com/blobstore/ I see that images are uploaded (cf. blob viewer).
In prod mode :
In chrome dev tool, when I submit the form in order to upload the image, I see that the form stay in a "PENDING" status. The purpose of this mail is to help me to understand what should fail. In the Network tab, I hae the following Header :
Request
URL:http://www.mananaseguro.com/_ah/upload/AMmfu6aImWsEdAeiy_FVrscqQiRoRSvjK2QSX6thgKTaMk4nKLbiJg86RrocrzAqWj2X2vi1gKrY_Yvr2kSQNpFMwxBiUFa1Tk5oEVZGjhMm_9SavhOAjNoteylbfLT7aZ5dUYMaDR2N/ALBNUaYAAAAAUeJL86BP_9txQLF54r96AvYfm1Nuw90l/
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Content-Type:multipart/form-data; boundary=----WebKitFormBoundarymTEOMdBbtHBxoFJb
Origin:http://www.mananaseguro.com
Referer:http://www.mananaseguro.com/
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36
Request Payload
------WebKitFormBoundarymTEOMdBbtHBxoFJb
Content-Disposition: form-data; name="myFile"; filename="trailer-8.webp"
Content-Type: image/webp
------WebKitFormBoundarymTEOMdBbtHBxoFJb--
And in the response tab :
**This request has no data response**
I have done the following servlet configuration (GUICE)
serve("/_ah/upload").with(BlobstoreUploadFinishedServlet.class); // post
serve("/_ah/serve").with(Servlet_Serve.class);
And some related line of code I call :
resp.sendRedirect("/_ah/serve?blob-key=" + url);
String url = blobstoreService.createUploadUrl("/_ah/upload");
Question : Can you explain me why "I" do not have a response (form POST is always pending) ?
I also do not know if I should use "/_ah/" or not (I have decided to put it everywhere) ?
I have another issue in dev mode, I cannot test upload to blobstore, because I have the following error :
WARNING: /_ah/upload/ag1tYW5hbmFzZWd1cm8xch0LEhVfX0Jsb2JVcGxvYWRTZXNzaW9uX18Y2aQCDA
java.lang.NullPointerException
at com.google.appengine.api.blobstore.dev.UploadBlobServlet.getSessionId(UploadBlobServlet.java:134)
Question : What is happening ? Is it a problem related to cookies ?
Thanks you,
I remember dealing with this issue a while back - sadly this happens because the blobstore implementation in devmode is different from the production one.
you don't need the /_ah/ prefix and if I'm not mistaken than /_ah/upload is reserved for the blobservice so you shouldn't use it. (don't take my word on this)
This is far from optimal, but you can understand if you are in devmode on the server side
by calling a utility function:
public static boolean isDevelopmentMode() {
return ( SystemProperty.environment.value() ==
SystemProperty.Environment.Value.Development );
}
and implementing a different logic by calling resp.sendRedirect("devmodeServe?blob-key="+blobkey); and binding that servlet to a DevmodeServeServlet class:
class DevmodeServeServlet extends HttpServlet
{
BlobstoreService blobstoreService = BlobstoreServiceFactory
.getBlobstoreService();
#Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {
BlobKey blobKey = new BlobKey(req.getParameter("blob-key"));
blobstoreService.serve(blobKey, resp);
}
}
I also don't remember why, but I think you have to use your computer name instead of 127.0.0.1 or localhost in the browser address bar (you'll might need to add it to the allowed lists in the gwt devmode plugin in your browser)