Now I have got a url list and I want to get all the webpages back. Here is what i have done:
for each url:
getaddrinfo(hostname, port, &hints, &res); // DNS
// create socket
sockfd = socket(res->ai_family, res->ai_socktype, res->ai_protocol);
connect(sockfd, res->ai_addr, res->ai_addrlen);
creatGET();
/* for example:
GET / HTTP/1.1\r\n
Host: stackoverflow.cn\r\n
...
*/
writeHead(); // send GET head to host
recv(); // get the webpage content
end
I have noted that many url's are under the same host, for example:
http://job.01hr.com/j/f-6164230.html
http://job.01hr.com/j/f-6184336.html
http://www.012yy.com/gangtaiju/32692/
http://www.012yy.com/gangtaiju/35162/
so I wonder, can I just connect only once to each host and then just creatGET(),writeHead() and recv() for each url? That may save a lot of time. So I changed my program like this:
split url into groups by their host;
for each group:
get hostname in the group;
getaddrinfo(hostname, port, &hints, &res);
sockfd = socket(res->ai_family, res->ai_socktype, res->ai_protocol);
connect(sockfd, res->ai_addr, res->ai_addrlen);
for each url in the group:
creatGET();
writeHead();
recv();
end
end
unfortunately, I find my program can only get the first webpage in each group back, and the rest all return empty file.
Am I missing something? Maybe the sockfd need some kind of reset for each recv() ?
Thank you for you generous help .
HTTP 1.1 connections are persistent meaning that after e.g. a POST/GET - 200 OK sequense the next request-response sequence could reuse the already established TCP connection.
But this is not mandatory. The connection could close at any time, so you should code for that as well.
Also it seems to me that you are trying to implement your own HTTP client.
I am not sure why you would want to do that, but anyway if you must you should read a little bit about the HTTP RFC to understand about the various headers to make sure that the underlying TCP connection is open as long as possible.
Of course if your server is an old HTTP1.0 you should not expect any reuse of connection unless explicitely indicated via keep-alive headers
Related
I want to forward message from one user to the others with a server between.
I accept the connection by connfd = accept(tcpfd, (struct sockaddr*)&cliaddr, &len); i saw some codes on the internet that use the return value of accept to identify a different client who connected to the server but when I checked the return value of accept for 5 clients, all of them returned 4.
I can send a message to the first client with send(connfd, *,*,*); but how to get info from the other client which I can use in send function to send them message?
Thanks in advance
I am trying to create a client application that will use asynchronous IO using IOCP. I already did similar server application and it works correctly, however I can't find any information on how to extract local endpoint information from socket that is connected via ConnectEx API.
With server sockets, documentation states that info about both local and remote endpoints will be part of buffer sent to the AcceptEx. There is not a similar thing in ConnectEx. I tried also to extract local endpoint information through getsockname, however this returned some garbage values. I also tried to use setsockopt(clientSocket, SOL_SOCKET, SO_UPDATE_CONNECT_CONTEXT, ...) prior to calling getsockname, however the result was same as without it. Is there a way to do this or am I misunderstanding something?
I also use ConnectEx() function and when the asynchronous Connect operation completes on IOCP, I normally call
::setsockopt(m_Socket, SOL_SOCKET, SO_UPDATE_CONNECT_CONTEXT, nullptr, 0);
where m_Socket is my connect socket on a client side.
It works and then I can get the local name by the ::getsockname() like this:
#define GoClearStruct(Struct) memset(&Struct, 0, sizeof(Struct))
sockaddr_storage _AddrStorage;
sockaddr *_Addr = (sockaddr *)&_AddrStorage;
int _AddrLen = sizeof(_AddrStorage);
GoClearStruct(_AddrStorage);
if (::getsockname(m_Socket, _Addr, &_AddrLen) != SOCKET_ERROR)
{
// extract the local name
}
It is difficult to find out the problem in your code, could you share more from it?
Did you check the return values of ::setsockopt() and ::getsockname()?
I am unable to receive the response to multiple HTTP requests when I attempt to enqueue data to send to a server.
We are able to establish a connection to a server and immediately issue an HTTP request inside the connected_callback() function (called as soon as a connection to the server is established) using the tcp_write() function. However, if I attempt to generate two HTTP resquests or more using the following syntax:
err_t connected_callback(void *arg, struct tcp_pcb *tpcb, err_t err) {
xil_printf("Connected to JUPITER server\n\r");
LWIP_UNUSED_ARG(arg);
/* set callback values & functions */
tcp_sent(tpcb, sent_callback);
tcp_recv(tpcb, recv_callback);
if (err == ERR_OK) {
char* request = "GET /circuits.json HTTP/1.1\r\n"
"Host: jupiter.info.polymtl.ca\r\n\r\n";
(void) tcp_write(tpcb, request, 100, 1);
request = "GET /livrable1/simulation.dee HTTP/1.1\r\n"
"Host: jupiter.info.polymtl.ca\r\n\r\n";
(void) tcp_write(tpcb, request, 100, 1);
tcp_output(tpcb);
xil_printf("tcp_write \n");
} else {
xil_printf("Unable to connect to server");
}
return err;}
I manage to send all of the data to the server, but I never receive any data for the second HTTP request. I manage to print the payload for the first request (the JSON file) but I never manage to receive anything for the .dee file. Are there any specific instructions to enqueue HTTP requests together with lwIP or am I missing something?
If you require any more code to accurately analyze my problem, feel free to say so.
Thanks!
The problem I see is that you have double \r\n combination at the end of your request header statement.
You need \r\n\r\n only at the end of your header. Now, you have double times. Remove from first write.
I am making a http c client socket. So far i have made a custom url parser and now the problem is connecting to absolute urls. The program works fine with relative urls but not absolute ones.
Here is a sample output for the results of both absolute and relative urls:
absolute url: http://www.google.com
relative url : http://techpatterns.com/downloads/firefox/useragentswitcher.xml
In an absolute url it gives a 301/302 status code while in a relative url the status is 200 OK
Here is a sample code of the key areas
char ip[100],*path, *domain, *abs_domain, *proto3;
char *user_agent = "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0";
char *accept_type = "Accept: text/html, application/xhtml+xml, */*\r\nAccept-Language: en-US\r\n";
char *encoding = "Accept-Encoding: gzip, deflate\r\n";
char *proxy_conn = "Proxy-Connection: Keep-Alive\r\n";
char hostname[1000];
url:
fgets(hostname,sizeof(hostname), stdin);
for(i=0; i<strlen(hostname);i++){//remove new line
if(hostname[i]=='\n'){
hostname[i]='\0';
}
}
proto3 = get_protocol(hostname); //get protocol i.e. http, ftp, etc
//get domain ie http://mail.google.com/index -> mail.google.com
//http://www.google/com/ssl_he -> www.google.com
domain = get_domain(hostname);
if(strlen(domain)==0){
printf("invalid url\n\n");
goto url;
}
abs_domain = get_abs_domain(hostname);//gets abs domain google.com, facebook.com etc
path = get_path(hostname);
//getting the ip address from the hostname
if ( (he = gethostbyname( abs_domain ) ) == NULL)
{
printf("gethostbyname failed : %d" , WSAGetLastError());
goto url;
}
//Cast the h_addr_list to in_addr , since h_addr_list also has the ip address in long format only
addr_list = (struct in_addr **) he->h_addr_list;
for(i = 0; addr_list[i] != NULL; i++)
{
//Return the first one;
strcpy(ip , inet_ntoa(*addr_list[i]) );
}
clientService.sin_addr.s_addr = inet_addr(ip);
clientService.sin_family = AF_INET;
clientService.sin_port = htons(80);
sprintf(sendbuf, "GET /%s HTTP/1.1\r\n%sUser-Agent: %s\r\nHost: %s\r\n\r\n", path,accept_type,user_agent, abs_domain);
Brief exlanation of the code:
i.e. if the url entered by the user is http://mail.deenze.com/control_panel/index.php
the protocol will be -> http
the domain will be -> mail.deenze.com
the abs_domain will be -> deenze.com
the path will be control_panel/index.php
Finally this values in conjunction with the user agent will be used to send the data.
301 and 302 status codes are redirects, not errors. They indicate that you should try the request at a different URL instead.
In this case, it looks like despite the fact that you entered the URL http://www.google.com/, the Host header you are sending only includes google.com. Google is sending you back a redirect telling you to use www.google.com instead.
I notice that you seem to have a get_abs_domain function that is stripping the www off; there is no reason you should do this. www.google.com and google.com are different hostnames, and may give you entirely different contents. In practice, most sites will give you the same result for them, but you can't depend on that; some will redirect from one to the other, some will simply serve up the same content, and some may only work at one or the other.
Instead of trying to rewrite one to the other, you should just follow whatever redirect you are returned.
I would recommend using an existing HTTP client library rather than trying to write your own (unless this is just an exercise for your own edification). For example, there's cURL if you want to be portable or HttpClient if you only need to work on Windows (based on your screenshots, I'm assuming that's the platform you're using). There is a lot of complexity in writing an HTTP client that can actually handle most of the web; SSL, compression, redirects, chunked transfer encoding, etc.
#Brian Campbell, i think the problem was the www, because if i use www.google.com it gives me a redirect url: https://www.google.com/?gws_rd=ssl same as my browser, but because it is a https i think i will have to use ssl, thank for your answer
I cant copy paste the text in my terminal but i have increase the fonts for visibility purposes
I'm working on an old school linux variant (QNX to be exact) and need a way to grab a web page (no cookies or login, the target URL is just a text file) using nothing but sockets and arrays.
Anyone got a snippet for this?
note: I don't control the server and I've got very little to work with besides what is already on the box (adding in additional libraries is not really "easy" given the contraints -- although I do love libcurl)
I'd look at libcurl if you want SSL support for or anything fancy.
However if you just want to get a simple webpage from a port 80, then just open a tcp socket, send "GET /index.html HTTP/1.0\n\r\n\r" and parse the output.
I do have some code, but it also supports (Open)SSL so it's a bit long to post here.
In essence:
parse the URL (split out URL scheme, host name, port number, scheme specific part
create the socket:
s = socket(PF_INET, SOCK_STREAM, proto);
populate a sockaddr_in structure with the remote IP and port
connect the socket to the far end:
err = connect(s, &addr, sizeof(addr));
make the request string:
n = snprinf(headers, "GET /%s HTTP/1.0\r\nHost: %s\r\n\r\n", ...);
send the request string:
write(s, headers, n);
read the data:
while (n = read(s, buffer, bufsize) > 0) {
...
}
close the socket:
close(s);
nb: pseudo-code above would collect both response headers and data. The split between the two is the first blank line.