HTTP Client for picky web-servers
Whilst writting a data mash-up service for the Predict the Sky challenge at the NASA
Space Apps hack day at the Met Office, I hit a very strange problem with the EXPath
HTTP Client. I needed to scrape data from a webpage on the Heavens Above website http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET
and so I wrote the following XQuery:
declare namespace http = "http://expath.org/ns/http-client";
http:send-request(
<http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET"/>
)
However that query would always return a HTTP 404 result:
<http:response xmlns:http="http://expath.org/ns/http-client" status="404" message="Not Found">
<http:header name="content-length" value="1176"/>
<http:header name="content-type" value="text/html"/>
<http:header name="server" value="Microsoft-IIS/7.5"/>
<http:header name="x-powered-by" value="ASP.NET"/>
<http:header name="date" value="Sun, 29 Apr 2012 14:36:40 GMT"/>
<http:header name="connection" value="keep-alive"/>
<http:body media-type="text/html"/>
</http:response>
Now, this seemed very strange to me as I could paste that URL into any Web Browser
and be returned a HTML Web Page! So I broke out one of my old favourite tools, Wireshark, to examine the differences between the HTTP request made by the EXPath HTTP Client
(which is really the Apache Commons HTTP Components Client underneath) and cURL. I decided to use cURL as its very simple and so therefore I knew it would not insert
unnessecary headers into a request, of course I made sure it worked first!
cURL HTTP conversation
GET /PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET HTTP/1.1
User-Agent: curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5
Host: heavens-above.com
Accept: */*
HTTP/1.1 200 OK
Content-Length: 6228
Cache-Control: private
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/7.5
Set-Cookie: ASP.NET_SessionId=omogf40spcfeh03hvveie1ca; path=/; HttpOnly
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Sun, 29 Apr 2012 14:47:51 GMT
Connection: keep-alive
EXPath HTTP Client HTTP Conversation
GET /PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET HTTP/1.1
Host: heavens-above.com
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1 (java 1.5)
HTTP/1.1 404 Not Found
Content-Length: 1176
Content-Type: text/html
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Sun, 29 Apr 2012 14:48:33 GMT
Connection: keep-alive
So what is going on here? Why does one request for the same URL succeed and the other
fail? If we examine the requests the only difference is that the HTTPClient request
includes a header 'Connection: keep-alive' whereas the cURL request does not, and
the User-Agent header represents each client.
Persistent Connections
So What is 'Connection: keep-alive'? The HTTP 1.1 specification describes persistent connections in §8 starting on page 43. Basically a persistent
connection allows multiple http requests and responses to be sent through the same
TCP connection for efficiency. The specification states in §8.1.1:
"HTTP implementations SHOULD implement persistent connections."
and subsequently in §8.1.2:
"A significant difference between HTTP/1.1 and earlier versions of HTTP is that persistent
connections are the default behavior of any HTTP connection. That is, unless otherwise
indicated, the client SHOULD assume that the server will maintain a persistent connection,
even after error responses from the server."
So whilst persistent connections 'SHOULD' be implemented rather than 'MUST' be implemented,
the default behaviour is that of persistent connections, which seems a bit, erm...
strange! So whether the client sends 'Connection: keep-alive' or not, the default
is in effect 'Connection: keep-alive' for HTTP 1.1, therefore cURL and HTTPClient
are semantically making exactly the same request.
If both cURL and HTTPClient are making the same request, why do they get different
responses from the server? Well, we can check if persistent connections from the HTTPClient
are the problem by forcing the HTTPClient to set a 'Connection: close' header as detailed
here:
http:send-request(
<http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET">
<http:header name="Connection" value="close"/>
</http:request>
)
Unfortunately we yet again get a HTTP 404 response. Which is actually correct if we
assume that the implementations and server adhere to the specification. So the only
remaining difference is the User Agent header.
User Agent
The only remaining difference is the User Agent string, but why would such a useful
information website block requests from application written in Java using a very common
library? I dont know! So perhaps we should choose a very common User Agent string,
for example one from a major web browser and try the request again:
http:send-request(
<http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET">
<http:header name="User-Agent" value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.165 Safari/535.19"/>
</http:request>
)
and finally success:
<http:response xmlns:http="http://expath.org/ns/http-client" status="200" message="OK">
<http:header name="cache-control" value="private"/>
<http:header name="content-type" value="text/html; charset=utf-8"/>
<http:header name="server" value="Microsoft-IIS/7.5"/>
<http:header name="x-aspnet-version" value="4.0.30319"/>
<http:header name="x-powered-by" value="ASP.NET"/>
<http:header name="date" value="Sun, 29 Apr 2012 15:54:52 GMT"/>
<http:header name="expires" value="Sun, 29 Apr 2012 15:59:52 GMT"/>
<http:header name="transfer-encoding" value="chunked"/>
<http:body media-type="text/html"/>
</http:response>
<html xmlns:html="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
<script src="http://1.2.3.4/bmi-int-js/bmi.js" language="javascript"/>
<head>
<title>ISS - Visible Passes </title>
...
Adam Retter posted on Sunday, 29th April 2012 at 14.28 (GMT+01:00)
Updated: Sunday, 29th 2012 at April 14.28 (GMT+01:00)
tags: EXPathHTTPClientUser AgentHTTP 1.1Persistent ConnectionsXQuerycURLIIS
|