EXPath HTTP Client and Heavens Above
Whilst writting a data mash-up service for the Predict the Sky challenge at the NASA Space Apps hack day at the Met Office, I hit a very strange problem with the EXPath HTTP Client. I needed to scrape data from a webpage on the Heavens Above website http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET and so I wrote the following XQuery:
declare namespace http = "http://expath.org/ns/http-client"; http:send-request( <http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET"/> )
However that query would always return a HTTP 404 result:
<http:response xmlns:http="http://expath.org/ns/http-client" status="404" message="Not Found"> <http:header name="content-length" value="1176"/> <http:header name="content-type" value="text/html"/> <http:header name="server" value="Microsoft-IIS/7.5"/> <http:header name="x-powered-by" value="ASP.NET"/> <http:header name="date" value="Sun, 29 Apr 2012 14:36:40 GMT"/> <http:header name="connection" value="keep-alive"/> <http:body media-type="text/html"/> </http:response>
Now, this seemed very strange to me as I could paste that URL into any Web Browser and be returned a HTML Web Page! So I broke out one of my old favourite tools, Wireshark, to examine the differences between the HTTP request made by the EXPath HTTP Client (which is really the Apache Commons HTTP Components Client underneath) and cURL. I decided to use cURL as its very simple and so therefore I knew it would not insert unnessecary headers into a request, of course I made sure it worked first!
cURL HTTP conversation
GET /PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET HTTP/1.1 User-Agent: curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5 Host: heavens-above.com Accept: */*
HTTP/1.1 200 OK Content-Length: 6228 Cache-Control: private Content-Type: text/html; charset=utf-8 Server: Microsoft-IIS/7.5 Set-Cookie: ASP.NET_SessionId=omogf40spcfeh03hvveie1ca; path=/; HttpOnly X-AspNet-Version: 4.0.30319 X-Powered-By: ASP.NET Date: Sun, 29 Apr 2012 14:47:51 GMT Connection: keep-alive
EXPath HTTP Client HTTP Conversation
GET /PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET HTTP/1.1 Host: heavens-above.com Connection: Keep-Alive User-Agent: Apache-HttpClient/4.1 (java 1.5)
HTTP/1.1 404 Not Found Content-Length: 1176 Content-Type: text/html Server: Microsoft-IIS/7.5 X-Powered-By: ASP.NET Date: Sun, 29 Apr 2012 14:48:33 GMT Connection: keep-alive
So what is going on here? Why does one request for the same URL succeed and the other fail? If we examine the requests the only difference is that the HTTPClient request includes a header 'Connection: keep-alive' whereas the cURL request does not, and the User-Agent header represents each client.
Persistent Connections
So What is 'Connection: keep-alive'? The HTTP 1.1 specification describes persistent connections in §8 starting on page 43. Basically a persistent connection allows multiple http requests and responses to be sent through the same TCP connection for efficiency. The specification states in §8.1.1:
"HTTP implementations SHOULD implement persistent connections."
and subsequently in §8.1.2:
"A significant difference between HTTP/1.1 and earlier versions of HTTP is that persistent connections are the default behavior of any HTTP connection. That is, unless otherwise indicated, the client SHOULD assume that the server will maintain a persistent connection, even after error responses from the server."
So whilst persistent connections 'SHOULD' be implemented rather than 'MUST' be implemented, the default behaviour is that of persistent connections, which seems a bit, erm... strange! So whether the client sends 'Connection: keep-alive' or not, the default is in effect 'Connection: keep-alive' for HTTP 1.1, therefore cURL and HTTPClient are semantically making exactly the same request.
If both cURL and HTTPClient are making the same request, why do they get different responses from the server? Well, we can check if persistent connections from the HTTPClient are the problem by forcing the HTTPClient to set a 'Connection: close' header as detailed here:
http:send-request( <http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET"> <http:header name="Connection" value="close"/> </http:request> )
Unfortunately we yet again get a HTTP 404 response. Which is actually correct if we assume that the implementations and server adhere to the specification. So the only remaining difference is the User Agent header.
User Agent
The only remaining difference is the User Agent string, but why would such a useful information website block requests from application written in Java using a very common library? I dont know! So perhaps we should choose a very common User Agent string, for example one from a major web browser and try the request again:
http:send-request( <http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET"> <http:header name="User-Agent" value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.165 Safari/535.19"/> </http:request> )
and finally success:
<http:response xmlns:http="http://expath.org/ns/http-client" status="200" message="OK"> <http:header name="cache-control" value="private"/> <http:header name="content-type" value="text/html; charset=utf-8"/> <http:header name="server" value="Microsoft-IIS/7.5"/> <http:header name="x-aspnet-version" value="4.0.30319"/> <http:header name="x-powered-by" value="ASP.NET"/> <http:header name="date" value="Sun, 29 Apr 2012 15:54:52 GMT"/> <http:header name="expires" value="Sun, 29 Apr 2012 15:59:52 GMT"/> <http:header name="transfer-encoding" value="chunked"/> <http:body media-type="text/html"/> </http:response> <html xmlns:html="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"> <script src="http://1.2.3.4/bmi-int-js/bmi.js" language="javascript"/> <head> <title>ISS - Visible Passes </title> ...
Adam Retter posted on Sunday, 29th April 2012 at 14.28 (GMT+01:00)
Updated: Sunday, 29th 2012 at April 14.28 (GMT+01:00)