Blog

Ponderings of a kind

This is my own personal blog, each article is an XML document and the code powering it is hand cranked in XQuery and XSLT. It is fairly simple and has evolved only as I have needed additional functionality. I plan to Open Source the code once it is a bit more mature, however if you would like a copy in the meantime drop me a line.

Atom Feed

EXPath HTTP Client and Heavens Above

HTTP Client for picky web-servers

Whilst writting a data mash-up service for the Predict the Sky challenge at the NASA Space Apps hack day at the Met Office, I hit a very strange problem with the EXPath HTTP Client. I needed to scrape data from a webpage on the Heavens Above website http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET and so I wrote the following XQuery:

    declare namespace http = "http://expath.org/ns/http-client";

    http:send-request(
        <http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET"/>
    )
            

However that query would always return a HTTP 404 result:

    <http:response xmlns:http="http://expath.org/ns/http-client" status="404" message="Not Found">
        <http:header name="content-length" value="1176"/>
        <http:header name="content-type" value="text/html"/>
        <http:header name="server" value="Microsoft-IIS/7.5"/>
        <http:header name="x-powered-by" value="ASP.NET"/>
        <http:header name="date" value="Sun, 29 Apr 2012 14:36:40 GMT"/>
        <http:header name="connection" value="keep-alive"/>
        <http:body media-type="text/html"/>
    </http:response>
            

Now, this seemed very strange to me as I could paste that URL into any Web Browser and be returned a HTML Web Page! So I broke out one of my old favourite tools, Wireshark, to examine the differences between the HTTP request made by the EXPath HTTP Client (which is really the Apache Commons HTTP Components Client underneath) and cURL. I decided to use cURL as its very simple and so therefore I knew it would not insert unnessecary headers into a request, of course I made sure it worked first!

cURL HTTP conversation
    GET /PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET HTTP/1.1
    User-Agent: curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5
    Host: heavens-above.com
    Accept: */*
            
    HTTP/1.1 200 OK
    Content-Length: 6228     
    Cache-Control: private
    Content-Type: text/html; charset=utf-8
    Server: Microsoft-IIS/7.5
    Set-Cookie: ASP.NET_SessionId=omogf40spcfeh03hvveie1ca; path=/; HttpOnly
    X-AspNet-Version: 4.0.30319
    X-Powered-By: ASP.NET
    Date: Sun, 29 Apr 2012 14:47:51 GMT
    Connection: keep-alive
            
EXPath HTTP Client HTTP Conversation
    GET /PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET HTTP/1.1
    Host: heavens-above.com
    Connection: Keep-Alive
    User-Agent: Apache-HttpClient/4.1 (java 1.5)
            
    HTTP/1.1 404 Not Found
    Content-Length: 1176     
    Content-Type: text/html
    Server: Microsoft-IIS/7.5
    X-Powered-By: ASP.NET
    Date: Sun, 29 Apr 2012 14:48:33 GMT
    Connection: keep-alive
            

So what is going on here? Why does one request for the same URL succeed and the other fail? If we examine the requests the only difference is that the HTTPClient request includes a header 'Connection: keep-alive' whereas the cURL request does not, and the User-Agent header represents each client.

Persistent Connections

So What is 'Connection: keep-alive'? The HTTP 1.1 specification describes persistent connections in §8 starting on page 43. Basically a persistent connection allows multiple http requests and responses to be sent through the same TCP connection for efficiency. The specification states in §8.1.1:

"HTTP implementations SHOULD implement persistent connections."

and subsequently in §8.1.2:

"A significant difference between HTTP/1.1 and earlier versions of HTTP is that persistent connections are the default behavior of any HTTP connection. That is, unless otherwise indicated, the client SHOULD assume that the server will maintain a persistent connection, even after error responses from the server."

So whilst persistent connections 'SHOULD' be implemented rather than 'MUST' be implemented, the default behaviour is that of persistent connections, which seems a bit, erm... strange! So whether the client sends 'Connection: keep-alive' or not, the default is in effect 'Connection: keep-alive' for HTTP 1.1, therefore cURL and HTTPClient are semantically making exactly the same request.

If both cURL and HTTPClient are making the same request, why do they get different responses from the server? Well, we can check if persistent connections from the HTTPClient are the problem by forcing the HTTPClient to set a 'Connection: close' header as detailed here:

    http:send-request(
        <http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET">
            <http:header name="Connection" value="close"/>
        </http:request>
    )
            

Unfortunately we yet again get a HTTP 404 response. Which is actually correct if we assume that the implementations and server adhere to the specification. So the only remaining difference is the User Agent header.

User Agent

The only remaining difference is the User Agent string, but why would such a useful information website block requests from application written in Java using a very common library? I dont know! So perhaps we should choose a very common User Agent string, for example one from a major web browser and try the request again:

    http:send-request(
        <http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET">
            <http:header name="User-Agent" value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.165 Safari/535.19"/>
        </http:request>
    )                
            

and finally success:

    <http:response xmlns:http="http://expath.org/ns/http-client" status="200" message="OK">
        <http:header name="cache-control" value="private"/>
        <http:header name="content-type" value="text/html; charset=utf-8"/>
        <http:header name="server" value="Microsoft-IIS/7.5"/>
        <http:header name="x-aspnet-version" value="4.0.30319"/>
        <http:header name="x-powered-by" value="ASP.NET"/>
        <http:header name="date" value="Sun, 29 Apr 2012 15:54:52 GMT"/>
        <http:header name="expires" value="Sun, 29 Apr 2012 15:59:52 GMT"/>
        <http:header name="transfer-encoding" value="chunked"/>
        <http:body media-type="text/html"/>
    </http:response>
    <html xmlns:html="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
        <script src="http://1.2.3.4/bmi-int-js/bmi.js" language="javascript"/>
        <head>
            <title>ISS - Visible Passes </title>
            ...
            

Adam Retter posted on Sunday, 29th April 2012 at 14.28 (GMT+01:00)
Updated: Sunday, 29th 2012 at April 14.28 (GMT+01:00)

tags: EXPathHTTPClientUser AgentHTTP 1.1Persistent ConnectionsXQuerycURLIIS

Add Comment



(will not be shown)






Tag Cloud