Content-Encoding in soup – all your gzip are belong to us

One thing everyone forgot to talk about the WebKitGTK+ hackfest was that master Dan Winship added basic Content-Encoding support to libsoup, and patched WebKitGTK+ to use it. If you are using a recent enough version of those you will finally be able to visit web sites that send gzipped content despite the browser saying it could not handle it, like the Internet Archive.

This was one of those cases in which the web shows all of its potential to behave weirdly. The HTTP/1.1 RFC says that if an Accept-Encoding header is not present, the server MAY assume the client accepts any encoding, so we were having many sites send us gzip content even though we did not support it. We then started sending a header saying “we support identity, and nothing else!”.

It turns out the web sucks, so many servers were not happy with a full header, and started giving us angry looks (slashdot, for instance, would not render correctly because it started sending encoded CSS files!). We then simplified the header we were sending, which made those servers happy again. Some sites, though, completely ignored our saying we didn’t support anything except identity, and sent us gzipped content anyway. Most of these were misbehaving caches (this was the case for Wikipedia), so would work after you asked for a forced reload, which would ignore the cache, but some servers, such as the Internet Archive’s didn’t really want to talk about encodings – they only wanted to send gzip-encoded content.

So, in the end, our only way out was implementing the damn encoding support, which finally happened during the hackfest. Take that, web!

3 Replies to “Content-Encoding in soup – all your gzip are belong to us”

  1. Oh, that’s a very nice improvement! I was already wondering why Wikipedia pages would sometimes show as garbage – thanks for the explanation!

  2. @alvherre: that doesn’t seem to be the same issue. That site does not advertise it’s going to send gzip, even:

    kov@goiaba ~> env http_proxy= wget -S –cache=off -O /dev/null
    –2010-04-07 17:57:26–
    Connecting to||:80… connected.
    HTTP request sent, awaiting response…
    HTTP/1.1 200 OK
    Date: Wed, 07 Apr 2010 20:57:32 GMT
    Server: Apache/2.0.63 (FreeBSD)
    Connection: close
    Content-Type: text/html
    Length: unspecified [text/html]
    Saving to: `/dev/null’

    I believe you have a different problem there. Can you try running with WEBKIT_DEBUG=network, and see what the request/response pair says?

Leave a Reply

Your email address will not be published. Required fields are marked *