Content-Encoding in soup – all your gzip are belong to us

By kov

January 17, 2010

One thing everyone forgot to talk about the WebKitGTK+ hackfest was that master Dan Winship added basic Content-Encoding support to libsoup, and patched WebKitGTK+ to use it. If you are using a recent enough version of those you will finally be able to visit web sites that send gzipped content despite the browser saying it could not handle it, like the Internet Archive.

This was one of those cases in which the web shows all of its potential to behave weirdly. The HTTP/1.1 RFC says that if an Accept-Encoding header is not present, the server MAY assume the client accepts any encoding, so we were having many sites send us gzip content even though we did not support it. We then started sending a header saying “we support identity, and nothing else!”.

It turns out the web sucks, so many servers were not happy with a full header, and started giving us angry looks (slashdot, for instance, would not render correctly because it started sending encoded CSS files!). We then simplified the header we were sending, which made those servers happy again. Some sites, though, completely ignored our saying we didn’t support anything except identity, and sent us gzipped content anyway. Most of these were misbehaving caches (this was the case for Wikipedia), so would work after you asked for a forced reload, which would ignore the cache, but some servers, such as the Internet Archive’s didn’t really want to talk about encodings – they only wanted to send gzip-encoded content.

So, in the end, our only way out was implementing the damn encoding support, which finally happened during the hackfest. Take that, web!