I remembered this other question that suggested using id_
, and sure enough the page is displayed correctly.
curl
-ing the Archive page for headers (curl -v -I https://web.archive.org/web/20210813183158/https://peers.community/
) gives me content-type: text/html; charset=windows-1252
and curl
-ing the actual Peers website for headers gives me content-type: text/html
(I don't see any "UTF-8" string in the headers). This suggests to me that the Peers webserver is not sending any encoding information, so Archive uses the default too, that happens to be "windows-1252". The other archiving website (archive.today) gives me content-type: text/html; charset=utf-8
and the page is displayed correctly. So, browsers must be picking up UTF-8 from the <meta charset>
, and not from the server headers. However, meta charset
should fit completely within the first 1024 bytes and I feel like all the javascript that is automatically inserted by Archive is messing with it (hence why id_
actually works).
I would try to change the server setting by returning the explicit charset, content-type: text/html; charset=UTF-8
and see if Archive picks it up. It should also be possible to do it via .htaccess
using AddDefaultCharset utf-8.
I'm not sure why forgefed.peers.community works, though. The page header seems a bit shorter, for example it has <html class="default light">
instead of <html class="default dark" lang="en">
. It would be interesting if you could use the very same header for testing, or maybe remove the html class entirely to save some characters and see if it makes any difference. If it does, then it must be the 1024-bytes limit surely. I've also noticed that the ForgeFed link uses <meta charset="UTF-8"/>
whereas Peers uses <meta charset="utf-8">
. This should not make any difference but you never know...
EDIT: It looks like that HTTP headers take precedence over any meta
tag all the times. So what is happening I think is this:
- https://peers.community is displayed correctly because the server is not returning any charset. Therefore the browser uses what it finds in
<meta>
- Archive doesn't see any charset in the HTTP header when retrieving the page, so it sets the default "windows-1252". Looks like Archive doesn't look for a
<meta charset>
, it just sets its own default. Its default HTTP header has precedence over your <meta charset>
- other archiving websites show it correctly because they use a different default, ie. UTF-8
- interestingly, Archive's HTTP headers for forgefed.peers.community contains the correct encoding
content-type: text/html; charset=utf-8
but peers.community instead is served with content-type: text/html; charset=windows-1252
. I'm totally baffled by this. It makes me think that maybe Archive can pick up <meta charset />
but not <meta charset>
.
The way to address this would be to have both the HTTP header set to UTF-8 server-side, and also to keep the <meta charset>
with the same encoding that is set in the HTTP header (such that the document is independent of the server, for example if it's downloaded). Before doing this though, I would try with <meta charset />
(with a closed tag) and see if it makes any difference for Archive.