DEAR PEOPLE FROM THE FUTURE: Here's what we've figured out so far...

Welcome! This is a Q&A website for computer programmers and users alike, focused on helping fellow programmers and users. Read more

What are you stuck on? Ask a question and hopefully somebody will be able to help you out!
+2 votes

For example, the emojis at the top and the return char at the bottom of the page are wrong:

https://peers.community/

https://web.archive.org/web/20210813183158/https://peers.community/

I even changed to some of the same emoji that ForgeFed uses, but they come out wrong.

https://web.archive.org/web/20210807015023/https://forgefed.peers.community/

https://web.archive.org/web/20210813190057/https://peers.community/dark/index.html

Last thing I tried was several combinations of lang attributes in the html.

EDIT: The new moon emoji shows up as 🌑. I can't actually seem to get it to show up in WOTAS, though.

https://emojipedia.org/new-moon/

by
edited by

1 Answer

+1 vote
 
Best answer

I remembered this other question that suggested using id_, and sure enough the page is displayed correctly.
curl-ing the Archive page for headers (curl -v -I https://web.archive.org/web/20210813183158/https://peers.community/) gives me content-type: text/html; charset=windows-1252 and curl-ing the actual Peers website for headers gives me content-type: text/html (I don't see any "UTF-8" string in the headers). This suggests to me that the Peers webserver is not sending any encoding information, so Archive uses the default too, that happens to be "windows-1252". The other archiving website (archive.today) gives me content-type: text/html; charset=utf-8 and the page is displayed correctly. So, browsers must be picking up UTF-8 from the <meta charset>, and not from the server headers. However, meta charset should fit completely within the first 1024 bytes and I feel like all the javascript that is automatically inserted by Archive is messing with it (hence why id_ actually works).

I would try to change the server setting by returning the explicit charset, content-type: text/html; charset=UTF-8 and see if Archive picks it up. It should also be possible to do it via .htaccess using AddDefaultCharset utf-8.

I'm not sure why forgefed.peers.community works, though. The page header seems a bit shorter, for example it has <html class="default light"> instead of <html class="default dark" lang="en">. It would be interesting if you could use the very same header for testing, or maybe remove the html class entirely to save some characters and see if it makes any difference. If it does, then it must be the 1024-bytes limit surely. I've also noticed that the ForgeFed link uses <meta charset="UTF-8"/> whereas Peers uses <meta charset="utf-8">. This should not make any difference but you never know...

EDIT: It looks like that HTTP headers take precedence over any meta tag all the times. So what is happening I think is this:

  • https://peers.community is displayed correctly because the server is not returning any charset. Therefore the browser uses what it finds in <meta>
  • Archive doesn't see any charset in the HTTP header when retrieving the page, so it sets the default "windows-1252". Looks like Archive doesn't look for a <meta charset>, it just sets its own default. Its default HTTP header has precedence over your <meta charset>
  • other archiving websites show it correctly because they use a different default, ie. UTF-8
  • interestingly, Archive's HTTP headers for forgefed.peers.community contains the correct encoding content-type: text/html; charset=utf-8 but peers.community instead is served with content-type: text/html; charset=windows-1252. I'm totally baffled by this. It makes me think that maybe Archive can pick up <meta charset /> but not <meta charset>.

The way to address this would be to have both the HTTP header set to UTF-8 server-side, and also to keep the <meta charset> with the same encoding that is set in the HTTP header (such that the document is independent of the server, for example if it's downloaded). Before doing this though, I would try with <meta charset /> (with a closed tag) and see if it makes any difference for Archive.

by
selected by
0

Thank you! Those are some great suggestions. I have tried some of them already, but not the .htaccess. I'll report back.

0

Looks like it was the header size. After adding the charset to the .htaccess file it looks good. (I've been trying different pages so make sure I don't hit the 45 minute limit.)

https://web.archive.org/web/20210813224938/https://peers.community/archive/index.html

EDIT: main page saved successfully:

https://web.archive.org/web/20210813224741/https://peers.community/index.html

+1

Could you try these with the old .htaccess (that does not set the encoding):

  • reduce the <head> node at minimum, ie. remove everything except for the <meta charset> element. If the problem is the head size, this should work because Archive's javascript injection is just below 1KB, so it shouldn't overflow the limit
  • add a closing slash to the meta tag, ie. try with <meta charset="utf-8" /> instead of <meta charset="utf-8">. It could be that Archive has an issue with parsing (X)HTML so maybe it is looking for the tag, but cannot parse it.

Trying these would help understanding better what's going on. I would like to think that Archive relies on the server's HTTP headers, but the two links (peers.community and forgefed.peers.community) are served from the same server, with the same HTTP headers, yet Archive serves one with charset=windows-1252 and the other with charset=utf-8. That's why I think the two options above could fix the issue, without needing to change the .htaccess.

0

The only thing I hadn't tried yet is reducing the size of the <head> down to only the charset and the title.

Interestingly, it still breaks:

https://web.archive.org/web/20210815174212/https://peers.community/index.html

I've had some trouble getting the wayback machine to save current pages, even outside the 45 minute window. Perhaps I've been marked as a spam bot of some kind. Some pages even when I archive them they continue to display saves from days earlier. Anyway...

Here's the reverted state of the website with the version of the head that I prefer:

https://web.archive.org/web/20210815174616/https://peers.community/dark/sitemap/index.html

Contributions licensed under CC0
...