Character sets and encoding types (UTF-8...)

What is a character set?

A character set is a well-defined set of characters. The simplest is the ASCII (list of US based caracters). The most complete and most modern is the Unicode.

What is character encoding?

The encoding of a character set specifies how these characters will be actually stored as bytesunderstandable by a computer. For a given character set, there is often a choice between many different encodings, with their advantages and disadvantages.

For example here are 2 widely used encodings for the Unicode character set:

  • UTF-8
    For this one the basic ASCII characters are encoded in one byte (8 bits). The others are encoded in several bytes. This format has the advantage of being compact (if there are a lot of basic ASCII characters) and easily readable with any text editor.
  • UTF-16
    For this one all the characters are encoded in two octets (16 bits). It is less compact than the UTF-8 for roman text. But a computer program will handle this faster encoding precisely because each character is exactly two bytes (it is rather easy to calculate the length of a string from the place it occupies in memory)

What character set and encoding choose for my website?

If you have a choice, then we recommend that you use the Unicode character set with UTF-8 encoding. This encoding is supported by all browsers and allows you to manage all possible languages ​​(Japanese French, English, ...)

If you have a static website, then you must ensure that your HTML files are saved in UTF-8 (the possibility exists with all modern text editors)

If you have a dynamic site integrating several technologies (databases, HTML templates, XML, languages ​​such as PHP, Ruby, ASP ...), then you must ensure that you have completely mastered your character conversion chain through each technological brick. It is not always easy, but it is essential not to have corrupted characters (garbage) in the final display in the browser.

How to apply this charset to my website?

Imagine that you have chosen UTF-8 for your static website. All your HTML files have been saved in UTF-8 format, which is perfect. Now the question is how the browser interpreting the HTML file will know which encoding is used? Here is the logic that most browsers use:

  • the browser will first check if the Web server that gave him the page also gave him the charset information. The server can note this in the'Content-Type' field of its HTTP response (but it is not obligatory).
    • text/html; charset=UTF-8: here the server indicates that it is a HTML file encoded in UTF-8
    • text/html : here the server does not provide the information, the browser will have to find another way to get it.
  • otherwise the browser checks if the HTML code itself contains any encoding information. In fact, a web designer can provide this information using a meta tag such as:
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
  • otherwise the browser will assume that the encoding used is the iso-8859-1.

How to diagnose and solve charset issues ?

If the display of certain characters (especially those containing accents) is corrupted in the browser, then it is important not to try to "tweak" and make a risky character conversion. This may aggravate the problem. On the contrary, the problem should be completely analyzed in order to be properly solved:

  • check your web platform by asking the question "what is the input encoding?" and "what is the output encoding?" for each brick. If necessary you can use conversion functions (for example, iconv in PHP ), but only if you have clearly identified the problem.
  • check the steps in the previous paragraph to make sure that your web server and your HTML page give the correct information of your encoding.
  • if your web page is XHTML (ie it is actually XML), then you must follow the XML rules to specify the encoding of your page. This is usually done in the first line of the xml file:
    <?xml version="1.0" encoding="ISO-8859-1"?>
    If no encoding is specified, then an HTML parser will deduce that it is UTF-8.
  • make sure your HTML files, your templates, your localization files or hardcoded character chains are in the desired encoding.

Of course, an analysis of your page with Yakaferci will help you identify any potential problems.

The free tool Yakaferci to analyze the charset of your pages

Analyze your charset encoding with our free SEO Page Analyzer

To start a Yakaferci analysis, you must simply enter the link to your page in the area above and click the Analyze button. You can then access the different pages of the report, particularly those on charset.

View more...

If this article has made you curious about the character sets topic, here are some helpful links :

Other Yakaferci articles in the SEO and Technology section may be helpful too.