Unicode and HTML

Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set.The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike.Both types of documents consist, at a fundamental level, of characters, which are graphemes and grapheme-like units, independent of how they manifest in computer storage systems and networks.To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example 合 instead of 合).Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers.For a browser from a location where legacy multi-byte character encodings are prevalent, some form of auto-detection is likely to be applied.Because of the legacy of 8-bit text representations in programming languages and operating systems and the desire to avoid burdening users with the need to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk and often do not even allow input of characters beyond a very limited range.Another factor contributing in the same direction, is the arrival of UTF-8 – which greatly diminishes the need for other encodings, and thus modern editors tends to default, as recommended by the HTML5 specification,[1] to UTF-8.No additional metadata mechanisms are required for these encodings since the byte-order mark includes all of the information necessary for processing applications.The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist.But note that Internet Explorer, Chrome and Safari – for both XML and text/html serializations – do not permit the encoding to be overridden whenever the page includes the BOM.They will correctly display any mix of Unicode blocks, as long as appropriate fonts are present in the operating system.