Unicode and HTML

Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set.The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike.Both types of documents consist, at a fundamental level, of characters, which are graphemes and grapheme-like units, independent of how they manifest in computer storage systems and networks.To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example 合 instead of 合).Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers.For a browser from a location where legacy multi-byte character encodings are prevalent, some form of auto-detection is likely to be applied.Because of the legacy of 8-bit text representations in programming languages and operating systems and the desire to avoid burdening users with the need to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk and often do not even allow input of characters beyond a very limited range.Another factor contributing in the same direction, is the arrival of UTF-8 – which greatly diminishes the need for other encodings, and thus modern editors tends to default, as recommended by the HTML5 specification,[1] to UTF-8.No additional metadata mechanisms are required for these encodings since the byte-order mark includes all of the information necessary for processing applications.The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist.But note that Internet Explorer, Chrome and Safari – for both XML and text/html serializations – do not permit the encoding to be overridden whenever the page includes the BOM.They will correctly display any mix of Unicode blocks, as long as appropriate fonts are present in the operating system.
Dynamic HTMLarticlecanvasMobile ProfileHTML elementdiv and spanmarqueeHTML attributealt attributeHTML frameHTML editorCharacter encodingsnamed charactersLanguage codeDocument Object ModelBrowser Object ModelStyle sheetsFont familyWeb colorsJavaScriptWebGPUValidatorWHATWGQuirks modeWeb storageRendering engineDocument markup languagesComparison of browser enginesWindows-1252ISO 10646Unicodeweb pagesnatural languageswriting systemscharacter encodingmarkup languageweb browserscharactersgraphemescomputer storagenetworksUniversal Character Setabstractionfile systemoctetsUnicode Transformation Formatnumeric character referencesUTF-16endiannessNumeric character referencedecimalhexadecimalcharacter entity referenceem dashList of XML and HTML character entity referencesUnicode encodingbyte order markWindows-1251programming languagesoperating systemsUTF-16BEUTF-16LEUTF-32LEUTF-32BESharp SZ with háčekCyrillicShort IHebrewArabicHiraganaCJK Unified IdeographSimplified ChineseTraditional ChineseHangulsyllableMalayalamFace with Tears of Joy emojiCode2000Mozilla FirefoxSafariInternet ExplorerUnicode blocksoperating systemNetscape NavigatorInternet Explorer 6Basic Multilingual PlaneGoogle8859-1Character encodings in HTMLCharset detectionWayback MachineUnicode ConsortiumISO/IEC 10646 (Universal Character Set)Code pointsCharacter propertyPrivate Use AreaCombining grapheme joinerLeft-to-right markRight-to-left markSoft hyphenVariant formWord joinerZero-width joinerZero-width non-joinerZero-width spaceCJK Unified IdeographsCombining characterDuplicate charactersNumeralsScriptsSymbolsHalfwidth and fullwidthAlias names and abbreviationsBidirectional textCollationISO/IEC 14651EquivalenceVariation sequencesInternational Ideographs CoreComparison of encodingsBOCU-1CESU-8PunycodeUTF-16/UCS-2UTF-32/UCS-4UTF-EBCDICCompatibility charactersHomoglyphPrecomposed characterZ-variantRegional indicator symbolDomain names (IDN)entity referencesnumeric referencesCommon Locale Data Repository (CLDR)GB 18030ISO/IEC 8859DIN 91379ISO 15924AnomaliesConScript Unicode RegistryIdeographic Research GroupInternational Components for UnicodeHan unificationCombining marksDiacriticsNumbersArmenianBalineseBengaliBopomofoBrailleBurmeseCanadian AboriginalChakmaCherokeeCJK Unified Ideographs (Han)DeseretDevanagariGeʽezGeorgianGujaratiGunjala GondiGurmukhiGurung KhemaHanifi RohingyaHanunuooJavaneseKannadaKatakanaKayah LiKirat RaiLepchaLisu (Fraser)LontaraMende KikakuiMedefaidrinMiao (Pollard)MongolianNag MundariNew Tai LueNüshuNyiakeng Puachue HmongOl ChikiOl OnalOsmanyaPahawh HmongPau Cin HauPracalit (Newa)RanjanaRejangSamaritanSaurashtraShavianSinhalaSorang SompengSundaneseSunuwarSyriacTagbanwaTai LeTai ThamTangsaTeluguThaanaTibetanTifinaghTirhutaWanchoWarang CitiAnatolian hieroglyphsAncient North ArabianAvestanBassa VahBhaiksukiBrāhmīCarianCaucasian AlbanianCopticCuneiformCypriotCypro-MinoanDives AkuruEgyptian hieroglyphsElbasanElymaicGlagoliticGothicGranthaHatranImperial AramaicKaithiKharosthiKhitan small scriptKhojkiKhudawadiKhwarezmianLinear ALinear BLycianLydianMahajaniMandaicManichaeanMarchenMeetei MayekMeroiticMultaniNabataeanNandinagariOld HungarianOld ItalicOld PermicOld Persian cuneiformOld SogdianOld TurkicOld UyghurPalmyreneʼPhags-paPhoenicianSharadaSiddhamSogdianSouth ArabianSoyomboSylheti NagriTagalog (Baybayin)TangutTodhriTulu TigalariUgariticVithkuqiYezidiZanabazar SquareDuployanSignWritingCultural, political, and religious symbolsCurrencyControl PicturesMathematical operators and symbolsGlossaryPhonetic symbols (including IPA)