URI normalization

Search engines employ URI normalization in order to correctly rank pages that may be found with multiple URIs, and to reduce indexing of duplicate pages.Web crawlers perform URI normalization in order to avoid crawling the same resource more than once.Web servers may also perform normalization for many reasons (i.e. to be able to more easily intercept security risks coming from client requests, to use only one absolute file name for each resource stored in their caches, named in log files, etc.).The following normalizations are described in RFC 3986 [1] to result in equivalent URIs: For http and https URIs, the following normalizations listed in RFC 3986 may result in equivalent URIs, but are not guaranteed to by the standards: Applying the following normalizations result in a semantically different URI although it may refer to the same resource: Some normalization rules may be developed for specific websites by examining URI lists obtained from previous crawls or web server logs.Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text) rules that can be applied to URI lists.
Types of URI normalization.
URL canonicalizationSearch enginesWeb crawlersWeb browserspage has been cachedWeb serverspercent-encodingcase-insensitiveschemedefault portdirectory indexesfragmentIP addressvirtual web serversapplication layernaked domainredirectURI fragmentWeb crawlerIdit KeidarWorld Wide Web