PoshCode Logo PowerShell Code Repository

File encoding no BOM by DanWard 12 weeks ago (modification of post by DanWard view diff)
diff | embed code: <script type="text/javascript" src="http://PoshCode.org/embed/3252"></script>download | new post

Returns the encoding type of the file; uses Byte Order Marker (BOM) if exists else analyzes files contents to determine encoding.

  1. #region Function: Get-DTWFileEncoding
  2.  
  3. <#
  4. .SYNOPSIS
  5. Returns the encoding type of the file
  6. .DESCRIPTION
  7. Returns the encoding type of the file.  It first attempts to determine the
  8. encoding by detecting the Byte Order Marker using Lee Holmes' algorithm
  9. (http://poshcode.org/2153).  However, if the file does not have a BOM
  10. it makes an attempt to determine the encoding by analyzing the file content
  11. (does it 'appear' to be UNICODE, does it have characters outside the ASCII
  12. range, etc.).  If it can't tell based on the content analyzed, then
  13. it assumes it's ASCII. I haven't checked all editors but PowerShell ISE and
  14. PowerGUI both create their default files as non-ASCII with a BOM (they use
  15. Unicode Big Endian and UTF-8, respectively).  If your file doesn't have a
  16. BOM and 'doesn't appear to be Unicode' (based on my algorithm*) but contains
  17. non-ASCII characters after index ByteCountToCheck, the file will be incorrectly
  18. identified as ASCII.  So put a BOM in there, would ya!
  19.  
  20. For more information and sample encoding files see:
  21. http://danspowershellstuff.blogspot.com/2012/02/get-file-encoding-even-if-no-byte-order.html
  22. And please give me any tips you have about improving the detection algorithm.
  23.  
  24. *For a full description of the algorithm used to analyze non-BOM files,
  25. see "Determine if Unicode/UTF8 with no BOM algorithm description".
  26. .PARAMETER Path
  27. Path to file
  28. .PARAMETER ByteCountToCheck
  29. Number of bytes to check, by default check first 10000 character.
  30. Depending on the size of your file, this might be the entire content of your file.
  31. .PARAMETER PercentageMatchUnicode
  32. If pecentage of null 0 value characters found is greater than or equal to
  33. PercentageMatchUnicode then this file is identified as Unicode.  Default value .5 (50%)
  34. .EXAMPLE
  35. Get-IHIFileEncoding -Path .\SomeFile.ps1 1000
  36. Attempts to determine encoding using only first 1000 characters
  37. BodyName          : unicodeFFFE
  38. EncodingName      : Unicode (Big-Endian)
  39. HeaderName        : unicodeFFFE
  40. WebName           : unicodeFFFE
  41. WindowsCodePage   : 1200
  42. IsBrowserDisplay  : False
  43. IsBrowserSave     : False
  44. IsMailNewsDisplay : False
  45. IsMailNewsSave    : False
  46. IsSingleByte      : False
  47. EncoderFallback   : System.Text.EncoderReplacementFallback
  48. DecoderFallback   : System.Text.DecoderReplacementFallback
  49. IsReadOnly        : True
  50. CodePage          : 1201
  51. #>
  52. function Get-DTWFileEncoding {
  53.   #region Function parameters
  54.   [CmdletBinding()]
  55.   param(
  56.     [Parameter(Mandatory = $true,ValueFromPipeline = $true,ValueFromPipelineByPropertyName = $true)]
  57.     [ValidateNotNullOrEmpty()]
  58.     [Alias("FullName")]
  59.     [string]$Path,
  60.     [Parameter(Mandatory = $false)]
  61.     [int]$ByteCountToCheck = 10000,
  62.     [Parameter(Mandatory = $false)]
  63.     [decimal]$PercentageMatchUnicode = .5
  64.   )
  65.   #endregion
  66.   process {
  67.     # minimum number of characters to check if no BOM
  68.     [int]$MinCharactersToCheck = 400
  69.     #region Parameter validation
  70.     #region SourcePath must exist; if not, exit
  71.     if ($false -eq (Test-Path -Path $Path)) {
  72.       Write-Error -Message "$($MyInvocation.MyCommand.Name) :: Path does not exist: $Path"
  73.       return
  74.     }
  75.     #endregion
  76.     #region ByteCountToCheck should be at least MinCharactersToCheck
  77.     if ($ByteCountToCheck -lt $MinCharactersToCheck) {
  78.       Write-Error -Message "$($MyInvocation.MyCommand.Name) :: ByteCountToCheck should be at least $MinCharactersToCheck : $ByteCountToCheck"
  79.       return
  80.     }
  81.     #endregion
  82.     #endregion
  83.  
  84.     #region Determine file encoding based on BOM - if exists
  85.     # the code in this section is mostly Lee Holmes' algorithm: http://poshcode.org/2153
  86.     # until we determine the file encoding, assume it is unknown
  87.     $Unknown = "UNKNOWN"
  88.     $result = $Unknown
  89.  
  90.     # The hashtable used to store our mapping of encoding bytes to their
  91.     # name. For example, "255-254 = Unicode"
  92.     $encodings = @{}
  93.  
  94.     # Find all of the encodings understood by the .NET Framework. For each,
  95.     # determine the bytes at the start of the file (the preamble) that the .NET
  96.     # Framework uses to identify that encoding.
  97.     $encodingMembers = [System.Text.Encoding] | Get-Member -Static -MemberType Property
  98.     $encodingMembers | ForEach-Object {
  99.       $encodingBytes = [System.Text.Encoding]::($_.Name).GetPreamble() -join '-'
  100.       $encodings[$encodingBytes] = $_.Name
  101.     }
  102.  
  103.     # Find out the lengths of all of the preambles.
  104.     $encodingLengths = $encodings.Keys | Where-Object { $_ } | ForEach-Object { ($_ -split "-").Count }
  105.  
  106.     # Go through each of the possible preamble lengths, read that many
  107.     # bytes from the file, and then see if it matches one of the encodings
  108.     # we know about.
  109.     foreach ($encodingLength in $encodingLengths | Sort-Object -Descending) {
  110.       $bytes = (Get-Content -Path $Path -Encoding byte -ReadCount $encodingLength)[0]
  111.       $encoding = $encodings[$bytes -join '-']
  112.  
  113.       # If we found an encoding that had the same preamble bytes,
  114.       # save that output and break.
  115.       if ($encoding) {
  116.         $result = $encoding
  117.         break
  118.       }
  119.     }
  120.     # if encoding determined from BOM, then return it
  121.     if ($result -ne $Unknown) {
  122.       [System.Text.Encoding]::$result
  123.       return
  124.     }
  125.     #endregion
  126.  
  127.     #region No BOM on file, attempt to determine based on file content
  128.     #region Determine if Unicode/UTF8 with no BOM algorithm description
  129.     <#
  130.        Looking at the content of many code files, most of it is code or
  131.        spaces.  Sure, there are comments/descriptions and there are variable
  132.        names (which could be double-byte characters) or strings but most of
  133.        the content is code - represented as single-byte characters.  If the
  134.        file is Unicode but the content is mostly code, the single byte
  135.        characters will have a null/value 0 byte as either as the first or
  136.        second byte in each group, depending on Endian type.
  137.        My algorithm uses the existence of these 0s:
  138.         - look at the first ByteCountToCheck bytes of the file
  139.         - if any character is greater than 127, note it (if any are found, the
  140.           file is at least UTF8)
  141.         - count the number of 0s found (in every other character)
  142.           - if a certain percentage (compared to total # of characters) are
  143.             null/value 0, then assume it is Unicode
  144.           - if the percentage of 0s is less than we identify as a Unicode
  145.             file (less than PercentageMatchUnicode) BUT a character greater
  146.             than 127 was found, assume it is UTF8.
  147.           - Else assume it's ASCII.
  148.        Yes, technically speaking, the BOM is really only for identifying the
  149.        byte order of the file but c'mon already... if your file isn't ASCII
  150.        and you don't want it's encoding to be confused just put the BOM in
  151.        there for pete's sake.
  152.        Note: if you have a huge amount of text at the beginning of your file which
  153.        is not code and is not single-byte, this algorithm may fail.  Again, put a
  154.        BOM in.
  155.     #>
  156.     #endregion
  157.     $Content = (Get-Content -Path $Path -Encoding byte -ReadCount $ByteCountToCheck -TotalCount $ByteCountToCheck)
  158.     # get actual count of bytes (in case less than $ByteCountToCheck)
  159.     $ByteCount = $Content.Count
  160.     [bool]$NonAsciiFound = $false
  161.     # yes, the big/little endian sections could be combined in one loop
  162.     # sorry, crazy busy right now...
  163.  
  164.     #region Check if Big Endian
  165.     # check if big endian Unicode first - even-numbered index bytes will be 0)
  166.     $ZeroCount = 0
  167.     for ($i = 0; $i -lt $ByteCount; $i += 2) {
  168.       if ($Content[$i] -eq 0) { $ZeroCount++ }
  169.       if ($Content[$i] -gt 127) { $NonAsciiFound = $true }
  170.     }
  171.     if (($ZeroCount / ($ByteCount / 2)) -ge $PercentageMatchUnicode) {
  172.       # create big-endian Unicode with no BOM
  173.       New-Object System.Text.UnicodeEncoding $true,$false
  174.       return
  175.     }
  176.     #endregion
  177.  
  178.     #region Check if Little Endian
  179.     # check if little endian Unicode next - odd-numbered index bytes will be 0)
  180.     $ZeroCount = 0
  181.     for ($i = 1; $i -lt $ByteCount; $i += 2) {
  182.       if ($Content[$i] -eq 0) { $ZeroCount++ }
  183.       if ($Content[$i] -gt 127) { $NonAsciiFound = $true }
  184.     }
  185.     if (($ZeroCount / ($ByteCount / 2)) -ge $PercentageMatchUnicode) {
  186.       # create little-endian Unicode with no BOM
  187.       New-Object System.Text.UnicodeEncoding $false,$false
  188.       return
  189.     }
  190.     #endregion
  191.  
  192.     #region Doesn't appear to be Unicode; either UTF8 or ASCII
  193.     # Ok, at this point, it's not a Unicode based on our percentage rules
  194.     # if not Unicode but non-ASCII character found, call it UTF8 (no BOM, alas)
  195.     if ($NonAsciiFound -eq $true) {
  196.       New-Object System.Text.UTF8Encoding $false
  197.       return
  198.     } else {
  199.     # if made it this far, I'm calling it ASCII; done deal pal
  200.     [System.Text.Encoding]::"ASCII"
  201.       return
  202.     }
  203.     #endregion
  204.     #endregion
  205.   }
  206. }
  207. Export-ModuleMember -Function Get-DTWFileEncoding
  208. #endregion

Submit a correction or amendment below (
click here to make a fresh posting)
After submitting an amendment, you'll be able to view the differences between the old and new posts easily.

Syntax highlighting:


Remember me