tomasz86 Posted October 7, 2012 Posted October 7, 2012 Is there any simple way to check text file encoding from command-line?I'm working on a script which adds lines to the I386\HIVE*.INF files in Windows 2000/XP source. In an English system those HIVE*.INF files are coded in ANSI but in a Korean Windows XP they're coded in UCS-2 Little Endian.I don't really need to know the specific encoding. I'd just like to know whether the file is ANSI or not.
jaclaz Posted October 7, 2012 Posted October 7, 2012 Is there any simple way to check text file encoding from command-line?I'm working on a script which adds lines to the I386\HIVE*.INF files in Windows 2000/XP source. In an English system those HIVE*.INF files are coded in ANSI but in a Korean Windows XP they're coded in UCS-2 Little Endian.I don't really need to know the specific encoding. I'd just like to know whether the file is ANSI or not.UCS-2 Little Endian sounds to me a lot like "Unicode". The simplest check you can make is looking for a hex 00 (if there is at least one, it's Unicode, conversely, if there are none, it's not Unicode - and it is very likely to be "plain ANSI text").http://betterexplained.com/articles/unicode/jaclaz
tomasz86 Posted October 7, 2012 Author Posted October 7, 2012 (edited) Thanks for you suggestions!@allen2It would be nice to be able to do it without any external application although I'll have a look at it if I can't manage to do it with just the Windows default tools.@jaclazCan you view a text file in hex in the command-line? After googling the only "method" which I've managed to find is to use "debug.exe" but it actually displays "00"s in ANSI files too.I wonder what you think about this. If you ECHO something to a text file coded in UCS-2 Little Endian from CMD (without the /U switch) the text will be completely broken. I'm thinking about ECHOing a specific string to those HIVE*.INF files and then just search for it with FINDSTR. If it can't find it then it will mean that the file is UCS-2 Little Endian. Edited October 7, 2012 by tomasz86
jumper Posted October 8, 2012 Posted October 8, 2012 endian.zip - endian.exe (console-32 app), endian.bat (sample usage)endian.exe source snippet based on header info detailed at http://betterexplained.com/articles/unicode/: // read first three (or more) bytes from file into byte array s[], then: return (s[0]==255) && (s[1]==254)? 255 : (s[0]==254) && (s[1]==255)? 254 : (s[0]==0xEF) && (s[1]==0xBB) && (s[2]==0xBF)? 239 : 0;Sample usage: endian.bat @echo off%0\..\endian %1IF ERRORLEVEL 255 GOTO UCS2LEIF ERRORLEVEL 254 GOTO UCS2BEIF ERRORLEVEL 239 GOTO UTF8echo ANSIGOTO End:UCS2LEecho UCS-2 Little EndianGOTO End:UCS2BEecho UCS-2 Big EndianGOTO End:UTF8echo UTF-8:EndIf debug.exe is available, I think it can be used to achieve the same results. Debug can be scripted to open the file, analyze the first three bytes, and create a temp com file that sets an appropriate ERRORLEVEL If Debug doesn't set the ERRORLEVEL upon exit itself, the temp file can be eliminated by running the temp program within Debug.
jaclaz Posted October 8, 2012 Posted October 8, 2012 @jaclazCan you view a text file in hex in the command-line? After googling the only "method" which I've managed to find is to use "debug.exe" but it actually displays "00"s in ANSI files too.I wonder what you think about this. If you ECHO something to a text file coded in UCS-2 Little Endian from CMD (without the /U switch) the text will be completely broken. I'm thinking about ECHOing a specific string to those HIVE*.INF files and then just search for it with FINDSTR. If it can't find it then it will mean that the file is UCS-2 Little Endian.Well, you have gsar already used in your "standard" set of tools, haven't you?I doubt - no offence intended - that an ANSI file contains hex 00, but you can (better ) use gsar to find the initial FFFE as jumper suggested.In any case, FOR/ F won't "like" Unicode, thus:@ECHO OFF::ISANSI.CMD - small example batch to check if a file is ANSI or UNICODEFOR /F %%A in (%1) DO ECHO ANSI&GOTO :EOFECHO UNICODEBUT this won't work the same:@ECHO OFF::NOANSI.CMD - small example batch that will always return ANSIFOR /F %%A in ('TYPE %1') DO ECHO ANSI&GOTO :EOFECHO UNICODEjaclazjaclaz
tomasz86 Posted October 8, 2012 Author Posted October 8, 2012 (edited) Well, in this particular script the only "non-Windows" tool is gsar.exe...But at the moment I actually managed to solve this specific problem with this very simple checking:SETLOCAL ENABLEDELAYEDEXPANSIONFINDSTR/IL "[Version]" I386\hivedef.inf >NULIF !ERRORLEVEL! EQU 0 ( TYPE temp.txt>>I386\hivedef.inf) ELSE ( CMD /U /C "TYPE temp.txt>>I386\hivedef.inf")FINDSTR can't find "[Version]" if the file is encoded in UCS-2 Little Endian. Edited October 8, 2012 by tomasz86
allen2 Posted October 8, 2012 Posted October 8, 2012 The BOM (if present) might help to detect to detect the encoding of the file.And this vbs will report the encoding of the file passed as argument:Function encoding(fpn) set file=CreateObject("ADODB.Stream") file.Type=1file.Open file.LoadFromFile fpn data = file.Read file.Close a = hex(Ascb(Midb(data, 1, 1)))b = hex(Ascb(Midb(data, 2, 1))) c = hex(Ascb(Midb(data, 3, 1))) d= hex(Ascb(Midb(data, 4, 1)))encoding="unknow ascii"If a = "EF" AND b = "BB" AND c = "BF" Then encoding = "UTF-8" If (a = "FE" AND b = "FF" AND not c = "00" ) then encoding = "UTF-16 (BE)"If (a = "FF" AND b = "FE") Then encoding = "UTF-16 (LE)" If (a = "00" AND b = "00" AND c = "FE" AND d = "FF" ) then encoding = "UTF-32 (BE)"If (a = "FF" AND b = "FE" AND c = "00" AND d = "00" ) then encoding = "UTF-32 (LE)"If (a = "2B" AND b = "2F" AND c = "76" AND (d = "38" or d = "39" or d = "2B" or d = "2F" )) then encoding = "UTF-7"If (a = "F7" AND b = "64" AND c = "4C") then encoding = "UTF-1"If (a = "DD" AND b = "73" AND c = "66" AND d = "73") then encoding = "UTF-EBCDIC"If (a = "0E" AND b = "FE" AND c = "FF") then encoding = "SCSU"If (a = "FB" AND b = "EE" AND c = "28") then encoding = "BOCU-1"If (a = "84" AND b = "31" AND c = "95" AND d = "33") then encoding = "GB-18030"End Function wscript.echo encoding(WScript.Arguments.Item(0))But many files doesn't contain the BOM.
jaclaz Posted October 8, 2012 Posted October 8, 2012 It is clear how we all give a different meaning to the word "simple". FOR /F %%A in (I386\hivedef.inf) DO TYPE temp.txt>>I386\hivedef.inf&GOTO :skipCMD /U /C "TYPE temp.txt>>I386\hivedef.inf":skipjaclaz
Yzöwl Posted October 8, 2012 Posted October 8, 2012 As some systems files are encoded in ANSI, is it a necessity to append using the same encoding? What happens if you convert the file to ANSI and use / append as such?
tomasz86 Posted October 9, 2012 Author Posted October 9, 2012 It is clear how we all give a different meaning to the word "simple". FOR /F %%A in (I386\hivedef.inf) DO TYPE temp.txt>>I386\hivedef.inf&GOTO :skipCMD /U /C "TYPE temp.txt>>I386\hivedef.inf":skipThank you I meant simpler when compared to the other options where other tools must be used.As some systems files are encoded in ANSI, is it a necessity to append using the same encoding? What happens if you convert the file to ANSI and use / append as such?It should be possible in case of these HIVE*.INF files because they utilise only English and Korean characters but how about unicode files where a lot of different languages are used like this one?Netrtle.7zIf you check the [strings.*] sections you'll see that there are a lot of them for several different languages. If I try to convert such a file to ANSI then many characters are lost.
allen2 Posted October 9, 2012 Posted October 9, 2012 Completly off topic but the file netrtle.inf you attached contains a "r" on the beginning of the first line which shouldn't be there no mater the encoding and the inf file.
tomasz86 Posted October 9, 2012 Author Posted October 9, 2012 Completly off topic but the file netrtle.inf you attached contains a "r" on the beginning of the first line which shouldn't be there no mater the encoding and the inf file.It's just a typo of mine You can safely ignore it.
jumper Posted October 10, 2012 Posted October 10, 2012 For those with pre-ME systems that don't support FINDSTR, here's a batch file that uses FIND to process one file from the command line or a whole directory of files:@echo offif not %1*==* goto TESTfor %%s in (*.inf) do call %0 %%sgoto EXIT:TESTfind>nul /i "[Version]" %1IF ERRORLEVEL 1 goto NONANSI:ANSIecho %1 is ANSIgoto EXIT:NONANSIecho %1 is non-ANSI:EXIT
Recommended Posts
Please sign in to comment
You will be able to leave a comment after signing in
Sign In Now