Yadong Yadong - 1 year ago 82
Vb.net Question

Value of Non-ASCII characters readed from txt file by using "System.IO.StreamReader.ReadLine" were modified by mistake

This looks a simple question, but I cannot find a good answer for it.
What I am doing is I read one line of a txt file, then parse it using string.substring(12, 45) to get the sub string, which is actually a piece of hex data. (It's a long story, why i get into this situation.)
The following screenshot is how it looks like in NotePad++

enter image description here

The Hex value was not encoded by any way, just using each char present a number. I would like to convert this string to a hex array. During testing I found most of the char can be converted to hex correctly. But some of them were wrong.
enter image description here

For example in the attached pic, I want to parse and get string of "00 00 02 87 50 0C". Then convert this string into hex array [0][0][0][0]0[8][7][5][0][0][C]. But the hex value "87" cannot be converted correctly.

After get a deeper look, I found it's caused by ReadLine(). During readline(), those non-ASCII characters was not being kept with original values. I did a test to read all lines one by one from input file, then write them into an output file. I can see those not-ASCII characters were changed to something else.
The code I used to read file is:

Dim fileInput As System.IO.StreamReader = New System.IO.StreamReader("d:\temp\xyz.txt")

Do While fileInput.Peek() <> -1

`Dim oneLine As String = fileInput.ReadLine()`

... blabla


So is there any way to read string line by line without change those non-ASCII character by mistake?


Answer Source

It seems you want to read some bytes from a file, after some number of lines, as binary, into a byte array.

Since your data is line-oriented with 0d 0a line-endings, it makes some sense to read it as text. However, since it also contains binary, you have to read it with an encoding that allows all byte values 0-255 in any order.

UTF-8 is the default for System.IO.StreamReader. It does not meet this requirement because it does not allow arbitrary ordering of all values. (It encodes some Unicode codepoints into multiple 8-bit code units and they have a particular pattern.)

You could use CP437. It allows all values 0-255 in any order and 0d is CR and 0a is LF so it is compatible with the line-endings in your data. It also encodes all characters in 1 byte.

So read to the line you want. Skip to the character position you want and take the substring to the characters that your binary data decoded to, then re-encode as CP437 to get the bytes back.

Dim encoding = Encoding.GetEncoding(437)
' get your line
Dim binaryDecodedAsCp437 = onelineDecodedAsCp437.substring(12, 45)
Dim byteArray = encoding.GetBytes(binaryDecodedAsCp437)

Since you are encoding back and forth with CP437, the original bytes will be restored.