bernatel bernatel - 3 months ago 9
Python Question

Copying from a binary file using Python inserts new bytes (?)

I am trying to open a file created by a measurement equipment, find the bytes correspoding to metadata, then write everything else to a new binary file. (The metadata part is not the problem: I know the headers and can find them easily. Let's not worry about that.)

The problem is: when I open the file and write the bytes into a new file, new bytes are added, which messes up the relevant data. Specifically, every time there is a '0A' byte in the original file, the new file has a '0D' byte before it.
I've gone through a few iterations of trimming down the code to find the issue. Here is the latest and simplest version, in three different ways that all produce the same result:

import os
import mmap

file_name = raw_input('Name of the file to be edited: ')
f = open(file_name, 'rb')

#1st try: using mmap, to make the metadata sarch easier
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
full_data = s.read(len(s))
with open(os.path.join('.', 'edited', ('[mmap data]' + file_name + '.bin')), 'a') as data_mmap:
data_mmap.write(full_data)

#2nd try: using bytes, in case mmap was giving me trouble

f_byte = bytes(f.read())
with open(os.path.join('.', 'edited', ('[bytes data]' + file_name + '.bin')), 'a') as data_bytes:
data_bytes.write(f_byte)

s.close()
f.close()

#3rd try: using os.read/write(file) instead of file.read() and file.write().
from os.path import getsize

o = os.open(file_name,os.O_BINARY) #only available on Windows
f_os = bytes(os.read(o,getsize(file_name)))
with open(os.path.join('.', 'edited', ('[os data]' + file_name + '.bin')), 'a') as data_os:
os.write(data_os.fileno(),f_os)

os.close(o)


The resulting files are all identical (compared with HxD). And they are almost identical to the original file, except for the single new bytes. For example, starting at offset 0120 the original file read:
A0 0A 00 00
whereas the new file reads:
A0 0D 0A 00 ...and then everything is exactly the same until the next occurrence of 0A in the original file, where again a 0D byte appears.

Since the code is really simple, I assume the error comes from the read function (or perhaps from some unavoidable inherent behaviour of the OS... I'm using python 2.7 on Windows, BTW.)
I also suspected the data format at first, but it seems to me it should be irrelevant. I am just copying everything, regardless of value.

I found no documentation that could help, so... anyone know what's causing that?

Edit: the same script works fine on Linux, by the way. So while it was not a big problem, it was very very annoying.

Answer

Welcome to the world of end of line markers! When a file is open in text mode under Windows, any raw \n (hex 0x0a) will be written as \r\n (hex 0x0d 0x0a).

Fortunately it is easy to fix: just open the file in binary mode (note the b):

with open(..., 'ab') as data_...:

and the unwanted \r will no longer bother you :-)