Hacksign Hacksign - 2 months ago 5
Linux Question

What will happen if I delete a input file while some program is reading data from that file?

If there is a python script doing this :

with open('large_input_file.log', 'rb') as f :
for each_line in f :
do something .....


Let's call this script
a.py


large_input_file.log
is about 16GB.
a.py
will take hours to process this file.

What will happen if I do this (under Linux):


  1. keep
    a.py
    running

  2. delete
    large_input_file.log

  3. replace
    large_input_file.log
    with different content but same name



Is
a.py
able to get the correct data in
large_input_file.log
before I delete it? (I guess this is what will happen.)

Or will
a.py
get new data starting with the same offset in the new
large_input_file.log


Can you explain it in kernel level or filesystem level? (How does linux accomplish this)?

-----------------Below is added after some answer------------------------

What if my disk size is 16Gb, so there can be store only one
large_input_file.log
.

What will happen if I delete
large_input_file.log
and create another 16Gb
large_input_file.log
file ?

Answer

Let's create a file:

# echo foo > test.txt

Now we'll use tail to monitor it for changes:

# tail -f test.txt
foo

Let's open another tab on our terminal, and check the pid of our tail process:

# ps aux | grep -i tail
root      5458  0.0  0.0   7484   724 ?        S    Sep15   0:13 tail -f -n 0 /var/log/syslog
root      5919  0.0  0.0   7484   784 ?        S    Sep15   0:13 tail -f -n 0 /var/log/syslog
root      6381  0.0  0.0   7484   840 ?        S    Sep15   0:14 tail -f -n 0 /var/log/syslog
emil     27789  0.0  0.0   8852   784 pts/8    S+   12:26   0:00 tail -f test.txt
emil     27826  0.0  0.0  15752  1016 pts/9    S+   12:26   0:00 grep -i tail

So, in my case the pid is 27789. We can look at the open files of the process by checking the /proc/27789/fd directory:

# ls -lah /proc/27789/fd/
total 0
dr-x------ 2 emil emil  0 Sep 20 12:26 .
dr-xr-xr-x 9 emil emil  0 Sep 20 12:26 ..
lrwx------ 1 emil emil 64 Sep 20 12:26 0 -> /dev/pts/8
lrwx------ 1 emil emil 64 Sep 20 12:26 1 -> /dev/pts/8
lrwx------ 1 emil emil 64 Sep 20 12:26 2 -> /dev/pts/8
lr-x------ 1 emil emil 64 Sep 20 12:26 3 -> /home/emil/test.txt
lr-x------ 1 emil emil 64 Sep 20 12:26 4 -> anon_inode:inotify

Here we see that tail has a file descriptor called 3 to test.txt. What if we delete the file?

# rm test.txt
# ls -lah /proc/27789/fd
total 0
dr-x------ 2 emil emil  0 Sep 20 12:26 .
dr-xr-xr-x 9 emil emil  0 Sep 20 12:26 ..
lrwx------ 1 emil emil 64 Sep 20 12:26 0 -> /dev/pts/8
lrwx------ 1 emil emil 64 Sep 20 12:26 1 -> /dev/pts/8
lrwx------ 1 emil emil 64 Sep 20 12:26 2 -> /dev/pts/8
lr-x------ 1 emil emil 64 Sep 20 12:26 3 -> /home/emil/test.txt (deleted)
lr-x------ 1 emil emil 64 Sep 20 12:26 4 -> anon_inode:inotify

The file descriptor still exists, but ls will helpfully let us know that the file has been deleted.

As Igor says, each file has a physical location on disk where the raw data exists. In order to find files, the system maintains a table of inodes mapping file names to actual data. Removing a file doesn't wipe the data from disk, it simply modifies the inode. The data will still exist, until it's explicitly overwritten by something else. In this specific case, though, the kernel contains extra code to make sure that the file continues to exist - and won't be overwritten - until it's no longer open by any process.