Samuel Samuel - 1 month ago 16
Linux Question

"Stale file handle" error, when process trying read the file, that other process already had deleted

I'm writing stress test suite for testing distributed file systems over NFS.

In some cases when some process deletes file, while some other process attempts to read from it, I'm getting "Stale file handle" error (116).

Is that kind of error is expected and acceptable in such raise condition?

Test working as follows:


  1. Starting x number of client machines

  2. Each client machine runs y processes

  3. Each process can do any file operation as stat/read/delete/open

  4. Mentioned file ops are standard python methods - os.stat/read/os.remove/open

  5. All files are empty 0 bytes data



File is exists, as successful
stat
operation shows:


controller_debug.log.2:2016-10-26 15:02:30,156;INFO -
[LG-E27A-LNX:0xa]: finished 640522b4d94c453ea545cb86568320ca, result:
success | stat |
/JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41
| data: {} | 2016/10/26 15:02:30.156


Process
0x1
on client
CLIENT-A
completed successful delete:


controller_debug.log.2:2016-10-26 15:02:30,164;INFO -
[CLIENT-A:0x1]: finished 5f5dfe6a06de495f851745a78857eec1, result:
success | delete |
/JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41
| data: {} | 2016/10/26 15:02:30.161


3 milliseconds later, process
0xb
on client
CLIENT-B
failed "read" op due to "Stale file handle"


controller_debug.log.2:2016-10-26 15:02:30,164;INFO -
[CLIENT-B:0xb]: finished e84e2064ead042099310af1bd44821c0, result:
failed | read |
/mnt/DIRSPLIT-node0.b27-1/JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41
| [errno:116] | Stale file handle | 142 | data: {} | 2016/10/26
15:02:30.160 controller_debug.log.2:2016-10-26 15:02:30,164;ERROR -
Operation read FAILED UNEXPECTEDLY on File
JUyw481MfvsBHOm1KQu7sHRB6ffAXKjwIATlsXmOgWh8XKQaIrPbxLgAo7sucdAM/o6V266xE8bTaUGzk8YDMfDAJp0YIfbT4fIK1oZ2R20tRX3xFCvjISj7WuMEwEV41
due to Stale file handle


Thanks

Answer

This is totally expected. The NFS specification is clear about use of file handles after an object (be it file or directory) has been deleted. Section 4 clearly addresses this. For example:

The persistent filehandle will become stale or invalid when the file system object is removed. When the server is presented with a persistent filehandle that refers to a deleted object, it MUST return an error of NFS4ERR_STALE.

This is such a common problem, it even has its own entry in section A.10 of the NFS FAQ, which says one common cause of ESTALE errors is that:

The file handle refers to a deleted file. After a file is deleted on the server, clients don't find out until they try to access the file with a file handle they had cached from a previous LOOKUP. Using rsync or mv to replace a file while it is in use on another client is a common scenario that results in an ESTALE error.

The expected resolution is that your client app must close and reopen the file to see what has happened. Or, as the FAQ says:

... to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle.

Comments