I have often noticed that some folks reboot systems to fix stale NFS mount problems which can be disruptive.
Fortunately, that often isn’t necessary. All you have to do is restart nfs and autofs services. However that sometimes fails because user processes have files open on the stale partition or users are cd’ed to the stale partition.
Both conditions are easy to fix. The steps to fix stale mounts by addressing the previously described conditions are described below.
Step 1. Kill process with open files on the partition
Use lsof
to find the processes that have files open on the partition and then kill those processes using kill
or pkill
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
% sudo su - % # Find the jobs that are accessing the state partition and kill them. % kill -9 $(lsof |\ egrep '/stale/fs|/export/backup' |\ awk '{print $2;}' |\ sort -fu ) % # Restart the NFS and AUTOFS services % service nfs stop % service autofs stop % service nfs start % service autofs start % # Check it % ls /stale/fs |
Typically this is sufficient but if it fails, you need to go to step 2.
Step 2. Kill process that have cd’ed to the partition
Look at the current working directory of all of the users. If any of them are on the partition, that process has to be killed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
% sudo su - % # List the users that are cd'ed to the stale partition and kill their jobs. % # NOTE: change /stale/fs to the path to your stale partition. % kill -9 $( for u in $( who | awk '{print $1;}' | sort -fu ) ; do \ pwdx $(pgrep -u $u) |\ grep '/stale/fs' |\ awk -F: '{print $1;}' ; \ done) % # umount the stale partition % umount -f /state/fs % # Restart the NFS and AUTOFS services % service nfs stop % service autofs stop % service nfs start % service autofs start % # Check it % ls /stale/fs |
If that fails you have to kill all of the users.
Step 3. Kill all of the users
If step 2 doesn’t work then there is something strange going on but killing all of the user processes will usually fix it. That is done as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
% sudo su - % # Kill all user processes. % for u in $( who | awk '{print $1;}' | sort -fu ) ; do \ kill -9 $(pgrep -u $u) |\ awk -F: '{print $1;}' ; \ done % # umount the stale partition % umount -f /state/fs % # Restart the NFS and AUTOFS services % service nfs stop % service autofs stop % service nfs start % service autofs start % # Check it % ls /stale/fs |
As you can see, it is basically the same as step 2 except that all user processes are killed.
If that doesn’t work you need to resort the nuclear option: rebooting.
Step 4. Reboot
This is the option of last resort but it should always work.
If you know of any other tips for fix stale NFS mounts I would really like to hear about them.
Enjoy!
Often,
umount -lf /stale/fs
is sufficient, instead of restarting nfs and autofs.
That is an excellent suggestion. Thank you.
Often, “sudo umount -lf” is sufficient to fix things. It’s amazingly effective.
The problem is that hung nfs mounts also hang lsof. Definitely true for SLES10 and 11. Having that trouble right now… typed lsof half an hour ago and it still hasn’t come back. The same thing happens with a forced lazy umount. May be Suse specific behavior? Still a PITA.
Just a note, for OpenSuse 12.3:
lsof hangs
but
umount -lf
did the trick instantly,
Thanks — saved me a reboot on a busy production system.
Thank you. That is an excellent suggestion.
Be careful with umount -lf commands. I had a situation where a super-block error occurred.
I would try umount -force first, but I’m not that inclined to use lazy umount. Perhaps my filesystem was busy then even though lsof was ok then.
What about lsof hanging out? Is there any way to prevent this?
What do you mean by “lsof hanging out”?