-
Kirill Smelkov authored
Long ago vfs read/write operations were passing ppos=&file->f_pos directly to .read / .write file_operations methods. That changed in 2004 in 55f09ec0 ("read/write: pass down a copy of f_pos, not f_pos itself.") which started to pass ppos=&local_var trying to avoid simultaneous read/write/lseek stepping onto each other toes and overwriting file->f_pos racily. That measure was not complete and in 2014 commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added file->f_pos_lock to completely disable simultaneous read/write/lseek runs. After f_pos_lock was introduced the reason to avoid passing ppos=&file->f_pos directly due to concurrency vanished. Linus explains[1]: In fact, we *used* to (long ago) pass in the address of "file->f_pos" itself to the low-level read/write routines. We then changed it to do that indirection through a local copy of pos (and file_pos_read/file_pos_write) because we didn't do the proper locking, so different read/write versions could mess with each other (and with lseek). But one of the things that commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") did was to add the proper locking at least for the cases that we care about deeply, so we *could* say that we have three cases: - FMODE_ATOMIC_POS: properly locked, - FMODE_STREAM: no pos at all - otherwise a "mostly don't care - don't mix!" and so we could go back to not copying the pos at all, and instead do something like loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos; ret = vfs_write(f.file, buf, count, ppos); and perhaps have a long-term plan to try to get rid of the "don't mix" case entirely (ie "if you use f_pos, then we'll do the proper locking") (The above is obviously surrounded by the fdget_pos()/fdput_pos() that implements the locking decision). Currently for regular files we always set FMODE_ATOMIC_POS and change that to FMODE_STREAM if stream_open is used explicitly on open. That leaves other files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case. Sockets, for example, always check that on read/write the initial pos they receive is 0 and don't update it. And if it is !0 they return -ESPIPE. That suggests that we can do the switch into passing &file->f_pos directly now and incrementally convert to FMODE_STREAM files that were doing the stream-like checking manually in their low-level .read/.write handlers. Note: it is theoretically possible that a driver updates *ppos inside even if read/write returns error. For such cases the conversion will change IO semantic a bit. The semantic that is changing here was introduced in 2013 in commit 5faf153e "don't call file_pos_write() if vfs_{read,write}{,v}() fails". [1] https://lore.kernel.org/linux-fsdevel/CAHk-=whJtZt52SnhBGrNMnuxFn3GE9X_e02x8BPxtkqrfyZukw@mail.gmail.com/Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
0192236d