我们在做数据库程序或者IO密集型的程序的时候,通常在更新的时候,比如说数据库程序,希望更新有一定的安全性,我们会在更新操作结束的时候调用fsync或者fdatasync来flush数据到持久设备去。而且通常是以页面为单位,16K一次或者4K一次。 安全性保证了,但是性能就有很大的损害。而且我们更新的时候,通常是更新文件的某一个页面,那么由于是更新覆盖操作,对文件系统的元数据来讲的话,无需变更,所以我们通常不大关心元数据是否写入。 当更新非常频繁的时候,我们时候能够有其他方法减少性能损失。sync_file_range同学出场了。
Linux下系统调用sync_file_range只在内核2.6.17及更高版本是可用的, 我们常用的RHEL 5U4是支持的。
下面是这篇文章的介绍:
The 2.6.17 kernel will include two new system calls which expand the capabilities available to user-space programs in some interesting ways. This article contains a look at the current form of these new interfaces.
splice()
The splice() system call has a long history. First proposed by Larry McVoy in 1998; it was seen as a way of improving I/O performance on server systems. Despite being often mentioned in the following years, no splice() implementation was ever created for the mainline Linux kernel. That situation changed, however, just before the 2.6.17 merge window was closed when Jens Axboe's splice() patch, along with a number of modifications, was merged.
As of this writing, the splice() interface looks like this:
long splice(int fdin, int fdout, size_t len, unsigned int flags);
A call to splice() will cause the kernel to move up to len bytes from the data source fdin to fdout. The data will move through kernel space only, with a minimum of copying. In the current implementation, at least one of the two file descriptors must refer to a pipe device. That requirement is a limitation of the current code, and it could be removed at some future time.
The flags argument modifies how the copy is done. Currently implemented flags are:
SPLICE_F_NONBLOCK: makes the splice() operations non-blocking. A call to splice() could still block, however, especially if either of the file descriptors has not been set for non-blocking I/O.
SPLICE_F_MORE: a hint to the kernel that more data will come in a subsequent splice() call.
SPLICE_F_MOVE: if the output is a file, this flag will cause the kernel to attempt to move pages directly from the input pipe buffer into the output address space, avoiding a copy operation.
Internally, splice() works using the pipe buffer mechanism added by Linus in early 2005 - that is why one side of the operation is required to be a pipe for now. There are two additions to the ever-growing file_operations structure for devices and filesystems which wish to support splice():
ssize_t (*splice_write)(struct inode *pipe, struct file *out,
size_t len, unsigned int flags);
ssize_t (*splice_read)(struct file *in, struct inode *pipe,
size_t len, unsigned int flags);
The new operations should move len bytes between pipe and either in or out, respecting the given flags. For filesystems, there are generic implementations of these operations which can be used; there is also a generic_splice_sendpage() which is used to enable splicing to a socket. As of this writing, there are no splice() implementations for device drivers, but there is nothing preventing such implementations in the future, for char devices at least.
Discussions on the linux-kernel suggest that the splice() interface could change before it is set in stone with the 2.6.17 release. Andrew Tridgell has requested that an offset argument be added to specify where copying should begin - either that, or a separate psplice() should be added. There is also some concern about error handling; if a splice() call returns an error, how does the application tell whether the problem is with the input or the output? Resolving those issues may require some interface changes over the next month or so.
sync_file_range()
Early in the 2.6.17 process, some changes to the posix_fadvise() system call were merged. The new, Linux-specific options were meant to give applications better control over how data written to files is flushed to the physical media. The capabilities provided are needed, but there were concerns about extending a POSIX-defined function in a Linux-specific way. So, after some discussions, Andrew Morton pulled that patch back out and replaced it with a new system call:
long sync_file_range(int fd, loff_t offset, loff_t nbytes, int flags);
This call will synchronize a file's data to disk, starting at the given offset and proceeding for nbytes bytes (or to the end of the file if nbytes is zero). How the synchronization is done is controlled by flags:
SYNC_FILE_RANGE_WAIT_BEFORE blocks the calling process until any already in-progress writeout of pages (in the given range) completes.
SYNC_FILE_RANGE_WRITE starts writeout of any dirty pages in the given range which are not already under I/O.