[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: changing semantics of the va_filerev (code review)



On Sun, 12 Apr 2009, Rick Macklem wrote:

In summary, the nfsv4 server needs 3 changes to the FreeBSD kernel:
...
3 - Support for the Change attribute, which is what this post is about.
...
As background, I believe va_filerev/i_modrev was added for nqnfs
long long ago. Since it is not exposed to userland via the stat structure,

va_gen/i_gen/di_gen/st_gen seems to be even more suitable for this
purpose, but it isn't actually a file generation number like its
comments say (it is normally set to a random value on file creation
then never changed) and it is exposed to userland (st_gen).

I don't believe anything outside of the kernel uses it. Inside the kernel,
the only thing that currently uses it is the nfs server, which uses it as
the cookie verifier. (It really doesn't use it, since a client regurgitates it back to the server as opaque bits in the next readdir rpc and the server then ignores those bits. This is correct, since va_filerev is a bogus cookie verifier.) As such, I don't believe changing the semantics of va_filerev will break anything in FreeBSD.

va_gen isn't used much either.  In ext2fs, i_gen is a copy of the
on-disk field i_generation which is documented to be /* for NFS */ but
nfs doesn't use va_gen at all.  nfs3 (getattr, loadattrcache) doesn't
even initialize va_gen.  nfsv2 initializes it to a non-random value
based on a timestamp.  I'm not sure if it does this only on creation
or on every cache miss or on every call.  I think the uninitialized
va_gen gives stack garbage in st_gen, but in tests I get 0 for both
nfsv3 and nfsv2 (as root -- st_gen is always 0 for non-root).  I don't
understand the security issues for *_gen, but remember its being changed
for security.  cvs history shows that it used to actually be a generation
number in at least ffs, but for ffs files and not for individual file
changes (or for individual ffs file systems or all file systems).

I'd like to change the semantics of va_filerev so that it can be used
by the nfsv4 server as the Change attribute. To do this, it needs to
change in 2 ways:
- must change upon metadata changes as well as data changes
- must persist across server reboots (ie. be moved to spare space in
	the on-disk i-node instead of in memory i-node)

Many nonstandard file systems, e.g., msdosfs, have no space to spare.

Read-only file systems like cd9660 and udf probably don't need a a
variable generation count (since they never change), but their current
handling of va_filerev and va_gen is wrong if these fields have any
other semantics.  These file systems just initialize va_gen to 1 for
all files and va_gen to 0 (with an XXX in udf only) for all files.

va_ctime should give what you want for all file systems, since it
should be increased whenever anything changes.  However, most file
systems always set the nsec part to 0, so va_ctime doesn't track
all file changes.  This is a problem for things like make(1) too,
so if nsec timestamps aren't available or are take too long or are
not fine-grained enough, the nsec part should be abused as a generation
counter so that any change gives a strictly larger timestamp.  The
case where someone sets the clock backwards is broken but won't
happen often.  Many nonstandard file systems, e.g., msdosfs, have
no space for an on-disk ctime, so they fake va_ctime using an on-disk
mtime.  Since such file systems don't have many attributes, only a
few more cases are broken.

Here is the patch to ufs for the above, that I have been using for some
time. Please review and comment.

...
--- ufs/ufs/ufs_vnops.c.sav	2009-04-12 02:28:41.000000000 -0400
+++ ufs/ufs/ufs_vnops.c	2009-03-10 16:47:11.000000000 -0400
@@ -157,11 +157,12 @@
	if (ip->i_flag & IN_UPDATE) {
		DIP_SET(ip, i_mtime, ts.tv_sec);
		DIP_SET(ip, i_mtimensec, ts.tv_nsec);
-		ip->i_modrev++;
+		DIP_SET(ip, i_modrev, DIP(ip, i_modrev) + 1);
	}
	if (ip->i_flag & IN_CHANGE) {
		DIP_SET(ip, i_ctime, ts.tv_sec);
		DIP_SET(ip, i_ctimensec, ts.tv_nsec);
+		DIP_SET(ip, i_modrev, DIP(ip, i_modrev) + 1);
	}

IN_UPDATE implies IN_CHANGE (unless there is a bug).  Thus the above
gives an extra increment.

Strictly, if you want to track _all_ metadata changes, then you need
an increment for IN_ACCESS, and va_ctime will no longer be nearly
usable since is not changed by read accesses.  I hope you don't want
this.

Bruce
_______________________________________________
freebsd-fs_(_at_)_freebsd_(_dot_)_org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-unsubscribe_(_at_)_freebsd_(_dot_)_org"


Visit your host, monkey.org