[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Understanding readdir and persistent NFS duplication



I was trying to figure out what's going on and I wrote up some of the
thinking. Hopefully somebody will find this useful as a starting point
of a discussion on fixing the problem.

-Costa
csapuntz_(_at_)_mit_(_dot_)_edu

The readdir problem

There have been reports about problems with persistent duplicate
entries appearing in directories under NFS. This document is an
attempt to think through the possible ways of getting duplicate
entries. The reason why our NFS implementation yields duplicate entries
is towards the bottom (under as section labeled NFS), so feel free to
skip down there.

The opendir/readdir/closedir interface (which I will abbreviate
readdir for the purpose of this document) in OpenBSD has some interesting
semantics in face of multiple concurrent operations against a
directory. Because of deficiencies present in the interfaces below
readdir, files can disappear or appear twice during the course of a session
when using the interface.

What are the ideal semantics for readdir? Ideally, we'd like an atomic
snapshot of a directory. That is to say, we'd like to assign an instant
in time when we read the directory. 

In reality, most applications do not need such robust semantics. It is
sufficient for the file system to display sequential consistency (that
is, operations by a process against a file system appear to occur in
order).  In addition, names which are not being modified should not
disappear then reappear between consecutive opendir/readdir
sessions. 

Currently, the opendir/readdir/closedir interface does satisfy the
weaker semantics. Opendir/readdir are implemented in terms of getdirentries,

       ssize_t getdirentries(int fd, char *buf, size_t  nbytes,
       off_t *basep);

Getdirentries reads as many directory entries into the buffer as it
can and moves the seek pointer on the fd to the end of the entries it
returned. Now, the user can issue additional getdirentries calls to
continue reading the rest of the directory.

Current opendir and readdir issue multiple calls to getdirentries to
read large directories. There are several things file systems do to
make sure you can do repeated calls to getdirentries without getting
missing files.  First off, the seek pointer maintained by
getdirentries should point to something valid (some entry or the end
of directory), no matter how long we've waited to call getdirentries
the second time (as long as we've kept the fd open).  The FFS
filesystem accomplishes this by rounding all directory reads to
directory block boundaries. Thus the seek pointer is always at a
directory block boundary. Since every directory block starts with a
valid entry in a well-known format, there is no problem with the
validity of the data.

For simplicity, the file system needs to avoid rearranging the names
in a directory. Otherwise, it stands the chance of moving a
long-standing name from after the seek pointer to before the seek
pointer and causing directory entries to disappear. If a file from
before the seek pointer is moved to after the seek pointer, a
duplicate would appear. FFS again prevents this by only reorganizing
entries within a directory block and keeping the seek pointers at directory
block boundaries.

However, even with measures taken by the file system, we cannot avoid
duplicate names from appearing. An example: 

We do two getdirentries calls. The first call returns the first half
of the directory, including a file called foo.txt. The foo.txt file is
then deleted and created again quickly. This time, however, its entry
is placed at the end of the directory. We then issue the second
getdirentries call and, presto, duplicate entries. 

This sceanrio does not violate our consistency model, since the name in
question underwent modification between the consecutive readdirs. 

Solutions
---------

#1) Return names in alphabetic order and have the seek pointer be the last name
you returned.  Unfortunately, the implementation of such a scheme is not
practical.

#2) Let the application deal. Applications should be able to deal with
duplicate names in a directory just fine. Just run the output through uniq
and you're done.
(note: if your code deals with transient duplicates,
it should probably deal with persistent duplicates anyway :)

#3) At least with local file systems, the easy way of doing an atomic read
of the directory is to pass in a buffer that is sufficiently large as to
encompass all the directory entries. This could be done in somewhat atomic
fashion by the following code:

repeat:
       stat(fd, &info);
       buffer = malloc(info.st_size)
       bytes_read = getdirentries(fd, buffer, info.st_size );
       if (bytes_read >= info.st_size) {
            free(buffer);
            goto repeat;
       }

This assumes that the size of all the directory entries is less than
number returned by stat.

The memory used by this approach is unbounded. In addition, the time
for the operation to complete could be long if the directory is being
updated concurrently by a fast-running process.

Compromises, though, can be made on when such an approach is used.
Environment variables could be used to specify whether or not this
approach was in force, and if so, how many times to try to repeat, etc.
Let your imagination go wild.

NFS, getdirentries revisited
----------------------------

The duplicate entries problem described above, however, is not a
persistent problem. That is, if we re-read the directory (and there
are no concurrent actions against it occuring), the duplicate entry
should disappear. Our NFS client displays persistent duplicates.

The fundamental reason OpenBSD gets persistent duplicates is that we are not
invalidating cached data aggressively enough (cached cookies especially). The
reason we don't do this is (nominally) to prevent missing files.

Let's try to understand the NFS interface a bit better. The NFS
protocol has a very similar interface to getdirentries for reading
directories.  You pass a buffer of how many bytes of directory
information you want to read and it returns to you a list of directory
entries. Along with each directory entry returned, there is a cookie,
analagous to our seek pointer in getdirentries, which, when presented
to the NFS server on that directory yields the next directory
entry.

There are a couple fundamental differences between the NFS protocol and the
file system interface. First, we are limited in the number of bytes we can read
from a directory - so we can't grab a large directory in one call. Second,
there is no notion of an open file descriptor in NFSv2, so we don't know when
people have lost interest in the directory. Finally, unlike getdirentries,
which returns you only one seek pointer per group of entries (which in the case
of UFS pointed on a block boundary so everything was OK), NFS returns one
cookie for each entry.  Not surprisingly, in the OpenBSD NFS server
implementation, these cookies correspond to the offsets of file entries in a
directory. These offsets change when a directory is reorganized (e.g. compacted
in FFS), invalidating cookies held by the client.

(Note: easy solution - disallow compaction on NFS mounted partitions)
(Note: harder solution - have NFS server always make the last cookie
in the last directory entry correspond to a block boundary. Have the
NFS client only ever use that last directory entry when requesting
more directory data)

When should the NFS client invalidate its cache of a directory?  When
should it invalidate its (toss?) cookies?  Using the modification time
on the directory, we can find out when the directory was
modified. Presumably, this should invalidate all the directory entries
and cookies in our cache. But what if the directory is open by some process
in the client side and we're in the middle of a getdirentries session?
If we toss all the cookies, what do we pass to the server when the client
issues the next getdirentries?

One answer is to toss all the cached but unused cookies. The easiest
way to do that is to store a cookie along with the seek pointer of
a directory. Unfortunately, we can only pass back a 32-bit quanitty
to the application and cookies are a tleast 64 bits. So, we must maintain
the cookies in the kernel, perhaps in the file descriptor structure along
with the seek pointer.

Disadvantages of this approach are tht, according to getdirentries man page,
any seek pointer returned by getdirentries during a session (i.e. while and fd
is open) should be valid. 

A simpler approach is to mark the cookie cache as dirty and flush it
when the number of references on the directroy drops to zero. Though it
is possible that overlapping readdirs would return duplicates, eventually
the problem should disappear.

The final approach, which is the most reliable, would be to have getdirentires
reutrn an error (ESTALE) whenver a directory changes. An application then would
need to restart the read of the directory from seek pointer zero or risk
getting garbage. This approach could be very costly in large directories with
high contention.

Notes
-----

The NFS server can make things easier for clients by never
invalidating a cookie (short of it pointing past the end of a
directory). This is not true of our current implementation due to
FFS compaction.

It's not clear in NFSv2 what the server does with a bad cookie.
Does it try to round the cookie to the next directory
entry? Does it return crap? Does it return nothing? Is there some kind
of error? In NFSv3, I believe the server can return and
NFSERR_BADCOOKIE.