Another Level of Indirection > From Function Arguments to Argument Pointers

17.2. From Function Arguments to Argument Pointers

Most Unix-related operating systems, such as FreeBSD, Linux, and Solaris, use function pointers to isolate the implementation of a filesystem from the code that accesses its contents. Interestingly, FreeBSD also employs indirection to abstract the read function's arguments.

When I first encountered the call vop->vop_read(a), shown in the previous section, I asked myself what that a argument was and what happened to the original four arguments of the hypothetical implementation of the VOP_READ function we saw earlier. After some digging, I found that the kernel uses another level of indirection to layer filesystems on top of each other to an arbitrary depth. This layering allows a filesystem to offer some services (such as translucent views, compression, and encryption) based on the services of another underlying filesystem. Two mechanisms work cleverly together to support this feature: one allows a single bypass function to modify the arguments of any vop_vector function, while another allows all undefined vop_vector functions to be redirected to the underlying filesystem layer.

You can see both mechanisms in action in Figure 17-2. The figure illustrates three file-systems layered on top of one another. On top lies the umapfs filesystem, which the system administrator mounted in order to map user credentials. This is valuable if the system where this particular disk was created used different user IDs. For instance, the administrator might want user ID 1013 on the underlying filesystem to appear as user ID 5325.

Figure 17-2. Example of filesystem layering


Beneath the top filesystem lies the Berkeley Fast Filesystem (ffs), the time- and space-efficient filesystem used by default in typical FreeBSD installations. The ffs in turn, for most of its operations, relies on the code of the original 4.2 BSD filesystem implementation ufs.

In the example shown in the figure, most system calls pass through a common bypass function in umapfs that maps the user credentials. Only a few system calls, such as rename and getattr, have their own implementations in umapfs. The ffs layer provides optimized implementations of read and write; both rely on a filesystem layout that is more efficient than the one employed by ufs. Most other operations, such as open, close, getattr, setatr, and rename, are handled in the traditional way. Thus, a vop_default entry in the ffs vop_vector structure directs all those functions to call the underlying ufs implementations. For example, a read system call will pass through umapfs_bypass and ffs_read, whereas a rename call will pass through umapfs_rename and ufs_rename.

Both mechanisms, the bypass and the default, pack the four arguments into a single structure to provide commonality between the different filesystem functions, and also support the groundwork for the bypass function. This is a beautiful design pattern that is easily overlooked within the intricacies of the C code required to implement it.

The four arguments are packed into a single structure, which as its first field (a_gen.a_desc) contains a description of the structure's contents (vop_read_desc, in the following code). As you can see in Figure 17-1, a read system call on a file in the FreeBSD kernel will trigger a call to vn_read, which will set up the appropriate lowl-evel arguments and call VOP_READ. This will pack the arguments and call VOP_READ_APV, which finally calls vop->vop_read and thereby the actual filesystem read function:

	struct vop_read_args {
	        struct vop_generic_args a_gen;
	        struct vnode *a_vp;
	        struct uio *a_uio;
	        int a_ioflag;
	        struct ucred *a_cred;
	};
	static _ _inline int VOP_READ(
	        struct vnode *vp,
	        struct uio *uio,
	        int ioflag,
	        struct ucred *cred)
	{
	        struct vop_read_args a;

	        a.a_gen.a_desc = &vop_read_desc;
	        a.a_vp = vp;
	        a.a_uio = uio;
	        a.a_ioflag = ioflag;
	        a.a_cred = cred;
	        return (VOP_READ_APV(vp->v_op, &a));
	}

This same elaborate dance is performed for calling all other vop_vector functions (stat, write, open, close, and so on). The vop_vector structure also contains a pointer to a bypass function. This function gets the packed arguments and, after possibly performing some modifications on them (such as, perhaps, mapping user credentials from one administrative domain to another) passes control to the appropriate underlying function for the specific call through the a_desc field.

Here is an excerpt of how the nullfs filesystem implements the bypass function. The nullfs filesystem just duplicates a part of an existing filesystem into another location of the global filesystem namespace. Therefore, for most of its operations, it can simply have its bypass function call the corresponding function of the underlying filesystem:

	#define VCALL(c) ((c)->a_desc->vdesc_call(c))
	int
	null_bypass(struct vop_generic_args *ap)
	{
	    /* ... */
	      error = VCALL(ap);

In the preceding code, the macro VCALL(ap) will bump the vnode operation that called null_bypass (for instance VOP_READ_APV) one filesystem level down. You can see this trick in action in Figure 17-3.

Figure 17-3. Routing system calls through a bypass function


In addition, the vop_vector contains a field named default that is a pointer to the vop_vector structure of the underlying filesystem layer. Through that field, if a filesystem doesn't implement some functionality, the request is passed on to a lower level. By populating the bypass and the default fields of its vop_vector structure, a filesystem can choose among:

In my mind, I visualize this as bits sliding down the ramps, kickers, and spinners of an elaborate pinball machine. The following example from the read system call implementation shows how the system locates the function to call:

	int
	VOP_READ_APV(struct {	vop_vector	*vop, struct vop_read_args *a)
	{

	   [...]
	     /*
	 * Drill down the filesystem layers to find one
	 * that implements the function or a bypass
	 */
	while (vop != NULL &&
	            vop->vop_read == NULL && vop->vop_bypass == NULL)
	                 vop = vop->vop_default;
	      /* Call the function or the bypass */
	        if (vop->vop_read != NULL)
	                 rc = vop->vop_read(a);
	        else 
	                 rc = vop->vop_bypass(&a->a_gen);

Elegantly, at the bottom of all filesystem layers lies a filesystem that returns the Unix "operation not supported" error (EOPNOTSUPP) for any function that wasn't implemented by the filesystems layered on top of it. This is our pinball's drain:

	#define VOP_EOPNOTSUPP ((void*)(uintptr_t)vop_eopnotsupp)

	struct vop_vector default_vnodeops = {
	        .vop_default =          NULL,
	        .vop_bypass =           VOP_EOPNOTSUPP,
	}

	int
	vop_eopnotsupp(struct vop_generic_args *ap)
	{
	          return (EOPNOTSUPP);
	}