





     UUsseerrffss -- FFiilleessyysstteemmss IImmpplleemmeenntteedd aass UUsseerr PPrroocceesssseess
           _J_e_r_e_m_y _F_i_t_z_h_a_r_d_i_n_g_e _<_j_e_r_e_m_y_@_s_w_._o_z_._a_u_>
                     Softway Pty. Ltd.



_1_.  _I_n_t_r_o_d_u_c_t_i_o_n

Userfs  is a mechanism by which normal user processes can be
a Linux filesystem.  There are many uses for  this,  includ-
ing:

Prototype filesystems

     Prototype  new  block  allocation  algorithms in a user
     process and debug with gdb before going into  the  com-
     pile-crash-reboot cycle of kernel development.

Infrequent use filesystems

     You  want to mount "FooBaz 0X" filesystems under Linux,
     but you don't want it that often, and you don't need it
     to  be  maximum  speed.   Rather than trying to get the
     kernel  itself  to  understand,  or  write  specialised
     tools, write a filesystem program.

Add capabilities to existing filesystems

     Want  compression, encryption, ACLs?  Have a process to
     mirror an existing file tree, but with your own  exten-
     tions and semantics.

Completely virtual filesystems and new interfaces

     Add  a  filesystem-type interface to an existing mecha-
     nism, or a filesystem interface as a new way of  repre-
     senting data.  Sick of FTP?  How about

          $$ mmkkddiirr //ffttpp//ttssxx--1111..mmiitt..eedduu
          $$ ccdd //ffttpp//ttssxx--1111..mmiitt..eedduu//ppuubb//lliinnuuxx
          $$ ccpp RREEAADDMMEE $$HHOOMMEE

     Or mail?




















                            - 2 -



          $$ ccdd //mmaaiill
          $$ llss
          000011..ssbbgg@@ssooccss..uuttss..eedduu..aauu
          000022..LLeerrooyy
          000033..ttlluukkkkaa@@vviinnkkkkuu..hhuutt..ffii
          000044..DDaavvoorr__JJaaddrriijjeevviicc
          $$ ccaatt **//FFrroomm
          FFrroomm:: ssbbgg@@ssooccss..uuttss..eedduu..aauu
          FFrroomm:: lleerrooyy@@ssooccss..uuttss..eedduu..aauu ((LLeerrooyy))
          FFrroomm:: ttlluukkkkaa@@vviinnkkkkuu..hhuutt..ffii
          FFrroomm:: ddaavvoorr%%eemmaarrdd..uuuuccpp@@ddss55000000..iirrbb..hhrr ((DDaavvoorr JJaaddrriijjeevviicc))
          $$ ccaatt **//SSuubbjjeecctt
          SSuubbjjeecctt:: MMoorree tthhiinnggss
          SSuubbjjeecctt:: ((nnoonnee))
          SSuubbjjeecctt:: TThhaatt uusseerrffss tthhiinngg
          SSuubbjjeecctt:: mmaaiillffss aaggaaiinn
          $$

You get the idea.


_2_.  _I_n_s_t_a_l_l_a_t_i_o_n

_2_._1  _K_e_r_n_e_l

First  of  all, remove traces of previous verions of userfs:
make  sure   there   are   no   userfs   header   files   in
_l_i_n_u_x_/_i_n_c_l_u_d_e_/_l_i_n_u_x and no userfs patches to any of the ker-
nel source.

Otherwise, the kernel module should just  compile  with  the
rest  of  the  build process.  Userfs is currently supported
for 1.3.x (tested up to 1.3.13), and for 1.2.x.  By  default
it  will  compile  for  1.3.x kernels, so to compile for 1.2
kernels you must edit _k_e_r_n_e_l_/_s_r_c_/_M_a_k_e_f_i_l_e_.  Just follow  the
comments.   If  you're using a kernel later than 1.3.11, try
it out anyway and tell me what happens.  There are no kernel
patches.

To  install  the module you need the mmoodduuttiillss package, which
should be available from your local Linux ftp  archive.   It
should  be  clear from its documentation what you need to do
with _u_s_e_r_f_s_._o to get it into the kernel.  If  you  get  some
warnings  about multiply defined symbols, ignore them.  Only
undefined symbols are a problem.  You can compile the kernel
and  module with either ELF or a.out compilers.  I used ELF.

_2_._2  _N_o_n_-_k_e_r_n_e_l _C_o_d_e

Building the rest of the code should be a matter  of  typing
"make"  at  the  top  userfs  directory.  This will generate
dependencies and build the utilities  needed  (genser),  the
library,  the  clients using the library and the kernel mod-
ule.  There will be some warnings; ignore them.








                            - 3 -



I used gcc 2.7.0; you probably need to use the  latest  com-
piler  and  libraries  (libg++  2.7.0.1) for the C++ (though
I've avoided templates and exceptions; g++ has enough  prob-
lems with simple things).

_2_._3  _M_a_i_l_i_n_g _l_i_s_t

There  is  a  USERFS  channel  on  the  Linux Activists list
server.  To subscribe, send mail with

     XX--MMnn--AAddmmiinn:: jjooiinn UUSSEERRFFSS

as the first line to linux-activists-request@niksula.hut.fi.
This channel is for general discussion of userfs development
and applications.

I'm not sure if this is still  active,  since  the  rest  of
linux-activists  is  going away, and the server is no longer
maintained.

_2_._4  _B_u_g_s_, _c_o_m_m_e_n_t_s_, _e_t_c

When you find a bug, tell  me.   Please  send  me  the  code
you're  using,  the  kernel version, whatever changes you've
made to userfs kernel code, and instructions or a script  to
reproduce the bug.  Don't just tell me "it broke."

If you've made changes to the kernel code, please send it to
me rather than sending it out to the world.  Please send  me
comments,  ideas for new kernel features, or things that you
think would make good filesystems but  you  can't  do  right
now.   Also  feel  free  to  ask  questions about either the
implementation of my code or how to write  your  own  userfs
clients.

Send    mail    to    either    me    (Jeremy   Fitzhardinge
<jeremy@sw.oz.au>) or to the mailing list (see above).


_3_.  _U_s_i_n_g _c_l_i_e_n_t_s

Clients are generally  mounted  with  the  mmuusseerrffss  command.
It's  quite  simple  -  it's  a program which makes sure the
mount point is legal for the user to mount  on,  and  mounts
the  given  process  with the user's permissions.  Note that
any user can mount a process, so more checking  is  done  on
the mount point than for a normal mount.  Unless the user is
root, the  mount  point  must  be  owned  by  the  user  and
writable.  mmuusseerrffss has a man page, which is even up to date.

There are a few useful or semi-useful clients: hhoommeerr, ffttppffss,
mmaaiillffss and aarrccffss..










                            - 4 -



Homer is written in C++, and uses the C++ library in the lib
     directory to do most of its work.  All it does  is  set
     up  a single directory under its mount point which con-
     tains symbolic links named after each user name in  the
     password  file,  which  points  to  the associated home
     directory.  Mounted on /u it makes a passible  replace-
     ment  for  ~ expansion in a shell (but it works for any
     program).

Ftpfs is an experimental filesystem  which  allows  readonly
     access  to  FTP  sites,  maintaining  a  long-term disk
     cache.  Its intended primarily for anonymous  FTP,  but
     can also be used for authenticated FTP sessions.

Mailfs is  by  Davor  Jadrijevic.   It  is for reading mail.
     Currently its read-only  and  does  not  track  mailbox
     changes,  and  is  no  longer being actively developed.
     Pester Davor (or fix it yourself).

Arcfs was written by David Gymer.  It allows you to mount  a
     compressed  tar  file  as  a  read-only filesystem, and
     inspect it with normal tools.  It's  pretty  neat,  but
     not recommended for heavy "production" use, or for very
     large files.


_4_.  _T_h_e_o_r_y _o_f _o_p_e_r_a_t_i_o_n

The kernel module registers a new filesystem type  with  the
kernel  ("userfs").   The  filesystem itself is very simple;
all it does it takes the normal kernel filesystem  requests,
wraps  them  up  into  well-defined packets and squirts them
down a file descriptor (presumeably connected to a  process)
and waits for the reply on another file descriptor.

If  the  filesystem process is on the same machine, then the
file descriptors  are  probably  ordinary  pipes.   However,
userfs  just  reads  and  writes on the file descriptors, so
they could be anything; files,  sockets,  devices  -  userfs
doesn't care.

The  following  is  not  a comprehensive tutorial on writing
filesystems, or a detailed "how it works"  or  specification
of  the  existing code.  It is intended to give some idea of
what I was thinking, and basic  concepts  to  bear  in  mind
while poking about in my kernel or user code.

_4_._1  _P_r_i_o_r_i_t_i_e_s

I  had  a  number  of goals which I wanted satisfied by this
thing (from most to least important):

Flexibility









                            - 5 -



     I want the process to have as much power as  a  kernel-
     resident  filesystem as possible.  I wanted to keep the
     interfaces as generic  and  flexible.   This  has  been
     mostly achieved.

Robustness

     Since I see prototyping and development a major use for
     userfs, it seems important to make sure that the kernel
     code can't (at worst) crash or lock up if the user code
     fails.  As it stands, it should  be  impossible  for  a
     user  process  to  crash the kernel, but it is possible
     for a bad user process to lock up processes  trying  to
     use the filesystem.

     It  is  also possible for a process to go strange while
     it is being mounted, leaving a half-mounted filesystem.
     The mountpoint becomes a nulled out inode, but the ker-
     nel refuses to unmount it (because it  isn't  mounted),
     and  refuses  to mount on it (because it's busy).  This
     happens much  less  often  than  it  used  to,  because
     muserfs  does  a  simple check to see if the filesystem
     process is at all viable.

Availability to users and Security

     I'd like any user to be able to write a filesystem pro-
     cess.    Traditionally,  filesystems  are  things  that
     embody the security of Unix,  and  are  therefore  very
     much  superuser-only things.  However, there are only a
     couple of really sensitive features that  shouldn't  be
     able to be controlled by any user: suid executables and
     device nodes.  Since a  trusted  superuser  process  is
     still  required to call the mount system call, and that
     process can set the no-suid and  no-device  flags,  the
     filesystem  code  can't use these as security holes.  I
     can't think of anything else that  needs  special  care
     from  a  security  point  of  view.  However, since the
     filesystem is completely under the control of the  pro-
     cess,  one  can make no assumptions about its contents.
     For example "." and ".." may not  do  expected  things,
     symlinks  may  point to places other than what readlink
     returns.  This makes navigating such filesystems a  new
     and interesting experience.

Efficiency

     Efficiency  is  my  lowest  priority,  but  it is still
     important.  Unfortunately the  other  requirements  (as
     usual)  make  things less efficient.  The most signifi-
     cant inefficiency is the context switches  between  the
     kernel  and the process.  I think the most benefits can
     be gained by reducing the number of these.   There  are
     several approaches to this:








                            - 6 -



        +o If  the  process  wants a default behaviour for an
          operation, then it can be done in the kernel.  The
          best  example  of this is permission checking - if
          the process wants normal unix permission  checking
          then  it  doesn't need to do it itself.  Otherwise
          it can take all the permission requests  from  the
          kernel,  and  implement other permission policies.
          This is currently implemented.  When the  filesys-
          tem  is first mounted, the kernel asks the process
          what requests it will accept.  From that point the
          kernel   will  do  sensible  default  actions  for
          requests that the process doesn't want  to  handle
          rather than sending them down the connection.

        +o Group  requests commonly issued together into one.
          This is hard, since  the  main  kernel  tells  the
          filesystem  code  very  little  about  what  it is
          doing, so it is hard to  know  what  to  do  next.
          However,  there  are  a  couple  of  single kernel
          requests that are implemented in the  protocol  as
          two  or more transactions.  This could be fixed in
          future.

        +o Data can be cached in the  kernel.   This  is  the
          most  tricky,  since  kernel caching or read-ahead
          limits the amount of control the process can  have
          over  the  data  once read.  I think this could be
          optionally implemented, depending on  whether  the
          process  says  it  is  OK to do caching, and if so
          what kinds.

          Currently directory readahead is implemented  with
          the  uupppp__mmuullttiirreeaaddddiirr  operation.  This allows the
          filesystem process to  return  as  many  directory
          entries  as  it  likes.   These entries can now be
          returned to the usermode process in one, if it has
          enough  space  in  its return buffer.  lump (which
          replaces a complex readahead mechanism).  This  is
          a  win  if  there  are  lots  of  linear directory
          searches (such as shell globbing, ls or pwd).

        +o A larger than 4k maximum packet size can be  used,
          now that the kernel memory allocator allows larger
          than 4k memory allocations.  However, since  pipes
          are  the most common connection beween filesystems
          and kernels, and pipes can  hold  at  most  4k  of
          data,  there  would  still  be  a  context  switch
          between filesystem code and kernel  every  4k,  so
          there wouldn't be much gain.
     A  number of people have suggested adding shared memory
     between the kernel and the  filesystem  process.   This
     would  be  quite  limiting  and  least likely option to
     improve things.  At the moment, the filesystem makes no
     assumptions  about  the  nature of the file descriptors








                            - 7 -



     for talking to the process.  To implement shared memory
     between  the  kernel and the process would require some
     way of finding the process on the other end of the file
     descriptors  (if  any),  and playing around with memory
     maps.  This still wouldn't cut down on  the  number  of
     context switches at all.

_4_._2  _P_r_o_t_o_c_o_l

The protocol used is machine independent, using network byte
order and defined type sizes.  The code to do the packetisa-
tion  and  depacketisation  is  generated automatically by a
program, given the description of each packet.  This is  not
fully  portable,  but  it  avoids  byte  order and structure
alignment problems.

A packet to or from the kernel has two parts.  The first  is
a header that contains a sequence number, an operation type,
a packet type, size of the following data,  and  a  protocol
version  number.  The packet type can either be a request, a
reply or an enquiry.  Requests and enquiries are always from
the  kernel  to the process, and the process only ever sends
replies to the kernel.  A reply's header has one extra field
-  an  error  field,  containing  an  error number.  Replies
always have the same sequence number as their  corresponding
request  or  enquiry.   If there was an error performing the
operation the error field is set to  the  error  number  and
there  is no additional data returned.  If there is no error
the error field is set to 0.

Following a request or reply packet is the  optional  opera-
tion-specific data.  This is passed through the protocol for
interpretation by the operation routines at each end.

The kernel may have multiple outstanding requests.  In other
words,  the kernel may send a new request before receiving a
reply to a previous one.   This  allows  the  filesystem  to
block one process for a slow operation while other processes
can  use  the  filesystem  for  shorter  operations.    This
improves  performance  on,  for  example, an ftp filesystem,
where one process may  be  using  a  fast  local  link,  and
another  may be using a slow international one, and each has
to wait for its own requests to  be  satisfied.   Of  course
this requires the filesystem process to be written with some
form of  multi-threading.   If  the  filesystem  just  reads
requests,  acts  on  them  and replies then it can do so and
ignore any kernel requests until it is ready  to  deal  with
them.

_4_._3  _H_a_n_d_l_e_s

The  base  element of a filesystem is an _i_n_o_d_e.  There is an
exact one to  one  relationship  between  inodes  and  files
(where  a  _f_i_l_e  in  this case can be any filesystem object,








                            - 8 -



like a normal file, a directory  and  so  on).   The  kernel
needs  to  be  able to uniquely identify inodes.  Inodes are
uniquely numbered within  a  filesystem,  but  each  mounted
filesystem  has  its  own  numbering.  Therefore an inode is
completely identified by an inode number  and  a  filesystem
identifier  (or  _d_e_v_i_c_e,  though  it doesn't mean much for a
filesystem which has no physical  hardware  associated  with
it).

A  device is what distinguishes mounted filesystems from one
another, and an inode is what distinguishes files  within  a
filesystem  from each other.  Inode numbers are generated by
each filesystem, and are used by the kernel to refer to spe-
cific  files  to the filesystem specific code.  User process
filesystems are no exception: between  the  kernel  and  the
filesystem  process,  files are refered to by using _h_a_n_d_l_e_s,
which are essentially  32  bit  unsigned  numbers.   When  a
filesystem  first mentions a file to the kernel, it gives it
a handle, which the kernel uses for all later operations  on
the  file.   It  the  the  handle which identifies the file,
rather than the name, so it is  important  to  use  distinct
handles for distinct files, and never change the handle of a
file once it has been given to the kernel.

_4_._4  _R_a_n_d_o_m _o_p_e_r_a_t_i_o_n _s_p_e_c_i_f_i_c _a_d_v_i_c_e _a_n_d _b_l_u_r_b

This may eventually accurately describe the whole  protocol,
but for now its a list of interesting points and things that
have bitten me.

Normally when  writing  a  filesystem  you  should  use  the
library  _l_i_b_u_s_e_r_f_s  (see  below), and use the advice in this
section as a guide on what kind of things should be  put  in
your userfs operation functions, or for idle curiosity.

_4_._4_._1  _M_o_u_n_t_i_n_g

The  mount  is initiated by a user process calling the mount
system call, with the  "userfs"  filesystem  type.   In  the
filesystem  specific  data,  the  process  passes  two  file
descriptor numbers for the kernel  to  read  and  write  to.
These  can by any kind of file descriptor at all.  Most com-
monly they would be  pipes  or  sockets,  but  there  is  no
restriction.   All the kernel requires that the one it talks
to the process with is writable, and the one it gets replies
from is readable.

The  most  important  request  is mounting.  Most important,
because it is one of the two requests that the  process  has
to  implement  (of  course,  not  implementing anything else
would be completely useless).  The  request  itself  is  not
that  complex.   All it does is return a handle of the inode
at the root of the filesystem.  Most commonly, this will  be
a  directory.   Userfs does not enforce this, but the kernel








                            - 9 -



itself may.

After the process returns the root handle, the  kernel  will
probe  the  process  to see what operations it is willing to
support.  This is done by sending a series of enquire  pack-
ets  to  the  process.  The process should reply with normal
reply packets, with the errno field either set to 0 if it is
supported  or  ENOSYS if it isn't.  No real operation should
be done, and no additional information should be sent in the
reply.   If  the  process replies ENOSYS to an operation, it
will never recieve it again, and the kernel will use a  sen-
sible  default  for it (typically what the kernel would nor-
mally do for an in-kernel filesystem if it  doesn't  support
the  operation).   Conversely,  if  the  filesystem  process
doesn't get an enquiry about a particular operation from the
kernel,  it  will  never see that operation from the kernel.
The filesystem process should send 0 for the  operations  it
explicitly  supports, and ENOSYS for everything else, so the
protocol can be extended without having to  modify  existing
clients.


_4_._4_._2  _R_e_a_d_i_n_g _I_n_o_d_e_s

The  most common thing for a filesystem to be asked to do is
to read inodes.  For the process, this involves filling  out
a  structure  much like the kernel's inode structure and the
stat structure.  It's important is to make sure  the  nlinks
field  is  non-zero.   This field is the number of names the
inode has, that is, the number of directory entries  in  the
filesystem  which  refer to this inode.  In theory, this can
never be 0 when the kernel asks for the inode, because  that
means  that the kernel asked for the inode without ever see-
ing a name referring to it,  implying  that  the  filesystem
never  told  the kernel about the file.  If it is 0 then the
kernel will never "put" the inode,  and  it  will  make  the
filesystem un-umountable.

When  the kernel wants an inode from the filesystem, it uses
the uupppp__iirreeaadd protocol request to fetch it.  This happens if
something  in  the  kernel  asks for the inode, but it isn't
already in the kernel inode table.  Therefore, once the ker-
nel  has  asked the filesystem for an inode, it will not ask
for it again while anything in the kernel is using it.

Once nothing in the kernel is using the  inode,  the  kernel
will  issue  an uupppp__iippuutt operation, which may be preceded by
an uupppp__iiwwrriittee if the inode was modified in use.  A  filesys-
tem  need not implement these operations if there is no need
to do so.

_4_._4_._3  _O_p_e_n _a_n_d _C_l_o_s_e

Reading  and  putting  inodes  are  the  basic   operations:








                           - 10 -



regardless  of  what  an  inode is being used for it will be
read and put.  The uupppp__ooppeenn and uupppp__cclloossee operations specif-
ically  correspond to the ooppeenn(2) and cclloossee(2) system calls.
Normally a filesystem doesn't need to  perform  any  special
handling for these operations, and would not normally imple-
ment them, except if it wants to know the  identity  of  the
process doing the operations.  When a program issues an open
system call for a file on the user  filesystem,  the  kernel
will  send a _u_p_p___o_p_e_n operation for the file, which includes
complete identifcation for  the  process  which  issued  the
open.   When the filesystem replies it returns a _c_r_e_d_e_n_t_i_a_l_s
_t_o_k_e_n_.  From then on, that credentials token is sent to  the
filesystem  in  all  operations which correspond to a system
call which takes a file descriptor as an argument,  such  as
rreeaadd,wwrriittee,rreeaaddddiirr,llsseeeekk and so on.

This  may  seem  a bit complex: why not just send the uid of
the process with the operations?  Well, the credentials of a
process  are  quite  complex,  since  they include the real,
saved and effective uids and gids of the  process,  and  all
the  auxillary groups.  Sending this with each request would
be quite an overhead.  The idea is that all the info is sent
on  a open, and the filesystem process can associate it with
a token internally, and only use the token in correspondance
with the kernel.

Also  note  that the credentials are associated with an open
file descriptor, not the process performing  the  operation.
Mostly a process will deal with file descriptors it has cre-
ated itself, but its quite possible that it can inherit file
descriptors  from  another  process  with a different set of
credentials.  In this case the filesystem knows the original
process's credentials, but not for the process which is per-
forming the operation.

_4_._4_._4  _H_a_n_d_l_e _M_a_n_a_g_e_m_e_n_t

The handle of an inode  is  only  way  the  kernel  and  the
filesystem  can  talk  about a file.  An inode may have more
than one name, or no names at all, so file names are  not  a
good  way  of  keeping  track of a file.  Use inodes in your
filesystem code to keep track of files, even if you  have  a
simple 1:1 name to file mapping.

Handles  must also be consistent.  Of course you must always
keep the handles of files currently in use  consistent,  but
you  must also keep them consistent between uses.  If a pro-
cess opens a file once, closes it and then reopens it,  then
it  will  expect  it  to have the same inode number if it is
supposed to be the same file (which is how processes using a
user filesystem will see the file handles).

Also,  if  you  ever refer to a handle in communication with
the kernel, you must be prepared for the kernel to ask about








                           - 11 -



it.   For  example, if the kernel reads a directory with the
uupppp__rreeaaddddiirr or uupppp__mmuullttiirreeaaddddiirr operations,  each  entry  in
the reply will have a name and a handle.  Each of those han-
dles must be the handle of the file if the kernel  looks  at
the  file  more closely.  If you make them all the same, for
example, then a program would be entitled  to  believe  that
all the names in the directory refer to one actual file.

_4_._4_._5  _D_e_a_l_i_n_g _w_i_t_h _m_u_s_e_r_f_s

Writing  a  client  which  can be handled by muserfs is very
easy.  The important thing to remember is that the  filesys-
tem  process can basically ignore muserfs, and ignore issues
like how to quit and so on.

A userfs filesystem process should only terminate under  one
condition:  it gets an EOF (a read of 0 bytes) from the ker-
nel on the file descriptor its  reading  operation  requests
from.   Muserfs  will  execute  it  so that most signals are
ignored, so it can handle them  itself.   When  the  muserfs
process is sent a SIGINT or SIGTERM it unmounts the filesys-
tem mount point with the uummoouunntt(8)  command  (used  so  that
/etc/mtab  is  updated properly).  This causes the kernel to
send the filesystem process  a  uupppp__uummoouunntt  operation.   The
kernel  will  close its end of the file descriptors, and the
process is expected to do the same, even if only by exiting.
Therefore,  when  trying  to unmount a userfs filesystem, do
not kill the filesystem process directly, and  do  not  kill
muserfs  with  SIGKILL.   Either  way  you should be able to
unmount with uummoouunntt as root.


_5_.  _U_s_i_n_g _l_i_b_u_s_e_r_f_s

_l_i_b_u_s_e_r_f_s is a C++ library designed to make writing filesys-
tem  clients  easier.  It is designed so all the work common
to almost all filesystems is encapsulated into a few generic
classes,  which  can  be  used  as base classes for specific
filesystem functions.

_5_._1  _B_a_s_i_c _C_l_a_s_s_e_s

The most basic classes, CCoommmm, FFiilleessyysstteemm and IInnooddee implement
the basic communication with the kernel and stub methods for
each operation.

The Comm class reads from the kernel and decodes the headers
of  the  operation  packets, and passes the remainder to the
Filesystem class.  The Filesystem performs the operation and
returns  an  unencoded return header and the encoded body of
the reply, if any.  All this is  not  exposed  to  the  code
using the library.

Filesystem  takes  each  operation  and dispatches it to the








                           - 12 -



appropriate place.  The Filesystem  class  directly  handles
the oprations which are global to the whole filesystem, such
as mounting or unmounting.  For operation which pertain to a
particular Inode (such as reading, or looking up a name in a
directory), Filesystem looks up the Inode in its  table  and
dispatches the operation to it.

The  Inode  class  has  all its methods implemented as stubs
which fail with the "not implemented" error code.   It  also
has members for the standard inode properties of mode, type,
size, ownership, links, timestamps and so on.

These classes are completely useless on their own,  so  they
must be used as base classes for other classes with actually
do something.  _l_i_b_u_s_e_r_f_s has more specific, but still gener-
ally useful classes.

SSiimmpplleeIInnooddee  implements  a  simple  inode with some normally
expected behaviour.  It has a constructor which  initializes
the  inode  properties to sensible values, and methods which
implement simple defaults for the open,  close  and  permis-
sions check operations.

DDiirrIInnooddee,, derived from SimpleInode, implements all the oper-
ations needed for a directory, including linking and unlink-
ing inodes to/from names, rename, and directory scanning and
lookup.  It takes very little extra code to implemement sim-
ple directory behaviour.

_5_._2  _W_r_i_t_i_n_g _y_o_u_r _o_w_n _f_i_l_e_s_y_s_t_e_m _c_l_a_s_s_e_s

A complete filesystem has two parts: a collection of inodes,
one for each file,  and  the  filesystem  structure  itself,
which  holds all the inodes together.  Each inode represents
a file in the filesystem, regardless of type.  There is only
one  inode  per  file  in  the  filesystem, even if the file
appears multiple times under different names.

_5_._2_._1  _A_r_g_u_m_e_n_t_s _a_n_d _r_e_t_u_r_n _v_a_l_u_e_s _o_f _o_p_e_r_a_t_i_o_n _m_e_t_h_o_d_s

Each method with the name ddoo__ssoommeetthhiinngg in the Filesystem and
Inode classes corresponds to an operation in the userfs pro-
tocol.  As a result, they all have similar  argument  struc-
tures.   All  such  methods  have ccoonnsstt uupp__pprreeaammbbllee &&pprree and
uupppp__rreeppll &&rreeppll which are references to the operation  reqest
and  reply  packet  headers.   Mostly there is no reason for
operation methods to use them, because  their  contents  are
dealt  with  in  lower  levels  of the library, but they are
there if you want them.

Each userfs protocol operation may  have  arguments,  return
values,  both  or neither, and the method for that operation
will have corresponding arguments.  For an operation named _x
the  method  argument with the operation arguments will have








                           - 13 -



the type ccoonnsstt uupppp___x__ss, and the return values argument  will
have  the  type  uupppp___x__rr, For example, the up_read operation
will correspond to the Inode method

     iinntt IInnooddee::::ddoo__rreeaadd((ccoonnsstt uupp__pprreeaammbbllee &&pprree,, uupppp__rreeppll &&rreeppll,,
                        ccoonnsstt uupppp__rreeaadd__ss &&aarrggss,, uupppp__rreeaadd__rr &&rreett));;

The contents of the  structures,  along  with  encoding  and
decoding  functions,  are  machine  generated, and therefore
have a consistent set of rules.  Mostly  its  quite  simple,
with  normal  base types directly corresponding to C and C++
types.  However, variable sized types need to  have  both  a
pointer  to  the  data and the size of the data encoded into
them.  Memory for the data is allocated with the C++ new and
delete  operators, with the aalllloocc method of a variable sized
object.  The memory is automatically freed by  the  method's
caller.  For example, if a return value of a method contains
an member called nnaammee representing a filename, it  would  be
set  with the following sequence (assuming oouurrnnaammee is a nor-
mal 0 terminated string):

     iinntt nnaammeelleenn == ssttrrlleenn((oouurrnnaammee));;
     rreett..nnaammee..aalllloocc((nnaammeelleenn));;                   //// AAllllooccaattee mmeemmoorryy
     rreett..nnaammee..nneelleemm == nnaammeelleenn;;                  //// SSeett nnaammee lleennggtthh
     mmeemmccppyy((&&rreett..nnaammee..eelleemmss,, oouurrnnaammee,, nnaammeelleenn));; //// SSeett nnaammee ccoonntteennttss
     //// ......

(alternatively, you could just point _r_e_t_._n_a_m_e_._e_l_e_m_s directly
at  _o_u_r_n_a_m_e,  because it won't try and free the string if it
was never allocated).

Note that strings are never zero terminated; the  length  of
the  returned  string is exactly the number of characters in
the string.

If the operation the method is performing fails,  it  should
return  the  appropriate  error  code,  or 0 if it succeeds.
Don't return -1 unless you mean to - it has special  meaning
(see below, in "Deferring Replies").

_5_._2_._2  _D_e_r_i_v_i_n_g _f_r_o_m _F_i_l_e_s_y_s_t_e_m

Filesystem  class must implement a number of methods to make
the filesystem viable:

_E_n_q_u_i_r_e is called when the kernel wants to find what  opera-
     tions your filesystem supports.  For all the operations
     that any inode will  implement,  return  0  and  return
     ENOSYS for the rest.

_d_o___m_o_u_n_t takes  no  arguments and returns the handle for the
     inode for the root directory (that is, the  top  direc-
     tory  of your filesystem).  The kernel immediately does
     a ddoo__iirreeaadd operation using this handle.








                           - 14 -



You can also implement _d_o___s_t_a_t_f_s which allows the kernel  to
get  space  and inode usage statistics, such as when "df" is
executed,  and  _d_o___u_m_o_u_n_t  so  the  filesystem  is  formally
informed  when it is unmounted (normally it just gets an EOF
from the kernel, and Comm::Run returns).

_5_._2_._3  _D_e_r_i_v_i_n_g _f_r_o_m _I_n_o_d_e

Most of the work of the filesystem is done  in  the  inodes.
All  inode classes must be derived from Inode, and generally
there will be a number of different Inode based classes.

It is probably better to use SimpleInode as  a  base  rather
than  plain Inode, because it implements simple defaults for
some methods,  which  would  otherwise  fail.   If  Filesys-
tem::Enquire  says that the filesystem supports a particular
operation, then any inode should be  prepared  to  get  that
operation from the kernel.

Similarly,  unless you are doing something special, deriving
directories from DirInode saves a lot of work.

Only  _d_o___i_r_e_a_d  need  be  implemented,  but  obviously   the
filesystem  will  do nothing interesting unless other opera-
tions are implemented.  do_iread returns the details of  the
inode.  Note that the Filesystem class calls the do_iread of
the Inode when the operation comes from the kernel,  so  the
inode  must  exist  by the time the kernel asks for it.  The
constructor for Inode automatically registers the  inode  in
the  Filesystem's  inode  table;  conversely, the destructor
removes it.

Here are  some  other  useful  methods  for  an  Inode;  the
descriptions  are  brief  and general, and don't necessarily
refer to all the arguments and return  values,  which  means
they can be ignored.

_d_o___i_w_r_i_t_e is,  obviously, the opposite of do_iread.  It sim-
     ply sets the various Inode values.

_d_o___i_p_u_t is called when the kernel is  no  longer  using  the
     inode.   That is, the inode is no longer open, the cur-
     rent or root directory of  a  process,  being  executed
     from or being mapped from.  If an inode is iput and has
     no names (has no name to inode mapping  in  any  direc-
     tory) it can be destroyed.

_d_o___r_e_a_d allows data to be read from the file.  The arguments
     are the offset in the file to start reading  from,  and
     the number of bytes desiried.  The method may return as
     many bytes up to that number as it likes, including  0,
     which means EOF.










                           - 15 -



_d_o___w_r_i_t_e does the converse; a block of data and an offset is
     passed in, and the method returns the number  of  bytes
     actually written.

_d_o___l_o_o_k_u_p translates  a  name into an inode reference.  This
     is typically implemented for directories; if  the  name
     exists  in  the  directory the method should return the
     handle of the inode, or fail with ENOENT.

_d_o___d_i_r_r_e_a_d returns the next directory entry  at  the  passed
     offset.  It returns the name and inode of the next file
     in the directory, and the size of the  entry  returned.
     This  is  added  by the kernel to the current offset in
     the directory to form the offset of the next  directory
     entry  for  the next call.  Since the directory entries
     don't correspond to real file storage as in other, more
     conventional  filesystems,  a  directory  entry  can be
     regarded as having an offset of 1.

     If the end of the directory has been reached, it should
     return a new offset of 0.

_d_o___m_u_l_t_i_r_e_a_d_d_i_r is similar to do_readdir, but can return any
     number of directory entries,  which  are  cached  in  a
     readahead  buffer in the kernel.  If a program asks for
     a directory entry for  an  inode  which  has  a  cached
     directory  entry  then  the entry will come from within
     the kernel rather than asking the  filesystem  process.
     This  operation  can  return  only one entry (and so is
     like do_readdir), or as many as will fit  in  a  return
     packet  (up  to  4k  or  so  of entries).  Returning no
     entries  means  the  end  of  the  directory  has  been
     reached.   Returning multiple entries improves the per-
     formance of directory scans, most  frequently  done  by
     ls, pwd and shell globbing.

     Look at the implementation of DirInode::do_multireaddir
     for details of how this should be dealt with.

_d_o___c_r_e_a_t_e does all file creation, whether  it  be  a  normal
     file,  a  directory, a fifo file or a device node.  The
     mode contains the type of the file in same way  as  the
     stat structure member sstt__mmooddee..

_d_o___u_n_l_i_n_k is the opposite, and is used for unlinking (remov-
     ing a name to inode mapping) files and directories.  If
     an  inode is not in use and has no links then it can be
     destroyed and its handle can be reused.

_d_o___s_y_m_l_i_n_k is used to create new symlink inodes.  It returns
     the handle of the new inode.

_d_o___r_e_a_d_l_i_n_k returns  the  pathname  which a symbolic link is
     pointing to.








                           - 16 -



_d_o___f_o_l_l_o_w_l_i_n_k returns the pathname of the  file  a  symbolic
     link  is  really  referring to.  If Filesystem::Enquire
     says the filesystem does not  support  this  operation,
     the readlink operation is used instead.

_d_o___o_p_e_n is  called  when  a  file is actually opened.  It is
     only necessary to implement this if it is important  to
     know whether a file is being opened as opposed to being
     used in any  other  way.   This  operation  passes  the
     filesystem  the  complete authentication credentials of
     the process doing the open, so that the filesystem  can
     do  extended  security checking or change the behaviour
     of the file depending on the user.

     This method can return a credential token, which  is  a
     magic number used by the filesystem process to refer to
     the set of credentials passed by the kernel.  The  ker-
     nel attaches this credentials token to each each opera-
     tion generated by system calls on the  file  descriptor
     generated  by  the open (read(), write(), readdir() and
     close()).  The credentials token is part  of  the  file
     descriptor,  so is inhereited unchanged if the descrip-
     tor is passed to another process, even if it  has  dif-
     ferent credentials.

     When  a  file is opened, a new file table entry for the
     inode is created.  That file table entry has  a  single
     file descriptor referring to it.  More file descriptors
     can be made to refer to the file table entry  with  the
     dduupp(2) system call, and can be removed with cclloossee(2).

_d_o___c_l_o_s_e is  called when the last file descriptor for a file
     table entry is closed.  The only argument for  this  is
     the  credentials  token  for  that file table entry, so
     that the filesystem can free all references to it.

_d_o___p_e_r_m_i_s_s_i_o_n is called when the filesystem says it wants to
     do permissions checking.  This is called a lot, and can
     cause many more operations to pass between  the  kernel
     and  filesystem  process.   If  the filesystem does not
     implement it the normal unix user/group/others checking
     is performed.

_d_o___r_e_n_a_m_e moves  a  file  from  one  directory  to a new one
     (though it may be the same).

_5_._2_._4  _D_e_r_i_v_i_n_g _f_r_o_m _D_i_r_I_n_o_d_e

DirInode implements a number of userfs operation methods for
directories,  such  as readdir, multireaddir and lookup.  It
also automatically constructs directories with "." and  ".."
entries pointing to the appropriate places.

DirInode deals with strings a lot, and rather than using the








                           - 17 -



normal cchhaarr ** it uses the libg++ SSttrriinngg class for all string
arguments  to  its  own methods (but not, of course, for the
userfs protocol operation methods).

DirInode expects a pointer to the parent directory, which is
also  a class derived from DirInode.  If the directory is at
the top of the  filesystem's  tree,  it  should  be  a  NULL
pointer.   The protected member ppaarreenntt points the the parent
inode, or tthhiiss for the top one.  It should never be NULL.

DirInode keeps a list of files in the  directory,  but  does
not allow that list to be directly visible.  The only opera-
tions for manipulating the directory contents for a  derived
class are:

iinntt lliinnkk((ccoonnsstt SSttrriinngg nnaammee,, IInnooddee **)) which  links a new name
     into the directory, updating all the reference and link
     counts;

iinntt uunnlliinnkk((ccoonnsstt SSttrriinngg nnaammee)) which does the opposite;

DDiirrEEnnttrryy **llooookkuupp((ccoonnsstt SSttrriinngg nnaammee)) which  returns  a direc-
     tory entry if it finds the file, or NULL otherwise; and

DDiirrEEnnttrryy **ssccaann((DDiirrEEnnttrryy ** &&ppooss)) which  returns the directory
     entry at _p_o_s_, updating it in the process,  or  NULL  if
     there are no more entries.

DDiirrEEnnttrryy **ssccaann((iinntt &&ppooss)) is  the  same,  except  it  uses an
     integer offset, which is less efficient.

_5_._3  _C_o_m_m_u_n_i_c_a_t_i_o_n_s _c_l_a_s_s_e_s

There are a number of communications classes in the library,
which provide different ways of multiplexing replies.

The  most  simple is the Comm class, which simply takes each
request, passes it to the  filesystem  and  sends  back  the
reply.  There are more complex comms classes though.

_5_._3_._1  _F_i_l_e _D_e_s_c_r_i_p_t_o_r _D_i_s_p_a_t_c_h_e_r

The  CCoommmmBBaassee  class  (base of all comms classes) provides a
dispatcher which allows  classes  to  register  interest  in
activity  on  file  descriptors.  This is used internally to
get input from the kernel, but can be used by  a  filesystem
to  monitor  any  file descriptor for any reason.  To do it,
simply derive a dispatcher class from  DDiissppaattcchhFFDD  and  call
ssttrruucctt  ddiisspp__ffdd  CCoommmmBBaassee::::aaddddDDiissppaattcchh((iinntt ffdd,, DDiissppaattcchhFFDD **,,
iinntt wwhhaatt)), where what can be one or more of  _D_I_S_P___R,  _D_I_S_P___W
or _D_I_S_P___E, for interest in read ready, write ready or excep-
tions.  When an event occurs,  the  DDiissppaattcchhFFDD::::ddiissppaattcchh((iinntt
ffdd,,  iinntt wwhhaatt)) method is called of the registered class.  If
it returns 0 then it is removed from the dispatch list.   If








                           - 18 -



it  returns  -1  it  indicates  an error; it is removed, and
CCoommmmBBaassee::::RRuunn(()) returns.  Returning 1 is a normal return.

CCoommmmBBaassee::::RRuunn(()) returns normally  when  there  are  no  more
entries on the dispatch list.

_5_._3_._2  _D_e_f_e_r_r_i_n_g _R_e_p_l_i_e_s

In normal operation, the filesystem processes one request at
a time, so each operation is replied to before the  next  is
looked  at.   This  is a convention of the way the user code
works, and not something the kernel enforces.  It just sends
requsts  as  processes  using  the filesystem need them, and
they block until the reply for their particular  request  is
replied  to.   Therefore,  it  is possible for multiple pro-
cesses to use the filesystem at once.

The DDeeffeerrCCoommmm and DDeeffeerrFFiilleessyyss classes have a method  called
DDeeffeerrRReeppllyy  (the  DeferFilesys once just calls the DeferComm
one to make it accessable to things within the  filesystem).
DeferReply  forks  the  filesystem;  on  the  child  side it
returns 0 and in the parent it returns the pid of the child.
If  the operation method returns -1 then the Filesystem just
goes on to processing the  next  request  from  the  kernel.
When  the child is ready to reply, it can just return in the
normal way.  The call to DeferReply sets  up  the  DeferComm
class in the child process to reply though the parent rather
than going straight to the kernel, in order to make sure the
replies  from multiple processes don't get jumbled up.  When
the reply has been sent back, the child process just  exits.

Because  the child is really a child process, you have to do
all the changes in filesystem state before calling  DeferRe-
ply,  or arrange for some other mechanism for the parent and
children to talk.

_5_._3_._3  _M_u_l_t_i_-_t_h_r_e_a_d_e_d _f_i_l_e_s_y_s_t_e_m_s

The TThhrreeaaddCCoommmm class creates a new  lightweight  thread  for
each  request,  using the Rex lwp library (in the lwp direc-
tory).  This allows multiple requests to be  handled  within
the  one  process,  so long as one thread does not block the
whole process in a system call.  The  file  descriptor  dis-
patcher in CommBase is useful for preventing this: see ffttppffss
for a complete example of a multithreaded filesystem.














