








              BBeerrkkeelleeyy SSooffttwwaarree AArrcchhiitteeccttuurree MMaannuuaall
                         44..44BBSSDD EEddiittiioonn


                _M_. _K_i_r_k _M_c_K_u_s_i_c_k_, _M_i_c_h_a_e_l _K_a_r_e_l_s
                   _S_a_m_u_e_l _L_e_f_f_l_e_r_, _W_i_l_l_i_a_m _J_o_y
                          _R_o_b_e_r_t _F_a_b_r_y
                 Computer Systems Research Group
                    Computer Science Division
    Department of Electrical Engineering and Computer Science
               University of California, Berkeley
                       Berkeley, CA  94720


                            _A_B_S_T_R_A_C_T

          This document summarizes the system calls provided
     by the 4.4BSD operating system.  It does not attempt to
     act  as  a  tutorial for use of the system, nor does it
     attempt to explain or justify the design of the  system
     facilities.   It gives neither motivation nor implemen-
     tation details, in favor of brevity.

          The first section describes the basic kernel func-
     tions provided to a process: process naming and protec-
     tion, memory management, software interrupts, time  and
     statistics  functions, object references (descriptors),
     and resource controls.  These facilities,  as  well  as
     facilities for bootstrap, shutdown and process account-
     ing, are provided solely by the kernel.

          The second section describes the  standard  system
     abstractions  for files and filesystems, communication,
     terminal handling, and process control  and  debugging.
     These  facilities are implemented by the operating sys-
     tem or by network server processes.
























PSD:5-4                                4.4BSD Architecture Manual


NNoottaattiioonn aanndd TTyyppeess

     The notation used to describe system calls is a variant of a
C language function call, consisting of a prototype call followed
by the declaration of parameters and results.  An additional key-
word  rreessuulltt, not part of the normal C language, is used to indi-
cate which of the declared entities receive results.  As an exam-
ple, consider the _r_e_a_d call, as described in section 2.1.1:

     cc = read(fd, buf, nbytes);
     result ssize_t cc; int fd; result void *buf; size_t nbytes;

The  first  line shows how the _r_e_a_d routine is called, with three
parameters.  As shown on the second line, the return value _c_c  is
a  size_t and _r_e_a_d also returns information in the parameter _b_u_f.

     The descriptions of error conditions arising from each  sys-
tem  call  are not provided here; they appear in section 2 of the
Programmer's Reference Manual.  In particular, when accessed from
the  C language, many calls return a characteristic -1 value when
an error occurs, returning the error code in the global  variable
_e_r_r_n_o.  Other languages may present errors in different ways.

     A number of system standard types are defined by the include
file _<_s_y_s_/_t_y_p_e_s_._h_> and used in the  specifications  here  and  in
many C programs.


Type       Value
--------------------------------------------------------------
caddr_t    char *               /* a memory address */
clock_t    unsigned long        /* count of CLK_TCK's */
gid_t      unsigned long        /* group ID */
int16_t    short                /* 16-bit integer */
int32_t    int                  /* 32-bit integer */
int64_t    long long            /* 64-bit integer */
int8_t     signed char          /* 8-bit integer */
mode_t     unsigned short       /* file permissions */
off_t      quad_t               /* file offset */
pid_t      long                 /* process ID */
qaddr_t    quad_t *
quad_t     long long
size_t     unsigned int         /* count of bytes */
ssize_t    int                  /* signed size_t */
time_t     long                 /* seconds since the Epoch */
u_char     unsigned char
u_int      unsigned int
u_long     unsigned long
u_quad_t   unsigned long long
u_short    unsigned short
uid_t      unsigned long        /* user ID */
uint       unsigned int         /* System V compatibility */
uint16_t   unsigned short       /* unsigned 16-bit integer */










4.4BSD Architecture Manual                                PSD:5-5


uint32_t   unsigned int         /* unsigned 32-bit integer */
uint64_t   unsigned long long   /* unsigned 64-bit integer */
uint8_t    unsigned char        /* unsigned 8-bit integer */
ushort     unsigned short       /* System V compatibility */


11..  KKeerrnneell pprriimmiittiivveess


     The  facilities  available  to  a user process are logically
divided into two parts: kernel facilities directly implemented by
code  running  in  the  operating  system,  and system facilities
implemented either by the system, or in cooperation with a _s_e_r_v_e_r
_p_r_o_c_e_s_s.  The kernel facilities are described in section 1.

     The  facilities  implemented  in  the kernel are those which
define the _4_._4_B_S_D _v_i_r_t_u_a_l _m_a_c_h_i_n_e in  which  each  process  runs.
Like  many real machines, this virtual machine has memory manage-
ment hardware, an interrupt facility, timers and  counters.   The
4.4BSD  virtual  machine allows access to files and other objects
through a set of _d_e_s_c_r_i_p_t_o_r_s.  Each descriptor resembles a device
controller,  and  supports  a set of operations.  Like devices on
real machines, some of which are internal to the machine and some
of  which  are  external,  parts  of the descriptor machinery are
built-in to the operating system, while other  parts  are  imple-
mented  in  server  processes  on other machines.  The facilities
provided through the descriptor machinery are described  in  sec-
tion 2.

11..11..  PPrroocceesssseess aanndd pprrootteeccttiioonn


11..11..11..  HHoosstt iiddeennttiiffiieerrss


     Each  host  has associated with it an integer host ID, and a
host name of up to MAXHOSTNAMELEN (256) characters (as defined in
_<_s_y_s_/_p_a_r_a_m_._h_>).  These identifiers are set (by a privileged user)
and retrieved using the _s_y_s_c_t_l  interface  described  in  section
1.7.1.   The  host ID is seldom used (or set), and is deprecated.
For convenience and backward compatibility, the following library
routines are provided:

     sethostid(hostid);
     long hostid;


     hostid = gethostid();
     result long hostid;


     sethostname(name, len);
     char *name; int len;










PSD:5-6                                4.4BSD Architecture Manual


     len = gethostname(buf, buflen);
     result int len; result char *buf; int buflen;


11..11..22..  PPrroocceessss iiddeennttiiffiieerrss

Each host runs a set of _p_r_o_c_e_s_s_e_s.  Each process is largely inde-
pendent of other processes, having  its  own  protection  domain,
address  space,  timers,  and an independent set of references to
system or user implemented objects.

     Each process in a host is named by  an  integer  called  the
_p_r_o_c_e_s_s  _I_D.  This number is in the range 1-30000 and is returned
by the _g_e_t_p_i_d routine:

     pid = getpid();
     result pid_t pid;

On each host this identifier is guaranteed to  be  unique;  in  a
multi-host  environment, the (hostid, process ID) pairs are guar-
anteed unique.  The parent process  identifier  can  be  obtained
using the _g_e_t_p_p_i_d routine:

     pid = getppid();
     result pid_t pid;


11..11..33..  PPrroocceessss ccrreeaattiioonn aanndd tteerrmmiinnaattiioonn


A  new  process  is  created  by making a logical duplicate of an
existing process:

     pid = fork();
     result pid_t pid;

The _f_o_r_k call returns twice, once in the  parent  process,  where
_p_i_d is the process identifier of the child, and once in the child
process where _p_i_d is 0.  The parent-child relationship imposes  a
hierarchical structure on the set of processes in the system.

     For  processes  that  are  forking solely for the purpose of
_e_x_e_c_v_e'ing another program, the  _v_f_o_r_k  system  call  provides  a
faster interface:

     pid = vfork();
     result pid_t pid;

Like  _f_o_r_k,  the  _v_f_o_r_k  call  returns  twice, once in the parent
process, where _p_i_d is the process identifier of  the  child,  and
once  in the child process where _p_i_d is 0.  The parent process is
suspended until the child process calls either _e_x_e_c_v_e or _e_x_i_t.











4.4BSD Architecture Manual                                PSD:5-7


A process may terminate by executing an _e_x_i_t call:

     exit(status);
     int status;

The lower 8 bits of exit status are available to its parent.

     When a child process exits  or  terminates  abnormally,  the
parent  process receives information about the event which caused
termination of the child process.  The interface allows the  par-
ent  to  wait  for  a  particular  process, process group, or any
direct descendent and to  retrieve  information  about  resources
consumed  by the process during its lifetime.  The request may be
done either synchronously (waiting for one of the requested  pro-
cesses  to exit), or asynchronously (polling to see if any of the
requested processes have exited):

     pid = wait4(wpid, astatus, options, arusage);
     result pid_t pid; pid_t wpid; result int *astatus;
     int options; result struct rusage *arusage;


     A process can  overlay  itself  with  the  memory  image  of
another  process,  passing  the  newly  created  process a set of
parameters, using the call:

     execve(name, argv, envp);
     char *name, *argv[], *envp[];

The specified _n_a_m_e must be a file which is in a format recognized
by  the  system,  either a binary executable file or a file which
causes the  execution  of  a  specified  interpreter  program  to
process  its  contents.   If the set-user-ID mode bit is set, the
effective user ID is set to the owner of the file;  if  the  set-
group-ID  mode  bit  is set, the effective group ID is set to the
group of the file.  Whether changed or not, the effective user ID
is  then  copied to the saved user ID, and the effective group ID
is copied to the saved group ID.

11..11..44..  UUsseerr aanndd ggrroouupp IIDDss


     Each process in the system has associated with it three user
IDs:  a  _r_e_a_l _u_s_e_r _I_D, an _e_f_f_e_c_t_i_v_e _u_s_e_r _I_D, and a _s_a_v_e_d _u_s_e_r _I_D,
all unsigned integral types (uuiidd__tt).  Each  process  has  a  _r_e_a_l
_g_r_o_u_p _I_D and a set of _a_c_c_e_s_s _g_r_o_u_p _I_D_s, the first of which is the
_e_f_f_e_c_t_i_v_e _g_r_o_u_p _I_D.  The group IDs are  unsigned  integral  types
(ggiidd__tt).   Each  process  may  be in multiple access groups.  The
maximum concurrent number of access groups is a  system  compila-
tion  parameter,  represented by the constant NGROUPS in the file
_<_s_y_s_/_p_a_r_a_m_._h_>.  It is guaranteed to be at least 16.

The real group ID is used in process accounting  and  in  testing
whether  the  effective  group  ID  may  be  changed;  it  is not









PSD:5-8                                4.4BSD Architecture Manual


otherwise used for access control.  The  members  of  the  access
group ID set are used for access control.  Because the first mem-
ber of the set is the effective group ID, which is  changed  when
executing a set-group-ID program, that element is normally dupli-
cated in the set so that access privileges for the original group
are not lost when using a set-group-ID program.

The  real  and  effective  user IDs associated with a process are
returned by:

     ruid = getuid();
     result uid_t ruid;


     euid = geteuid();
     result uid_t euid;

the real and effective group IDs by:

     rgid = getgid();
     result gid_t rgid;


     egid = getegid();
     result gid_t egid;

The access group ID set is returned by a _g_e_t_g_r_o_u_p_s call:

     ngroups = getgroups(gidsetsize, gidset);
     result int ngroups; int gidsetsize; result gid_t gidset[gidsetsize];


The user and group IDs are  assigned  at  login  time  using  the
_s_e_t_u_i_d, _s_e_t_g_i_d, and _s_e_t_g_r_o_u_p_s calls:

     setuid(uid);
     uid_t uid;


     setgid(gid);
     gid_t gid;


     setgroups(gidsetsize, gidset);
     int gidsetsize; gid_t gidset[gidsetsize];

The _s_e_t_u_i_d call sets the real, effective, and saved user IDs, and
is permitted only if the specified _u_i_d is the current  real  user
ID  or if the caller is the super-user.  The _s_e_t_g_i_d call sets the
real, effective, and saved group IDs; it is permitted only if the
specified  _g_i_d  is  the current real group ID or if the caller is
the super-user.  The _s_e_t_g_r_o_u_p_s call sets the access group ID set,
and is restricted to the super-user.










4.4BSD Architecture Manual                                PSD:5-9


The  _s_e_t_e_u_i_d routine allows any process to set its effective user
ID to either its real or saved user ID:

     seteuid(uid);
     uid_t uid;

The _s_e_t_e_g_i_d routine allows any process to set its effective group
ID to either its real or saved group ID:

     setegid(gid);
     gid_t gid;


11..11..55..  SSeessssiioonnss


     When  a user first logs onto the system, they are put into a
session with a controlling process (usually a shell).   The  ses-
sion is created with the call:

     pid = setsid();
     result pid_t pid;

All  subsequent  processes  created by the user (that do not call
_s_e_t_s_i_d) will be part of the session.   The  session  also  has  a
login  name  associated with it which is set using the privileged
call:

     setlogin(name);
     char *name;

The login name can be retrieved using the call:

     name = getlogin();
     result char *name;

Unlike historic systems, the value returned by _g_e_t_l_o_g_i_n is stored
in the kernel and can be trusted.

11..11..66..  PPrroocceessss ggrroouuppss


     Each process in the system is also associated with a _p_r_o_c_e_s_s
_g_r_o_u_p.  The group of processes in a process  group  is  sometimes
referred  to  as a _j_o_b and manipulated by high-level system soft-
ware (such as the shell).  All members of  a  process  group  are
members  of  the  same  session.   The current process group of a
process is returned by the _g_e_t_p_g_r_p call:

     pgrp = getpgrp();
     result pid_t pgrp;

When a process is in a specific  process  group  it  may  receive
software  interrupts  affecting  the  group, causing the group to









PSD:5-10                               4.4BSD Architecture Manual


suspend or resume execution or to be interrupted  or  terminated.
In  particular,  a  system  terminal has a process group and only
processes which are in the process group of the terminal may read
from  the terminal, allowing arbitration of a terminal among sev-
eral different jobs.

The process group associated with a process may be changed by the
_s_e_t_p_g_i_d call:

     setpgid(pid, pgrp);
     pid_t pid, pgrp;

Newly  created  processes  are assigned process IDs distinct from
all processes and process groups, and the same process  group  as
their parent.  Any process may set its process group equal to its
process ID or to the value of any process group within  its  ses-
sion.

11..22..  MMeemmoorryy mmaannaaggeemmeenntt


11..22..11..  TTeexxtt,, ddaattaa,, aanndd ssttaacckk


     Each  process  begins  execution with three logical areas of
memory called text, data, and stack.  The text area is  read-only
and  shared, while the data and stack areas are writable and pri-
vate to the process.  Both  the  data  and  stack  areas  may  be
extended and contracted on program request.  The call:

     brk(addr);
     caddr_t addr;

sets  the end of the data segment to the specified address.  More
conveniently, the end can be extended by _i_n_c_r bytes, and the base
of the new area returned with the call:

     addr = sbrk(incr);
     result caddr_t addr; int incr;

Application programs normally use the library routines _m_a_l_l_o_c and
_f_r_e_e, which provide a more  convenient  interface  than  _b_r_k  and
_s_b_r_k.

There  is no call for extending the stack, as it is automatically
extended as needed.

11..22..22..  MMaappppiinngg ppaaggeess


     The system supports sharing of  data  between  processes  by
allowing  pages to be mapped into memory.  These mapped pages may
be _s_h_a_r_e_d with other processes or _p_r_i_v_a_t_e to the  process.   Pro-
tection and sharing options are defined in _<_s_y_s_/_m_m_a_n_._h_> as:









4.4BSD Architecture Manual                               PSD:5-11


     Protections are chosen from these bits, or-ed together:

     PROT_READ           /* pages can be read */
     PROT_WRITE          /* pages can be written */
     PROT_EXEC           /* pages can be executed */




     Flags contain sharing type and options.  Sharing options, choose one:

     MAP_SHARED                   /* share changes */
     MAP_PRIVATE                  /* changes are private */




     Option flags[+]:

     MAP_ANON           /* allocated from virtual memory; _f_d ignored */
     MAP_FIXED          /* map addr must be exactly as requested */
     MAP_NORESERVE      /* don't reserve needed swap area */
     MAP_INHERIT        /* region is retained after exec */
     MAP_HASSEMAPHORE   /* region may contain semaphores */


The size of a page is  CPU-dependent,  and  is  returned  by  the
_s_y_s_c_t_l  interface  described  in  section 1.7.1.  The _g_e_t_p_a_g_e_s_i_z_e
library routine is provided for convenience and backward compati-
bility:

     pagesize = getpagesize();
     result int pagesize;


The call:

     maddr = mmap(addr, len, prot, flags, fd, pos);
     result caddr_t maddr; caddr_t addr; size_t len; int prot, flags, fd; off_t pos;

causes  the pages starting at _a_d_d_r and continuing for at most _l_e_n
bytes to be mapped from the object represented by descriptor  _f_d,
starting  at  byte offset _p_o_s.  If _a_d_d_r is NULL, the system picks
an unused address for the region.  The starting  address  of  the
region  is  returned;  for  the convenience of the system, it may
differ from that supplied unless the MAP_FIXED flag is given,  in
which  case the exact address will be used or the call will fail.
The _a_d_d_r parameter  must  be  a  multiple  of  the  pagesize  (if
MAP_FIXED  is given).  If _p_o_s and _l_e_n are not a multiple of page-
size, they will be rounded (down and up respectively) to  a  page
boundary by the system; the rounding will cause the mapped region
-----------
[+]  In 4.4BSD, only MAP_ANON and MAP_FIXED are imple-
mented.









PSD:5-12                               4.4BSD Architecture Manual


to extend past the  specified  range.   A  successful  _m_m_a_p  will
delete  any previous mapping in the allocated address range.  The
parameter _p_r_o_t specifies the accessibility of the  mapped  pages.
The  parameter  _f_l_a_g_s  specifies the type of object to be mapped,
mapping options, and whether modifications made  to  this  mapped
copy of the page are to be kept _p_r_i_v_a_t_e, or are to be _s_h_a_r_e_d with
other references.  Possible types include MAP_SHARED or  MAP_PRI-
VATE  that map a regular file or character-special device memory,
and MAP_ANON, which maps memory not associated with any  specific
file.  The file descriptor used when creating MAP_ANON regions is
not used and should be -1.  The MAP_INHERIT flag allows a  region
to  be  inherited  after  an  _e_x_e_c_v_e.   The MAP_HASSEMAPHORE flag
allows special handling for regions that may contain  semaphores.
The MAP_NORESERVE flag allows processes to allocate regions whose
virtual address space,  if  fully  allocated,  would  exceed  the
available  memory  plus  swap  resources.  Such regions may get a
SIGSEGV signal if they page fault and resources are not available
to  service  their  request;  typically  they  would free up some
resources via _m_u_n_m_a_p so that when they return from the signal the
page fault could be completed successfully.

A  facility  is  provided to synchronize a mapped region with the
file it maps; the call:

     msync(addr, len);
     caddr_t addr; size_t len;

causes any modified pages in the specified region to be  synchro-
nized  with  their  source  and other mappings.  If necessary, it
writes any modified pages back to the filesystem, and updates the
file  modification  time.  If _l_e_n is 0, all modified pages within
the region containing _a_d_d_r will be flushed; this usage is  provi-
sional, and may be withdrawn.  If _l_e_n is non-zero, only the pages
containing _a_d_d_r and _l_e_n succeeding locations  will  be  examined.
Any  required  synchronization  of  memory  caches will also take
place at this time.

Filesystem operations on a file that is mapped for shared modifi-
cations are currently unpredictable except after an _m_s_y_n_c.

A mapping can be removed by the call

     munmap(addr, len);
     caddr_t addr; size_t len;

This  call  deletes the mappings for the specified address range,
and causes further references to addresses within  the  range  to
generate invalid memory references.

11..22..33..  PPaaggee pprrootteeccttiioonn ccoonnttrrooll


A process can control the protection of pages using the call:










4.4BSD Architecture Manual                               PSD:5-13


     mprotect(addr, len, prot);
     caddr_t addr; size_t len; int prot;

This  call  changes  the specified pages to have protection _p_r_o_t.
Not all implementations  will  guarantee  protection  on  a  page
basis;  the  granularity of protection changes may be as large as
an entire region.

11..22..44..  GGiivviinngg aanndd ggeettttiinngg aaddvviiccee


A process that has knowledge of its memory behavior may  use  the
_m_a_d_v_i_s_e[+] call:

     madvise(addr, len, behav);
     caddr_t addr; size_t len; int behav;

_B_e_h_a_v describes expected behavior, as given in _<_s_y_s_/_m_m_a_n_._h_>:


     MADV_NORMAL       /* no further special treatment */
     MADV_RANDOM       /* expect random page references */
     MADV_SEQUENTIAL   /* expect sequential references */
     MADV_WILLNEED     /* will need these pages */
     MADV_DONTNEED     /* don't need these pages */


The  _m_i_n_c_o_r_e[+]  function  allows a process to obtain information
about whether pages are memory resident:

     mincore(addr, len, vec);
     caddr_t addr; size_t len; result char *vec;

Here the current memory residency of the pages is returned in the
character  array  _v_e_c, with a value of 1 meaning that the page is
in-memory.  _M_i_n_c_o_r_e provides  only  transient  information  about
page  residency.   Real-time processes that need guaranteed resi-
dence over time can use the call:

     mlock(addr, len);
     caddr_t addr; size_t len;

This call locks the pages for the specified  address  range  into
memory (paging them in if necessary) ensuring that further refer-
ences to addresses within the  range  will  never  generate  page
faults.  The amount of memory that may be locked is controlled by
a resource limit, see section  1.6.3.   When  the  memory  is  no
longer critical it can be unlocked using:


-----------
[+]  The  entry point for this system call is defined,
but is not implemented, so  currently  always  returns
with the error ``Operation not supported.''









PSD:5-14                               4.4BSD Architecture Manual


     munlock(addr, len);
     caddr_t addr; size_t len;

After  the _m_u_n_l_o_c_k call, the pages in the specified address range
are still accessible but may be paged out if memory is needed and
they are not accessed.

11..22..55..  SSyynncchhrroonniizzaattiioonn pprriimmiittiivveess

Primitives  are  provided for synchronization using semaphores in
shared memory.[++] These primitives are expected to be superseded
by the semaphore interface being  specified  by  the  POSIX  1003
Pthread  standard.   They  are  provided  as an efficient interim
solution.  Application programmers  are  encouraged  to  use  the
Pthread interface when it becomes available.

     Semaphores must lie within a MAP_SHARED region with at least
modes PROT_READ and PROT_WRITE.  The MAP_HASSEMAPHORE  flag  must
have  been  specified  when the region was created.  To acquire a
lock a process calls:

     value = mset(sem, wait);
     result int value; semaphore *sem; int wait;

_M_s_e_t indivisibly tests and sets the semaphore _s_e_m.  If the previ-
ous  value  is  zero,  the process has acquired the lock and _m_s_e_t
returns true immediately.  Otherwise, if the _w_a_i_t flag  is  zero,
failure  is  returned.  If _w_a_i_t is true and the previous value is
non-zero, _m_s_e_t relinquishes the processor until notified that  it
should retry.

To release a lock a process calls:

     mclear(sem);
     semaphore *sem;

_M_c_l_e_a_r  indivisibly  tests  and clears the semaphore _s_e_m.  If the
``WANT'' flag is zero in the previous value, _m_c_l_e_a_r returns imme-
diately.  If the ``WANT'' flag is non-zero in the previous value,
_m_c_l_e_a_r arranges for waiting processes to retry before  returning.

     Two  routines provide services analogous to the kernel _s_l_e_e_p
and _w_a_k_e_u_p functions interpreted in the domain of shared  memory.
A  process  may relinquish the processor by calling _m_s_l_e_e_p with a
set semaphore:

     msleep(sem);
     semaphore *sem;

If the semaphore is still set when it is checked by  the  kernel,
the  process  will  be  put  in a sleeping state until some other
-----------
[++] All  currently  unimplemented,  no  entry  points
exists.









4.4BSD Architecture Manual                               PSD:5-15


process issues an _m_w_a_k_e_u_p  for  the  same  semaphore  within  the
region using the call:

     mwakeup(sem);
     semaphore *sem;

An  _m_w_a_k_e_u_p  may  awaken  all  sleepers  on the semaphore, or may
awaken only the next sleeper on a queue.

11..33..  SSiiggnnaallss



11..33..11..  OOvveerrvviieeww


     The system defines a set of _s_i_g_n_a_l_s that may be delivered to
a  process.   Signal delivery resembles the occurrence of a hard-
ware interrupt: the signal is blocked  from  further  occurrence,
the  current process context is saved, and a new one is built.  A
process may specify a _h_a_n_d_l_e_r to which a signal is delivered,  or
specify  that  the signal is to be _b_l_o_c_k_e_d or _i_g_n_o_r_e_d.  A process
may also specify that a _d_e_f_a_u_l_t action is to be taken  when  sig-
nals occur.

     Some  signals  will  cause a process to exit if they are not
caught.  This may be accompanied by  creation  of  a  _c_o_r_e  image
file,  containing the current memory image of the process for use
in post-mortem debugging.  A process may also choose to have sig-
nals delivered on a special stack, so that sophisticated software
stack manipulations are possible.

     All signals have the same _p_r_i_o_r_i_t_y.  If multiple signals are
pending,  signals  that  may be generated by the program's action
are delivered first; the order in which other signals are  deliv-
ered to a process is not specified.  Signal routines execute with
the signal that caused their invocation _b_l_o_c_k_e_d, but  other  sig-
nals  may  occur.   Multiple signals may be delivered on a single
entry to the system, as if signal handlers  were  interrupted  by
other  signal handlers.  Mechanisms are provided whereby critical
sections of code may protect themselves against the occurrence of
specified signals.

11..33..22..  SSiiggnnaall ttyyppeess


     The  signals  defined  by  the  system fall into one of five
classes: hardware conditions, software  conditions,  input/output
notification,  process  control, or resource control.  The set of
signals is defined by the file _<_s_i_g_n_a_l_._h_>.

     Hardware signals are  derived  from  exceptional  conditions
which  may  occur  during execution.  Such signals include SIGFPE
representing floating  point  and  other  arithmetic  exceptions,









PSD:5-16                               4.4BSD Architecture Manual


SIGILL for illegal instruction execution, SIGSEGV for attempts to
access addresses outside the currently assigned area  of  memory,
and SIGBUS for accesses that violate memory access constraints.

     Software   signals  reflect  interrupts  generated  by  user
request: SIGINT for the normal interrupt signal; SIGQUIT for  the
more  powerful _q_u_i_t signal, which normally causes a core image to
be generated; SIGHUP and SIGTERM that cause graceful process ter-
mination,  either  because  a user has ``hung up'', or by user or
program request; and SIGKILL, a more powerful termination  signal
which  a  process  cannot  catch  or ignore.  Programs may define
their own asynchronous events using SIGUSR1 and  SIGUSR2.   Other
software signals (SIGALRM, SIGVTALRM, SIGPROF) indicate the expi-
ration of interval timers.  When a window changes  size,  a  SIG-
WINCH is sent to the controlling terminal process group.

     A  process  can request notification via a SIGIO signal when
input or output is possible on a descriptor, or when a _n_o_n_-_b_l_o_c_k_-
_i_n_g operation completes.  A process may request to receive a SIG-
URG signal when an urgent condition arises.

     A process may be _s_t_o_p_p_e_d by a signal sent to it or the  mem-
bers of its process group.  The SIGSTOP signal is a powerful stop
signal, because it cannot be caught.  Other stop signals SIGTSTP,
SIGTTIN, and SIGTTOU are used when a user request, input request,
or output request respectively is the  reason  for  stopping  the
process.   A  SIGCONT signal is sent to a process when it is con-
tinued from a stopped state.  Processes may receive  notification
with  a SIGCHLD signal when a child process changes state, either
by stopping or by terminating.

     Exceeding resource limits may cause signals to be generated.
SIGXCPU  occurs  when  a  process  nears  its  CPU time limit and
SIGXFSZ when a process reaches the limit on file size.

11..33..33..  SSiiggnnaall hhaannddlleerrss


     A process has a handler associated with  each  signal.   The
handler controls the way the signal is delivered.  The call:


     struct sigaction {
          void       (*sa_handler)();
          sigset_t   sa_mask;
          int        sa_flags;
     };



     sigaction(signo, sa, osa);
     int signo; struct sigaction *sa; result struct sigaction *osa;

assigns  interrupt  handler  address  _s_a___h_a_n_d_l_e_r to signal _s_i_g_n_o.









4.4BSD Architecture Manual                               PSD:5-17


Each handler address specifies either an  interrupt  routine  for
the  signal,  that the signal is to be ignored, or that a default
action (usually process termination) is to occur  if  the  signal
occurs.   The  constants  SIG_IGN  and SIG_DFL used as values for
_s_a___h_a_n_d_l_e_r cause ignoring or defaulting of a  condition,  respec-
tively.   The  _s_a___m_a_s_k value specifies the signal mask to be used
when the handler is invoked; it implicitly  includes  the  signal
which invoked the handler.  Signal masks include one bit for each
signal.  The following macros, defined  in  _s_i_g_n_a_l_._h,  create  an
empty mask, then put _s_i_g_n_o into it:

     sigemptyset(set);
     sigaddset(set, signo);
     result sigset_t *set; int signo;

_S_a___f_l_a_g_s   specifies  whether  pending  system  calls  should  be
restarted if the signal handler returns (SA_RESTART) and  whether
the handler should operate on the normal run-time stack or a spe-
cial signal stack (SA_ONSTACK; see below).  If _o_s_a  is  non-zero,
the previous signal handler information is returned.

     When  a signal condition arises for a process, the signal is
added to a set of signals pending for the process.  If the signal
is  not  currently  _b_l_o_c_k_e_d by the process it then will be deliv-
ered.  The process of signal  delivery  adds  the  signal  to  be
delivered  and  those  signals specified in the associated signal
handler's _s_a___m_a_s_k to a set of those _m_a_s_k_e_d for the process, saves
the  current  process context, and places the process in the con-
text of the signal handling routine.  The  call  is  arranged  so
that  if the signal handling routine returns normally, the signal
mask will be restored and the process will  resume  execution  in
the original context.

     The  mask  of _b_l_o_c_k_e_d signals is independent of handlers for
signals.  It delays signals from being delivered much as a raised
hardware  interrupt  priority  level  delays hardware interrupts.
Preventing an interrupt from occurring by changing the handler is
analogous to disabling a device from further interrupts.

The  signal  handling routine _s_a___h_a_n_d_l_e_r is called by a C call of
the form:

     (*sa_handler)(signo, code, scp);
     int signo; long code; struct sigcontext *scp;

The _s_i_g_n_o gives the number of the signal that occurred,  and  the
_c_o_d_e, a word of signal-specific information supplied by the hard-
ware.  The _s_c_p parameter is  a  pointer  to  a  machine-dependent
structure  containing  the  information for restoring the context
before the signal.  Normally this context will be  restored  when
the  signal handler returns.  However, a process may do so at any
time by using the call:











PSD:5-18                               4.4BSD Architecture Manual


     sigreturn(scp);
     struct sigcontext *scp;

If the signal handler makes a call to _l_o_n_g_j_m_p, the signal mask at
the time of the corresponding _s_e_t_j_m_p is restored.

11..33..44..  SSeennddiinngg ssiiggnnaallss


A process can send a signal to another process or processes group
with the call:

     kill(pid, signo)
     pid_t pid; int signo;

For compatibility with old systems, a  compatibility  routine  is
provided to send a signal to a process group:

     killpg(pgrp, signo)
     pid_t pgrp; int signo;

Unless the process sending the signal is privileged, it must have
the same effective user id as the process receiving the signal.

     Signals also are sent implicitly from a terminal  device  to
the process group associated with the terminal when certain input
characters are typed.

11..33..55..  PPrrootteeccttiinngg ccrriittiiccaall sseeccttiioonnss


The _s_i_g_p_r_o_c_m_a_s_k system call is used to  manipulate  the  mask  of
blocked signals:

     sigprocmask(how, newmask, oldmask);
     int how; sigset_t *newmask; result sigset_t *oldmask;

The  actions done by _s_i_g_p_r_o_c_m_a_s_k are to add to the list of masked
signals (SIG_BLOCK), delete  from  the  list  of  masked  signals
(SIG_UNBLOCK), and block a specific set of signals (SIG_SETMASK).
The _s_i_g_p_r_o_c_m_a_s_k call can be used to  read  the  current  mask  by
specifying SIG_BLOCK with an empty _n_e_w_m_a_s_k.

     It  is  possible  to  check  conditions  with  some  signals
blocked, and then to pause waiting for a signal and restoring the
mask, by using:

     sigsuspend(mask);
     sigset_t *mask;

It is also possible to find out which blocked signals are pending
delivery using the call:











4.4BSD Architecture Manual                               PSD:5-19


     sigpending(mask);
     result sigset_t *mask;


11..33..66..  SSiiggnnaall ssttaacckkss


Applications that maintain complex or fixed size stacks  can  use
the call:


     struct sigaltstack {
          caddr_t   ss_sp;
          long      ss_size;
          int       ss_flags;
     };



     sigaltstack(ss, oss)
     struct sigaltstack *ss; result struct sigaltstack *oss;

to provide the system with a stack based at _s_s___s_p of size _s_s___s_i_z_e
for delivery of signals.  The value  _s_s___f_l_a_g_s  indicates  whether
the process is currently on the signal stack, a notion maintained
in software by the system.

     When a signal is to be delivered to a handler for which  the
SA_ONSTACK flag was set, the system checks whether the process is
on a signal stack.  If not, then the process is switched  to  the
signal  stack for delivery, with the return from the signal doing
a _s_i_g_r_e_t_u_r_n to restore the previous stack.  If the process  takes
a  non-local  exit  from  the  signal  routine, _l_o_n_g_j_m_p will do a
_s_i_g_r_e_t_u_r_n call to switch back to the run-time stack.

11..44..  TTiimmeerrss


11..44..11..  RReeaall ttiimmee


     The system's notion of the current time  is  in  Coordinated
Universal Time (UTC, previously GMT) and the current time zone is
set and returned by the calls:

     settimeofday(tp, tzp);
     struct timeval *tp;
     struct timezone *tzp;


     gettimeofday(tp, tzp);
     result struct timeval *tp;
     result struct timezone *tzp;










PSD:5-20                               4.4BSD Architecture Manual


where the structures are defined in _<_s_y_s_/_t_i_m_e_._h_> as:


     struct timeval {
          long   tv_sec;           /* seconds since Jan 1, 1970 */
          long   tv_usec;          /* and microseconds */
     };
     struct timezone {
          int    tz_minuteswest;   /* of Greenwich */
          int    tz_dsttime;       /* type of dst correction to apply */
     };


The timezone information is present only for  historical  reasons
and is unused by the current system.

The precision of the system clock is hardware dependent.  Earlier
versions of UNIX contained only a 1-second resolution version  of
this call, which remains as a library routine:

     time(tvsec);
     result time_t *tvsec;

returning only the tv_sec field from the _g_e_t_t_i_m_e_o_f_d_a_y call.

The _a_d_j_t_i_m_e system calls allows for small changes in time without
abrupt changes by skewing the rate at which time advances:

     adjtime(delta, olddelta);
     struct timeval *delta; result struct timeval *olddelta;


11..44..22..  IInntteerrvvaall ttiimmee


The system provides each  process  with  three  interval  timers,
defined in _<_s_y_s_/_t_i_m_e_._h_>:


     ITIMER_REAL      /* real time intervals */
     ITIMER_VIRTUAL   /* virtual time intervals */
     ITIMER_PROF      /* user and system virtual time */


The  ITIMER_REAL timer decrements in real time.  It could be used
by a library routine to  maintain  a  wakeup  service  queue.   A
SIGALRM signal is delivered when this timer expires.

     The ITIMER_VIRTUAL timer decrements in process virtual time.
It runs only when the process is executing.  A  SIGVTALRM  signal
is delivered when it expires.

     The  ITIMER_PROF  timer  decrements  both in process virtual
time and when the system is running on behalf of the process.  It









4.4BSD Architecture Manual                               PSD:5-21


is  designed  to  be  used  by processes to statistically profile
their execution.  A SIGPROF signal is delivered when it  expires.

A timer value is defined by the _i_t_i_m_e_r_v_a_l structure:


     struct itimerval {
          struct   timeval it_interval;   /* timer interval */
          struct   timeval it_value;      /* current value */
     };


and a timer is set or read by the call:

     setitimer(which, value, ovalue);
     int which; struct itimerval *value; result struct itimerval *ovalue;


     getitimer(which, value);
     int which; result struct itimerval *value;

The  _i_t___v_a_l_u_e  specifies  the  time  until  the  next signal; the
_i_t___i_n_t_e_r_v_a_l specifies a new interval that should be  loaded  into
the  timer  on  each expiration.  The third argument to _s_e_t_i_t_i_m_e_r
specifies an optional structure to receive the previous  contents
of  the  interval  timer.   A  timer  can  be disabled by setting
_i_t___v_a_l_u_e and _i_t___i_n_t_e_r_v_a_l to 0.

     The system rounds argument timer intervals to  be  not  less
than  the  resolution of its clock.  This clock resolution can be
determined by loading a very small value into a timer and reading
the timer back to see what value resulted.

     The  _a_l_a_r_m  system  call of earlier versions of UNIX is pro-
vided as a library routine using the ITIMER_REAL timer.

     The process profiling facilities of earlier versions of UNIX
remain  because  it is not always possible to guarantee the auto-
matic restart of system calls after receipt  of  a  signal.   The
_p_r_o_f_i_l  call arranges for the kernel to begin gathering execution
statistics for a process:

     profil(samples, size, offset, scale);
     result char *samples; int size, offset, scale;

This call begins sampling the program  counter,  with  statistics
maintained in the user-provided buffer.

11..55..  DDeessccrriippttoorrss














PSD:5-22                               4.4BSD Architecture Manual


11..55..11..  TThhee rreeffeerreennccee ttaabbllee


     Each  process  has  access to resources through _d_e_s_c_r_i_p_t_o_r_s.
Each descriptor is  a  handle  allowing  processes  to  reference
objects such as files, devices and communications links.

     Rather than allowing processes direct access to descriptors,
the system introduces a level of indirection, so that descriptors
may  be  shared between processes.  Each process has a _d_e_s_c_r_i_p_t_o_r
_r_e_f_e_r_e_n_c_e _t_a_b_l_e, containing pointers to the  actual  descriptors.
The  descriptors  themselves  therefore  may have multiple refer-
ences, and are reference counted by the system.

     Each process has a limited size descriptor reference  table,
where the current size is returned by the _g_e_t_d_t_a_b_l_e_s_i_z_e call:

     nds = getdtablesize();
     result int nds;

and guaranteed to be at least 64.  The maximum number of descrip-
tors is a resource limit (see section 1.6.3).  The entries in the
descriptor reference table are referred to by small integers; for
example if there are 64 slots they are numbered 0 to 63.

11..55..22..  DDeessccrriippttoorr pprrooppeerrttiieess


     Each descriptor has a logical set of  properties  maintained
by  the system and defined by its _t_y_p_e.  Each type supports a set
of operations; some operations, such as reading and writing,  are
common  to  several  abstractions,  while others are unique.  For
those types that support random access, the current  file  offset
is  stored in the descriptor.  The generic operations applying to
many of these types are described in section  2.1.   Naming  con-
texts,  files and directories are described in section 2.2.  Sec-
tion 2.3 describes communications domains and sockets.  Terminals
and  (structured  and unstructured) devices are described in sec-
tion 2.4.

11..55..33..  MMaannaaggiinngg ddeessccrriippttoorr rreeffeerreenncceess


A duplicate of a descriptor reference may be made by doing:

     new = dup(old);
     result int new; int old;

returning a copy of descriptor reference _o_l_d which  is  indistin-
guishable from the original.  The value of _n_e_w chosen by the sys-
tem will be the smallest unused  descriptor  reference  slot.   A
copy  of a descriptor reference may be made in a specific slot by
doing:










4.4BSD Architecture Manual                               PSD:5-23


     dup2(old, new);
     int old, new;

The _d_u_p_2 call causes the system to deallocate the descriptor ref-
erence  current  occupying  slot _n_e_w, if any, replacing it with a
reference to the same descriptor as old.

Descriptors are deallocated by:

     close(old);
     int old;


11..55..44..  MMuullttiipplleexxiinngg rreeqquueessttss


     The system provides a standard way  to  do  synchronous  and
asynchronous  multiplexing of operations.  Synchronous multiplex-
ing is performed by using the _s_e_l_e_c_t call to examine the state of
multiple  descriptors  simultaneously,  and  to  wait  for  state
changes on those descriptors.  Sets of  descriptors  of  interest
are specified as bit masks, as follows:

     nds = select(nd, in, out, except, tvp);
     result int nds; int nd; result fd_set *in, *out, *except;
     struct timeval *tvp;

     FD_CLR(fd, &fdset);
     FD_COPY(&fdset, &fdset2);
     FD_ISSET(fd, &fdset);
     FD_SET(fd, &fdset);
     FD_ZERO(&fdset);
     int fd; fs_set fdset, fdset2;

The  _s_e_l_e_c_t  call  examines the descriptors specified by the sets
_i_n, _o_u_t and _e_x_c_e_p_t, replacing the specified bit masks by the sub-
sets  that  select true for input, output, and exceptional condi-
tions respectively (_n_d indicates the number of  file  descriptors
specified by the bit masks).  If any descriptors meet the follow-
ing criteria, then the number of such descriptors is returned  in
_n_d_s and the bit masks are updated.

*    A  descriptor  selects for input if an input oriented opera-
     tion such as _r_e_a_d or _r_e_c_e_i_v_e is possible, or if a connection
     request may be accepted (see sections 2.1.3 and 2.3.1.4).

*    A descriptor selects for output if an output oriented opera-
     tion such as _w_r_i_t_e or _s_e_n_d is possible, or if  an  operation
     that  was ``in progress'', such as connection establishment,
     has completed (see sections 2.1.3 and 2.3.1.5).

*    A descriptor selects for an exceptional condition if a  con-
     dition  that  would  cause  a  SIGURG signal to be generated
     exists (see section 1.3.2), or other device-specific  events









PSD:5-24                               4.4BSD Architecture Manual


     have occurred.

For  these  tests, an operation is considered to be possible if a
call to the operation would return without blocking (even if  the
O_NONBLOCK  flag  were not set).  For example, a descriptor would
test as ready for reading if a read call would return immediately
with  data,  an  end-of-file  indication,  or an error other than
EWOULDBLOCK.

If none of the specified conditions is true, the operation  waits
for  one  of the conditions to arise, blocking at most the amount
of time specified by _t_v_p.  If _t_v_p is given as  NULL,  the  _s_e_l_e_c_t
waits indefinitely.

Options  affecting I/O on a descriptor may be read and set by the
call:

     dopt = fcntl(d, cmd, arg);
     result int dopt; int d, cmd, arg;



     /* command values */

     F_DUPFD    /* return a new descriptor */
     F_GETFD    /* get file descriptor flags */
     F_SETFD    /* set file descriptor flags */
     F_GETFL    /* get file status flags */
     F_SETFL    /* set file status flags */
     F_GETOWN   /* get SIGIO/SIGURG proc/pgrp */
     F_SETOWN   /* set SIGIO/SIGURG proc/pgrp */
     F_GETLK    /* get blocking lock */
     F_SETLK    /* set or clear lock */
     F_SETLKW   /* set lock with wait */


The F_DUPFD _c_m_d provides identical functionality to _d_u_p_2;  it  is
provided  solely for POSIX compatibility.  The F_SETFD _c_m_d can be
used to set the close-on-exec flag for a  file  descriptor.   The
F_SETFL  _c_m_d  may be used to set a descriptor in non-blocking I/O
mode and/or enable signaling when I/O is possible.  F_SETOWN  may
be used to specify a process or process group to be signaled when
using the latter mode of operation  or  when  urgent  indications
arise.  The _f_c_n_t_l system call also provides POSIX-compliant byte-
range locking on files.  However the semantics  of  unlocking  on
every _c_l_o_s_e rather than last close makes them useless.  Much bet-
ter semantics and faster locking are provided by the _f_l_o_c_k system
call  (see section 2.2.7).  The _f_c_n_t_l and _f_l_o_c_k locks can be used
concurrently; they will serialize against each other properly.

     Operations on non-blocking descriptors will either  complete
immediately,  return the error EWOULDBLOCK, partially complete an
input or output operation returning a partial count, or return an
error  EINPROGRESS  noting  that  the  requested  operation is in









4.4BSD Architecture Manual                               PSD:5-25


progress.  A descriptor which has signalling enabled  will  cause
the  specified  process  and/or process group be signaled, with a
SIGIO for input, output, or in-progress operation complete, or  a
SIGURG for exceptional conditions.

     For  example,  when writing to a terminal using non-blocking
output, the system will accept only as  much  data  as  there  is
buffer space, then return.  When making a connection on a _s_o_c_k_e_t,
the operation may return indicating that  the  connection  estab-
lishment  is ``in progress''.  The _s_e_l_e_c_t facility can be used to
determine when further output is possible  on  the  terminal,  or
when the connection establishment attempt is complete.

11..66..  RReessoouurrccee ccoonnttrroollss


11..66..11..  PPrroocceessss pprriioorriittiieess


     The  system  gives CPU scheduling priority to processes that
have not used CPU time recently.  This tends to favor interactive
processes and processes that execute only for short periods.  The
instantaneous scheduling priority is a function of CPU usage  and
a  settable  priority  value  used in adjusting the instantaneous
priority with CPU usage or inactivity.  It is possible to  deter-
mine the settable priority factor currently assigned to a process
(PRIO_PROCESS), process group (PRIO_PGRP), or the processes of  a
specified  user  (PRIO_USER), or to alter this priority using the
calls:

     prio = getpriority(which, who);
     result int prio; int which, who;


     setpriority(which, who, prio);
     int which, who, prio;

The value _p_r_i_o is in the range -20 to 20.  The  default  priority
is  0; lower priorities cause more favorable execution.  The _g_e_t_-
_p_r_i_o_r_i_t_y call returns  the  highest  priority  (lowest  numerical
value)  enjoyed by any of the specified processes.  The _s_e_t_p_r_i_o_r_-
_i_t_y call sets the priorities of all the  specified  processes  to
the specified value.  Only the super-user may lower priorities.

11..66..22..  RReessoouurrccee uuttiilliizzaattiioonn


     The   _g_e_t_r_u_s_a_g_e  call  returns  information  describing  the
resources used by the current process (RUSAGE_SELF), or  all  its
terminated descendent processes (RUSAGE_CHILDREN):

     getrusage(who, rusage);
     int who; result struct rusage *rusage;










PSD:5-26                               4.4BSD Architecture Manual


The   information   is   returned   in  a  structure  defined  in
_<_s_y_s_/_r_e_s_o_u_r_c_e_._h_>:


     struct rusage {
          struct   timeval ru_utime;   /* user time used */
          struct   timeval ru_stime;   /* system time used */
          int      ru_maxrss;          /* maximum core resident set size: kbytes */
          int      ru_ixrss;           /* integral shared memory size (kbytes*sec) */
          int      ru_idrss;           /* unshared data memory size */
          int      ru_isrss;           /* unshared stack memory size */
          int      ru_minflt;          /* page-reclaims */
          int      ru_majflt;          /* page faults */
          int      ru_nswap;           /* swaps */
          int      ru_inblock;         /* block input operations */
          int      ru_oublock;         /* block output operations */
          int      ru_msgsnd;          /* messages sent */
          int      ru_msgrcv;          /* messages received */
          int      ru_nsignals;        /* signals received */
          int      ru_nvcsw;           /* voluntary context switches */
          int      ru_nivcsw;          /* involuntary context switches */
     };



11..66..33..  RReessoouurrccee lliimmiittss


     The resources of a process for which limits  are  controlled
by  the kernel are defined in _<_s_y_s_/_r_e_s_o_u_r_c_e_._h_>, and controlled by
the _g_e_t_r_l_i_m_i_t and _s_e_t_r_l_i_m_i_t calls:

     getrlimit(resource, rlp);
     int resource; result struct rlimit *rlp;


     setrlimit(resource, rlp);
     int resource; struct rlimit *rlp;

The resources that may currently be controlled include:























4.4BSD Architecture Manual                               PSD:5-27


     RLIMIT_CPU       /* cpu time in milliseconds */
     RLIMIT_FSIZE     /* maximum file size */
     RLIMIT_DATA      /* data size */
     RLIMIT_STACK     /* stack size */
     RLIMIT_CORE      /* core file size */
     RLIMIT_RSS       /* resident set size */
     RLIMIT_MEMLOCK   /* locked-in-memory address space */
     RLIMIT_NPROC     /* number of processes */
     RLIMIT_NOFILE    /* number of open files */
     RLIMIT_SBSIZE    /* maximum size of all socket buffers */
     RLIMIT_AS        /* virtual process size (inclusive of mmap) */
     RLIMIT_VMEM      /* alias of RLIMIT_AS */
     RLIMIT_NTHR      /* number of threads */


Each limit has a current value  and  a  maximum  defined  by  the
_r_l_i_m_i_t structure:


     struct rlimit {
          quad_t   rlim_cur;   /* current (soft) limit */
          quad_t   rlim_max;   /* hard limit */
     };



     Only  the  super-user  can  raise the maximum limits.  Other
users may only alter _r_l_i_m___c_u_r within the range from 0 to _r_l_i_m___m_a_x
or  (irreversibly)  lower  _r_l_i_m___m_a_x.   To  remove  a  limit  on a
resource, the value is set to RLIM_INFINITY.

11..77..  SSyysstteemm ooppeerraattiioonn ssuuppppoorrtt


Unless noted otherwise, the calls in this section  are  permitted
only to a privileged user.

11..77..11..  MMoonniittoorriinngg ssyysstteemm ooppeerraattiioonn


     The  _s_y_s_c_t_l  function  allows any process to retrieve system
information and allows processes with appropriate  privileges  to
set system configurations.

     sysctl(name, namelen, oldp, oldlenp, newp, newlen);
     int *name; u_int namelen; result void *oldp; result size_t *oldlenp;
     void *newp; size_t newlen;

The  information  available  from  _s_y_s_c_t_l  consists  of integers,
strings, and tables.  _S_y_s_c_t_l returns a consistent snapshot of the
data  requested.  Consistency is obtained by locking the destina-
tion buffer into memory so that the data may be copied out  with-
out  blocking.  Calls to _s_y_s_c_t_l are serialized to avoid deadlock.










PSD:5-28                               4.4BSD Architecture Manual


     The object to be interrogated or set is named using a ``Man-
agement  Information  Base''  (MIB)  style  name, listed in _n_a_m_e,
which is a _n_a_m_e_l_e_n length array of integers.  This name is from a
hierarchical  name  space, with the most significant component in
the first element of the array.  It is analogous to a file  path-
name, but with integers as components rather than slash-separated
strings.

     The information is copied into the buffer specified by _o_l_d_p.
The  size  of  the  buffer  is given by the location specified by
_o_l_d_l_e_n_p before the call, and that location is filled in with  the
amount  of data copied after a successful call.  If the amount of
data available is greater than the size of the  buffer  supplied,
the call supplies as much data as fits in the buffer provided and
returns an error.

     To set a new value, _n_e_w_p is set to  point  to  a  buffer  of
length  _n_e_w_l_e_n from which the requested value is to be taken.  If
a new value is not to be set, _n_e_w_p should  be  set  to  NULL  and
_n_e_w_l_e_n set to 0.

     The  top level names (those used in the first element of the
_n_a_m_e array) are defined with a CTL_ prefix in _<_s_y_s_/_s_y_s_c_t_l_._h_>, and
are as follows.  The next and subsequent levels down are found in
the include files listed here:


     Name          Next Level Names   Description
     ----------------------------------------------------
     CTL_DEBUG     sys/sysctl.h       Debugging
     CTL_FS        sys/sysctl.h       Filesystem
     CTL_HW        sys/sysctl.h       Generic CPU, I/O
     CTL_KERN      sys/sysctl.h       High kernel limits
     CTL_MACHDEP   sys/sysctl.h       Machine dependent
     CTL_NET       sys/socket.h       Networking
     CTL_USER      sys/sysctl.h       User-level
     CTL_VM        vm/vm_param.h      Virtual memory



11..77..22..  BBoooottssttrraapp ooppeerraattiioonnss


The call:

     mount(type, dir, flags, data);
     int type; char *dir; int flags; caddr_t data;

extends the name space. The _m_o_u_n_t call grafts a filesystem object
onto  the  system  file  tree at the point specified in _d_i_r.  The
argument _t_y_p_e specifies the type of  filesystem  to  be  mounted.
The  argument  _d_a_t_a describes the filesystem object to be mounted
according to the _t_y_p_e.  The contents  of  the  filesystem  become
available through the new mount point _d_i_r.  Any files in or below









4.4BSD Architecture Manual                               PSD:5-29


_d_i_r at the time of a successful mount  disappear  from  the  name
space  until the filesystem is unmounted.  The _f_l_a_g_s value speci-
fies generic properties, such as a request to mount the  filesys-
tem read-only.

Information  about  all  mounted filesystems can be obtained with
the call:

     getfsstat(buf, bufsize, flags);
     result struct statfs *buf; long bufsize, int flags;


The call:

     swapon(blkdev);
     char *blkdev;

specifies a device to be made available for paging and  swapping.

11..77..33..  SShhuuttddoowwnn ooppeerraattiioonnss


The call:

     unmount(dir, flags);
     char *dir; int flags;

unmounts  the  filesystem mounted on _d_i_r.  This call will succeed
only if the filesystem is not currently  being  used  or  if  the
MNT_FORCE flag is specified.

The call:

     sync();

schedules  I/O  to flush all modified disk blocks resident in the
kernel.  (This call does not require privileged  status.)   Files
can be selectively flushed to disk using the _f_s_y_n_c call (see sec-
tion 2.2.6).

The call:

     reboot(how);
     int how;

causes a machine halt or reboot.  The call may request  a  reboot
by  specifying  _h_o_w as RB_AUTOBOOT, or that the machine be halted
with RB_HALT, among other options.  These constants  are  defined
in _<_s_y_s_/_r_e_b_o_o_t_._h_>.

11..77..44..  AAccccoouunnttiinngg












PSD:5-30                               4.4BSD Architecture Manual


     The  system  optionally keeps an accounting record in a file
for each process that exits on the system.  The  format  of  this
record  is beyond the scope of this document.  The accounting may
be enabled to a file _n_a_m_e by doing:

     acct(path);
     char *path;

If _p_a_t_h is NULL, then accounting  is  disabled.   Otherwise,  the
named file becomes the accounting file.

22..  SSyysstteemm ffaacciilliittiieess


The system abstractions described are:

Directory contexts
     A  directory  context  is  a position in the filesystem name
     space.  Operations on files and other  named  objects  in  a
     filesystem  are always specified relative to such a context.

Files
     Files are used to store uninterpreted  sequences  of  bytes,
     which  may  be  _r_e_a_d and _w_r_i_t_t_e_n randomly.  Pages from files
     may also be mapped into the process address space.  A direc-
     tory  may  be  read as a file if permitted by the underlying
     storage facility, though it is usually accessed using _g_e_t_d_i_-
     _r_e_n_t_r_i_e_s  (see  section 2.2.3.1).  (Local filesystems permit
     directories to be read, although most NFS implementations do
     not allow reading of directories.)

Communications domains
     A  communications domain represents an interprocess communi-
     cations environment, such as the  communications  facilities
     of the 4.4BSD system, communications in the INTERNET, or the
     resource sharing protocols and access rights of  a  resource
     sharing system on a local network.

Sockets
     A socket is an endpoint of communication and the focal point
     for IPC in a communications domain.  Sockets may be  created
     in  pairs,  or given names and used to rendezvous with other
     sockets in a communications  domain,  accepting  connections
     from  these sockets or exchanging messages with them.  These
     operations  model  a  labeled  or  unlabeled  communications
     graph,  and  can be used in a wide variety of communications
     domains.  Sockets can have different _t_y_p_e_s to  provide  dif-
     ferent  semantics of communication, increasing the flexibil-
     ity of the model.

Terminals and other devices
     Devices include terminals (providing input  editing,  inter-
     rupt generation, output flow control, and editing), magnetic
     tapes, disks, and other peripherals.  They normally  support









4.4BSD Architecture Manual                               PSD:5-31


     the generic _r_e_a_d and _w_r_i_t_e operations as well as a number of
     _i_o_c_t_l's.

Processes
     Process  descriptors  provide  facilities  for  control  and
     debugging of other processes.

22..11..  GGeenneerriicc ooppeerraattiioonnss


     Many  system abstractions support the _r_e_a_d, _w_r_i_t_e, and _i_o_c_t_l
operations.  We describe the basics of  these  common  primitives
here.   Similarly,  the  mechanisms  whereby normally synchronous
operations may occur in a non-blocking  or  asynchronous  fashion
are  common  to all system-defined abstractions and are described
here.

22..11..11..  RReeaadd aanndd wwrriittee


     The _r_e_a_d and _w_r_i_t_e system calls can be applied to communica-
tions  channels,  files,  terminals  and  devices.  They have the
form:

     cc = read(fd, buf, nbytes);
     result ssize_t cc; int fd; result void *buf; size_t nbytes;


     cc = write(fd, buf, nbytes);
     result ssize_t cc; int fd; void *buf; size_t nbytes;

The _r_e_a_d call transfers as much data as possible from the  object
defined  by  _f_d to the buffer at address _b_u_f of size _n_b_y_t_e_s.  The
number of bytes transferred is returned in _c_c, which is -1  if  a
return  occurred  before  any  data was transferred because of an
error or use of non-blocking operations.  A return value of 0  is
used to indicate an end-of-file condition.

     The  _w_r_i_t_e call transfers data from the buffer to the object
defined by _f_d.  Depending on the type of _f_d, it is possible  that
the  _w_r_i_t_e call will accept only a portion of the provided bytes;
the user should resubmit the other  bytes  in  a  later  request.
Error  returns  because  of  interrupted  or otherwise incomplete
operations are possible, in which case no  data  will  have  been
transferred.

     Scattering of data on input, or gathering of data for output
is also possible using an array of input/output  vector  descrip-
tors.  The type for the descriptors is defined in _<_s_y_s_/_u_i_o_._h_> as:














PSD:5-32                               4.4BSD Architecture Manual


     struct iovec {
          char     *iov_base;   /* base of a component */
          size_t   iov_len;     /* length of a component */
     };



The _i_o_v___b_a_s_e field should be treated as if its type  were  ``void
*''  as  POSIX  and  other versions of the structure may use that
type.  Thus, pointer arithmetic should not use this value without
a cast.

The calls using an array of _i_o_v_e_c structures are:

     cc = readv(fd, iov, iovlen);
     result ssize_t cc; int fd; struct iovec *iov; int iovlen;


     cc = writev(fd, iov, iovlen);
     result ssize_t cc; int fd; struct iovec *iov; int iovlen;

Here _i_o_v_l_e_n is the count of elements in the _i_o_v array.

22..11..22..  IInnppuutt//oouuttppuutt ccoonnttrrooll


Control operations on an object are performed by the _i_o_c_t_l opera-
tion:

     ioctl(fd, request, buffer);
     int fd; u_long request; caddr_t buffer;

This operation causes the specified _r_e_q_u_e_s_t to  be  performed  on
the object _f_d.  The _r_e_q_u_e_s_t parameter specifies whether the argu-
ment buffer is to be read, written, read and written, or  is  not
used,  and  also  the size of the buffer, as well as the request.
Different descriptor types and subtypes within  descriptor  types
may  use distinct _i_o_c_t_l requests. For example, operations on ter-
minals control flushing of input and output queues and setting of
terminal  parameters; operations on disks cause formatting opera-
tions to occur; operations on  tapes  control  tape  positioning.
The   names   for   basic   control  operations  are  defined  by
_<_s_y_s_/_i_o_c_t_l_._h_>, or more specifically by files it includes.

22..11..33..  NNoonn--bblloocckkiinngg aanndd aassyynncchhrroonnoouuss ooppeerraattiioonnss


     A process that wishes to do non-blocking operations  on  one
of  its  descriptors  sets the descriptor in non-blocking mode as
described in section 1.5.4.  Thereafter the _r_e_a_d call will return
a specific EWOULDBLOCK error indication if there is no data to be
_r_e_a_d.  The process may _s_e_l_e_c_t the associated descriptor to deter-
mine when a read is possible.










4.4BSD Architecture Manual                               PSD:5-33


     Output  attempted  when a descriptor can accept less than is
requested will either accept some of the provided data, returning
a  shorter than normal length, or return an error indicating that
the operation would block.  More output can be performed as  soon
as a _s_e_l_e_c_t call indicates the object is writable.

     Operations  other than data input or output may be performed
on a descriptor in a non-blocking fashion.  These operations will
return  with  a  characteristic error indicating that they are in
progress if they cannot complete immediately.  The descriptor may
then  be  _s_e_l_e_c_t'ed  for _w_r_i_t_e to find out when the operation has
been  completed.   When  _s_e_l_e_c_t  indicates  the   descriptor   is
writable,  the  operation has completed.  Depending on the nature
of the descriptor and the operation, additional activity  may  be
started or the new state may be tested.

22..22..  FFiilleessyysstteemm


22..22..11..  OOvveerrvviieeww


     The filesystem abstraction provides access to a hierarchical
filesystem structure.  The filesystem contains directories  (each
of which may contain sub-directories) as well as files and refer-
ences to other objects such as devices and inter-process communi-
cations sockets.

     Each  file  is  organized  as  a  linear array of bytes.  No
record boundaries or system related information is present  in  a
file.   Files may be read and written in a random-access fashion.
If permitted by the underlying storage mechanism,  the  user  may
read  the  data in a directory as though it were an ordinary file
to determine the names of the contained files, but only the  sys-
tem may write into the directories.

22..22..22..  NNaammiinngg


     The  filesystem  calls take _p_a_t_h _n_a_m_e arguments.  These con-
sist of a zero or more component _f_i_l_e _n_a_m_e_s  separated  by  ``/''
characters,  where each file name is up to NAME_MAX (255) charac-
ters excluding null and ``/''.  Each pathname is up  to  PATH_MAX
(1024) characters excluding null.

     Each  process  always  has  two naming contexts: one for the
root directory of the filesystem and one for the current  working
directory.  These are used by the system in the filename transla-
tion process.  If a path name begins with a ``/'', it is called a
full  path  name  and  interpreted relative to the root directory
context.  If the path name does not begin  with  a  ``/''  it  is
called  a relative path name and interpreted relative to the cur-
rent directory context.










PSD:5-34                               4.4BSD Architecture Manual


     The file name ``.'' in each directory refers to that  direc-
tory.   The file name ``..'' in each directory refers to the par-
ent directory of that directory.  The  parent  directory  of  the
root of the filesystem is itself.

The calls:

     chdir(path);
     char *path;


     fchdir(fd);
     int fd;


     chroot(path);
     char *path;

change the current working directory or root directory context of
a process.  Only the super-user can  change  the  root  directory
context of a process.

Information  about  a  filesystem that contains a particular file
can be obtained using the calls:

     statfs(path, buf);
     char *path; struct statfs *buf;


     fstatfs(fd, buf);
     int fd; struct statfs *buf;


22..22..33..  CCrreeaattiioonn aanndd rreemmoovvaall


     The filesystem allows directories, files,  special  devices,
and fifos to be created and removed from the filesystem.

22..22..33..11..  DDiirreeccttoorryy ccrreeaattiioonn aanndd rreemmoovvaall


A directory is created with the _m_k_d_i_r system call:

     mkdir(path, mode);
     char *path; mode_t mode;

where  the  mode  is  defined as for files (see section 2.2.3.2).
Directories are removed with the _r_m_d_i_r system call:

     rmdir(path);
     char *path;

A directory must be empty  (other  than  the  entries  ``.''  and









4.4BSD Architecture Manual                               PSD:5-35


``..'')  if it is to be deleted.

Although directories can be read as files, the usual interface is
to use the call:

     getdirentries(fd, buf, nbytes, basep);
     int fd; char *buf; int nbytes; long *basep;

The _g_e_t_d_i_r_e_n_t_r_i_e_s system call returns a canonical array of direc-
tory  entries  in  the filesystem independent format described in
_<_d_i_r_e_n_t_._h_>.  Application programs usually use  the  library  rou-
tines  _o_p_e_n_d_i_r, _r_e_a_d_d_i_r, and _c_l_o_s_e_d_i_r which provide a more conve-
nient interface than _g_e_t_d_i_r_e_n_t_r_i_e_s.  The _f_t_s package is  provided
for recursive directory traversal.

22..22..33..22..  FFiillee ccrreeaattiioonn


Files are opened and/or created with the _o_p_e_n system call:

     fd = open(path, oflag, mode);
     result int fd; char *path; int oflag; mode_t mode;

The  _p_a_t_h  parameter specifies the name of the file to be opened.
The _o_f_l_a_g parameter must include O_CREAT to cause the file to  be
created.  Bits for _o_f_l_a_g are defined in _<_f_c_n_t_l_._h_>:


     O_RDONLY     /* open for reading only */
     O_WRONLY     /* open for writing only */
     O_RDWR       /* open for reading and writing */
     O_NONBLOCK   /* no delay */
     O_APPEND     /* set append mode */
     O_SHLOCK     /* open with shared file lock */
     O_EXLOCK     /* open with exclusive file lock */
     O_ASYNC      /* signal pgrp when data ready */
     O_FSYNC      /* synchronous writes */
     O_CREAT      /* create if nonexistent */
     O_TRUNC      /* truncate to zero length */
     O_EXCL       /* error if already exists */



     One  of  O_RDONLY,  O_WRONLY and O_RDWR should be specified,
indicating what types of operations are desired to be done on the
open  file.   The  operations  will be checked against the user's
access rights to the file before allowing the  _o_p_e_n  to  succeed.
Specifying O_APPEND causes all writes to be appended to the file.
Specifying O_TRUNC causes the file to be truncated  when  opened.
The  flag  O_CREAT  causes  the file to be created if it does not
exist, owned by the current user and the group of the  containing
directory.   The  permissions  for  the new file are specified in
_m_o_d_e as the OR of  the  appropriate  permissions  as  defined  in
_<_s_y_s_/_s_t_a_t_._h_>:









PSD:5-36                               4.4BSD Architecture Manual


     S_IRWXU                      /* RWX for owner */
     S_IRUSR                      /* R for owner */
     S_IWUSR                      /* W for owner */
     S_IXUSR                      /* X for owner */
     S_IRWXG                      /* RWX for group */
     S_IRGRP                      /* R for group */
     S_IWGRP                      /* W for group */
     S_IXGRP                      /* X for group */
     S_IRWXO                      /* RWX for other */
     S_IROTH                      /* R for other */
     S_IWOTH                      /* W for other */
     S_IXOTH                      /* X for other */
     S_ISUID                      /* set user id */
     S_ISGID /* set group id */
     S_ISTXT /* sticky bit */



Historically,  the  file mode has been used as a four digit octal
number.  The bottom three digits encode read access as  4,  write
access  as  2  and execute access as 1, or'ed together.  The 0700
bits describe owner access, the  070  bits  describe  the  access
rights  for  processes  in the same group as the file, and the 07
bits describe the access rights for other  processes.   The  7000
bits  encode  set  user ID as 4000, set group ID as 2000, and the
sticky bit as 1000.  The mode specified to _o_p_e_n  is  modified  by
the process _u_m_a_s_k; permissions specified in the _u_m_a_s_k are cleared
in the mode of the created file.  The _u_m_a_s_k can be  changed  with
the call:

     oldmask = umask(newmask);
     result mode_t oldmask; mode_t newmask;


     If the O_EXCL flag is set, and the file already exists, then
the _o_p_e_n will fail without affecting the file in any  way.   This
mechanism provides a simple exclusive access facility.  For secu-
rity reasons, if the O_EXCL flag is set and the file  is  a  sym-
bolic link, the open will fail regardless of the existence of the
file referenced by the link.  The  O_SHLOCK  and  O_EXLOCK  flags
allow the file to be atomically _o_p_e_n'ed and _f_l_o_c_k'ed; see section
2.2.7 for the semantics of _f_l_o_c_k style locks.  The  O_ASYNC  flag
enables  the  SIGIO signal to be sent to the process group of the
opening process when I/O is possible, e.g., upon availability  of
data to be read.

22..22..33..33..  CCrreeaattiinngg rreeffeerreenncceess ttoo ddeevviicceess


     The  filesystem  allows  entries  which reference peripheral
devices.  Peripherals are distinguished  as  _b_l_o_c_k  or  _c_h_a_r_a_c_t_e_r
devices  according  by  their  ability  to support block-oriented
operations.   Devices  are  identified  by  their  ``major''  and
``minor'' device numbers.  The major device number determines the









4.4BSD Architecture Manual                               PSD:5-37


kind of peripheral it is, while the minor device number indicates
either  one of possibly many peripherals of that kind, or special
characteristics of the peripheral.  Structured devices  have  all
operations done internally in ``block'' quantities while unstruc-
tured devices may have input and output done  in  varying  units,
and  may act as a non-seekable communications channel rather than
a random-access device.  The _m_k_n_o_d call creates special entries:

     mknod(path, mode, dev);
     char *path; mode_t mode; dev_t dev;

where _m_o_d_e is formed from the object type and access permissions.
The  parameter _d_e_v is a configuration dependent parameter used to
identify specific character or block I/O devices.

Fifo's can be created in the filesystem using the call:

     mkfifo(path, mode);
     char *path; mode_t mode;

The _m_o_d_e parameter is used solely to specify the  access  permis-
sions of the newly created fifo.

22..22..33..44..  LLiinnkkss aanndd rreennaammiinngg


     Links allow multiple names for a file to exist.  Links exist
independently of the file to which they are linked.

     Two types of links exist, _h_a_r_d links and _s_y_m_b_o_l_i_c links.   A
hard link is a reference counting mechanism that allows a file to
have multiple names within the same filesystem.  Each link  to  a
file  is  equivalent,  referring to the file independently of any
other name.  Symbolic links cause string substitution during  the
pathname  interpretation process, and refer to a file name rather
than referring directly to a file.

     Hard links and symbolic links have different properties.   A
hard link ensures that the target file will always be accessible,
even after its original directory entry is removed; no such guar-
antee  exists  for  a symbolic link.  Unlike hard links, symbolic
links can refernce directories and span  filesystems  boundaries.
An  _l_s_t_a_t (see section 2.2.4) call on a hard link will return the
information about the file (or directory) that  the  link  refer-
ences while an _l_s_t_a_t call on a symbolic link will return informa-
tion about the link itself.  A symbolic link  does  not  have  an
owner,  group,  permissions,  access and modification times, etc.
The only attributes returned from an _l_s_t_a_t that refer to the sym-
bolic  link itself are the file type (S_IFLNK), size, blocks, and
link count (always 1).  The other attributes are filled  in  from
the directory that contains the link.

The following calls create a new link, named _p_a_t_h_2, to _p_a_t_h_1:










PSD:5-38                               4.4BSD Architecture Manual


     link(path1, path2);
     char *path1, *path2;


     symlink(path1, path2);
     char *path1, *path2;

The _u_n_l_i_n_k primitive may be used to remove either type of link.

If  a  file  is a symbolic link, the ``value'' of the link may be
read with the _r_e_a_d_l_i_n_k call:

     len = readlink(path, buf, bufsize);
     result int len; char *path; result char *buf; int bufsize;

This call returns, in _b_u_f, the string substituted into  pathnames
passing through _p_a_t_h.  (This string is not NULL terminated.)

Atomic  renaming  of filesystem resident objects is possible with
the _r_e_n_a_m_e call:

     rename(oldname, newname);
     char *oldname, *newname;

where both _o_l_d_n_a_m_e and _n_e_w_n_a_m_e must be in  the  same  filesystem.
If  either _o_l_d_n_a_m_e or _n_e_w_n_a_m_e is a directory, then the other also
must be a directory for the _r_e_n_a_m_e to succeed.  If _n_e_w_n_a_m_e exists
and is a directory, then it must be empty.

22..22..33..55..  FFiillee,, ddeevviiccee,, aanndd ffiiffoo rreemmoovvaall


A reference to a file, special device or fifo may be removed with
the _u_n_l_i_n_k call:

     unlink(path);
     char *path;

The caller must have write access to the directory in  which  the
file  is  located  for this call to be successful.  When the last
name for a file has been removed,  the  file  may  no  longer  be
opened;  the  file itself is removed once any existing references
have been closed.

All current access to a file can be revoked using the call:

     revoke(path);
     char *path;

Subsequent operations on any descriptors open at the time of  the
_r_e_v_o_k_e  fail, with the exceptions that a _c_l_o_s_e call will succeed,
and a _r_e_a_d from a character device file which  has  been  revoked
returns  a count of zero (end of file).  If the file is a special
file for a device which is open, the  device  close  function  is









4.4BSD Architecture Manual                               PSD:5-39


called  as  if  all  open references to the file had been closed.
_O_p_e_n's done after the _r_e_v_o_k_e may succeed.  This call is most use-
ful  for  revoking  access  to  a terminal line after a hangup in
preparation for reuse by a new login session.  Access to  a  con-
trolling  terminal  is  automatically  revoked  when  the session
leader for the session exits.

22..22..44..  RReeaaddiinngg aanndd mmooddiiffyyiinngg ffiillee aattttrriibbuutteess


Detailed information about  the  attributes  of  a  file  may  be
obtained with the calls:

     stat(path, stb);
     char *path; result struct stat *stb;


     fstat(fd, stb);
     int fd; result struct stat *stb;

The _s_t_a_t structure includes the file type, protection, ownership,
access times, size, and a count of hard links.  If the file is  a
symbolic  link,  then  the status of the link itself (rather than
the file the link references) may be  obtained  using  the  _l_s_t_a_t
call:

     lstat(path, stb);
     char *path; result struct stat *stb;


     Newly  created files are assigned the user ID of the process
that created them and the group ID of the directory in which they
were  created.   The ownership of a file may be changed by either
of the calls:

     chown(path, owner, group);
     char *path; uid_t owner; gid_t group;


     fchown(fd, owner, group);
     int fd, uid_t owner; gid_t group;


     In addition to ownership, each  file  has  three  levels  of
access  protection  associated  with  it.  These levels are owner
relative, group relative, and other.  Each level  of  access  has
separate  indicators  for  read permission, write permission, and
execute permission.  The protection bits associated with  a  file
may be set by either of the calls:

     chmod(path, mode);
     char *path; mode_t mode;











PSD:5-40                               4.4BSD Architecture Manual


     fchmod(fd, mode);
     int fd, mode_t mode;

where  _m_o_d_e is a value indicating the new protection of the file,
as listed in section 2.2.3.2.

     Each file has a set of flags stored as a bit mask associated
with  it.  These flags are returned in the _s_t_a_t structure and are
set using the calls:

     chflags(path, flags);
     char *path; u_long flags;


     fchflags(fd, flags);
     int fd; u_long flags;

The flags specified are formed by or'ing the following values:


     UF_NODUMP      Do not dump the file.
     UF_IMMUTABLE   The file may not be changed.
     UF_APPEND      The file may only be appended to.
     SF_IMMUTABLE   The file may not be changed.
     SF_APPEND      The file may only be appended to.


The UF_NODUMP, UF_IMMUTABLE and UF_APPEND flags  may  be  set  or
unset  by  either  the  owner  of  a file or the super-user.  The
SF_IMMUTABLE and SF_APPEND flags may only be set or unset by  the
super-user.   They  may be set at any time, but normally may only
be unset when the system is in single-user mode.

Finally, the access and modify times on a file may be set by  the
call:

     utimes(path, tvp);
     char *path; struct timeval *tvp[2];

This  is  particularly useful when moving files between media, to
preserve file access and modification times.

22..22..55..  CChheecckkiinngg aacccceessssiibbiilliittyy


     A process running with different real and effective user-ids
may  interrogate  the accessibility of a file to the real user by
using the _a_c_c_e_s_s call:

     accessible = access(path, how);
     result int accessible; char *path; int how;

_H_o_w is constructed by  OR'ing  the  following  bits,  defined  in
_<_u_n_i_s_t_d_._h_>:









4.4BSD Architecture Manual                               PSD:5-41


     F_OK   /* file exists */
     X_OK   /* file is executable/searchable */
     W_OK   /* file is writable */
     R_OK   /* file is readable */


The  presence  or  absence  of advisory locks does not affect the
result of _a_c_c_e_s_s.

     The _p_a_t_h_c_o_n_f and _f_p_a_t_h_c_o_n_f functions provide  a  method  for
applications  to  determine  the  current value of a configurable
system limit or option variable associated  with  a  pathname  or
file descriptor:

     ans = pathconf(path, name);
     result long ans; char *path; int name;


     ans = fpathconf(fd, name);
     result long ans; int fd, name;

For  _p_a_t_h_c_o_n_f,  the _p_a_t_h argument is the name of a file or direc-
tory.  For _f_p_a_t_h_c_o_n_f, the _f_d argument is an open file descriptor.
The  _n_a_m_e  argument  specifies the system variable to be queried.
Symbolic constants for each name value are found in  the  include
file _<_u_n_i_s_t_d_._h_>.

22..22..66..  EExxtteennssiioonn aanndd ttrruunnccaattiioonn


     Files  are created with zero length and may be extended sim-
ply by writing or appending to them.  While a file  is  open  the
system  maintains  a pointer into the file indicating the current
location in  the  file  associated  with  the  descriptor.   This
pointer  may  be moved about in the file in a random access fash-
ion.  To set the current offset into a file, the _l_s_e_e_k  call  may
be used:

     oldoffset = lseek(fd, offset, type);
     result off_t oldoffset; int fd; off_t offset; int type;

where _t_y_p_e is defined by _<_u_n_i_s_t_d_._h_> as one of:


     SEEK_SET   /* set file offset to offset */
     SEEK_CUR   /* set file offset to current plus offset */
     SEEK_END   /* set file offset to EOF plus offset */


The  call  ``lseek(fd,  0, SEEK_CUR)'' returns the current offset
into the file.

     Files may have ``holes'' in them.  Holes are  areas  in  the
linear  extent  of  the  file  where data has never been written.









PSD:5-42                               4.4BSD Architecture Manual


These may be created by seeking to a location in a file past  the
current end-of-file and writing.  Holes are treated by the system
as zero valued bytes.

A file may be extended or truncated with either of the calls:

     truncate(path, length);
     char *path; off_t length;


     ftruncate(fd, length);
     int fd; off_t length;

changing the size of the specified file to _l_e_n_g_t_h bytes.

     Unless opened with the O_FSYNC flag,  writes  to  files  are
held  for  an  indeterminate  period of time in the system buffer
cache.  The call:

     fsync(fd);
     int fd;

ensures that the contents of a file are committed to disk  before
returning.   This feature is used by applications such as editors
that want to ensure the integrity of a new file  before  continu-
ing.

22..22..77..  LLoocckkiinngg


     The  filesystem provides basic facilities that allow cooper-
ating processes to synchronize their access to shared  files.   A
process  may  place  an advisory _r_e_a_d or _w_r_i_t_e lock on a file, so
that other cooperating processes may avoid interfering  with  the
process'  access.   This  simple  mechanism provides locking with
file granularity.  Byte range locking is  available  with  _f_c_n_t_l;
see  section  1.5.4.  The system does not force processes to obey
the locks; they are of an advisory nature only.

Locking can be done  as  part  of  the  _o_p_e_n  call  (see  section
2.2.3.2) or after an _o_p_e_n call by applying the _f_l_o_c_k primitive:

     flock(fd, how);
     int fd, how;

where the _h_o_w parameter is formed from bits defined in _<_f_c_n_t_l_._h_>:


     LOCK_SH   /* shared file lock */
     LOCK_EX   /* exclusive file lock */
     LOCK_NB   /* don't block when locking */
     LOCK_UN   /* unlock file */











4.4BSD Architecture Manual                               PSD:5-43


Successive lock calls may be used to  increase  or  decrease  the
level  of  locking.   If an object is currently locked by another
process when a _f_l_o_c_k call is made, the  caller  will  be  blocked
until  the  current  lock  owner  releases  the lock; this may be
avoided by including LOCK_NB in the  _h_o_w  parameter.   Specifying
LOCK_UN  removes all locks associated with the descriptor.  Advi-
sory locks held by a process are automatically deleted  when  the
process terminates.

22..22..88..  DDiisskk qquuoottaass


     As  an  optional  facility, each local filesystem can impose
limits on a user's or group's disk  usage.   Two  quantities  are
limited: the total amount of disk space which a user or group may
allocate in a filesystem and the total number of files a user  or
group  may  create in a filesystem.  Quotas are expressed as _h_a_r_d
limits and _s_o_f_t limits.  A hard limit is  always  imposed;  if  a
user  or  group  would  exceed  a hard limit, the operation which
caused the resource request will fail.  A soft limit  results  in
the  user  or group receiving a warning message, but with alloca-
tion succeeding.  Facilities are provided  to  turn  soft  limits
into hard limits if a user or group has exceeded a soft limit for
an unreasonable period of time.

The _q_u_o_t_a_c_t_l call enables, disables  and  manipulates  filesystem
quotas:

     quotactl(path, cmd, id, addr);
     char *path; int cmd; int id; char *addr;

A  quota control command given by cmd operates on the given file-
name path for the given user ID. The address of an optional  com-
mand  specific data structure, addr, may be given.  The supported
commands include:


     Q_QUOTAON    /* enable quotas */
     Q_QUOTAOFF   /* disable quotas */
     Q_GETQUOTA   /* get limits and usage */
     Q_SETQUOTA   /* set limits and usage */
     Q_SETUSE     /* set usage */
     Q_SYNC       /* sync disk copy of a filesystems quotas */



22..22..99..  RReemmoottee ffiilleessyysstteemmss


There are two system calls intended to help  support  the  remote
filesystem implementation.  The call:












PSD:5-44                               4.4BSD Architecture Manual


     nfssvc(flags, argstructp);
     int flags, void *argstructp;

is  used  by  the NFS daemons to pass information into and out of
the kernel and also to enter the kernel as a server daemon.   The
flags  argument consists of several bits that show what action is
to be taken once in the kernel and _a_r_g_s_t_r_u_c_t_p points  to  one  of
three structures depending on which bits are set in flags.

The call:

     getfh(path, fhp);
     char *path; result fhandle_t *fhp;

returns  a file handle for the specified file or directory in the
file handle pointed to by fhp.  This file handle can then be used
in  future  calls  to  NFS to access the file without the need to
repeat the pathname translation.  This system call is  restricted
to the superuser.

22..22..1100..  OOtthheerr ffiilleessyysstteemmss


The kernel supports many other filesystems.  These include:

+o    The log-structured filesystem. It provides an alternate disk
     layout than the fast filesystem optimized for writing rather
     than  reading.  For further information see the mount_lfs(8)
     manual page.

+o    The ISO-standard 9660 filesystem with Rock Ridge  extensions
     used   for   CD-ROMs.    For  further  information  see  the
     mount_cd9660(8) manual page.

+o    The file descriptor mapping filesystem.  For further  infor-
     mation see the mount_fdesc(8) manual page.

+o    The  /proc  filesystem as an alternative for debuggers.  For
     further   information   see   section    2.5.1    and    the
     mount_procfs(8) manual page.

+o    The  memory-based  filesystem,  used  primarily for fast but
     ethereal uses such as /tmp.  For further information see the
     mount_mfs(8) manual page.

+o    The  kernel  variable  filesystem, used as an alternative to
     _s_y_s_c_t_l.  For further information see section 1.7.1  and  the
     mount_kernfs(8) manual page.

+o    The  portal  filesystem,  used  to  mount  processes  in the
     filesystem.  For further information see the mount_portal(8)
     manual page.











4.4BSD Architecture Manual                               PSD:5-45


+o    The  uid/gid remapping filesystem, usually layered above NFS
     filesystems exported to an  outside  administrative  domain.
     For further information see the mount_umap(8) manual page.

+o    The  union  filesystem,  used to place a writable filesystem
     above a read-only filesystem.  This filesystem is useful for
     compiling sources on a CD-ROM without having to copy the CD-
     ROM contents to writable disk.  For further information  see
     the mount_union(8) manual page.

22..33..  IInntteerrpprroocceessss ccoommmmuunniiccaattiioonnss


22..33..11..  IInntteerrpprroocceessss ccoommmmuunniiccaattiioonn pprriimmiittiivveess


22..33..11..11..  CCoommmmuunniiccaattiioonn ddoommaaiinnss


     The  system provides access to an extensible set of communi-
cation _d_o_m_a_i_n_s.  A communication domain (or protocol  family)  is
identified   by   a   manifest   constant  defined  in  the  file
_<_s_y_s_/_s_o_c_k_e_t_._h_>.  Important standard domains supported by the sys-
tem  are  the  local  (``UNIX'') domain (PF_LOCAL or PF_UNIX) for
communication  within  the  system,   the   ``Internet''   domain
(PF_INET) for communication in the DARPA Internet, the ISO family
of protocols (PF_ISO and PF_CCITT) for providing a check-off  box
on  the  list  of your system capabilities, and the ``NS'' domain
(PF_NS) for communication using the Xerox Network Systems  proto-
cols.  Other domains can be added to the system.

22..33..11..22..  SSoocckkeett ttyyppeess aanndd pprroottooccoollss


     Within  a domain, communication takes place between communi-
cation endpoints known as _s_o_c_k_e_t_s.  Each socket has the potential
to exchange information with other sockets of an appropriate type
within the domain.

     Each socket has an associated abstract type, which describes
the  semantics  of  communication  using that socket.  Properties
such as reliability, ordering, and prevention of  duplication  of
messages  are  determined  by  the type.  The basic set of socket
types is defined in _<_s_y_s_/_s_o_c_k_e_t_._h_>:



















PSD:5-46                               4.4BSD Architecture Manual


     Standard socket types
     --------------------------------------------------
     SOCK_DGRAM       /* datagram */
     SOCK_STREAM      /* virtual circuit */
     SOCK_RAW         /* raw socket */
     SOCK_RDM         /* reliably-delivered message */
     SOCK_SEQPACKET   /* sequenced packets */


The SOCK_DGRAM type models the semantics of datagrams in  network
communication:  messages may be lost or duplicated and may arrive
out-of-order.  A datagram socket may send messages to and receive
messages  from  multiple  peers.   The  SOCK_RDM  type models the
semantics of reliable datagrams: messages arrive unduplicated and
in-order,  the sender is notified if messages are lost.  The _s_e_n_d
and _r_e_c_e_i_v_e operations (described  below)  generate  reliable  or
unreliable  datagrams.   The  SOCK_STREAM type models connection-
based virtual circuits:  two-way  byte  streams  with  no  record
boundaries.   Connection setup is required before data communica-
tion may begin.  The SOCK_SEQPACKET  type  models  a  connection-
based,  full-duplex, reliable, exchange preserving message bound-
aries; the sender is notified if messages are lost, and  messages
are  never  duplicated  or  presented out-of-order.  Users of the
last two abstractions may  use  the  facilities  for  out-of-band
transmission to send out-of-band data.

     SOCK_RAW  is used for unprocessed access to internal network
layers and interfaces;  it  has  no  specific  semantics.   Other
socket types can be defined.

     Each socket may have a specific _p_r_o_t_o_c_o_l associated with it.
This protocol is used within the domain to provide the  semantics
required  by the socket type.  Not all socket types are supported
by each domain; support depends on the existence and  the  imple-
mentation of a suitable protocol within the domain.  For example,
within the ``Internet'' domain, the SOCK_DGRAM type may be imple-
mented  by  the  UDP  user datagram protocol, and the SOCK_STREAM
type may be implemented by the TCP transmission control protocol,
while no standard protocols to provide SOCK_RDM or SOCK_SEQPACKET
sockets exist.

22..33..11..33..  SSoocckkeett ccrreeaattiioonn,, nnaammiinngg aanndd sseerrvviiccee eessttaabblliisshhmmeenntt


     Sockets may be _c_o_n_n_e_c_t_e_d  or  _u_n_c_o_n_n_e_c_t_e_d.   An  unconnected
socket descriptor is obtained by the _s_o_c_k_e_t call:

     s = socket(domain, type, protocol);
     result int s; int domain, type, protocol;

The socket domain and type are as described above, and are speci-
fied using the definitions from _<_s_y_s_/_s_o_c_k_e_t_._h_>.  The protocol may
be  given  as  0,  meaning any suitable protocol.  One of several
possible protocols may be  selected  using  identifiers  obtained









4.4BSD Architecture Manual                               PSD:5-47


from a library routine, _g_e_t_p_r_o_t_o_b_y_n_a_m_e.

     An  unconnected  socket  descriptor of a connection-oriented
type may yield a connected socket descriptor in one of two  ways:
either  by  actively connecting to another socket, or by becoming
associated with a name in the communications domain and _a_c_c_e_p_t_i_n_g
a  connection  from  another  socket.   Datagram sockets need not
establish connections before use.

     To accept connections or to receive datagrams, a socket must
first have a binding to a name (or address) within the communica-
tions domain.  Such a binding may be established by a _b_i_n_d call:

     bind(s, name, namelen);
     int s; struct sockaddr *name; int namelen;

Datagram sockets may have default bindings established when first
sending  data if not explicitly bound earlier.  In either case, a
socket's bound name may be retrieved with a _g_e_t_s_o_c_k_n_a_m_e call:

     getsockname(s, name, namelen);
     int s; result struct sockaddr *name; result int *namelen;

while the peer's name can be retrieved with _g_e_t_p_e_e_r_n_a_m_e:

     getpeername(s, name, namelen);
     int s; result struct sockaddr *name; result int *namelen;

Domains may support sockets with several names.

22..33..11..44..  AAcccceeppttiinngg ccoonnnneeccttiioonnss


Once a binding is made to a  connection-oriented  socket,  it  is
possible to _l_i_s_t_e_n for connections:

     listen(s, backlog);
     int s, backlog;

The  _b_a_c_k_l_o_g  specifies the maximum count of connections that can
be simultaneously queued awaiting acceptance.

An _a_c_c_e_p_t call:

     t = accept(s, name, anamelen);
     result int t; int s; result struct sockaddr *name; result int *anamelen;

returns a descriptor for a new, connected, socket from the  queue
of  pending  connections  on _s.  If no new connections are queued
for acceptance, the call will wait for a connection  unless  non-
blocking I/O has been enabled (see section 1.5.4).












PSD:5-48                               4.4BSD Architecture Manual


22..33..11..55..  MMaakkiinngg ccoonnnneeccttiioonnss


An  active  connection  to  a named socket is made by the _c_o_n_n_e_c_t
call:

     connect(s, name, namelen);
     int s; struct sockaddr *name; int namelen;

Although datagram sockets do not establish connections, the  _c_o_n_-
_n_e_c_t  call may be used with such sockets to create an _a_s_s_o_c_i_a_t_i_o_n
with the foreign address.  The address is  recorded  for  use  in
future  _s_e_n_d  calls,  which  then  need  not  supply  destination
addresses.  Datagrams will be received only from that  peer,  and
asynchronous error reports may be received.

     It  is  also  possible  to create connected pairs of sockets
without using the domain's name space to rendezvous; this is done
with the _s_o_c_k_e_t_p_a_i_r call[+]:

     socketpair(domain, type, protocol, sv);
     int domain, type, protocol; result int sv[2];

Here the returned _s_v descriptors  correspond  to  those  obtained
with _a_c_c_e_p_t and _c_o_n_n_e_c_t.

The call:

     pipe(pv);
     result int pv[2];

creates  a  pair  of  SOCK_STREAM sockets in the PF_LOCAL domain,
with pv[0] only writable and pv[1] only readable.

22..33..11..66..  SSeennddiinngg aanndd rreecceeiivviinngg ddaattaa


Messages may be sent from a socket by:

     cc = sendto(s, msg, len, flags, to, tolen);
     result int cc; int s; void *msg; size_t len;
     int flags; struct sockaddr *to; int tolen;

if the socket is not connected or:

     cc = send(s, msg, len, flags);
     result int cc; int s; void *msg; size_t len; int flags;

if the socket is connected.  The corresponding receive primitives
are:

-----------
[+]  4.4BSD  supports  _s_o_c_k_e_t_p_a_i_r creation only in the
PF_LOCAL communication domain.









4.4BSD Architecture Manual                               PSD:5-49


     msglen = recvfrom(s, buf, len, flags, from, fromlenaddr);
     result int msglen; int s; result void *buf; size_t len; int flags;
     result struct sockaddr *from; result int *fromlenaddr;

and:

     msglen = recv(s, buf, len, flags);
     result int msglen; int s; result void *buf; size_t len; int flags;


     In the unconnected case, the parameters _t_o and _t_o_l_e_n specify
the destination or source of the message, while the _f_r_o_m  parame-
ter  stores the source of the message, and _*_f_r_o_m_l_e_n_a_d_d_r initially
gives the size of the _f_r_o_m buffer and is updated to  reflect  the
true length of the _f_r_o_m address.

     All  calls  cause the message to be received in or sent from
the message buffer of length _l_e_n bytes, starting at address  _b_u_f.
The  _f_l_a_g_s specify peeking at a message without reading it, send-
ing or receiving high-priority  out-of-band  messages,  or  other
special requests as follows:


     MSG_OOB         /* process out-of-band data */
     MSG_PEEK        /* peek at incoming message */
     MSG_DONTROUTE   /* send without using routing tables */
     MSG_EOR         /* data completes record */
     MSG_TRUNC       /* data discarded before delivery */
     MSG_CTRUNC      /* control data lost before delivery */
     MSG_WAITALL     /* wait for full request or error */
     MSG_DONTWAIT    /* this message should be nonblocking */



22..33..11..77..  SSccaatttteerr//ggaatthheerr aanndd eexxcchhaannggiinngg aacccceessss rriigghhttss


     It  is  possible  to scatter and gather data and to exchange
access rights with messages.  When either of these operations  is
involved,  the  number  of  parameters to the call becomes large.
Thus,  the  system  defines  a  message  header   structure,   in
_<_s_y_s_/_s_o_c_k_e_t_._h_>,  which  can  be  used to conveniently contain the
parameters to the calls:




















PSD:5-50                               4.4BSD Architecture Manual


     struct msghdr {
          caddr_t   msg_name;         /* optional address */
          u_int     msg_namelen;      /* size of address */
          struct    iovec *msg_iov;   /* scatter/gather array */
          u_int     msg_iovlen;       /* # elements in msg_iov */
          caddr_t   msg_control;      /* ancillary data */
          u_int     msg_controllen;   /* ancillary data buffer len */
          int       msg_flags;        /* flags on received message */
     };


Here _m_s_g___n_a_m_e and _m_s_g___n_a_m_e_l_e_n specify the source  or  destination
address  if the socket is unconnected; _m_s_g___n_a_m_e may be given as a
null pointer if no names are desired or  required.   The  _m_s_g___i_o_v
and   _m_s_g___i_o_v_l_e_n   describe   the  scatter/gather  locations,  as
described in section 2.1.1.  The data in the  _m_s_g___c_o_n_t_r_o_l  buffer
is  composed  of  an  array  of variable length messages used for
additional information with or about a datagram  not  expressible
by flags.  The format is a sequence of message elements headed by
_c_m_s_g_h_d_r structures:


     struct cmsghdr {
          u_int    cmsg_len;      /* data byte count, including hdr */
          int      cmsg_level;    /* originating protocol */
          int      cmsg_type;     /* protocol-specific type */
          u_char   cmsg_data[];   /* variable length type specific data */
     };


The following macros are provided for use  with  the  _m_s_g___c_o_n_t_r_o_l
buffer:


     CMSG_FIRSTHDR(mhdr)       /* given msghdr, return first cmsghdr */
     CMSG_NXTHDR(mhdr, cmsg)   /* given msghdr and cmsghdr, return next cmsghdr */
     CMSG_DATA(cmsg)           /* given cmsghdr, return associated data pointer */


Access  rights to be sent along with the message are specified in
one of these _c_m_s_g_h_d_r structures, with level SOL_SOCKET  and  type
SCM_RIGHTS.  In the PF_LOCAL domain these are an array of integer
descriptors, copied from the sending process  and  duplicated  in
the receiver.

This structure is used in the operations _s_e_n_d_m_s_g and _r_e_c_v_m_s_g:

     sendmsg(s, msg, flags);
     int s; struct msghdr *msg; int flags;


     msglen = recvmsg(s, msg, flags);
     result int msglen; int s; result struct msghdr *msg; int flags;










4.4BSD Architecture Manual                               PSD:5-51


22..33..11..88..  UUssiinngg rreeaadd aanndd wwrriittee wwiitthh ssoocckkeettss


     The  normal _r_e_a_d and _w_r_i_t_e calls may be applied to connected
sockets and translated into _s_e_n_d and _r_e_c_e_i_v_e calls from or  to  a
single  area  of  memory  and  discarding any rights received.  A
process may operate on a virtual circuit socket, a terminal or  a
file  with blocking or non-blocking input/output operations with-
out distinguishing the descriptor type.

22..33..11..99..  SShhuuttttiinngg ddoowwnn hhaallvveess ooff ffuullll--dduupplleexx ccoonnnneeccttiioonnss


     A process that has a full-duplex socket such  as  a  virtual
circuit and no longer wishes to read from or write to this socket
can give the call:

     shutdown(s, direction);
     int s, direction;

where _d_i_r_e_c_t_i_o_n is 0 to not read further, 1 to not write further,
or  2  to completely shut the connection down.  If the underlying
protocol supports unidirectional or bidirectional shutdown,  this
indication  will  be passed to the peer.  For example, a shutdown
for writing might produce an end-of-file condition at the  remote
end.

22..33..11..1100..  SSoocckkeett aanndd pprroottooccooll ooppttiioonnss


     Sockets,  and  their underlying communication protocols, may
support _o_p_t_i_o_n_s.  These options may be used to manipulate  imple-
mentation-  or  protocol-specific facilities.  The _g_e_t_s_o_c_k_o_p_t and
_s_e_t_s_o_c_k_o_p_t calls are used to control options:

     getsockopt(s, level, optname, optval, optlen);
     int s, level, optname; result void *optval; result int *optlen;


     setsockopt(s, level, optname, optval, optlen);
     int s, level, optname; void *optval; int optlen;

The option _o_p_t_n_a_m_e is interpreted at the indicated protocol _l_e_v_e_l
for socket _s.  If a value is specified with _o_p_t_v_a_l and _o_p_t_l_e_n, it
is interpreted by the software operating at the specified  _l_e_v_e_l.
The  _l_e_v_e_l  SOL_SOCKET is reserved to indicate options maintained
by the socket facilities.  Other _l_e_v_e_l values indicate a particu-
lar  protocol which is to act on the option request; these values
are normally interpreted as a ``protocol number'' within the pro-
tocol family.













PSD:5-52                               4.4BSD Architecture Manual


22..33..22..  PPFF__LLOOCCAALL ddoommaaiinn


     This   section  describes  briefly  the  properties  of  the
PF_LOCAL (``UNIX'') communications domain.

22..33..22..11..  TTyyppeess ooff ssoocckkeettss


     In the local domain, the  SOCK_STREAM  abstraction  provides
pipe-like  facilities,  while SOCK_DGRAM provides (usually) reli-
able message-style communications.

22..33..22..22..  NNaammiinngg


     Socket names are strings and may appear  in  the  filesystem
name space.

22..33..22..33..  AAcccceessss rriigghhttss ttrraannssmmiissssiioonn


     The ability to pass descriptors with messages in this domain
allows migration of service within the  system  and  allows  user
processes to be used in building system facilities.

22..33..33..  IINNTTEERRNNEETT ddoommaaiinn


     This  section  describes  briefly how the Internet domain is
mapped to the model described in this section.  More  information
will  be found in the document describing the network implementa-
tion in 4.4BSD (SMM:18).

22..33..33..11..  SSoocckkeett ttyyppeess aanndd pprroottooccoollss


     SOCK_STREAM is  supported  by  the  Internet  TCP  protocol;
SOCK_DGRAM  by the UDP protocol.  Each is layered atop the trans-
port-level Internet Protocol (IP).  The Internet Control  Message
Protocol  is  implemented  atop/beside IP and is accessible via a
raw socket.  The SOCK_SEQPACKET has  no  direct  Internet  family
analogue;  a  protocol  based on one from the XEROX NS family and
layered on top of IP could be implemented to fill this gap.

22..33..33..22..  SSoocckkeett nnaammiinngg


     Sockets in the Internet domain  have  names  composed  of  a
32-bit Internet address and a 16-bit port number.  Options may be
used to provide IP  source  routing  or  security  options.   The
32-bit address is composed of network and host parts; the network
part is variable in size and is frequency encoded.  The host part
may  optionally be interpreted as a subnet field plus the host on









4.4BSD Architecture Manual                               PSD:5-53


the subnet; this is enabled by setting a network address mask  at
boot time.

22..33..33..33..  AAcccceessss rriigghhttss ttrraannssmmiissssiioonn


     No access rights transmission facilities are provided in the
Internet domain.

22..33..33..44..  RRaaww aacccceessss


     The Internet domain allows the super-user access to the  raw
facilities of IP.  These interfaces are modeled as SOCK_RAW sock-
ets.  Each raw socket is associated with one IP protocol  number,
and  receives  all  traffic  received  for  that  protocol.  This
approach allows administrative and debugging functions to  occur,
and  enables user-level implementations of special-purpose proto-
cols such as inter-gateway routing protocols.

22..44..  TTeerrmmiinnaallss aanndd DDeevviicceess


22..44..11..  TTeerrmmiinnaallss


     Terminals support _r_e_a_d and _w_r_i_t_e I/O operations, as well  as
a  collection  of  terminal specific _i_o_c_t_l operations, to control
input character interpretation and editing, and output format and
delays.

     A terminal may be used as a controlling terminal (login ter-
minal) for a login session.  A controlling terminal is associated
with a session (see section 1.1.4).  A controlling terminal has a
foreground process group, which must be a member of  the  session
with  which the terminal is associated (see section 1.1.5).  Mem-
bers of the foreground process group are allowed to read from and
write  to  the  terminal  and change the terminal settings; other
process groups from the session may be stopped upon  attempts  to
do these operations.

     A  session  leader  allocates  a terminal as the controlling
terminal for its session using the ioctl

     ioctl(fd, TIOCSCTTY, NULL);
     int fd;

Only a session leader may acquire a controlling terminal.

22..44..11..11..  TTeerrmmiinnaall iinnppuutt


     Terminals are handled according to the underlying communica-
tion characteristics such as baud rate and required delays, and a









PSD:5-54                               4.4BSD Architecture Manual


set of software parameters.  These parameters  are  described  in
the  _t_e_r_m_i_o_s structure maintained by the kernel for each terminal
line:


     struct termios {
          tcflag_t   c_iflag;      /* input flags */
          tcflag_t   c_oflag;      /* output flags */
          tcflag_t   c_cflag;      /* control flags */
          tcflag_t   c_lflag;      /* local flags */
          cc_t       c_cc[NCCS];   /* control chars */
          long       c_ispeed;     /* input speed */
          long       c_ospeed;     /* output speed */
     };


The _t_e_r_m_i_o_s structure is set and retrieved  using  the  _t_c_s_e_t_a_t_t_r
and _t_c_g_e_t_a_t_t_r functions.

     Two  general kinds of input processing are available, deter-
mined by whether the terminal device file is in canonical mode or
noncanonical  mode.  Additionally, input characters are processed
according to the _c___i_f_l_a_g and _c___l_f_l_a_g fields.  Such processing can
include  echoing, which in general means transmitting input char-
acters immediately back to the terminal when  they  are  received
from  the  terminal.   Non-graphic  ASCII input characters may be
echoed as a two-character  printable  representation,  ``^charac-
ter.''

     In  canonical  mode input processing, terminal input is pro-
cessed in units of lines.  A line is delimited by a newline char-
acter  (NL),  an  end-of-file  (EOF) character, or an end-of-line
(EOL) character.  Input is presented  on  a  line-by-line  basis.
Using  this  mode means that a read request will not return until
an entire line has been typed, or a  signal  has  been  received.
Also, no matter how many bytes are requested in the read call, at
most one line is returned.  It is not, however, necessary to read
a  whole  line  at  once;  any  number of bytes, even one, may be
requested in a read without losing information.

     When the terminal is in canonical mode, editing of an  input
line is performed.  Editing facilities allow deletion of the pre-
vious character or word, or deletion of the current  input  line.
In  addition, a special character may be used to reprint the cur-
rent input line.  Certain other characters are  also  interpreted
specially.  Flow control is provided by the _s_t_o_p _o_u_t_p_u_t and _s_t_a_r_t
_o_u_t_p_u_t control characters.  Output may be flushed with the  _f_l_u_s_h
_o_u_t_p_u_t  character; and the _l_i_t_e_r_a_l _c_h_a_r_a_c_t_e_r may be used to force
the following character into the input line,  regardless  of  any
special meaning it may have.

     In  noncanonical  mode input processing, input bytes are not
assembled into lines, and erase  and  kill  processing  does  not
occur.   All  input  is  passed  through  to  the reading process









4.4BSD Architecture Manual                               PSD:5-55


immediately and without interpretation.  Signals and flow control
may be enabled; here the handler interprets input only by looking
for characters that cause interrupts or output flow control;  all
other characters are made available.

     When  interrupt characters are being interpreted by the ter-
minal handler they cause a software interrupt to be sent  to  all
processes  in  the  process  group  associated with the terminal.
Interrupt characters exist to send SIGINT  and  SIGQUIT  signals,
and  to stop a process group with the SIGTSTP signal either imme-
diately, or when all input up to  the  stop  character  has  been
read.

22..44..11..22..  TTeerrmmiinnaall oouuttppuutt


     On output, the terminal handler provides some simple format-
ting services.  These  include  converting  the  carriage  return
character  to the two character return-linefeed sequence, insert-
ing delays after certain standard control characters, and expand-
ing tabs.

22..44..22..  SSttrruuccttuurreedd ddeevviicceess


     Structured devices are typified by disks and magnetic tapes,
but may represent any random-access device.  The system  performs
read-modify-write  type  buffering  actions  on  block devices to
allow them to be read and written in random access  fashion  like
ordinary  files.   Filesystems  are  normally  mounted  on  block
devices.

22..44..33..  UUnnssttrruuccttuurreedd ddeevviicceess


     Unstructured devices are those devices which do not  support
block  structure.  Familiar unstructured devices are raw communi-
cations lines (with no terminal handler), raster  plotters,  mag-
netic tape and disks unfettered by buffering and permitting large
block input/output and positioning and formatting commands.

22..55..  PPrroocceessss ddeebbuuggggiinngg


22..55..11..  TTrraaddiittiioonnaall ddeebbuuggggiinngg


Debuggers traditionally use the _p_t_r_a_c_e interface:

     ptrace(request, pid, addr, data);
     int request, pid, *addr, data;

This interface provides a means by which  a  parent  process  may
control  the execution of a child process, and examine and change









PSD:5-56                               4.4BSD Architecture Manual


its core image.  Its primary use is  for  the  implementation  of
breakpoint debugging.  There are four arguments whose interpreta-
tion depends on a  request  argument.   A  process  being  traced
behaves normally until it encounters a signal (whether internally
generated like ``illegal instruction''  or  externally  generated
like  ``interrupt'').   Then  the traced process enters a stopped
state and its parent is notified via _w_a_i_t.  When the child is  in
the  stopped  state,  its core image can be examined and modified
using _p_t_r_a_c_e.  Another ptrace request can then  cause  the  child
either to terminate or to continue, possibly ignoring the signal.

     A more general interface is also  provided  in  4.4BSD;  the
_m_o_u_n_t___p_r_o_c_f_s  filesystem attaches an instance of the process name
space to the global  filesystem  name  space.   The  conventional
mount  point  is  _/_p_r_o_c.  The root of the process filesystem con-
tains an entry for each active process.  These processes are vis-
ible  as  directories named by the process' ID.  In addition, the
special entry  _c_u_r_p_r_o_c  references  the  current  process.   Each
directory  contains  several  files,  including  a _c_t_l file.  The
debugger finds (or creates) the process that it  wants  to  debug
and  then  issues  an  attach  command via the _c_t_l file.  Further
interaction can then be done with the process through  the  other
files provided by the _/_p_r_o_c filesystem.

22..55..22..  KKeerrnneell ttrraacciinngg


Another facility for debugging programs is provided by the _k_t_r_a_c_e
interface:

     ktrace(tracefile, ops, trpoints, pid);
     char *tracefile; int ops, trpoints, pid;

_K_t_r_a_c_e does kernel trace logging  for  the  specified  processes.
The kernel operations that are traced include system calls, path-
name translations, signal processing, and I/O.  This facility can
be  particularly  useful  to  debug programs for which you do not
have the source.

























4.4BSD Architecture Manual                               PSD:5-57


33..  SSuummmmaarryy ooff ffaacciilliittiieess


1    KKeerrnneell pprriimmiittiivveess
1.1  PPrroocceesssseess aanndd pprrootteeccttiioonn
       sethostid     set host identifier
       gethostid     get host identifier
       sethostname   set host name
       gethostname   get host name
       getpid        get process identifier
       getppid       get parent process identifier
       fork          create a new process
       vfork         create a new process
       exit          terminate a process
       wait4         collect exit status of child
       execve        execute a new program
       getuid        get real user identifier
       geteuid       get effective user identifier
       getgid        get real group identifier
       getegid       get effective group identifier
       getgroups     get access group set
       setuid        set real, effective, and saved user  identi-
fiers
       setgid        set real, effective, and saved group identi-
fiers
       setgroups     set access group set
       seteuid       set effective user identifier
       setegid       set effective group identifier
       setsid        create a new session
       setlogin      set login name
       getlogin      get login name
       getpgrp       get process group
       setpgid       set process group
1.2  MMeemmoorryy mmaannaaggeemmeenntt
       brk           set data section size
       sbrk          change data section size
       getpagesize   get system page size
       mmap          map files or devices into memory
       msync         synchronize a mapped region
       munmap        remove a mapping
       mprotect      control the protection of pages
       madvise       give advise about use of memory
       mincore       get advise about use of memory
       mlock         lock physical pages in memory
       munlock       unlock physical pages in memory
       mset          acquire and set a semaphore
       mclear        release a semaphore and awaken waiting  pro-
cesses
       msleep        wait for a semaphore
       mwakeup       awaken process(es) sleeping on a semaphore
1.3  SSiiggnnaallss
       sigaction     setup software signal handler
       sigreturn     return from a signal
       kill          send signal to a process









PSD:5-58                               4.4BSD Architecture Manual


       killpg        send signal to a process group
       sigprocmask   manipulate current signal mask
       sigsuspend    atomically  release blocked signals and wait
for interrupt
       sigpending    get pending signals
       sigaltstack   set and/or get signal stack context
1.4  TTiimmeerrss
       settimeofday  set date and time
       gettimeofday  get date and time
       adjtime       synchronization of the system clock
       setitimer     set value of interval timer
       getitimer     get value of interval timer
       profil        control process profiling
1.5  DDeessccrriippttoorrss
       getdtablesize get descriptor table size
       dup           duplicate an existing file descriptor
       dup2          duplicate an existing file descriptor
       close         delete a descriptor
       select        synchronous I/O multiplexing
       fcntl         file control
1.6  RReessoouurrccee ccoonnttrroollss
       getpriority   get program scheduling priority
       setpriority   set program scheduling priority
       getrusage     get information about resource utilization
       getrlimit     get maximum system resource consumption
       setrlimit     set maximum system resource consumption
1.7  SSyysstteemm ooppeerraattiioonn ssuuppppoorrtt
       sysctl        get or set system information
       mount         mount a filesystem
       getfsstat     get list of all mounted filesystems
       swapon        add  a  swap  device  for  interleaved  pag-
ing/swapping
       unmount       dismount a filesystem
       sync          force  completion  of  pending  disk  writes
(flush cache)
       reboot        reboot system or halt processor
       acct          enable or disable process accounting
2    SSyysstteemm ffaacciilliittiieess
2.1  GGeenneerriicc ooppeerraattiioonnss
       read          read input
       write         write output
       readv         read gathered input
       writev        write scattered output
       ioctl         control device
2.2  FFiilleessyysstteemm
       chdir         change current working directory
       fchdir        change current working directory
       chroot        change root directory
       statfs        get file system statistics
       fstatfs       get file system statistics
       mkdir         make a directory file
       rmdir         remove a directory file
       getdirentries get directory entries in a filesystem  inde-
pendent format









4.4BSD Architecture Manual                               PSD:5-59


       open          open or create a file for reading or writing
       umask         set file creation mode mask
       mknod         make a special file node
       mkfifo        make a fifo file
       link          make a hard file link
       symlink       make a symbolic link to a file
       readlink      read value of a symbolic link
       rename        change the name of a file
       unlink        remove directory entry
       revoke        revoke file access
       stat          get file status
       fstat         get file status
       lstat         get file status
       chown         change owner and group of a file
       fchown        change owner and group of a file
       chmod         change mode of file
       fchmod        change mode of file
       chflags       set file flags
       fchflags      set file flags
       utimes        set file access and modification times
       access        check access permissions of a file or  path-
name
       pathconf      get configurable pathname variables
       fpathconf     get configurable pathname variables
       lseek         reposition read/write file offset
       truncate      truncate a file to a specified length
       ftruncate     truncate a file to a specified length
       fsync         synchronize  in-core  state  of  a file with
that on disk
       flock         apply or remove an advisory lock on an  open
file
       quotactl      manipulate filesystem quotas
       nfssvc        NFS services
       getfh         get file handle
2.3  IInntteerrpprroocceessss ccoommmmuunniiccaattiioonnss
       socket        create an endpoint for communication
       bind          bind a name to a socket
       getsockname   get socket name
       getpeername   get name of connected peer
       listen        listen for connections on a socket
       accept        accept a connection on a socket
       connect       initiate a connection on a socket
       socketpair    create a pair of connected sockets
       pipe          create descriptor pair for interprocess com-
munication
       sendto        send a message from a socket
       send          send a message from a socket
       recvfrom      receive a message from a socket
       recv          receive a message from a socket
       sendmsg       send a message from a socket
       recvmsg       receive a message from a socket
       shutdown      shut down part of a full-duplex connection
       getsockopt    get options on socket
       setsockopt    set options on socket









PSD:5-60                               4.4BSD Architecture Manual


2.4  TTeerrmmiinnaallss aanndd DDeevviicceess
2.5  PPrroocceessss ddeebbuuggggiinngg
       ptrace        process trace
       ktrace        process tracing
3    SSuummmmaarryy ooff ffaacciilliittiieess


























































PSD:5-2                                4.4BSD Architecture Manual


                            CCoonntteennttss


             NNoottaattiioonn aanndd TTyyppeess                            4
         1   KKeerrnneell pprriimmiittiivveess                             4
       1.1   PPrroocceesssseess aanndd pprrootteeccttiioonn                      5
     1.1.1   Host identifiers                              5
     1.1.2   Process identifiers                           5
     1.1.3   Process creation and termination              5
     1.1.4   User and group IDs                            6
     1.1.5   Sessions                                      7
     1.1.6   Process groups                                7
       1.2   MMeemmoorryy mmaannaaggeemmeenntt                             8
     1.2.1   Text, data, and stack                         8
     1.2.2   Mapping pages                                 8
     1.2.3   Page protection control                      10
     1.2.4   Giving and getting advice                    10
     1.2.5   Synchronization primitives                   10
       1.3   SSiiggnnaallss                                      11
     1.3.1   Overview                                     11
     1.3.2   Signal types                                 11
     1.3.3   Signal handlers                              12
     1.3.4   Sending signals                              13
     1.3.5   Protecting critical sections                 13
     1.3.6   Signal stacks                                14
       1.4   TTiimmeerrss                                       14
     1.4.1   Real time                                    14
     1.4.2   Interval time                                15
       1.5   DDeessccrriippttoorrss                                  16
     1.5.1   The reference table                          16
     1.5.2   Descriptor properties                        16
     1.5.3   Managing descriptor references               16
     1.5.4   Multiplexing requests                        16
       1.6   RReessoouurrccee ccoonnttrroollss                            18
     1.6.1   Process priorities                           18
     1.6.2   Resource utilization                         18
     1.6.3   Resource limits                              19
       1.7   SSyysstteemm ooppeerraattiioonn ssuuppppoorrtt                     19
     1.7.1   Monitoring system operation                  20
     1.7.2   Bootstrap operations                         20
     1.7.3   Shutdown operations                          21
     1.7.4   Accounting                                   21
         2   SSyysstteemm ffaacciilliittiieess                            21
       2.1   GGeenneerriicc ooppeerraattiioonnss                           22
     2.1.1   Read and write                               22
     2.1.2   Input/output control                         23
     2.1.3   Non-blocking and asynchronous operations     23
       2.2   FFiilleessyysstteemm                                   23
     2.2.1   Overview                                     23
     2.2.2   Naming                                       23
     2.2.3   Creation and removal                         24
   2.2.3.1   Directory creation and removal               24
   2.2.3.2   File creation                                24
   2.2.3.3   Creating references to devices               26









4.4BSD Architecture Manual                                PSD:5-3


   2.2.3.4   Links and renaming                           26
   2.2.3.5   File, device, and fifo removal               27
     2.2.4   Reading and modifying file attributes        27
     2.2.5   Checking accessibility                       28
     2.2.6   Extension and truncation                     29
     2.2.7   Locking                                      29
     2.2.8   Disk quotas                                  30
     2.2.9   Remote filesystems                           30
    2.2.10   Other filesystems                            31
       2.3   IInntteerrpprroocceessss ccoommmmuunniiccaattiioonnss                  31
     2.3.1   Interprocess communication primitives        31
   2.3.1.1   Communication domains                        31
   2.3.1.2   Socket types and protocols                   31
   2.3.1.3   Socket creation, naming and service establishment32
   2.3.1.4   Accepting connections                        33
   2.3.1.5   Making connections                           33
   2.3.1.6   Sending and receiving data                   33
   2.3.1.7   Scatter/gather and exchanging access rights  34
   2.3.1.8   Using read and write with sockets            35
   2.3.1.9   Shutting down halves of full-duplex connections35
  2.3.1.10   Socket and protocol options                  35
     2.3.2   PF_LOCAL domain                              36
   2.3.2.1   Types of sockets                             36
   2.3.2.2   Naming                                       36
   2.3.2.3   Access rights transmission                   36
     2.3.3   INTERNET domain                              36
   2.3.3.1   Socket types and protocols                   36
   2.3.3.2   Socket naming                                36
   2.3.3.3   Access rights transmission                   36
   2.3.3.4   Raw access                                   36
       2.4   TTeerrmmiinnaallss aanndd DDeevviicceess                        37
     2.4.1   Terminals                                    37
   2.4.1.1   Terminal input                               37
   2.4.1.2   Terminal output                              38
     2.4.2   Structured devices                           38
     2.4.3   Unstructured devices                         38
       2.5   PPrroocceessss ddeebbuuggggiinngg                            38
     2.5.1   Traditional debugging                        38
     2.5.2   Kernel tracing                               38
         3   SSuummmmaarryy ooff ffaacciilliittiieess                        40




















