The Brahma virtual machine interface
------------------------------------

Brahma is a low-level interface designed for pSather and Siva, but is
designed to be useful for other projects.  Its interface provides
portable mechanisms for threads, synchronization, and active messages.

Please send comments or questions to

    <bug-sather@gnu.org>




Compiling Brahma
----------------

Brahma must be compiled separately for each platform, including for
serial, threaded, and distributed/threaded variations.  Presently the
marked platforms are available:

 * serial		- No concurrency; synchronization are no-ops
 * smp_solaris		- Threads, emulated clusters (solaris)
   smp_lwp		- Threads, emulated clusters (linux/lwp)
   shmem		- Using shmem AM, for debugging (solaris)
 * tcpip		- TCP/IP AM (solaris)
   meiko		- Meiko (solaris)
 * myrinet		- Myrinet (solaris)

The Brahma archive file may be created for a given platform "foo":

   make foo INSTALL=xxx.a

Documentation on the specifics on each platform can be found in the
file <platform>.doc.



Using Brahma
------------

All Brahma routines and macros begin with "BR_".  A lowercase routine
name can be relied on to be implemented as a function.

To use Brahma, it is necessary to understand the Active Message model,
which imposes constraints on how messages may be used.  For example,
reply handlers must not request or reply and it is only possible to
reply to the same node that sends a request.  More information about
active messages, see

   http://now.cs.berkeley.edu/AM/active_messages.html

The file "brahma.h" of the Brahma distribution defines the following
interface:


Typedefs:

      BR_word_t		- Unsigned.  May be "short", "int", "long", 
      BR_doubleword_t	  or "long long" as appropriate.

      BR_cluster_t	- Integral cluster id type.

      BR_handler_0_t   	- Standard Active Message types, sending
      BR_handler_1_t	  zero through four word payload arguments.
      BR_handler_2_t
      BR_handler_3_t
      BR_handler_4_t
      BR_handler_5_t
      BR_handler_mem_t  - bulk xfer messages

      BR_thread_t	- ID of a thread, unique across all nodes, and 8 bytes.

      BR_lock_t		- Synchronization types; these must be pointers.
      BR_sema_t

      BR_spinlock_t     - Synchronization type (see below)

      BR_delay_t	- A unit of delay, with "sec" and "nsec" fields.
      BR_delay_handler_t - Handler for delayed function calls.


Starting and stopping:

   void BR_init(int clusters, int argc, char *argv[])

      - Initialization; must be called on one node before any other
	calls.  This creates copies of the code on other nodes, if
	necesssary.  On some platforms, the number of clusters is
	determined by hardware; on others, it may be variable.  It is
	an error if the number of clusters requested is not available;
	if zero is passed, the maximum number available are used.

	BR_init(...) is called only once by the programmer, who then
	sees many separate threads of control return as with fork().
	On a network of workstations, this requires that BR_init
	remotely invoke the program with the original arguments.  The
	remote invocations eventually reach their own BR_init,
	synchronize with the original process, and all continue.  For
	this reason, programs should not do nondeterministic things
	until BR_init is called so that all copies of the program will
	be sure to reach BR_init in the same state.

	On platforms which naturally have a single system image (SMPs,
	T3E, etc.) this is not an issue, since there is really only a
	single process calling BR_init.

   void BR_exit()

      - Shuts down Brahma.  This must be called on cluster zero,
	and will shut down all threads and return, even if other nodes
	have not yet reached BR_exit.  Implementations on platforms
	without a single system image such as networks of workstations
	should take care to clean up remote processes when a node dies
	under unusual circumstances (ie. program fault or cntl-C).


Location:

   unsigned int BR_CLUSTERS()

      - Number of clusters.  Often not a compile-time constant.

   unsigned int BR_PROCESSORS()

      - Number of processors at current cluster.

   cluster_t BR_HERE()

      - Local cluster id, always between 0 and BR_CLUSTERS-1.

   caddr_t BR_CLUSTER_LOCAL()

      - Get the local memory associated with this cluster.  For truly
	distributed systems, this may just be a pointer to a static
	region at the same address everywhere.  For systems which emulate
        distribution in a single address space, this will be a zeroed
        region allocated in BR_init(...).

   size_t BR_CLUSTER_LOCAL_SIZE()

      - Size of the cluster local memory in bytes, at least 1KB.


Messages:

   size_t BR_MAX_XFER()

      - Maximum number of bytes that may be transferred using the
	"BR_STORE", "BR_STORE_ASYNC" and "BR_GET" calls.

   void BR_POLL()

      - Macro for polling the network.  This should be inserted into
	the instruction stream regularly, eg. at each function call
	and long loop.  On some platforms this is not required and
	does nothing.  

   void BR_REQUEST_0(BR_cluster_t c, BR_handler_0_t handler)
   void BR_REQUEST_1(BR_cluster_t c, BR_handler_1_t handler, BR_word_t arg0)
   void BR_REQUEST_2(BR_cluster_t c, BR_handler_2_t handler, BR_word_t arg0, BR_word_t arg1)
   void BR_REQUEST_3(BR_cluster_t c, BR_handler_3_t handler, 
		     BR_word_t arg0, BR_word_t arg1, BR_word_t arg2)
   void BR_REQUEST_4(BR_cluster_t c, BR_handler_4_t handler, 
		     BR_word_t arg0, BR_word_t arg1, BR_word_t arg2, BR_word_t arg3)
   void BR_REQUEST_5(BR_cluster_t c, BR_handler_5_t handler, 
		     BR_word_t arg0, BR_word_t arg1, BR_word_t arg2, BR_word_t arg3, BR_word_t arg4)

      - Asyncronously send a request active message.  There is no
        failure mode.

   void BR_REPLY_0(BR_cluster_t c, BR_handler_0_t handler)
   void BR_REPLY_1(BR_cluster_t c, BR_handler_1_t handler, BR_word_t arg0)
   void BR_REPLY_2(BR_cluster_t c, BR_handler_2_t handler, BR_word_t arg0, BR_word_t arg1)
   void BR_REPLY_3(BR_cluster_t c, BR_handler_3_t handler, 
		     BR_word_t arg0, BR_word_t arg1, BR_word_t arg2)
   void BR_REPLY_4(BR_cluster_t c, BR_handler_4_t handler, 
		     BR_word_t arg0, BR_word_t arg1, BR_word_t arg2, BR_word_t arg3)
   void BR_REPLY_5(BR_cluster_t c, BR_handler_5_t handler, 
		     BR_word_t arg0, BR_word_t arg1, BR_word_t arg2, BR_word_t arg3, BR_word_t arg4)

      - Asyncronously send a reply active message.  This must only be
	done from within a request handler.  There is no failure mode.

   void BR_STORE(BR_cluster_t c, caddr_t from, caddr_t to, size_t size,
		     BR_handler_mem_t handler, BR_word_t arg0)

      - Transfer size bytes from local memory at address from to remote
	address to; on completion, remote handler is invoked with the
	arguments (<requesting cluster>, to, size, arg0).  Sender
	blocks until memory transfer complete.  There is no failure mode.

   void BR_ASYNC_STORE(BR_cluster_t c, caddr_t from, caddr_t to, size_t size,
		     BR_handler_mem_t handler, BR_word_t arg0a,
		     BR_handler_mem_t on_completion, BR_word_t arg0b)

      - Like BR_STORE, but sender does not block.  "on_completion" is
	invoked locally when transfer completes with the arguments
	(c, from, size, arg0b), while "handler" is invoked remotely
	with the arguments (<requesting cluster>, to, size, arg0a).  
	There is no failure mode.

   void BR_GET(BR_cluster_t c, caddr_t from, caddr_t to, size_t size,
		     BR_handler_mem_t handler, BR_word_t arg0)

      - Transfer data from remote cluster c to local memory.
	when tranfer is complete, handler is invoked locally
	with arguments (c, to, size, arg0).  This may not be called
	from any handler function.  There is no failure mode.

   void BR_dummy(...)

      - Does absolutely nothing; convenient no-op handler.

   void BR_freeze()

      - Forcefully halt all threads on all clusters (other than those
	needed by the Brahma implementation itself).  On some platforms,
	this may also imply that the network is also drained of user
	active messages.  On return, the executing thread must not be able
	to observe any user activity other than itself.  However, new
	threads and messages may be initiated by the executing thread
	without being affected by the freeze.

	This is useful for debugging and garbage collection, and must be
	called on cluster 0.

   void BR_thaw()

      - Restarts user activity halted by "BR_freeze".  This should be
	executed exactly once, after a freeze, on cluster 0.

   
Threads:

   void BR_FORK_0(BR_cluster_t c, BR_handler_0_t func)
   void BR_FORK_1(BR_cluster_t c, BR_handler_1_t func, BR_word_t arg0)
   void BR_FORK_2(BR_cluster_t c, BR_handler_2_t func, BR_word_t arg0, BR_word_t arg1)
   void BR_FORK_3(BR_cluster_t c, BR_handler_3_t func, 
		  BR_word_t arg0, BR_word_t arg1, BR_word_t arg2)
   void BR_FORK_4(BR_cluster_t c, BR_handler_4_t func, 
		  BR_word_t arg0, BR_word_t arg1, BR_word_t arg2, BR_word_t arg3)
   void BR_FORK_5(BR_cluster_t c, BR_handler_5_t func, 
		  BR_word_t arg0, BR_word_t arg1, BR_word_t arg2, BR_word_t arg3, BR_word_t arg4)

      - Create a new thread on cluster "c" executing "func".  Brahma
	guarantees that thread tear-down times won't overconsume resources
	when there are lots of very short threads forked quickly; there
	is no way to explicitly join.

   BR_thread_t BR_THREAD_ID()

      - ID of executing thread

   int BR_SAME_THREAD(BR_thread_t id1, BR_thread_t id2)

      - Returns 1 if two threads are equal, otherwise 0.

   BR_thread_t BR_INVALID_ID()

      - Returns a value which can never be an id of an actual thread,
	to use as a sentinel value.

   unsigned int BR_THREAD_HASH(BR_thread_t id)

      - Returns an integer that can be used for hashing thread ids.

   void BR_SET_THREAD_LOCAL(caddr_t x)

      - Set the local memory associated with the executing thread to "x".

   caddr_t BR_GET_THREAD_LOCAL()

      - Get the local memory set with "BR_SET_THREAD_LOCAL".



Timing:

   void BR_delay_function(BR_delay_t n, BR_delay_handler_t func, void *arg)

      - Asyncronously call a function after at least n.sec seconds and
	n.nsec nano seconds, with arg as argument.  May or may not be
	executed by the thread that called "BR_delay_function".  The
	function should be short and non-blocking.


Synchronization:

There are three kind of synchronization provided by Brahma: there are
mutual exclusions and semaphores (lock_t and sema_t), and a lightweight
spin lock (spinlock_t).  The heavy versions are equivalent to the
synchronization functionality in Solaris threads - they may block, and
the critical regions they are used with may be of arbitrary length and
be nested.  The spinlock tries to use inline atomic instructions to
build very fast spinlock-style mutual exclusion.  This should only be
used for very small, nonblocking, non-nested exclusion.  The heavy
versions can always safely replace a spinlock versions, with some
performance penalty.

Using locks and semaphores in request or reply handlers should be avoided.

   BR_lock_t BR_LOCK_CREATE()
   BR_sema_t BR_SEMA_CREATE(unsigned int count)
   void BR_LOCK_DELETE(BR_lock_t l)
   void BR_SEMA_DELETE(BR_sema_t s)

      - Create or delete an instance of the synchronization object.

   void BR_LOCK(BR_lock_t l)
   void BR_UNLOCK(BR_lock_t l)

      - Lock and unlock the lock "l".

   void BR_WAIT(BR_sema_t s)
   void BR_SIGNAL(BR_sema_t s)

      - Wait or signal the semaphore "s".  Wait blocks if the count
	is zero, then atomically decrements the count and unblocks.
	Signal increments the count.

   int BR_TRY_LOCK(BR_lock_t l)
   int BR_TRY_WAIT(BR_sema_t s)

      - Nonblocking versions of BR_LOCK and BR_WAIT; a zero return
	indicates no synchronization succeeded.

   void BR_unlock_handler(BR_cluster_t ignored, BR_word_t lock)
   void BR_signal_handler(BR_cluster_t ignored, BR_word_t sema)

      - Useful handler for unlocking and signalling on remote nodes.
        (The second arg will have to be cast to word to use.)

   void BR_signal_mem_handler(BR_cluster_t from, void *a, int size, BR_word_t sema)

      - Useful handler for signalling on remote nodes.  The first three
	arguments are ignored.

   BR_SPINLOCK_DEC(s);
   BR_SPINLOCK_INIT(s);

      - Declaration of a spinlock.  These are macros to allow them to
	be ignored in the serial case.  Spinlocks don't need to be
	explicitly created or deleted, but they do need to be explicitly
	initialized before use.  The special type "BR_spinlock_t" may be
	used for declarations or typedefs, but BR_SPINLOCK_DEC is
	preferred where possible.  (This is because C makes it nonportable
	to have datatypes of size zero, and the BR_SPINLOCK_DEC may simply
	be ignored while BR_spinlock_t must waste at least a byte in the
	serial case.)

   BR_SPINLOCK_LOCK(s);
   BR_SPINLOCK_UNLOCK(s);
   BR_SPINLOCK_TRY(s);

      - Synchronization operations.  These operations should only be
	used around very small critical regions because they use busy
	waiting instead of proper blocking (BR_SPINLOCK_TRY obviously
	doesn't busy wait).

   BR_BARRIER()

      - A global barrier that must be called by one thread on each cluster
	before it proceeds.

Debugging:

   const char *BR_ASCII_PLATFORM()

      - Return human-readable description of the Brahma platform.

   void BR_ascii_id(BR_thread_t id, char *buf, size_t maxlen)

      - Create human-readable form of the thread id in the buffer "buf",
	for debugging.

