MCL FAQ(1)                USER COMMANDS                MCL FAQ(1)



  NAME
          mclfaq  -  faqs  and  facts about the MCL cluster algo-
          rithm.

          MCL refers to the generic MCL  algorithm  and  the  MCL
          process  on which the algorithm is based. mcl refers to
          the implementation. This FAQ answers questions  related
          to both. In some places MCL is written where MCL or mcl
          can be read. This is the case for  example  in  section
          2, What  kind of graphs.  It should in general be obvi-
          ous from the context.

          This FAQ does not begin to attempt to explain the moti-
          vation  and  mathematics behind the MCL algorithm - the
          internals are not explained. A broad view is  given  in
          faq 1.2, and see also faq 1.5 and section REFERENCES.

          Some   additional   sections  preceed  the  actual  faq
          entries.  The TOC section contains  a  listing  of  all
          questions.

  RESOURCES
          mcl     development     is     discussed     on    mcl-
          devel@lists.mdcc.cx,   this   list   is   archived   at
          http://lists.mdcc.cx/mcl-devel/.

          See  the  REFERENCES Section for publications detailing
          the mathematics behind the MCL algorithm.

  TOC
  1...... General questions
   1.1... For whom is mcl and for whom is this FAQ?
   1.2... What is the relationship between the MCL  process,  the
          MCL algorithm, and the 'mcl' implementation?
   1.3... What do the letters MCL stand for?
   1.4... How could you be so feebleminded to use MCL as abbrevi-
          ation? Why is it labeled 'Markov cluster' anyway?
   1.5... Where can I learn about the innards of  the  MCL  algo-
          rithm/process?
   1.6... For which platforms is mcl available?
   1.7... How does mcl's versioning scheme work?

  2...... What kind of graphs
   2.1... What is legal input for MCL?
   2.2... What is sensible input for MCL?
   2.3... Does MCL work for weighted graphs?
   2.4... Does MCL work for directed graphs?
   2.5... Can  MCL  work for lattices / directed acyclic graphs /
          DAGs?
   2.6... Does MCL work for tree graphs?
   2.7... For what kind of graphs does  MCL  work  well  and  for
          which does it not?
   2.8... What  makes a good input graph?  How do I construct the
          similarities?  How to make  them  satisfy  this  Markov
          condition?
   2.9... My input graph is directed. Is that bad?
   2.10.. Why  does  mcl  like  undirected graphs and why does it
          dislike uni-directed graphs so much?
   2.11.. How do I check that my graph/matrix is  symmetric/undi-
          rected?

  3...... Speed and complexity
   3.1... How fast is mcl/MCL?
   3.2... What statistics are available?
   3.3... Does this implementation need to sort vectors?
   3.4... mcl does not compute the ideal MCL process!
   3.5... I've  read  elsewhere  that  MCL is slow [compared with
          XYZ]

  4...... Resource tuning / accuracy
   4.1... What do you mean by resource tuning?
   4.2... How do I compute the maximum amount of  RAM  needed  by
          mcl?
   4.3... How  much does the mcl clustering differ from the clus-
          tering resulting from a perfectly computed MCL process?
   4.4... How do I know that I am using enough resources?
   4.5... Where  is the mathematical analysis of this mcl pruning
          strategy?
   4.6... What qualitative  statements  can  be  made  about  the
          effect of pruning?
   4.7... At  different  high  resource levels my clusterings are
          not identical.  How can I trust the output  clustering?

  5...... Tuning cluster granularity
   5.1... How do I tune cluster granularity?
   5.2... The effect of inflation on cluster granularity.
   5.3... The  effect  of  similarity distribution homogeneity on
          cluster granularity.
   5.4... The effect of initial centering on cluster granularity.
   5.5... How to implement two-level approaches using mcl.

  6...... Implementing the MCL algorithm
   6.1... How easy is it to implement the MCL algorithm?

  7...... Cluster overlap / MCL iterand cluster interpretation
   7.1... Introduction
   7.2... Can the clusterings returned by mcl contain overlap?
   7.3... How  do  I  obtain  the clusterings associated with MCL
          iterands?

  8...... Miscellaneous
   8.1... How do I find the default settings of mcl?
   8.2... What's next?

  FAQ
                             General questions

   1.1    For whom is mcl and for whom is this FAQ?

          For everybody with an appetite  for  graph  clustering.
          Regarding  the FAQ, I have kept the amount of mathemat-
          ics as low as possible, as far as  matrix  analysis  is
          concerned.   Inevitably,  some  terminology pops up and
          some references are made to  the  innards  of  the  MCL
          algorithm,  especially  in the section on resources and
          accuracy. Graph terminology is used somewhat more care-
          lessly   though.  The  future  might  bring  definition
          entries, right now you have to do without.   Mathemati-
          cally inclined people may be interested in the pointers
          found in the REFERENCES section.

          Given this mention of mathematics,  let  me  point  out
          this   one  time  only  that  using  mcl  is  extremely
          straightforward anyway. You need only mcl and an  input
          graph  (refer  to the mcl manual page), and many people
          trained in something else than  mathematics  are  using
          mcl happily.

   1.2    What  is  the relationship between the MCL process, the
          MCL algorithm, and the 'mcl' implementation?

          mcl is what you use for clustering. It  implements  the
          MCL algorithm, which is a cluster algorithm for graphs.
          The MCL algorithm is basically a shell in which the MCL
          process  is  computed  and interpreted. I will describe
          them in the natural, reverse, order.

          The MCL process generates a row of stochastic  matrices
          given some initial stochastic matrix. The elements with
          even index are obtained by expanding the previous  ele-
          ment,  and  the elements with odd index are obtained by
          inflating the previous  element  given  some  inflation
          constant. Expansion is nothing but normal matrix squar-
          ing, and inflation is a particular way of rescaling the
          entries  of  a  stochastic  matrix such that it remains
          stochastic.

          The row of MCL elements (from the MCL  process)  is  in
          principle  without  end,  but  what happens is that the
          elements converge to  some  specific  kind  of  matrix,
          called the limit of the process. The heuristic underly-
          ing MCL predicts that the interaction of expansion with
          inflation  will  lead  to  a  limit  exhibiting cluster
          structure in the  graph  associated  with  the  initial
          matrix. This is indeed the case, and several mathemati-
          cal results tie MCL iterands and  limits  and  the  MCL
          interpretation together (REFERENCES).

          The MCL algorithm is simply a shell around the MCL pro-
          cess in which an input graph  is  transformed  into  an
          initial  matrix  suitable  for starting the process, in
          which inflation parameters are set, and  in  which  the
          MCL  process  is stopped once the limit is reached, and
          in which the result is interpreted as a clustering.

          The mcl implementation supplies  the  functionality  of
          the  MCL  algorithm,  with  some  extra  facilities for
          manipulation  of  the  input  graph,  interpreting  the
          result, manipulating resources while computing the pro-
          cess, and monitoring the state of these  manipulations.

   1.3    What do the letters MCL stand for?

          For  Markov  Cluster.  The  MCL  algorithm is a cluster
          algorithm that is basically a shell in which  an  alge-
          braic  process  is  computed.  This process iteratively
          generates stochastic matrices,  also  known  as  Markov
          matrices,  named after the famous Russian mathematician
          Andrei Markov.

   1.4    How could you be so feebleminded to use MCL as abbrevi-
          ation? Why is it labeled 'Markov cluster' anyway?

          Sigh.  It  is  a widely known fact that a TLA or Three-
          Letter-Acronym is the canonical self-describing  abbre-
          viation  for the name of a species with which computing
          terminology is infested (quoted from  the  Free  Online
          Dictionary of Computing). Back when I was thinking of a
          nice tag for this cute algorithm, I was totally unaware
          of  this.  I naturally dismissed MC (and would still do
          that today). Then MCL occurred to me, and without  giv-
          ing  it  much  thought  I  started  using it.  A Google
          search (or was I still  using  Alta-Vista  back  then?)
          might have kept me from going astray.

          Indeed, MCL is used as a tag for Macintosh Common Lisp,
          Mission Critical Linux, Monte Carlo  Localization,  MUD
          Client for Linux, Movement for Canadian Literacy, and a
          gazillion other things - refer to the file  mclmcl.txt.
          Confusing.  It seems that the three characters MCL pos-
          sess otherworldly magical powers making them an ever so
          strange  and  strong attractor in the space of TLAs. It
          probably helps that Em-See-Ell  (Em-Say-Ell  in  Dutch)
          has  some  rhythm  to it as well. Anyway MCL stuck, and
          it's here to stay.

          On a more general level, the label  Markov  Cluster  is
          not  an  entirely  fortunate  choice  either.  Although
          phrased in the language  of  stochastic  matrices,  MCL
          theory bears very little relation to Markov theory, and
          is much closer to matrix analysis (including  Hilbert's
          distance)  and  the  theory  of  dynamical  systems. No
          results have been derived in the latter framework,  but
          many conjectures are naturally posed in the language of
          dynamical systems.

   1.5    Where can I learn about the innards of  the  MCL  algo-
          rithm/process?

          Currently,  the most basic explanation of the MCL algo-
          rithm is found in the technical report [2]. It contains
          sections  on  several  other (related) subjects though,
          and it assumes some working knowledge on graphs, matrix
          arithmetic, and stochastic matrices.

   1.6    For which platforms is mcl available?

          It  should  compile and run on virtually any flavour of
          UNIX (including Linux and the BSD variants of  course).
          Following  the instructions in the INSTALL file shipped
          with mcl  should  be  straightforward  and  sufficient.
          Courtesy  to  Joost  van Baal who completely autofooled
          mcl.

          Building MCL on Wintel (Windows on Intel  chip)  should
          be  straightforward if you use the full suite of cygwin
          tools. Install cygwin if you do not have it yet. In the
          cygwin  shell, unpack mcl and simply issue the commands
          ./configure,  make,  make  install,  i.e.  follow   the
          instructions in INSTALL.

          This  MCL  implementation  has not yet been reported to
          run on MAC. For the latest Mac OS X  one  would  expect
          that it is certainly possible to make this happen.

          If you have further questions or news about this issue,
          contact mcl-devel <at> lists <dot> mdcc <dot> cx.

   1.7    How does mcl's versioning scheme work?

          The current setup, which I hope to continue,  is  this.
          All  releases are identified by a date stamp. For exam-
          ple 02-095 denotes day 95 in the year 2002.  This  date
          stamp  agrees  (as of April 2000) with the (differently
          presented) date stamp used in all manual pages  shipped
          with  that release.  For example, the date stamp of the
          FAQ you are reading is  3 Oct 2003,  which  corresponds
          with the MCL stamp 03-276.  The Changelog file contains
          a list of what's changed/added with each release.  Cur-
          rently,   the   date   stamp  is  the  primary  way  of
          identifying an mcl release. When asked for its  version
          by using --version, mcl outputs both the date stamp and
          a version tag (see below).

          In early 2002 it occurred to me  that  mcl  should,  in
          addition  to time stamps, also have something like ver-
          sion numbers, wanting to use those to indicate notewor-
          thy  changes.  The  April  2002 release got version tag
          1.001, in order to celebrate the  then-recent  addition
          of  this  FAQ,  mcl's  new  logging facility --log, and
          clmimac to  the  MCL  distribution.  The  January  2003
          release had its version number bumped to 1.002, marking
          MCL's ability to directly deal with a much more general
          type  of graph encoding.  Currently, the version tag is
          not used in the mcl distribution name - only  the  date
          stamp is used for that.

                            What kind of graphs

   2.1    What is legal input for MCL?

          Any graph (encoded as a matrix of similarities) that is
          nonnegative, i.e. all similarities are greater than  or
          equal to zero.

   2.2    What is sensible input for MCL?

          It  is  ok  for  graphs to be weighted, and they should
          preferably be symmetric.   They  should  certainly  not
          contain  parts that are (almost) cyclic, although noth-
          ing stops you from experimenting with such input.

   2.3    Does MCL work for weighted graphs?

          Yes, unequivocally. They should preferably  be  symmet-
          ric/undirected though.  See entries 2.7 and 2.8.

   2.4    Does MCL work for directed graphs?

          Maybe, with a big caveat. See entries 2.8 and 2.9.

   2.5    Can  MCL  work for lattices / directed acyclic graphs /
          DAGs?

          Such graphs [term] can  surely  exhibit  clear  cluster
          structure. If they do, there is only one way for mcl to
          find out. You have to change all arcs to edges, i.e. if
          there is an arc from i to j with similarity s(i,j) - by
          the DAG property this implies s(j,i) = 0  -  then  make
          s(j,i) equal to s(i,j).

          This  may feel like throwing away valuable information,
          but in  truth  the  information  that  is  thrown  away
          (direction)  is  not  informative  with  respect to the
          presence of cluster structure. This may well deserve  a
          longer discussion than would be justified here.

   2.6    Does MCL work for tree graphs?

          Nah, I don't think so. More info at entry 2.7.

   2.7    For  what  kind  of  graphs  does MCL work well and for
          which does it not?

          Graphs in  which  the  diameter  [term]  of  (subgraphs
          induced   by)   natural  clusters  is  not  too  large.
          Additionally,  graphs  should  preferably  be  (almost)
          undirected (see entry below) and not so sparse that the
          cardinality of the edge set is close to the  number  of
          nodes.

          A  class  of  such  very  sparse graphs is that of tree
          graphs. You might look into graph  visualization  soft-
          ware  and research if you are interested in decomposing
          trees into 'tight' subtrees.

          The diameter criterion could be violated by  neighbour-
          hood  graphs  derived from vector data. In the specific
          case of 2 and 3 dimensional data, you might  be  inter-
          ested in image segmentation and boundary detection, and
          for the general case there is a  host  of  other  algo-
          rithms out there. [add]

          In  case  of weighted graphs, the notion of diameter is
          sometimes  not  applicable.  Generalizing  this  notion
          requires inspecting the mixing properties of a subgraph
          induced by a natural cluster in terms of its  spectrum.
          However,  the  diameter statement is something grounded
          on heuristic  considerations  (confirmed  by  practical
          evidence  [4])  to  begin  with, so you should probably
          forget about mixing properties.

   2.8    What makes a good input graph?  How do I construct  the
          similarities?   How  to  make  them satisfy this Markov
          condition?

          To begin with the last one: you need not and  must  not
          make  the  input  graph  such that it is stochastic aka
          Markovian [term]. What you need to do is make  a  graph
          that  is  preferably  symmetric/undirected,  i.e. where
          s(i,j) = s(j,i) for all nodes i and j. It need  not  be
          perfectly  undirected, see the following faq for a dis-
          cussion of that.  mcl will work with the graph of  ran-
          dom walks that is associated with your input graph, and
          that is the natural state of affairs.

          The input graph should  preferably  be  honest  in  the
          sense  that if s(x,y)=N and s(x,z)=200N (i.e. the simi-
          larities differ by a  factor  200),  then  this  should
          really  reflect  that  the  similarity  of  y  to  x is
          neglectible compared with the similarity of z to x.

          For the rest, anything goes. Try to get  a  feeling  by
          experimenting.   Sometimes  it is a good idea to filter
          out  high-frequency  and/or  low-frequency  data,  i.e.
          nodes with either very many neighbours or extremely few
          neighbours.

   2.9    My input graph is directed. Is that bad?

          It depends. The class of directed graphs can be  viewed
          as  a  spectrum  going  from  undirected graphs to uni-
          directed  graphs.  Uni-directed  is  terminology  I  am
          inventing here, which I define as the property that for
          all node pairs i, j, at least one of s(i,j)  or  s(j,i)
          is  zero. In other words, if there is an arc going from
          i to j in a uni-directed graph, then there  is  no  arc
          going from j to i. I call a node pair i, j, almost uni-
          directed if s(i,j) << s(j,i) or vice versa, i.e. if the
          similarities differ by an order of magnitude.

          If  a  graph  does  not  have (large) subparts that are
          (almost) uni-directed, have a go with  mcl.  Otherwise,
          try  to  make your graph less uni-directed.  You are in
          charge, so do anything with your graph as you see  fit,
          but  preferably  abstain  from feeding mcl uni-directed
          graphs.

   2.10   Why does mcl like undirected graphs  and  why  does  it
          dislike uni-directed graphs so much?

          Mathematically,  the mcl iterands will be nice when the
          input graph is symmetric, where nice is  in  this  case
          diagonally symmetric to a semi-positive definite matrix
          (ignore as needed). For one thing, such  nice  matrices
          can  be interpreted as clusterings in a way that gener-
          alizes the interpretation of the mcl limit as  a  clus-
          tering  (if you are curious to these intermediate clus-
          terings, see faq entry 7.3).  See the  REFERENCES  sec-
          tion for pointers to mathematical publications.

          The reason that mcl dislikes uni-directed graphs is not
          very mcl specific, it has more to do with the  cluster-
          ing  problem  itself.   Somehow, directionality thwarts
          the notion of cluster structure.  [add].

   2.11   How do I check that my graph/matrix is  symmetric/undi-
          rected?

          Whether  your  graph is created by third-party software
          (e.g.  the  TribeMCL  module   (maintained   by   Anton
          Enright))  or  by custom sofware written by someone you
          know (e.g. yourself), it is advisable to  test  whether
          the  software generates symmetric matrices. This can be
          done as follows using the mcx  utility,  assuming  that
          you  want to test the matrix stored in file matrix.mci.
          The mcx utility should be available on your  system  if
          mcl was installed in the normal way.

          mcx /matrix.mci lm tp -1 mul add /check wm

          This  loads  the graph/matrix stored in matrix.mci into
          mcx's memory with the mcx lm primitive. -  the  leading
          slash  is  how strings are introduced in the stack lan-
          guage interpreted by mcx. The transpose of that  matrix
          is  then  pushed on the stack with the tp primitive and
          multiplied by minus one. The two  matrices  are  added,
          and  the  result  is  written  to  the file check.  The
          transposed matrix is the mirrored version of the origi-
          nal  matrix  stored in matrix.mci. If a graph/matrix is
          undirected/symmetric, the mirrored image is necessarily
          the  same,  so  if  you  subtract one from the other it
          should yield an all zero matrix.

          Thus, the file check should look like this:

          (mclheader
          mcltype matrix
          dimensions <num>x<num>
          )
          (mclmatrix
          begin
          )

          Where <num> is the same as in the file  matrix.mci.  If
          this  is  not the case, find out what's prohibiting you
          from feeding mcl  symmetric  matrices.  Note  that  any
          nonzero  entries  found  in  the matrix stored as check
          correspond to node pairs for which the arcs in the  two
          possible directions have different weight.

                           Speed and complexity

   3.1    How fast is mcl/MCL?

          It's fast - here is how and why. Let N be the number of
          nodes in the input graph. A straigtforward  implementa-
          tion  of  MCL  will  have  time  and  space  complexity
          respecively  O(N^3)  (i.e.  cubic  in  N)  and   O(N^2)
          (quadratic in N).  So you don't want one of those.

          mcl  implements a slightly perturbed version of the MCL
          process, as discussed  in  section  Resource  tuning  /
          accuracy.   Refer  to that section for a more extensive
          discussion of all the aspects involved. This section is
          only  concerned  with the high-level view of things and
          the nitty gritty complexity details.

          While computing the square of a matrix (the product  of
          that  matrix  with itself), mcl keeps the matrix sparse
          by allowing a certain maximum number of nonzero entries
          per  stochastic  column.  The maximum is one of the mcl
          parameters, and it is typically set  somewhere  between
          500 and 1000.  Call the maximum K.

          mcl's  time complexity is governed by the complexity of
          matrix squaring.  There are two sub-algorithms to  con-
          sider.  The  first  is  the  algorithm  responsible for
          assembling a new vector during  matrix  multiplication.
          This  algorithm  has  worst case complexity O(K^2). The
          pruning algorithm (which uses heap selection) has worst
          case  complexity  O(L*log(K)),  where  L is how large a
          newly computed matrix  column  can  get  before  it  is
          reduced  to at most K entries. L is bound by the small-
          est of the two numbers N and K^2 (the square of K), but
          on  average  L  will  be much smaller than that, as the
          presence of cluster structure aids in keeping the  fac-
          tor L low. [Related to this is the fact that clustering
          algorithms are actually used to compute  matrix  split-
          tings  that  minimize  the number of cross-computations
          when carrying out matrix multiplication among  multiple
          processors.]   In  actual cases of heavy usage, L is of
          order in the tens of thousands, and K is in  the  order
          of several hundreds up to a thousand.

          It  is  safe to say that in general the worst case com-
          plexity of mcl is  of  order  O(N*K^2);  for  extremely
          tight and dense graphs this might become O(N*N*log(K)).
          Still, these are worst  case  estimates,  and  observed
          running  times  for  actual  usage are much better than
          that.  (refer to faq 3.2).

          In this analysis, the number of iterations required  by
          mcl  was  not  included.  It is nearly always far below
          100. Only the first few iterations are  genuinely  time
          consuming;  the first few iterations (some number below
          10) are usually responsible for more than 95 percent of
          the running time.

          The  process of removing the smallest entries of a vec-
          tor is called pruning. mcl provides  extensive  facili-
          ties for monitoring and controlling the effect of prun-
          ing, and it will output statistics and a  summary  once
          it is done. More information is provided in the pruning
          section of the mcl manual and Section 4 in this FAQ.

          The space complexity is of order O(N*K).

   3.2    What statistics are available?

          Few. Some experiments are described  in  [4],  and  [5]
          mentions  large  graphs being clustered in very reason-
          able time. In protein clustering, mcl has been  applied
          to graphs with up to one million nodes, and on high-end
          hardware such graphs can  be  clustered  within  a  few
          hours.

   3.3    Does this implementation need to sort vectors?

          No,  it  does  not.  You might expect that one needs to
          sort a vector in order to obtain the K largest entries,
          but  a simpler mechanism called heap selection does the
          job nicely.  Selecting the K largest entries from a set
          of  L  by sorting would require O(L*log(L)) operations;
          heap selection requires O(L*log(K)) operations.

   3.4    mcl does not compute the ideal MCL process!

          Indeed it does not. What are the ramifications? Several
          entries  in  section Resource tuning / accuracy discuss
          this issue. For a synopsis,  consider  two  ends  of  a
          spectrum.

          On  the  one  end, a graph that has very strong cluster
          structure, with clearly  (and  not  necessarity  fully)
          separated  clusters.  This mcl implementation will cer-
          tainly retrieve those clusters if the graphs falls into
          the category of graphs for which mcl is applicable.  On
          the other end, consider a  graph  that  has  only  weak
          cluster  structure  superimposed  on  a background of a
          more or less random graph. There might sooner be a dif-
          ference  between  the  clustering  that  should ideally
          result and the one computed by mcl. Such a  graph  will
          have  a  large number of whimsical nodes that might end
          up either here or there, nodes that are of a peripheral
          nature, and for which the (cluster) destination is very
          sensitive to fluctutations in edge weights or algorithm
          parametrizations (any algorithm, not just mcl).

          One can say that the perturbation effect of the pruning
          process applied by mcl is just a small source of noise.
          Additionally,  graphs  at the noisy end of the spectrum
          will  generally  be  very  susceptible  to  changes  in
          parametrization  of  the MCL algorithm and process, and
          the perturbation caused by computing an imperfect  pro-
          cess  will  generally be small compared with the effect
          of changing parametrizations.

          Of course, there is the issue of very  large  and  very
          dense  graphs.   The  act of pruning will have a larger
          impact as graphs grow larger  and  denser.   Obviously,
          mcl  will have trouble dealing with such very large and
          very dense graphs - so will other methods.

          Finally,  there  is  the  engineering  approach,  which
          offers the possibility of pruning a whole lot of specu-
          lation. Do the experiments with mcl, try  it  out,  and
          see what's there to like and dislike.

   3.5    I've  read  elsewhere  that  MCL is slow [compared with
          XYZ]

          Presumably, they did not know mcl, and did not read the
          parts  in [1] and [2] that discuss implementation. Per-
          haps they assume or insist that the only way to  imple-
          ment  MCL  is  to  implement the ideal process. Or they
          have a really stupifyingly fast algorithm.   [One  such
          publication  is  Ulrik  Brandes,  Marco  Gaertler,  and
          Dorothea Wagner: Experiments on Graph Clustering  Algo-
          rithms.  Proc.  11th Europ. Symp. Algorithms (ESA '03),
          Springer LNCS, to appear].

                        Resource tuning / accuracy

   4.1    What do you mean by resource tuning?

          mcl computes a process in which stochastic matrices are
          alternately expanded and inflated. Expansion is nothing
          but standard matrix multiplication, inflation is a par-
          ticular way of rescaling the matrix entries.

          Expansion  causes  problems  in  terms of both time and
          space. mcl works with matrices of dimension N, where  N
          is  the number of nodes in the input graph.  If no pre-
          cautions are taken, the number of entries  in  the  mcl
          iterands  (which  are  stochastic  matrices)  will soon
          approach the square of N. The time it takes to  compute
          such a matrix will be proportional to the cube of N. If
          your input graph has 100.000 nodes, the memory require-
          ments  become  infeasible  and  the  time  requirements
          become impossible.

          What mcl does is perturbing the process it  computes  a
          little  by removing the smallest entries - it keeps its
          matrices sparse.   This  is  a  natural  thing  to  do,
          because  the matrices are sparse in a weighted sense (a
          very high proportion of the  stochastic  mass  is  con-
          tained in relatively few entries), and the process con-
          verges to matrices that are extremely sparse, with usu-
          ally no more than N entries.  It is thus known that the
          MCL iterands are sparse in a  weighted  sense  and  are
          usually  very  close to truly sparse matrices.  The way
          mcl perturbs its matrices is by the strategy  of  prun-
          ing,   selection,  and  recovery  that  is  extensively
          described in the mcl manual page.   The  question  then
          is:  What  is  the  effect  of this perturbation on the
          resulting clustering, i.e.  how  would  the  clustering
          resulting from a perfectly computed mcl process compare
          with the clustering I have on disk?  Faq entry 4.3 dis-
          cusses this issue.

          The amount of resources used by mcl is bounded in terms
          of the maximum number of neighbours a node  is  allowed
          to have during all computations.  Equivalently, this is
          the maximum number of nonzero entries a  matrix  column
          can possibly have. This number, finally, is the maximum
          of the the values corresponding  with  the  -S  and  -R
          options.   The  latter two are listed when using the -z
          option (see faq 8.1).

   4.2    How do I compute the maximum amount of  RAM  needed  by
          mcl?

          It is rougly equal to

          2 * s * K * N

          bytes, where 2 is the number of matrices held in memory
          by mcl, s is the size of a  single  cell  (c.q.  matrix
          entry  or  node/arc  specification), N is the number of
          nodes in the input graph, and where K is the maximum of
          the  values  corresponding  with  the -S and -R options
          (and this assumes that the average node degree  in  the
          input  graph  does not exceed K either). The value of s
          can be found by using the -z option. It  is  listed  in
          one  of  the  first  lines  of  the resulting output. s
          equals the size of an int plus the  size  of  a  float,
          which  will  be  8 on most systems.  The estimate above
          will in most cases be way too pessimistic (meaning  you
          do not need that amount of memory).

          The -how-much-ram option is provided by mcl for comput-
          ing the bound given above. This options takes as  argu-
          ment the number of nodes in the input graph.

          The  theoretically more precise upper bound is slightly
          larger due to overhead. It is something like

          ( 2 * s * (K + c)) * N

          where c is 5 or so, but one should not pay attention to
          such a small difference anyway.

   4.3    How  much does the mcl clustering differ from the clus-
          tering resulting from a perfectly computed MCL process?

          For  graphs  with  up until a few thousand nodes a per-
          fectly computed MCL process can be achieved by abstain-
          ing  from  pruning  and  doing full-blown matrix arith-
          metic. Of  course,  this  still  leaves  the  issue  of
          machine  precision,  but  let  us wholeheartedly ignore
          that.

          Such experiments give evidence (albeit incidental) that
          pruning  is  indeed really what it is thought to be - a
          small perturbation. In many cases,  the  'approximated'
          clustering  is  identical to the 'exact' clustering. In
          other cases, they are very close to each other in terms
          of  the  metric  split/join  distance  as  computed  by
          clmdist.  Some experiments with randomly generated test
          graphs, clustering, and pruning are described in [4].

          On  a different level of abstraction, note that pertur-
          bations of the inflation parameter will  also  lead  to
          perturbations in the resulting clusterings, and surely,
          large changes in the inflation parameter will  in  gen-
          eral   lead   to   large  shifts  in  the  clusterings.
          Node/cluster pairs that are different for the  approxi-
          mated  and the exact clustering will very likely corre-
          spond with nodes that are in a boundary region  between
          two or more clusters anyway, as the perturbation is not
          likely to move a node from one core  of  attraction  to
          another.

          Faq entry 4.6 has more to say about this subject.

   4.4    How do I know that I am using enough resources?

          In  mcl  parlance,  this  becomes how do I know that my
          -scheme parameter is high enough  or  more  elaborately
          how do I know that the values of the {-P, -S, -R, -pct}
          combo are high enough?

          There are several aspects. First, watch the jury  marks
          reported  by  mcl  when  it's done.  The jury marks are
          three grades, each out of 100. They indicate  how  well
          pruning went. If the marks are in the seventies, eight-
          ies, or nineties, mcl is probably doing fine.  If  they
          are in the eighties or lower, try to see if you can get
          the marks  higher  by  spending  more  resources  (e.g.
          increase the parameter to the -scheme option).

          Second,  you  can  do  multiple  mcl runs for different
          resource schemes, and compare the resulting clusterings
          using clmdist. See the clmdist manual for a case study.

   4.5    Where is the mathematical analysis of this mcl  pruning
          strategy?

          There is none. [add]

          Ok, the next entry gives an engineer's rule of thumb.

   4.6    What  qualitative  statements  can  be  made  about the
          effect of pruning?

          The more severe pruning is, the more the computed  pro-
          cess  will tend to converge prematurely. This will gen-
          erally lead to  finer-grained  clusterings.   In  cases
          where  pruning  was  severe,  the  mcl  clustering will
          likely be closer to a clustering ideally resulting from
          another  MCL  process with higher inflation value, than
          to the clustering ideally resulting from the  same  MCL
          process.  Strong support for this is found in a general
          observation illustrated by the following example.  Sup-
          pose u is a stochastic vector resulting from expansion:

          u   =  0.300 0.200 0.200 0.100 0.050 0.050 0.050 0.050

          Applying inflation with inflation value 2.0 to u gives

          v   =  0.474 0.211 0.211 0.053 0.013 0.013 0.013 0.013

          Now suppose we first apply pruning to u such that the 3
          largest  entries 0.300, 0.200 and 0.200 survive, throw-
          ing away 30 percent of the stochastic  mass  (which  is
          quite  a  lot  by  all  means).  We rescale those three
          entries and obtain

          u'  =  0.429 0.286 0.286 0.000 0.000 0.000 0.000 0.000

          Applying inflation with inflation value 2.0 to u' gives

          v'  =  0.529 0.235 0.235 0.000 0.000 0.000 0.000 0.000

          If we had applied inflation with inflation value 2.5 to
          u, we would have obtained

          v'' =  0.531 0.201 0.201 0.038 0.007 0.007 0.007 0.007

          The vectors v' and v'' are much closer  to  each  other
          than  the  vectors  v'  and v, illustrating the general
          idea.

          In practice, mcl should (on  average)  do  much  better
          than in this example.

   4.7    At  different  high  resource levels my clusterings are
          not identical.  How can I trust the output  clustering?

          Did  you  read  all other entries in this section? That
          should have reassured you somewhat, except perhaps  for
          Faq answer 4.5.

          You  need  not  feel uncomfortable with the clusterings
          still being different at high resource levels, if  ever
          so  slightly. In all likelihood, there are anyway nodes
          which are not in any core of attraction, and  that  are
          on  the  boundary between two or more clusterings. They
          may go one way or another,  and  these  are  the  nodes
          which will go different ways even at high resource lev-
          els. Such nodes may be stable in  clusterings  obtained
          for  lower inflation values (i.e. coarser clusterings),
          in which the  different  clusters  to  which  they  are
          attracted are merged.

          By  the  way, you do know all about clmdist, don't you?
          Because the statement that clusterings are not  identi-
          cal should be quantified: How much do they differ? This
          issue is discussed in the clmdist manual page - clmdist
          gives  you  a robust measure for the distance (dissimi-
          larity) between two clusterings.

          There are other means of gaining trust in a clustering,
          and  there  are  different issues at play. There is the
          matter of how accurately this mcl computed the mcl pro-
          cess,  and  there  is the matter of how well the chosen
          inflation parameter fits the data.  The  first  can  be
          judged  by  looking  at  the  jury  marks (faq 4.4) and
          applying clmdist to different clusterings.  The  second
          can  be judged by measurement (e.g. use clminfo) and/or
          inspection (use your judgment).

                        Tuning cluster granularity

   5.1    How do I tune cluster granularity?

          There are several ways for influencing  cluster  granu-
          larity.  These  ways and their relative merits are suc-
          cessively discussed below. The clmdist(1)  manual  con-
          tains an example of doing multiple mcl runs for finding
          granularily different clusterings, using the most  com-
          mon approach, namely that of varying inflation.

   5.2    The effect of inflation on cluster granularity.

          The  main  handle  for  changing  inflation  is  the -I
          option. This is also the principal handle for  regulat-
          ing  cluster  granularity. Unless you are mangling huge
          graphs it could be the only mcl option  you  ever  need
          besides the output redirection option -o.

          Increasing the value of -I will increase cluster granu-
          larity.  Conceivable values are from 1.1 to 5.0 or  so,
          but  the range of suitable values will certainly depend
          on your input graph. For many graphs, 1.1 will  be  far
          too low, and for many other graphs, 5.0 will be far too
          high. You will have to find the right value or range of
          values by experimenting, using your judgment, and using
          measurement tools such  as  clmdist  and  clminfo.  The
          default  2.0  is a good value to begin the experimental
          stage with.

          For experiments that are more subtle  with  respect  to
          inflation,  mcl  provides  the -i option in conjunction
          with the -l (small letter ell) option. Do this only  if
          you  have  the  intention of playing around with mcl in
          order to study the characteristics of the process  that
          it computes, and maybe, just maybe, use it in a produc-
          tion environment if you find it useful.  In  the  first
          vein, you may be interested to know that mcx is a stack
          language/interpreter in which the entire MCL  algorithm
          can be written in three lines of code. It provides com-
          prehensive  access  to  the  MCL   graph   and   matrix
          libraries.  However, the mcx interface to the MCL prun-
          ing facilities is not yet satisfactory at this time.

   5.3    The effect of similarity  distribution  homogeneity  on
          cluster granularity.

          How  similarities in the input graph were derived, con-
          structed, adapted, filtered  (et  cetera)  will  affect
          cluster granularity.  It is important that the similar-
          ities are honest; refer to faq 2.8.

          Another issue is that homogeneous similarities tend  to
          result in more coarse-grained clusterings. You can make
          a set of similarities more homogeneous by applying some
          function to all of them, e.g. for all pairs of nodes (x
          y) replace S(x,y) by the square root, the logarithm, or
          some  other  convex  function.  Note  that you need not
          worry about scaling, i.e. the possibly large changes in
          magnitude  of  the similarities. MCL is not affected by
          absolute magnitudes, it is only affected by  magnitudes
          taken relative to each other.

          Here  is  how  to  make  a  graph more homogeneous with
          respect to the weight function. Given  orig.mci,  clus-
          tering  revised.mci  as constructed below should gener-
          ally lead to coarser clusterings.

          mcx /orig.mci lm 0.5 hdp /revised.mci wm

          The parameter 0.5 can be changed to other values in the
          range  [0..1.0].   The  closer  it is to zero, the more
          clusterings will tend to be coarse.

          If the parameter is chosen larger than 1.0, say in  the
          range  [1.2..5.0] then clusterings will tend to be more
          finer-graine. For example,

          mcx /orig.mci lm 3.0 hdp /revised.mci wm

   5.4    The effect of initial centering on cluster granularity.

          This  refers  to  the -c parameter, which adds loops to
          the input  graph.  Its  default  value  is  1.0,  which
          results  in  loops  of a somehow 'neutral' weight to be
          added. If you need  to  really  fine-tune  granularity,
          this option can be of use, otherwise you should abstain
          from using it.   Increasing  its  value  will  increase
          cluster granularity.

          Conceivable/normal  values are in the range 1.0 to 5.0,
          but nothing stops you from  going  higher  or  slightly
          lower.  Going  lower  than 0.5 is definitely not a good
          idea.

          If you are into clustering at high levels of  granular-
          ity, there is the issue whether to further increase -I,
          or whether to start increasing or further increase  -c.
          It  will  really  depend  on the characteristics of the
          graph you are working with, and at this point in time I
          cannot  even  give advice in terms of a general catego-
          rization.  Experiment,  learn,  and  let  me  know  the
          results if you like.

   5.5    How to implement two-level approaches using mcl.

          If  changing  inflation does not yield clusterings that
          are sufficiently coarse to your liking,  you  may  con-
          sider  trying  a  two-level  approach.  Presumably your
          input graph is very large if you find yourself in  this
          situation.  You should be aware of the possibility that
          the graph you are clustering simply does not posses the
          type  of  coarse-grained structure that you are looking
          for.

          Two-level approaches can be implemented in a variety of
          ways,  and you may wish to invoke tools other than mcl.
          However, it is possible to  experiment  with  two-level
          approaches  using  mcl  and its associated utility mcx.
          Here is how, assuming your  original  graph  is  called
          orig.mci.

          Warning
          This  approach  is  a  little crude, and will suffer if
          (many) small clusters are present.

          mcl orig.mci -I 5.0 -c 3.0 -scheme 5 -o orig.i5.mco

          Cluster it first so that you get a  fine-grained  clus-
          tering.   Since  orig.mci  is  likely  a large graph, I
          opted for a high scheme.

          mcx /orig.i5.mco lm tp exch     # line continues
                      /orig.mci lm exch mul mul tp add /coarse.mci wm

          This transforms the clustering+graph into a  new  graph
          coarse.mci  where the clusters are nodes. You may, upon
          inspection, wish  to  change  the  homogeneity  of  the
          weight distribution by applying the method described in
          faq entry 5.3 - but  that's  something  best  left  for
          optionally  fine-tuning  this method once you decide it
          has merits.

          mcl coarse.mci -I 2.0 -c 0.0 -scheme 5 -o coarse.mco

          Cluster the coarsened graph, and keep the loops as com-
          puted in the coarsening step.

          mcx /orig.i5.mco lm /coarse.mco lm mul /projected.mco wm

          Project the 'coarsened' clustering back onto the origi-
          nal graph.  Now projected.mco should be a coarse  clus-
          ter for orig.mci.

          There  are  a lot of parameters to play with here; e.g.
          the 5.0, 3.0 and 2.0, and 1.0.  These  seem  reasonable
          defaults.

                      Implementing the MCL algorithm

   6.1    How easy is it to implement the MCL algorithm?

          Very  easy, if you will be doing small graphs only, say
          up to a few thousand entries at  most.  These  are  the
          basic ingredients:

          o  Adding  loops  to  the  input graph, conversion to a
             stochastic matrix.
          o  Matrix multiplication and matrix inflation.
          o  The interpretation function mapping MCL limits  onto
             clusterings.

          These  must  be  wrapped  in  a program that does graph
          input and  cluster  output,  alternates  multiplication
          (i.e.  expansion) and inflation in a loop, monitors the
          matrix iterands thus found, quits the loop when conver-
          gence is detected, and interprets the last iterand.

          Implementing  matrix  muliplication is a standard exer-
          cise. Implementing inflation  is  nearly  trivial.  The
          hardest  part  may actually be the interpretation func-
          tion, because you need to cover  the  corner  cases  of
          overlap  and  attractor  systems of cardinality greater
          than one.

          In Mathematica or Maple, this should be  doable  in  at
          most  50  lines of code.  For perl you may need 50 more
          lines - note that MCL does not use intricate and expen-
          sive  operations  such  as  matrix  inversion or matrix
          reductions. In lower level languages such as C a  basic
          MCL  program  may  need  a  few  hundred lines, but the
          largest part will probably be input/output  and  inter-
          pretation.

          It is perhaps even such that implementing the basic MCL
          algorithm makes a nice programming  exercise.  However,
          if  you  need  an implementation that scales to several
          hundreds of thousands of  nodes  and  possibly  beyond,
          then  your  duties become much heavier. This is because
          one needs to prune MCL iterands  (c.q.  matrices)  such
          that  they  remain  sparse. This must be done carefully
          and preferably in such a way that a  trade-off  between
          speed,  memory  usage, and potential losses or gains in
          accuracy can be controlled via monitoring  and  logging
          of  relevant characteristics.  Some other points are i)
          support for threading via  pthreads,  openMP,  or  some
          other  parallel  programming  API.   ii)  a  robust and
          generic interpretation function is written in terms  of
          weakly connected components.

           Cluster overlap / MCL iterand cluster interpretation

   7.1    Introduction

          A  natural  mapping  exists  of  MCL  iterands  to DAGs
          (directed acyclic graphs). This is because MCL iterands
          are  generally  diagonally positive semi-definite - see
          [3].  Such a DAG can be interpreted  as  a  clustering,
          simply  by  taking as cores all endnodes (sinks) of the
          DAG, and by attaching to each core all the  nodes  that
          reach it. This procedure may result in clusterings con-
          taining overlap.

          In the MCL limit, the associated DAG has in  general  a
          very degenerated (no offense meant) form, which induces
          overlap only on very  rare  occasions  (see  faq  entry
          7.2).

          Interpreting  mcl  iterands  as clusterings may well be
          interesting.  Few experiments have been done so far. It
          is  clear  though that early iterands generally contain
          the most overlap  (when  interpreted  as  clusterings).
          Overlap   dissappears   soon   as   the  iterand  index
          increases. For  more  information,  consult  the  other
          entries in this section and the clmimac manual page.

   7.2    Can the clusterings returned by mcl contain overlap?

          No.  Clusterings  resulting from the abstract MCL algo-
          rithm may in theory contain overlap,  but  the  default
          behaviour  in  mcl  is to remove it should it occur, by
          allocating the nodes in overlap to the first cluster in
          which  they are seen. mcl will warn you if this occurs.
          This behaviour is switched off by supplying  --overlap.

          Do  note that overlap is mostly a theoretical possibil-
          ity.  It is conjectured that it requires  the  presence
          of  very  strong  symmetries in the input graph, to the
          extent that there exists an automorphism of  the  input
          graph mapping the overlapping part onto itself.

          It  is  possible  to construct (highly symmetric) input
          graphs leading to cluster overlap. Examples of  overlap
          in  which  a  few  nodes  are involved are easy to con-
          struct; examples with many nodes are exceptionally hard
          to construct.

          Clusterings   associated  with  intermediate/early  MCL
          iterands may very well contain overlap, see the  intro-
          duction in this section and other entries.

   7.3    How  do  I  obtain  the clusterings associated with MCL
          iterands?

          There are two options. If you are interested  in  clus-
          terings  containing overlap, you should go for the sec-
          ond. If not, use the first, but beware that the result-
          ing clusterings may contain overlap.

          The  first  solution  is  to use -dump cls (probably in
          conjunction with either -L or -dumpi in order to  limit
          the number of matrices written). This will cause mcl to
          write the clustering generically associated  with  each
          iterand to file. The -dumpstem option may be convenient
          as well.

          The second solution is  to  use  the  -dump ite  option
          (-dumpi  and  -dumpstem may be of use again). This will
          cause mcl to write the intermediate iterands  to  file.
          After  that, you can apply clmimac (interpret matrix as
          clustering) to those iterands.  clmimac  has  a  -tight
          parameter  which  affects  the  mapping  of matrices to
          clusterings. It takes a value  between  0  and  100  as
          argument.  The  default is 100 and corresponds with the
          strict mapping. Lowering the -tight value  will  gener-
          ally  result  in  clusterings  containing more overlap.
          This will have the largest effect for  early  iterands;
          its   effect   will   diminish  as  the  iterand  index
          increases.

          When set to 0, the  -tight  parameter  results  in  the
          clustering  associated  with the DAG associated with an
          MCL iterand as described in [3].  This  DAG  is  pruned
          (thus  possibly  resulting in less overlap in the clus-
          tering) by increasing the -tight parameter. [add]

                               Miscellaneous

   8.1    How do I find the default settings of mcl?

          Use -z to find out the actual settings - it  shows  the
          settings  as  resulting  from  the command line options
          (e.g. the default settings  if  no  other  options  are
          given).

   8.2    What's next?

          I'd like to port MCL to cluster computing, using one of
          the PVM, MPI, or  openMP  frameworks.   For  the  1.002
          release,  mcl's  internals were rewritten to allow more
          general matrix computations. Among other things,  mcl's
          data  structures  and primitive operations are now more
          suited to be employed in a distributed computing  envi-
          ronment.  However,  much  remains to be done before mcl
          can operate in such an environment.

          At some point in the future a second, xml-based,  ascii
          input format may be introduced.

          If you feel that mcl should support some other standard
          matrix format, let us know.

  BUGS
          This FAQ tries to compromise between being concise  and
          comprehensive. The collection of answers should prefer-
          ably cover the universe  of  questions  at  a  pleasant
          level of semantic granularity without too much overlap.
          It should offer value to people interested in  cluster-
          ing but without sound mathematical training. Therefore,
          if this FAQ has not  failed  somewhere,  it  must  have
          failed.

          Send  criticism and missing questions for consideration
          to mcl-faq at micans.org.

  AUTHOR
          Stijn van Dongen.

  SEE ALSO
          mcxio(5), mcl, mclfaq, mclpipeline,  mcxassemble,  mcx-
          subs,  mcxconvert,  mcxmap,  clmdist, clminfo, clmmeet,
          clmimac, clmresidue, clmformat.

          If enabled in this installation:  mclblastline,  mcxde-
          blast.

          mcl's home at http://micans.org/mcl/.

  REFERENCES
          [1]  Stijn van Dongen. Graph Clustering by Flow Simula-
          tion.  PhD thesis, University of Utrecht, May 2000.
          http://www.library.uu.nl/digia-
          rchief/dip/diss/1895620/inhoud.htm

          [2]  Stijn  van Dongen. A cluster algorithm for graphs.
          Technical Report INS-R0010, National Research Institute
          for  Mathematics  and  Computer  Science in the Nether-
          lands, Amsterdam, May 2000.
          http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z

          [3] Stijn van Dongen. A stochastic  uncoupling  process
          for   graphs.   Technical  Report  INS-R0011,  National
          Research Institute for Mathematics and Computer Science
          in the Netherlands, Amsterdam, May 2000.
          http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z

          [4]  Stijn  van  Dongen. Performance criteria for graph
          clustering and Markov  cluster  experiments.  Technical
          Report INS-R0012, National Research Institute for Math-
          ematics and Computer Science in the  Netherlands,  Ams-
          terdam, May 2000.
          http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z

          [5]  Enright  A.J.,  Van  Dongen  S., Ouzounis C.A.  An
          efficient algorithm for large-scale detection  of  pro-
          tein  families,  Nucleic Acids Research 30(7):1575-1584
          (2002).

  NOTES
          This  page  was  generated  from  ZOEM  manual  macros,
          http://micans.org/zoem. Both html and roff pages can be
          created from the same source without having  to  bother
          with  all  the usual conversion problems, while keeping
          some level of sophistication in the typesetting.



  MCL FAQ 1.003, 03-276       3 Oct 2003                 MCL FAQ(1)
