Building and testing an OpenMosix cluster

Adam Funk

14 February 2005


Contents

1 Introduction

This report describes the installation and preliminary testing of an OpenMosix cluster on Debian GNU/Linux using three spare machines from the ``computer graveyard''. The OpenMosix kernel extensions and daemon are designed to distribute processes among networked cluster machines in imitation of a multiprocessor machine.

2 Installation, configuration and expansion

2.1 Installation and initial configuration

This section explains the procedure I discovered and used for building the cluster. I hope anyone else interested in building an OpenMosix cluster may find this information useful. These instructions are based on the following assumptions. I was pleasantly surprised to find that my lack of experience with kernel hacking and compilation was not a problem, because of the excellent instructions given by Buytaert (2) (and incorporated below).
  1. Download the bootable Debian minimal installation CD.
    http://ftp.fi.debian.org/debian-minicd/
  2. Assemble three machines and connect them through a hub to the internet. In our test cluster, each machine has a Pentium II 350 MHz, a CD-ROM drive, 128 kB RAM, a 4.3 GB hard drive and an Intel Pro/100 NIC.
  3. On each node, boot the CD and install a basic Debian stable system. (I skipped tasksel and dselect.) Edit /etc/apt/source.list (and possibly /etc/apt/apt.conf). Then update, upgrade and add more packages. (Not all of the following packages are necessary to run OpenMosix, but they are useful. Emacs is optional according to your religious preferences.)
    # apt-get update
    # apt-get upgrade
    # apt-get install apt-show-versions dnsutils emacs21 file gcc \
     gnupg kernel-package kernel-patch-openmosix less libncurses5-dev \
     libstdc++2.10-dev lynx-ssl mps ntp-simple ntpdate openmosix \
     openmosix-dev screen ssh tree wget nload
    
    When configuring the ssh package, choose to run the ssh server (daemon). Remote logins will be very useful.
  4. On the first node, install a newer kernel such as the kernel-image-2.4.16-686 package. (I believe this is worthwhile in order to get all the hardware working under this revision of the kernel before the next step.) Fiddle with /etc/lilo.conf as necessary. Edit /etc/modules.conf and add the module for the network card (e.g. eepro100 for the Intel Pro/100 card). Further modules might be necessary for additional hardware. (Additional modules are necessary because the CD's 2.2.20-idepci kernel has a lot more hardware support compiled in than normal kernels.)

    Get the first node to boot the new kernel successfully--with functional networking!--and then repeat this step to upgrade the kernel on the other nodes.

  5. Fetch the stock kernel source and verify the signature against key 0x517D0F0E. (The patches provided by the Debian stable kernel-patch-openmosix packages are compatible only with the standard 2.4.16 kernel source--not with other versions or the Debian kernel source. The testing and unstable versions of the package have different requirements.)
    $ wget http://www.kernel.org/pub/linux/kernel/v2.4/linux-2.4.16.tar.bz2
    $ wget http://www.kernel.org/pub/linux/kernel/v2.4/linux-2.4.16.tar.bz2.sign
    $ gpg linux-2.4.16.tar.bz2.sign
    
  6. Unpack, patch, compile and install the kernel.
    # cd /usr/src
    # tar xjf /path/to/linux-2.4.16.tar.bz2
    # cd linux
    # ../kernel-patches/i386/apply/openmosix
    # make menuconfig
    
    According to Robbins (5), the following options should be enabled in the configuration menu.
    # make-kpkg kernel_image modules_image
    # cd ..
    # dpkg -i kernel-image-2.4.16_10.00.Custom_i386.deb
    
    Reboot the system with the new kernel.
    # shutdown -r now
    ...
    $ uname -a
    Linux dmtest1 2.4.16 #1 SMP Fri Jan 28 09:05:01 GMT 2005 i686 unknown
    
    Once this works, repeat this step to create and install the OpenMosix kernel on the other nodes.
  7. Edit /etc/openmosix.map. Comment out the loopback address and add the IP addresses of all the nodes in the cluster, as shown in the following example.
    #1 127.0.0.1 1
    1 192.84.82.230  1
    2 192.84.82.231  1
    3 192.84.82.232  1
    
  8. Create the following symlinks. (They might not be necessary, but the daemon complains if they are missing from /usr/bin/.)
    # cd /usr/bin/
    # ln -s /bin/mosctl
    # ln -s /bin/mosmon
    # ln -s /bin/mosrun
    
  9. Enable MFS. Start by adding the following line to /etc/fstab.
    none  /mfs  mfs  noauto,dfsa=1  0 0
    
    Edit /etc/default/openmosix and change the line MFS=no to MFS=yes. Mount MFS and start OpenMosix as follows.
    # mkdir /mfs
    # mount /mfs/
    # /etc/init.d/openmosix restart
    Restarting OPENMOSIX...
    All remote processes were expelled and no further remote processes 
    accepted.
    All local processes were brought back and will stay here.
    Local processes may now leave the node automatically.
    automatic load-balancing already enabled.
    Remote processes now allowed in.
    MFS access to this node now allowed.
    

2.2 Expansion

I have not yet tried adding further nodes to our cluster, but I believe the procedure would be to follow the steps above on each new node, add all the new nodes' IP addresses to each node's /etc/openmosix.map file, and then restart the OpenMosix daemon on each node.

2.3 Configuration options

Our cluster currently uses local /etc/passwd and related files that happen to have the same accounts on every node. Ideally each node should obtain user account details from a central source using LDAP or NIS.

By default all nodes in an OpenMosix cluster are equal: users can log in and start programs on any node, and processes can migrate between any nodes. It is however possible to create a master node from which but not to which processes can migrate. This would reduce the probability that users could have difficulty logging in and starting programs on an overloaded machine.

3 Tests

Porthos has a Pentium IV 1800 MHz processor and is used as a desktop workstation. The OpenMosix cluster consists of three dedicated Pentium II 350 MHz machines with 100 Mb/s network cards.

The results for the quantitative tests show real (clock time), user (user CPU time) and sys (system CPU time) as reported by the standard time command.

3.1 Bash and Perl test

This test consists of a Bash script that launches in the background a large number of Perl programs, each of which takes a random amount of time to run. (Appendix A.1 gives the source code.)

This test qualitatively demonstrates that OpenMosix successfully migrates processes between the nodes in reaction to the process distribution. Running mosmon in another console (on any node of the cluster) produces a live bar graph of the process load on all three nodes, as shown in Appendix B. The mtop utility produces a live table similar to top and shows all processes that were started on the same node, but indicates the numbers of the nodes on which the processes are currently running.

3.2 C++ tests

3.2.1 Homemade test

The program (listed in Appendix A.2) takes one argument: the number of threads to create. For each integer from 1 to that number, the program creates a thread (a process using fork()) which recursively calculates the nth Fibonacci number and prints it. Each thread takes a different length of time to finish. The main program then waits for all the threads to finish before terminating. In both cluster tests mosmon showed good migration of processes between the nodes.
Tests
(A) on porthos
(B) on the cluster
(C) on one of the cluster machines with OpenMosix disabled
Results for 40 threads
$ time ./multifibo 40
Test real user sys
(A) porthos 13.640s 12.400s 0.030s
(B) cluster 27.104s 48.680s 0.320s
(C) one cluster machine 48.634s 48.580s 0.050s
Results for 45 threads
$ time ./multifibo 45
Test real user sys
(A) porthos 3m25.045s 3m09.330s 0m00.240s
(B) cluster 4m20.371s 8m58.880s 0m01.210s
(C) one cluster machine 8m58.827s 8m57.670s 0m01.120s

3.2.2 DistKeyGen test

I downloaded DistKeyGen.C from Chen (3) which starts five threads (using fork()), each of which generates 1000 RSA keypairs. In the cluster test mosmon showed good distribution of processes between the nodes.
Tests
(A) on porthos
(B) on the cluster
(C) on one of the cluster machines with OpenMosix disabled
Results:
$ time ./DistKeyGen
Test real user sys
(A) porthos 30m39.566s 29m30.550s 0m9.250s
(B) cluster 31m11.394s 86m58.080s 1m36.970s
(C) one cluster machine 85m23.610s 85m01.460s 0m21.440s

3.3 Java tests

3.3.1 Testing Java itself

The program (listed in Appendix A.3) takes one argument: the number of threads to create. For each integer from 1 to that number, the program creates a thread (using the Java standard library's Thread class) which recursively calculates the nth Fibonacci number and prints it. Each thread takes a different amount of time to finish. The main program then waits for all the threads to finish before terminating.

The C++ and Java programs cannot be fairly compared with each other, since they use different ways of creating threads and determining when the threads have finished.

Various web pages and personal communications with users of OpenMosix indicate that Java programs can be distributed only as green threads (if at all). Green threads are not supported--even as an option--by any version of Java above 1.3.1.

The Java SDK on porthos (1.4.1) does not provide green threads. The Java SDK on the cluster (1.3.1) should--according to the documentation, at least--provide either thread model according to the -green or -native option, or use one of these two models as the default.

Tests
(A) on porthos $ java Multifibo 40
(B) on the cluster using the default $ java Multifibo 40
(C) on the cluster specifying native $ java -native Multifibo 40
(D) on the cluster specifying green $ java -green Multifibo 40
Results for 40 threads
Test real user sys
(A) porthos 14.193s 13.490s 0.050s
(B) cluster default 44.292s 42.010s 1.040s
(C) cluster -native 5m41.909s 5m39.980s 0.940s
(D) cluster -green 5m38.563s 5m37.700s 0.690s

Mysteriously, specifying either threading option caused the program to much more slowly than taking the default--which should be the same as one of the two options. None of the tests on the cluster worked properly. In all cases mosmon indicated that the Java threads were never distributed across the OpenMosix nodes; all the Java processes remained on the node on which the program was started.

I launched two separate multi-threaded Java programs on node 1 of the cluster and found that both of them stayed entirely on that node.

I also found that fully compiled Java programs (ELF executables produced by the GNU Java Compiler gcj 3.0.4) refused to migrate, as reported by Ghilardi (4).

My discussions with an OpenMosix administrator and examination of material on the WWW suggest that defects inherent in Java prevent it from migrating properly.

3.3.2 Testing Java with other processes

Using mosmon and mtop to monitor the cluster, I started several Java programs at the same time on node 2, and then launched the homemade C++ test on node 1. Although the Java processes remained on node 2, OpenMosix took them into account while migrating the C++ processes to balance the processor loads; as the Java programs finished, C++ processes migrated into node 2.

I started the homemade C++ test on node 1 and waited until some of its threads had migrated into node 2, where I then started several Java programs. Although the Java processes stayed on node 2, the C++ processes migrated away from that node 2 to balance the loads.

I started both the Java programs and the C++ programs on node 1 several times and found again that although the Java programs did not migrate, the others migrated to balance the loads as well as possible.

3.4 SWI-Prolog test

I compiled and installed SWI-Prolog (11) with the multithreaded option and ran the program listed in Appendix A.4. The processes did not (according to mtop and mosmon) distribute but stayed on the node on which the program was launched. This is not surprising because ``SWI-Prolog multi-threading is based on the POSIX thread standard ... on most popular systems except for MS-Windows'' (11, §8) and Ghilardi (4) states that POSIX threads do not migrate.

4 Conclusion

For the C++ program, the benefit from running the threads concurrently on the cluster exceeds the distribution overheads. The user (clock) time running the program on the cluster (of three 350 MHz machines) compares reasonably with one much faster machine (1800 MHz).

Although Java threads and whole programs refuse to migrate between nodes, OpenMosix takes them into account when migrating other processes to balance the processor loads. Java itself does not take advantage of the parallel processing, but nothing prevents the use of Java programs on the cluster.

5 Plans

A. Source code


A..1 The Bash and Perl test

A..1.1 testmosix1.sh

#!/bin/sh
DUMP="$HOME/dump"
cat /dev/null >$DUMP

for i in `ls /etc/`
  do 
  ~/bin/testmosix1a.perl /etc/$i >> $DUMP &
done

A..1.2 testmosix1a.perl

#!/usr/bin/perl

$file = shift(@ARGV) ;

if (-r $file) {

    $mc = `mcookie -f $file` ;
    chomp($mc) ;
    $mc1 = $mc ;
    $mc1 =~ s/[^\d]//g ;
    $mc2 = $mc1 ;
    $x = 0 ;
    $mc2 = 9999 + ($mc1 % 999999) ;
    
    for ($i=0 ; $i< $mc2 ; $i++) {
        $x = ($x + rand($mc1)) * ($x - rand($mc1)) ;
    }

    print "$file\n $mc\n $mc1\n $i\n $x\n" ;
}

else {
    print "$file\n not readable\n" ;
}


A..2 The C++ test: multifibo.C

#include<iostream>
#include<cstdio>
#include<sys/types.h>
#include<sys/wait.h>
#include<unistd.h>
#include<stdlib.h>
#include<map>

/* table of PIDs */
map<int, pid_t> pid_list ;

/* Recursive definition of Fibonacci numbers */
int fibor (int input) {
  int output = 1 ;
  if (input > 1) {
    output = fibor(input-1) + fibor(input-2) ;
  }
  return (output) ;
}

/***** MAIN *****/
int main (int argc, char *argv[]) {
  int max = 35 ; /* default value */
  if ( argc > 1 ) {
    std::sscanf(argv[1], "%i", &max) ;
  }

  /* start the threads */
  int loop;
  for ( loop=1 ; loop <= max ; ++loop ) {
    pid_t  pid = fork() ;
    if (pid < 0) {
      std::cout << "Fork failed!" ;
      exit (3) ;
    }
    else if (pid == 0 ) {
      std::cout << loop << " -> started\n" ;
      int f = fibor(loop) ;
      std::cout << loop << " -> " << f << "\n" ;
      exit(0) ;
      /* anything after this must be in the main thread */
    }
    else {
      pid_list.insert(make_pair(loop, pid)) ;
    }
  }
  
  /* wait for the threads to finish */
  int status ;
  map<int, pid_t>::iterator pid_iter ;
  for ( pid_iter = pid_list.begin() ;
        pid_iter != pid_list.end() ; 
        ++pid_iter ) {
    pid_t pid = pid_iter->second ;
    waitpid(pid, &status, 0);
    std::cout << pid_iter->first << " -> done [" << pid << "]\n" ;
  }
  std::cout << "Done!\n" ;
  return(0) ;
}


A..3 The Java test: Multifibo.java

import java.util.* ;

public class Multifibo {
    private Counter counter ;

    public static void main(String args[]) {
        int max = 30 ;  /* default value */
        try {
            max = Integer.parseInt(args[0]) ;
        }
        /* ignore if wrong */
        catch (NumberFormatException x) { }  
        catch (ArrayIndexOutOfBoundsException x) { } 
        catch (Exception x) {
            x.printStackTrace() ;
        }
        Multifibo foo = new Multifibo(max) ;
    }

    public Multifibo(int max) {
        fake_main(max) ;
    }

    public void fake_main(int end) {
        int start  =  1  ;
        counter = new Counter() ;

        make_threads(start,   end, 1) ;
        
        while (counter.value() > 0) {
            counter.advise() ;
            try {
                Thread.sleep(100) ;
            }
            catch(Exception x) {
                x.printStackTrace() ;
            }
        }

        System.out.println("Done!") ;
        System.exit(0) ;
    }

    private void make_threads(int start, int end, int increment) {
        int loop ; 
        for ( loop = start ; loop <= end ; loop = loop+increment ) {
            Fibothread new_thread = new Fibothread(loop, counter) ;
            new_thread.start() ;
            counter.count_up() ;
            System.out.println(loop + " -> started") ;
        }
    }
} // end class

class Fibothread extends Thread {
    int input, output ;
    Counter counter ;

    public Fibothread(int n, Counter counter_ref ) {
        input = n ;
        counter = counter_ref ;
    }

    private int fibor( int n ) {
        int result = 1 ;
        if (n > 1) {
            result = fibor(n-1) + fibor(n-2) ;
        }
        return result ;
    }

    public void run() {
        //      counter.count_up() ;
        output = fibor(input) ;
        System.out.println( input + " -> " + output ) ;
        counter.count_down() ;
    }
} // end class

class Counter {
    int thread_counter ;

    public Counter() {
        thread_counter = 0 ;
        advise() ;
    }

    public synchronized void count_up() {
        thread_counter++ ;
        advise() ;
    }

    public synchronized void count_down() {
        thread_counter-- ;
        advise() ;
    }

    public synchronized void advise() {
        System.out.println("[" + thread_counter +"]") ;
    }

    public synchronized int value() {
        return thread_counter ;
    }

} // end class Counter


A..4 The SWI-Prolog test: multifibo.pl

#!/usr/local/bin/pl -g main -s

/* Recursive definition of Fibonacci */
fibor(IN, 1) :-
        IN < 2,
        !.

fibor(IN, OUT) :-
        M is (IN - 1),
        N is (IN - 2),
        fibor(M, FM),
        fibor(N, FN),
        OUT is (FM + FN).

/* Parse argument and run */
main :-
        current_prolog_flag(argv, Argv),
        append(_, ['--', N | _], Argv),
        atomic(N),
        atom_chars(N, NC),
        number_chars(NN, NC),
        number(NN),
        !, 
        make_threads(NN, TL),
        wait_for_threads(TL),
        halt.

main :-
        make_threads(35, TL),  /* default value */
        wait_for_threads(TL),
        halt.

/* Calculate and print */
fibor_thread(IN) :-
        fibor(IN, OUT),
        format("~w -> ~w~n", [IN, OUT]).

/* Make a thread for each number from 1 to N */
make_threads(0, []) :-
        !.

make_threads(N, [T0|TL1]) :-
        N1 is (N - 1),
        make_threads(N1, TL1),
        thread_create(fibor_thread(N), T0, []),
        format("~w -> started~n", [N]).

/* Wait for threads to finish */
wait_for_threads([]).

wait_for_threads([T|TL]) :-
        thread_join(T, _),
        wait_for_threads(TL).


B. Screenshot of mosmon

Image mosmon-shot.png

C. This document

This report is available as HTML and PostScript. It was written on 1 February 2005 and last revised on 14 February 2005.

Bibliography

1
Moshe Bar.
The OpenMosix project, 2004.
URL http://openmosix.sourceforge.net/.

2
Kris Buytaert.
The OpenMosix HOWTO, June 2004.
URL http://howto.ipng.be/openMosix-HOWTO/.

3
Ying-Hung Chen.
Current status for MOSIX clusters, 2001.
URL http://ying.yingternet.com/mosix/.

4
Gian Paolo Ghilardi.
Consideration on OpenMosix.
In Workshop Linux Cluster: the openMosix approach, November 2002.
URL http://www.democritos.it/events/openMosix/papers/crema.pdf.

5
Daniel Robbins.
OpenMosix: Part 2 of 3.
Whitepaper 20445, Intel, 2005.
URL http://www.intel.com/cd/ids/developer/asmo-na/eng/20445.htm.

6
Gustavo Noronha Silva.
The APT HOWTO, 2004.
URL http://www.debian.org/doc/manuals/apt-howto/index.en.html.

7
Miroslav Skoric.
LILO mini-HOWTO, October 2004.
URL http://www.tldp.org/HOWTO/LILO.html.

8
SPI.
Debian GNU/Linux.
Software in the Public Interest, Inc., 2005.
URL http://www.debian.org/.

9
University of Waterloo Bioinformatics Research Group.
What is OpenMosix?
University of Waterloo Bioinformatics Research Group, November 2004.
URL https://www.bioinformatics.uwaterloo.ca/wiki/index.php?openMosix.

10
Tomas Vinar, Michael Jarrett, and Ross Jordan.
Java green threads.
Usenet thread on uw.linux and uw.mfcf.questions, November 2003.
Message-ID: <Pine.SOL.4.44.0311261432500.28085-100000@fe02.math.uwaterloo.ca>.

11
Jan Wielemaker.
SWI-Prolog 5.4.1 Reference Manual.
University of Amsterdam, 2004.
URL http://hcs.science.uva.nl/projects/SWI-Prolog/Manual/index.html.

About this document ...

Building and testing an OpenMosix cluster

This document was generated using the LaTeX2HTML translator Version 2002 (1.62)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -image_type png -nonavigation -nosubdir -show_section_numbers -local_icons -noaddress cluster

The translation was initiated by on 2005-02-14