=head1 NAME GRID::Machine::perlparintro - A brief and basic introduction to Parallel Distributed Computing in Perl =head1 SYNOPSIS $ time gridpipes.pl 1 1000000000 Process 0: machine = beowulf partial = 3.141593 pi = 3.141593 Pi = 3.141593. N = 1000000000 Time = 27.058693 real 0m28.917s user 0m0.584s sys 0m0.192s pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 2 1000000000 Process 0: machine = beowulf partial = 1.570796 pi = 1.570796 Process 1: machine = orion partial = 1.570796 pi = 3.141592 Pi = 3.141592. N = 1000000000 Time = 15.094719 real 0m17.684s user 0m0.904s sys 0m0.260s pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 3 1000000000 Process 0: machine = beowulf partial = 1.047198 pi = 1.047198 Process 1: machine = orion partial = 1.047198 pi = 2.094396 Process 2: machine = nereida partial = 1.047198 pi = 3.141594 Pi = 3.141594. N = 1000000000 Time = 10.971036 real 0m13.700s user 0m0.952s sys 0m0.240s =head1 SUMMARY The total computational power of institutions as a whole has dramatically rised in the last decades, but due to distributed ownership and administration restrictions, individuals are not able to capitalize such computing power. Many machines sit idle for very long periods of time while their owners are busy doing other things. Many of them run some sort of UNIX, have Perl installed and provide SSH access. If such is your scenario you can use L to have perl interpreters running in those nodes and make them collaborate to give you more computational power and more fun. All this without having to ask administrators or having to install any additional software. This tutorial introduces the basics of parallel computing by means of a simple program that distributes the evaluation of some mathematical expression between several machines. The computational results show that - when the problem is large enough - a substantial improving is gained in performance: The execution times is reduced to the half by using two machines. =head1 REQUIREMENTS To experiment with the examples in this tutorial you will need at least two Unix machines with Perl and SSH. If you are not familiar with Perl or Linux this module probably isn't for you. If you are not familiar with SSH, see =over 2 =item * I by Daniel J. Barrett and Richard E. Silverman. O'Reilly =item * L =item * Man pages of C, C, C, C, C, C, C =item * L =item * Linux Focus article L by Erdal Mutlu I =back =head1 BUILDING A "FUZZY" PARALLEL CLUSTER SSH includes the ability to authenticate users using public keys. Instead of authenticating the user with a password, the SSH server on the remote machine will verify a challenge signed by the user's I against its copy of the user's I. To achieve this automatic ssh-authentication you have to: =over 2 =item * Generate a public key use the C utility. For example: local.machine$ ssh-keygen -t rsa -N '' The option C<-t> selects the type of key you want to generate. There are three types of keys: I, I and I. The C<-N> option is followed by the I. The C<-N ''> setting indicates that no pasphrase will be used. This is useful when used with key restrictions or when dealing with cron jobs, batch commands and automatic processing which is the context in which this module was designed. If still you don't like to have a private key without passphrase, provide a passphrase and use C to avoid the inconvenience of typing the passphrase each time. C is a program you run once per login sesion and load your keys into. From that moment on, any C client will contact C and no more passphrase typing will be needed. By default, your identification will be saved in a file C. Your public key will be saved in C. =item * Once you have generated a key pair, you must install the public key on the remote machine. To do it, append the public component of the key in /home/user/.ssh/id_rsa.pub to file /home/user/.ssh/authorized_keys on the remote machine. If the C script is available, you can do it using: local.machine$ ssh-copy-id -i ~/.ssh/id_rsa.pub user@remote.machine Alternatively you can write the following command: $ ssh remote.machine "umask 077; cat >> .ssh/authorized_keys" < /home/user/.ssh/id_rsa.pub The C command is needed since the SSH server will refuse to read a C files which have loose permissions. =item * Edit your local configuration file C (see C in UNIX) and create a new section for C connections to that host. Here follows an example: ... # A new section inside the config file: # it will be used when writing a command like: # $ ssh gridyum Host gridyum # My username in the remote machine user my_login_in_the_remote_machine # The actual name of the machine: by default the one provided in the # command line Hostname real.machine.name # The port to use: by default 22 Port 2048 # The identitiy pair to use. By default ~/.ssh/id_rsa and ~/.ssh/id_dsa IdentityFile /home/user/.ssh/yumid # Useful to detect a broken network BatchMode yes # Useful when the home directory is shared across machines, # to avoid warnings about changed host keys when connecting # to local host NoHostAuthenticationForLocalhost yes # Another section ... Host another.remote.machine an.alias.for.this.machine user mylogin_there ... This way you don't have to specify your I name on the remote machine even if it differs from your I name in the local machine, you don't have to specify the I if it isn't 22, etc. This is the I way to work with C. Avoid cluttering the constructor C. =item * Once the public key is installed on the server you should be able to authenticate using your private key $ ssh remote.machine Linux remote.machine 2.6.15-1-686-smp #2 SMP Mon Mar 6 15:34:50 UTC 2006 i686 Last login: Sat Jul 7 13:34:00 2007 from local.machine user@remote.machine:~$ You can also automatically execute commands in the remote server: local.machine$ ssh remote.machine uname -a Linux remote.machine 2.6.15-1-686-smp #2 SMP Mon Mar 6 15:34:50 UTC 2006 i686 GNU/Linux =item * Once you have installed L you can check that C can be executed in that machine using this one-liner: $ perl -e 'use GRID::Machine qw(is_operative); print is_operative("ssh", "beowulf")."\n"' 1 =back =head1 A PARALLEL ALGORITHM We are going to compute an approach to the number Pi (3.14159...) using numerical integration. Namely the area under the curve 1/(1+x**2) between 0 and 1 is C as it shows the following debugger session: pp2@nereida:~/public_html/cgi-bin$ perl -wde 0 main::(-e:1): 0 DB<1> use Math::Integral::Romberg 'integral' DB<2> p integral(sub { my $x = shift; 4/(1+$x*$x) }, 0, 1); 3.14159265358972 The module L provides the function C that allow us to compute the area of a given function in some interval. In fact - if you remember your high school days - it is easy to see the reason: the integral of C<4/(1+$x*$x)> is C<4*arctg($x)> and so its area between 0 and 1 is given by: 4*(arctg(1) - arctg(0)) = 4 * arctg(1) = 4 * Pi / 4 = Pi This is not, in fact, a good way to compute Pi, but makes a good example of how to exploit several machines to fulfill a task. To compute the area under C<4/(1+$x*$x)> we can divide up the interval C<[0,1]> into sub-intervals of size C<1/N> and add up the areas of the small rectangles with base C<1/N> and height the value of the curve C<4/(1+$x*$x)> in the middle of the interval. The following debugger session illustrates the idea: pp2@nereida:~$ perl -wde 0 main::(-e:1): 0 DB<1> use List::Util qw(sum) DB<2> $N = 6 DB<3> @divisions = map { $_/$N } 0..($N-1) DB<4> sub f { my $x = shift; 4/(1+$x*$x) } DB<5> @halves = map { $_+0.5/$N } @divisions DB<6> $area = sum(map { f($_)/$N } @halves) DB<7> p $area 3.14390742722244 Since our goal is to optimize the execution time, we will distribute the sum in line 6 C<$area = sum(map { f($_)/$N } @halves)> among the processors. The machines will be numbered from 0 to C (being C the number of machines) and each machine will sum up the areas of roughly C intervals. To achieve a higher performance the code to execute on each machine is written in C: pp2@nereida:~/LGRID_Machine/examples$ cat -n pi.c 1 #include 2 #include 3 4 main(int argc, char **argv) { 5 int id, N, np, i; 6 double sum, left; 7 8 if (argc != 4) { 9 printf("Usage:\n%s id N np\n",argv[0]); 10 exit(1); 11 } 12 id = atoi(argv[1]); 13 N = atoi(argv[2]); 14 np = atoi(argv[3]); 15 for(i=id, sum = 0; i, identifies the machine with a logical number, the second one, C, is the total number of intervals, the third C is the number of machines being used. Notice the C loop at line 15: Processor C sums up the areas corresponding to intervals C, C, C, etc. The program concludes writing to C the partial sum. Observe that, since we aren't using infinite precision numbers errors introduced by rounding and truncation imply that increasing C would not lead to a more precise evaluation of Pi. To get the executable we have a simple C: pp2@nereida:~/LGRID_Machine/examples$ cat -n Makefile 1 pi: 2 cc pi.c -o pi =head1 COORDINATING A CLUSTER The program C following in the lines below runs C<$np> copies of the former C program in a set C<@machines> of available machines, adding up the partial results as soon as they arrive. pp2@nereida:~/LGRID_Machine/examples$ cat -n gridpipes.pl 1 #!/usr/bin/perl 2 use warnings; 3 use strict; 4 use IO::Select; 5 use GRID::Machine; 6 use Time::HiRes qw(time gettimeofday tv_interval); The first lines load the modules: =over 2 =item * L will be used to open C connections with the remote machines and control the execution environment =item * L will be used to process the results as soon as they start to arrive. =item * L will be used to time the processes so that we can compare times and see if there is any gain in this approach =back 8 my @machine = qw{beowulf orion nereida}; 9 my $nummachines = @machine; 10 my %machine; # Hash of GRID::Machine objects 11 #my %debug = (beowulf => 12345, orion => 0, nereida => 0); 12 my %debug = (beowulf => 0, orion => 0, nereida => 0); 13 14 my $np = shift || $nummachines; # number of processes 15 my $lp = $np-1; 16 17 my $N = shift || 100; 18 19 my @pid; # List of process pids 20 my @proc; # List of handles 21 my %id; # Gives the ID for a given handle 22 23 my $cleanup = 0; 24 25 my $pi = 0; 26 27 my $readset = IO::Select->new(); Variable C<@machine> stores the IP addresses/names of the machines we have SSH access. These machines will constitute our 'virtual' parallel machine. For each of these machines (see the for loop in lines 30-46) a SSH connection is created (line 31) via Cnew>. The resulting C objects will be stored inside the hash C<%machine> (line 44). 29 my $i = 0; 30 for (@machine){ 31 my $m = GRID::Machine->new(host => $_, debug => $debug{$_}, ); 32 33 $m->copyandmake( 34 dir => 'pi', 35 makeargs => 'pi', 36 files => [ qw{pi.c Makefile} ], 37 cleanfiles => $cleanup, 38 cleandirs => $cleanup, # remove the whole directory at the end 39 ) 40 unless $m->_x("pi/pi")->result; 41 42 die "Can't execute 'pi'\n" unless $m->_x("pi")->result; 43 44 $machine{$_} = $m; 45 last unless $i++ < $np; 46 } The call to C at line 33 copies (using C) the files C and C to a directory named C in the remote machine. The directory C will be created if it does not exists. After the file transfer the C specified by the C option make => 'command' will be executed with the arguments specified in the option C. If the C option isn't specified but there is a file named C between the transferred files, the C program will be executed. Set the C option to number 0 or the string C<''> if you want to avoid the execution of any command after the transfer. The transferred files will be removed when the connection finishes if the option C is set. More radical, the option C will remove the created directory and all the files below it. Observe that the directory and the files will be kept if they were'nt created by this connection. The call to C by default sets C as the current directory in the remote machine. Use the option C 1> to one to avoid this. The condition at line 40 checks for the existence of the executable C: No transference will be done if an executable in C already exists. After the involved files are transferred and executables have been built, the program proceeds to open C<$np> processes. The call to C at line 52 executes the C program in the remote machine C<$m>. In a list context returns a handler - that can be used to read from the process - and the PID of the child process. The new handler C<$proc[$_]> is added to the L object C<$readset> at line 53. The hash C<%id> stores the relation between handlers and logical process identifiers. 48 my $t0 = [gettimeofday]; 49 for (0..$lp) { 50 my $hn = $machine[$_ % $nummachines]; 51 my $m = $machine{$hn}; 52 ($proc[$_], $pid[$_]) = $m->open("./pi $_ $N $np |"); 53 $readset->add($proc[$_]); 54 my $address = 0+$proc[$_]; 55 $id{$address} = $_; 56 } During the last stage the master node simply waits in the L object listening on each of the channels. As soon as a result is received it is added to the total sum for C<$pi>: 58 my @ready; 59 my $count = 0; 60 do { 61 push @ready, $readset->can_read unless @ready; 62 my $handle = shift @ready; 63 64 my $me = $id{0+$handle}; 65 66 my ($partial); 67 my $numBytesRead = sysread($handle, $partial, 1024); 68 chomp($partial); 69 70 $pi += $partial; 71 print "Process $me: machine = $machine[$me % $nummachines] partial = $partial pi = $pi\n"; 72 73 $readset->remove($handle) if eof($handle); 74 } until (++$count == $np); 75 76 my $elapsed = tv_interval ($t0); 77 print "Pi = $pi. N = $N Time = $elapsed\n"; =head1 PERFORMANCE: COMPUTATIONAL RESULTS Let us see the time it takes the execution of the I program on each of the involved nodes (nereida, beowulf and orion). To have an idea of how things work for a comptuation large enough we set C<$N> to C<1 000 000 000> intervals: pp2@nereida:~/LGRID_Machine/examples$ time ssh nereida 'pi/pi 0 1000000000 1' 3.141593 real 0m32.534s user 0m0.036s sys 0m0.008s pp2@nereida:~/LGRID_Machine/examples$ time ssh beowulf 'pi/pi 0 1000000000 1' 3.141593 real 0m27.020s user 0m0.036s sys 0m0.008s casiano@beowulf:~$ time ssh orion 'pi/pi 0 1000000000 1' 3.141593 real 0m29.120s user 0m0.028s sys 0m0.003s As you can see, there is some heterogeneity here. Machine C (my desktop) is slower than the others two. C is the fastest. Now let us run the parallel perl program in C using only the C node. The time spent is roughly comparable to the I time. That is nice: The overhead introduced by the coordination tasks is not as large (compare it with the C entry above): pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 1 1000000000 Process 0: machine = beowulf partial = 3.141593 pi = 3.141593 Pi = 3.141593. N = 1000000000 Time = 27.058693 real 0m28.917s user 0m0.584s sys 0m0.192s Now comes the true test: will it be faster using two nodes? how much? pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 2 1000000000 Process 0: machine = beowulf partial = 1.570796 pi = 1.570796 Process 1: machine = orion partial = 1.570796 pi = 3.141592 Pi = 3.141592. N = 1000000000 Time = 15.094719 real 0m17.684s user 0m0.904s sys 0m0.260s We can see that the sequential pure C version took 32 seconds in my desktop (C). By using two machines I have SSH access I have reduced that time to roughly 18 seconds. This a factor of C<32/18 = 1.8> times faster. This factor is even better if I don't consider the set-up time: C<32/15 = 2.1>. The total time decreases if I use the three machines: pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 3 1000000000 Process 0: machine = beowulf partial = 1.047198 pi = 1.047198 Process 1: machine = orion partial = 1.047198 pi = 2.094396 Process 2: machine = nereida partial = 1.047198 pi = 3.141594 Pi = 3.141594. N = 1000000000 Time = 10.971036 real 0m13.700s user 0m0.952s sys 0m0.240s which gives a speed factor of C<32/13.7 = 2.3> or not considering the set-up time C<32/10.9 = 2.9>. What happens if you have multiprocessor machine. The results highly depend on the underlying architecture. My machine C is a dual Xeon: nereida:/tmp/graphviz-2.20.2# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.66GHz stepping : 5 cpu MHz : 2658.041 cache size : 512 KB physical id : 0 ....................................... processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.66GHz stepping : 5 cpu MHz : 2658.041 cache size : 512 KB physical id : 0 ................................... After changing the C to include the C<-O3> option and the line defining the set of machines in C (addresses in the subnetwork 127.0.0 are mapped to localhost): my @machine = qw{127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.4}; We have the following results: pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 1 1000000000 Process 0: machine = 127.0.0.1 partial = 3.141593 pi = 3.141593 Pi = 3.141593. N = 1000000000 Time = 32.968117 real 0m33.858s user 0m0.336s sys 0m0.128s pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 2 1000000000 Process 1: machine = 127.0.0.2 partial = 1.570796 pi = 1.570796 Process 0: machine = 127.0.0.1 partial = 1.570796 pi = 3.141592 Pi = 3.141592. N = 1000000000 Time = 16.552487 real 0m18.076s user 0m0.504s sys 0m0.188s Which gives an speed up near 2. =head1 CONCLUSIONS, LIMITATIONS AND FUTURE WORK This example shows how to take advantage of the computational power of idle stations and the very high level of programming offered by Perl to improve the performance of an application. High Performance Programming (HPP, as provided by Very High Level Languages) and High Performance Computing (HPC) have been always two opposites ends of the spectrum: if you optimize programmer's time (HPP) computing time suffers (HPC) and viceversa. A synergetic combination of HPC and HPP tools can bring the best of both worlds. This example however is too simplistic and does not address some important limitations that point to what must be done. I look forward for CPAN modules filling these gaps: =over 2 =item * The former example does not address the I. The load balancing problem is the problem to find the optimal work distribution among nodes, network links and any other involved resources, in order to get an optimal resource utilization. =item * Fault tolerance in the workers: What happens if one of the worker machines is shutdown in the middle of a computation? or the connection goes down? What mechanisms are provided? =item * Fault tolerance in the master: What if the master node goes down? Has the computation to restart from scratch? =item * Dynamic Resources: What happens if a new machine that previously was down is now available? Can we add it to our pool of resources? =back =head1 SEE ALSO =over 2 =item * L =item * L =item * L =item * L =item * L =item * L =item * L =item * Man pages of C, C, C, C, C, C, C =item * L =back =over 2 =item * L =item * The Wikipedia entry in Cluster Computing L =item * The Wikipedia entry in GRID Computing: L =item * The Wikipedia entry for I L =item * The CPAN module L by Andy Armstrong =item * The State of Parallel Computing in Perl 2007. Perlmonks node at L =back =head1 THE FULL CODE =head2 The Driver: File C $ cat gridpipes.pl #!/usr/bin/perl use warnings; use strict; use IO::Select; use GRID::Machine; use Time::HiRes qw(time gettimeofday tv_interval); my @machine = qw{beowulf orion nereida}; my $nummachines = @machine; my %machine; # Hash of GRID::Machine objects #my %debug = (beowulf => 12345, orion => 0, nereida => 0); my %debug = (beowulf => 0, orion => 0, nereida => 0); my $np = shift || $nummachines; # number of processes my $lp = $np-1; my $N = shift || 100; my @pid; # List of process pids my @proc; # List of handles my %id; # Gives the ID for a given handle my $cleanup = 1; my $pi = 0; my $readset = IO::Select->new(); my $i = 0; for (@machine){ my $m = GRID::Machine->new(host => $_, debug => $debug{$_}, ); $m->copyandmake( dir => 'pi', makeargs => 'pi', files => [ qw{pi.c Makefile} ], cleanfiles => $cleanup, cleandirs => $cleanup, # remove the whole directory at the end keepdir => 1, ); $m->chdir("pi/"); die "Can't execute 'pi'\n" unless $m->_x("pi")->result; $machine{$_} = $m; last unless $i++ < $np; } my $t0 = [gettimeofday]; for (0..$lp) { my $hn = $machine[$_ % $nummachines]; my $m = $machine{$hn}; ($proc[$_], $pid[$_]) = $m->open("./pi $_ $N $np |"); $readset->add($proc[$_]); my $address = 0+$proc[$_]; $id{$address} = $_; } my @ready; my $count = 0; do { push @ready, $readset->can_read unless @ready; my $handle = shift @ready; my $me = $id{0+$handle}; my ($partial); my $numBytesRead = sysread($handle, $partial, 1024); chomp($partial); $pi += $partial; print "Process $me: machine = $machine[$me % $nummachines] partial = $partial pi = $pi\n"; $readset->remove($handle) if eof($handle); } until (++$count == $np); my $elapsed = tv_interval ($t0); print "Pi = $pi. N = $N Time = $elapsed\n"; =head2 The Application. File C $ cat pi.c #include #include main(int argc, char **argv) { int id, N, np, i; double sum, left; if (argc != 4) { printf("Usage:\n%s id N np\n",argv[0]); exit(1); } id = atoi(argv[1]); N = atoi(argv[2]); np = atoi(argv[3]); for(i=id, sum = 0; i