Package edu.rit.pj2.tracker

Package edu.rit.pj2.tracker contains classes for the Parallel Java 2 (PJ2) Tracker and related classes.

See: Description

Package edu.rit.pj2.tracker Description

Package edu.rit.pj2.tracker contains classes for the Parallel Java 2 (PJ2) Tracker and related classes.

A PJ2 Job consists of multiple Tasks that are executed on one or more nodes. Four objects cooperate to run PJ2 jobs:

Implementations. Various Tracker and Launcher implementations are available to execute tasks on various kinds of parallel computers. See the classes below for further information about configuring and operating the various Trackers and Launchers.

PJ2 on a Cluster Parallel Computer

A cluster parallel computer typically consists of a frontend node for controlling the cluster and a number of backend nodes for performing computations. The frontend and backend nodes are linked together by a dedicated high-speed backend local network. The frontend node can connect to the backend nodes over the backend local network, but the backend nodes are typically not accessible from the Internet. The frontend node also typically has another network interface that is connected to the Internet, allowing users to log into the frontend node.

To set up PJ2 on a cluster parallel computer, to be shared by many users, run the Tracker and Launcher programs as follows.

Tracker. Run the Tracker program on the frontend node. It is recommended to run the Tracker program on a separate frontend node, not one of the backend nodes, so that the full processing capacity of the backend nodes is available for computations. For security, it is recommended to run the Tracker program from a special PJ2 account, not from any user's account.

Specify the Tracker program's tracker parameter to be the host name or IP address of the Tracker's node; every Launcher program and every user's Job program must be able to connect to this host name. For security, it is recommended not to use an Internet-accessible host name or IP address; rather, use a host name or IP address that can only be accessed by logging into the node where the Tracker is running (such as the IP address of the backend local network interface).

Specify the Tracker program's web parameter to be the Internet-accessible host name or IP address for the Tracker's web interface.

Specify the Tracker program's name parameter to be the name of the PJ2 system.

Redirect the Tracker program's standard output and standard error to a log file.

Launchers. Run the Launcher program on each node of the cluster that will perform computations. For security, it is recommended to run the Launcher program from a special PJ2 account, not from any user's account; this will prevent users running jobs on the cluster from accessing other users' accounts.

Specify the Launcher program's tracker parameter to refer to the Tracker's host name (the same as the Tracker's tracker parameter).

Specify the Launcher program's other parameters with the characteristics of the Launcher's node.

Redirect the Launcher program's standard output and standard error to a log file.

Jobs. Users can now run Jobs on the cluster by logging into their accounts on the node where the Tracker is running, and running the pj2 program.

Specify the pj2 program's jar parameter to refer to a Java archive (JAR) file containing all the classes in the user's program. (The JAR file need not and should not include the PJ2 Library classes.)

If necessary, specify the pj2 program's tracker parameter to refer to the Tracker's host name. If necessary, specify the pj2 program's listen parameter to be the Internet-accessible host name or IP address of the user's node; every Launcher program must be able to connect to this host name. (The tracker and listen parameters don't usually need to be specified; the defaults usually work.)

When the Job launches a Task, the task runs on one of the cluster nodes, in a Backend process spawned by that node's Launcher; the Backend process runs in the same account as the Launcher.

Tasks. Users can also run multicore Tasks on the cluster by logging into their accounts on the node where the Tracker is running, and running the pj2 program. In this case the pj2 program automatically wraps the task inside a job and runs the job as described above. If the pj2 program's threads parameter is specified, the task will require the given number of cores, otherwise the task will require all the cores on the node. The task runs on one of the cluster nodes, in a Backend process spawned by that node's Launcher; the Backend process runs in the same account as the Launcher.

Firewalls. By default, the Job listens for TCP socket connections from Backends on an unused port; the Tracker listens for TCP socket connections from Jobs and Launchers on port 20618; and the Tracker listens for TCP socket connections from web browsers on port 8080. (The defaults can be overridden.) If a firewall is active on the node where a Job or a Tracker is running, the firewall must be configured to allow incoming connections to the appropriate port(s).

PJ2 on a Multicore Parallel Computer

A multicore parallel computer consists of a single node with multiple cores. To run a Task on a multicore parallel computer, the Tracker is not required. If no Tracker is running, the pj2 program executes the task in the pj2 program's process.

Although not required, a Tracker might still be useful on a multicore parallel computer. A typical situation is when many users want to run tasks on the computer. In this case, a Tracker can queue users' tasks and schedule them to execute when computational resources, such as CPU cores and GPU accelerators, are available.

To set up PJ2 on a multicore parallel computer, to be shared by many users, run the Tracker program as follows.

Tracker. Run the Tracker program on the node. For security, it is recommended to run the Tracker program from a special PJ2 account, not from any user's account.

It is recommended not to specify the Tracker program's tracker parameter. The Tracker will then listen for connections on localhost. This way, the Tracker can only be accessed by logging into the node where the Tracker is running.

Specify the Tracker program's web parameter to be the Internet-accessible host name or IP address for the Tracker's web interface.

Specify the Tracker program's name parameter to be the name of the PJ2 system.

Specify the Tracker program's node parameter to be the name of the node, the number of CPU cores on the node, and the number of GPU accelerators on the node. (The node parameter is required because there are no Launchers.)

Redirect the Tracker program's standard output and standard error to a log file.

Tasks. Users can now run multicore Tasks on the node by logging into their accounts on the node and running the pj2 program. If the pj2 program's threads parameter is specified, the task will require the given number of cores, otherwise the task will require all the cores on the node. By default, the pj2 program contacts the Tracker on localhost to schedule the task. When resources are available, the Tracker notifies the pj2 program, which then executes the task in the pj2 program's process. (There are no Launchers, so the task does not run in a separate Backend process.)

Jobs. Users can also run Jobs on the node by logging into their accounts on the node and running the pj2 program. The system acts as a "cluster" with one node. The job's tasks are executed in the job's process.

PJ2 Protocol

The Job, Tracker, Launcher, and Backend objects communicate among each other as follows:

The messages sent among the Job, Tracker, Launcher, and Backend objects are as follows:

  1. When the Launcher starts on a certain node, the Launcher connects to the Tracker's host and port. The Launcher tells the Tracker that the Launcher started, specifying the characteristics of the node (name, number of CPU cores, number of GPU accelerators).

  2. The Tracker exchanges heartbeat messages with each Launcher periodically. If a heartbeat message does not arrive within the expected time, that tells the Tracker or Launcher that its counterpart has failed. If the Tracker detects that a Launcher failed, the Tracker no longer schedules tasks on that node. If a Launcher detects that the Tracker failed, the Launcher terminates.

  3. The Job connects to the Tracker's host and port. The Job asks the Tracker to start a new job, specifying the Job's host and port. The Tracker records the job.

  4. The Tracker tells the Job that the job was launched, specifying the job ID.

  5. When the Tracker decides the job can start (immediately after Step 4, or later), the Tracker tells the Job that the job was started.

  6. Once the job has launched, the Job exchanges heartbeat messages with the Tracker periodically. If a heartbeat message does not arrive within the expected time, that tells the Job or Tracker that its counterpart has failed. If the Job detects that the Tracker failed, the Job itself fails. If the Tracker detects that the Job failed, the Tracker removes the job and all its tasks.

  7. The Job asks the Tracker to launch a new task group, consisting of one or more tasks. For each task in the group, the Job specifies the task ID and the node requirements (number of CPU cores, number of GPU accelerators, etc.). The Tracker records the task group. The Job can launch multiple task groups.

  8. When an appropriate node is available for a task, the Tracker tells the Job that the task is launching. The Job starts a timeout to detect if the task fails to launch.

    Alternatively, if there is a problem (e.g., there are no nodes with the required resources), the Tracker tells the Job that the task failed and removes the task.

  9. The Tracker tells the chosen node's Launcher to launch a new Backend process, specifying the Job's host and port and the task ID.

  10. The Launcher attempts to create a new Backend process, passing it the Job's host and port and the task ID.

  11. If the Backend process failed to be created, the Launcher tells the Tracker that the launch failed, and the Launcher terminates (see Step 23). The Tracker tells the Job that the task failed and removes the task.

  12. If the Backend process was successfully created, the Backend connects to the Job's host and port. The Backend tells the Job that the task was launched, specifying the task ID.

  13. The Job tells the Backend to start the task, specifying the task parameters (command line arguments, input tuples, etc.). The Backend starts running the task.

  14. While the task is running, the Job exchanges heartbeat messages with the Backend periodically. If a heartbeat message does not arrive within the expected time, that tells the Job or Backend that its counterpart has failed. If the Job detects that the Backend failed, the corresponding task fails. If the Backend detects that the Job failed, the Backend terminates.

  15. If the task prints on the Backend's standard output or standard error, the printed characters are intercepted and sent to the Job, which prints the characters on the Job's own standard output or standard error.

  16. If the task needs to take a tuple out of tuple space while executing, the Backend informs the Job, specifying the template with which to find a matching tuple.

  17. When the Job finds a tuple that matches the template, the Job removes the tuple from tuple space and sends the tuple to the Backend.

  18. If the task writes a tuple into tuple space while executing, the Backend informs the Job, specifying the tuple to be written. The Job writes the tuple into tuple space.

  19. When the task finishes running successfully, the Backend tells the Job that the task finished, specifying the task's results.

    Alternatively, if the task fails (i.e., throws an exception), the Backend tells the Job that the task failed, specifying the exception.

    In either case, the Backend process then terminates.

  20. When the task is done, the Job tells the Tracker that the task is done. The Tracker removes the task.

  21. When all tasks of the Job are done, the Job tells the Tracker that the job is done. The Tracker removes the job.

  22. If the job fails, including if the user kills the job manually, the Job tells any remaining Backends to stop their tasks. The Backends terminate. The Job also tells the Tracker to stop the job, specifying an error message that goes in the Tracker's log. The Tracker removes the job and all its tasks.

  23. If the Launcher decides to stop, it informs the Tracker. The Tracker no longer schedules tasks on that node.

Copyright © 2013–2018 by Alan Kaminsky. All rights reserved. Send comments to ark­@­cs.rit.edu.