Kuznetsov D., Tarasov E.
Abstract
--------
This paper describes our understanding of parallel
systems. We suggest the parallel computation model and
appropriate programming tools. The hardware architeture needed
to support such systems is based on distributed memory model
and the high level language proessor's concepts.
Keywords:
---------
parallel systems, distributed memory, computation models,
programming languages, Modula-2, Oberon, Kronos, C, Unix,
transputer, INMOS.
---
INMOS (tm) - is a tardemark of INMOS corp.
Unix (tm) - is a tardemark of ATT
Kronos (tm) - is a tardemark of KRG
Introduction
------------
This paper describes the huge-multiprocessor project
which is working out in Kronos Research Group. The major
requisite for this work was Kronos microprocessor architecture
and software designed for it in Novosibirsk Computer Center at
last years. Kronos is microprocessor architecture based on
OISC ideas. From one hand it optimized for efficient
implementation of high level languages. From the other for
simplification of silicon implementation of microprocrssor
chip set. Original operating system Excelsior was designed for
Kronos architecture. Excelsior was complex, modern operating
system. It allow Kronos Research Group follow-up some great
ideas in parallel software design. Optimized compilers for
Modula-2, C and Fortran-77 also designed for Kronos. We hope
all this will be a good begining (a half of good ending) for
decision dificult problem of huge multiprocessor system
design. The experience of transputer computation model
(invoked by INMOS) convinces us to try to find out other way
for flexible and usefull organization of parallel system.
General description of parallel system
--------------------------------------
The goal of any parallel computation system is
information processing. In classic models such as Turing
mashine the atoms of information processed strictly
sequential. But even in early implementation of cumputer
systems bits of information coupeled in portions (named
"words"). All bits of word proccesed in parallel by special
device called "processor". The processor may be the complex
device that includes arithmetic logic unit, registers, cash
memory, float point acselerator and so on. All this technical
details are very interesting and may be the subject of
individual investigation. Some phases of data processing in
the processor may be also executed in parallel. But our great
interest is directed to the huge number processors parallel
system arcitecture.
We assume that a lot of processors elements (nodes)
executes their individual programs for data processing and
comunicate to the neighbours for syncronization and data
exchange. Some subsets of processors may realy share common
data (for example program code and constants). We will discuse
problems of organization parallel work in such system. Base
element of PARK system will be KRONOS microprocessor,
programming in Modula-P language. The multiprocessor system
will be driven by PARIS operating environment discribed below.
Computation Model
-----------------
PARK is the nodes set, linked in a net with appropriate
topology. The net topology optimization is not a goal of our
exploration, and will be disused later for concrete
applications. The attainability (may be transit attainability)
of each node is the signle condition needed for our reasoning.
Each node of PARK include processor and local memory.
Local memory connected to processor through the special memory
manager. Memory manager includes the table driven virtual
memory support and facilities for access to the local memory
of other nodes. The node's processor may access to any non
local memory but with the different speed. Maximum speed
reached for local memory. Non local memory access speed
depends of distance from processor to memory determined by net
topology and system overhead.
Operating environment formed as a set of independent
tasks. Task's independence means that tasks never processed
the shared data. The independence property nevertheless allow
tasks shared use of read-only-data, because such data never be
writen by this tasks and so never be processed. So, PARK is
divided resource for the task set. System cernel may schedule
the PARK resources (processors, memories and so on) between
tasks.
Task itself is a set of processes shared the task's
virtual address space. The global virtual address space for
any task is reflected into the physical memories of system
nodes. The distance between each process of task and each data
determines by net configuration, process to processor mapping
and virtual to physical address reflection on processors used
in the task. Last two factors must be optimized to minimize
the distance between data and processes. This optimization is
individual for each task.
Processes formed a task demand the syncronization. Their
requirements may by obtained by different ways. Our model
suggest then processes of task may share task's data and so
they require the means for exclusevly access to common data.
The "signal" model of syncronization is satisfactory. "Signal"
in essence is queue of waiting processes with send-count (when
queue is empty). Sending signal activate first waiting process
or increments signal send-count. Waiting signal decrements
send-count iff it's non zero and continue, or delay process
until the signal will be sended. This general syncronization
method used not only for processes. In the task a set or
special signals represents the processors used by this task.
Signal from this set is the ready-queue for appropriate
processor. When operating environment decided that this
processor is ready for the task it simply take first waiting
process from its ready-queue.
The execution unit in PARK architeture will be program as
a collection of algorithms and data structures. Program
determines task execution, processes creation and algorithms
for them, resource allocation and usage.
The main assumption about the program properties needed
for efficient execution based on locality of processes. This
assumption means that most significant data trafic in any
process of the task is trafic between process and its local
data. Process's local data resides in local RAM of the
appropriate processor, but the points of syncronization,
accepting and sending data to other processes and shared
access to common data structures are not very rarely. At worst
when program don't satisfied this assumption (for example
writen for traditional common memory multiprocessor) it may be
slow but in any case will be executable!
PARIS - parallel instrumental system
------------------------------------
The problem of parallel systems control is the corner
stone of our project. We will not use words 'operating
system', because its semantic was greatly overload. We will
not tought abnormaly high reliable system problems. We will
not tought very high level safety systems. It does not means
that we ignore this problems, but we think, that main goal of
huge parallel operating environment is to provide instrumental
tools for creating such kind of systems.
The perfix instrumental means that tools that we foresee
are oriented to professional programmers, future system
developers. The super high level system for safety non
professional parallel programming may be the goal of later on
project. We hope that such systems may be developed from PARIS
by regular way used N.Wirth for extracting Oberon language
from Modula-2.
The main instrument for parallel programing in PARIS is
the Modula-2 language and compiler for it. The program in
Modula-2 designed as a set of modules. Processes runed by the
program uses procedures from modules to express their its
execution algorithms and modules variables to access to global
data structures.
There are three notions in PARIS system: tasks, processes
and modules. Modules from one hand essentialy are the
half-finished products to construct the algorithms of
individal processes, from the other hand they are the data
managers and provides the exclusive or shared access to data
structures.
There are two ways for processes comunication. First is
access to guarded common data sctructures, managed by modules.
Second is message passing through sending and waiting signals.
Forthunatly both types of syncronization may be expressed with
Send/Wait signal operations and it is nessesary only add new
standard type SIGNAL and operation with this type to existing
Modula-2 language.
We have no need to design new language for programming
the kernel of operating environment. We are sure, that
Modula-2 is the good choose for this goals.
For the purposes of compatibility we will implement the
system calls similar to Unix-V operating system. It will allow
us to run existing C language compiler for Kronos
microprocessor and use portable C-labrary for ordinary Unix
programs. Of course, additional libraries will present for
newly designed C-programs specific multiprocessor system
facilities.
PARK - parallel Kronos architecture
-----------------------------------
Network structure.
Special network of links is used for data pass between
nodes. Each link is 32-bit width bus, connected to several
nodes. Each node can be connected to a few links. If two nodes
are connected to the same link then each of them can directly
access memory of another. Access through the link uses global
physical memory address consisted of two parts: node number
and word address in node's local memory. Link controller of
each node provides request retranslation from one link to
another using map table loaded with information about network
topology. In such way node can access memory of any other node
in the system. Reading memory request is processed in two
separate steps: address transmission and data reception. This
is nessesary to prevent network interconnection deadlocks.
Node structure.
Main part of node is CPU. It is single chip 32-bit
microprocessor Kronos with architecture optimized for
efficient execution of programs written in high level
programming languages. All computations are performed on
hardware arithmetic stack. Addressing modes reflect data
structures of high level programming languages. Instruction
set includes commands for access to global, local,
intermediate and external variables, arrays, records. A number
of commands are intended for recursive procedure calls,
processes and signals management.
CPU provides virtual memory support. Address translation
table is stored in node local memory. Special address
translation cash decreases memory access time by holding last
translations' results. CPU uses translation table only when
there is no physical address of accessed memory page in cash.
Physical address may refer to the local RAM or to RAM of
another node. Realy address translation cash is distributed
between local memory controller and number of link
controllers. All of them try to find translation in their
cashes in parallel. If no one found translation then CPU
calculates needed translation and puts it into one of the
cashes. Physical address determine the cash in wich
translation will be stored. So, only one controller may have
translation for the appropriate virtual address. Each task has
its own translation table and the kernel of operating
environment must provide syncronous modification of the table
in nodes executing the task.
Single chip memory controller manage array of dynamic
memory chips. It contains error detection and correction
circuit, data cash, memory refresh circuit.
Disk drives and another peripheral devices can be
connected with SCSI controller chip. Several nodes may be
connected to single SCSI bus. Controller provides direct
memory access for input and output data.
Conclusion
----------
Offered parallel architecture PARK, based on ideas of
distributed memory, combines advantages of shared memory model
with high speed local memory of message-passing transputer
model. So it may be the compromise way increases the
application area of parallel architectures.
The powerfull of Modula-2 language based on modular style
of programing permit us to hope that Modula-2 with minimal
extensions is a good tool to design and implementation kernel
of operating envirtonment. Abstract data types incapsulated in
appropriate set of modules may support very easy and elegance
tools for efficient parallel software design.
We are very much obliged to proffesor V. Kotov for his
important influence at our understanding of major parallel
problems. We also thanks all members of Kronos Research Group
who are realy promoted this work.
Thursday August 24 1989
Kronos Research Group
Computer Center
Novosibirsk