Kuznetsov D., Tarasov E. Abstract -------- This paper describes our understanding of parallel systems. We suggest the parallel computation model and appropriate programming tools. The hardware architeture needed to support such systems is based on distributed memory model and the high level language proessor's concepts. Keywords: --------- parallel systems, distributed memory, computation models, programming languages, Modula-2, Oberon, Kronos, C, Unix, transputer, INMOS. --- INMOS (tm) - is a tardemark of INMOS corp. Unix (tm) - is a tardemark of ATT Kronos (tm) - is a tardemark of KRG Introduction ------------ This paper describes the huge-multiprocessor project which is working out in Kronos Research Group. The major requisite for this work was Kronos microprocessor architecture and software designed for it in Novosibirsk Computer Center at last years. Kronos is microprocessor architecture based on OISC ideas. From one hand it optimized for efficient implementation of high level languages. From the other for simplification of silicon implementation of microprocrssor chip set. Original operating system Excelsior was designed for Kronos architecture. Excelsior was complex, modern operating system. It allow Kronos Research Group follow-up some great ideas in parallel software design. Optimized compilers for Modula-2, C and Fortran-77 also designed for Kronos. We hope all this will be a good begining (a half of good ending) for decision dificult problem of huge multiprocessor system design. The experience of transputer computation model (invoked by INMOS) convinces us to try to find out other way for flexible and usefull organization of parallel system. General description of parallel system -------------------------------------- The goal of any parallel computation system is information processing. In classic models such as Turing mashine the atoms of information processed strictly sequential. But even in early implementation of cumputer systems bits of information coupeled in portions (named "words"). All bits of word proccesed in parallel by special device called "processor". The processor may be the complex device that includes arithmetic logic unit, registers, cash memory, float point acselerator and so on. All this technical details are very interesting and may be the subject of individual investigation. Some phases of data processing in the processor may be also executed in parallel. But our great interest is directed to the huge number processors parallel system arcitecture. We assume that a lot of processors elements (nodes) executes their individual programs for data processing and comunicate to the neighbours for syncronization and data exchange. Some subsets of processors may realy share common data (for example program code and constants). We will discuse problems of organization parallel work in such system. Base element of PARK system will be KRONOS microprocessor, programming in Modula-P language. The multiprocessor system will be driven by PARIS operating environment discribed below. Computation Model ----------------- PARK is the nodes set, linked in a net with appropriate topology. The net topology optimization is not a goal of our exploration, and will be disused later for concrete applications. The attainability (may be transit attainability) of each node is the signle condition needed for our reasoning. Each node of PARK include processor and local memory. Local memory connected to processor through the special memory manager. Memory manager includes the table driven virtual memory support and facilities for access to the local memory of other nodes. The node's processor may access to any non local memory but with the different speed. Maximum speed reached for local memory. Non local memory access speed depends of distance from processor to memory determined by net topology and system overhead. Operating environment formed as a set of independent tasks. Task's independence means that tasks never processed the shared data. The independence property nevertheless allow tasks shared use of read-only-data, because such data never be writen by this tasks and so never be processed. So, PARK is divided resource for the task set. System cernel may schedule the PARK resources (processors, memories and so on) between tasks. Task itself is a set of processes shared the task's virtual address space. The global virtual address space for any task is reflected into the physical memories of system nodes. The distance between each process of task and each data determines by net configuration, process to processor mapping and virtual to physical address reflection on processors used in the task. Last two factors must be optimized to minimize the distance between data and processes. This optimization is individual for each task. Processes formed a task demand the syncronization. Their requirements may by obtained by different ways. Our model suggest then processes of task may share task's data and so they require the means for exclusevly access to common data. The "signal" model of syncronization is satisfactory. "Signal" in essence is queue of waiting processes with send-count (when queue is empty). Sending signal activate first waiting process or increments signal send-count. Waiting signal decrements send-count iff it's non zero and continue, or delay process until the signal will be sended. This general syncronization method used not only for processes. In the task a set or special signals represents the processors used by this task. Signal from this set is the ready-queue for appropriate processor. When operating environment decided that this processor is ready for the task it simply take first waiting process from its ready-queue. The execution unit in PARK architeture will be program as a collection of algorithms and data structures. Program determines task execution, processes creation and algorithms for them, resource allocation and usage. The main assumption about the program properties needed for efficient execution based on locality of processes. This assumption means that most significant data trafic in any process of the task is trafic between process and its local data. Process's local data resides in local RAM of the appropriate processor, but the points of syncronization, accepting and sending data to other processes and shared access to common data structures are not very rarely. At worst when program don't satisfied this assumption (for example writen for traditional common memory multiprocessor) it may be slow but in any case will be executable! PARIS - parallel instrumental system ------------------------------------ The problem of parallel systems control is the corner stone of our project. We will not use words 'operating system', because its semantic was greatly overload. We will not tought abnormaly high reliable system problems. We will not tought very high level safety systems. It does not means that we ignore this problems, but we think, that main goal of huge parallel operating environment is to provide instrumental tools for creating such kind of systems. The perfix instrumental means that tools that we foresee are oriented to professional programmers, future system developers. The super high level system for safety non professional parallel programming may be the goal of later on project. We hope that such systems may be developed from PARIS by regular way used N.Wirth for extracting Oberon language from Modula-2. The main instrument for parallel programing in PARIS is the Modula-2 language and compiler for it. The program in Modula-2 designed as a set of modules. Processes runed by the program uses procedures from modules to express their its execution algorithms and modules variables to access to global data structures. There are three notions in PARIS system: tasks, processes and modules. Modules from one hand essentialy are the half-finished products to construct the algorithms of individal processes, from the other hand they are the data managers and provides the exclusive or shared access to data structures. There are two ways for processes comunication. First is access to guarded common data sctructures, managed by modules. Second is message passing through sending and waiting signals. Forthunatly both types of syncronization may be expressed with Send/Wait signal operations and it is nessesary only add new standard type SIGNAL and operation with this type to existing Modula-2 language. We have no need to design new language for programming the kernel of operating environment. We are sure, that Modula-2 is the good choose for this goals. For the purposes of compatibility we will implement the system calls similar to Unix-V operating system. It will allow us to run existing C language compiler for Kronos microprocessor and use portable C-labrary for ordinary Unix programs. Of course, additional libraries will present for newly designed C-programs specific multiprocessor system facilities. PARK - parallel Kronos architecture ----------------------------------- Network structure. Special network of links is used for data pass between nodes. Each link is 32-bit width bus, connected to several nodes. Each node can be connected to a few links. If two nodes are connected to the same link then each of them can directly access memory of another. Access through the link uses global physical memory address consisted of two parts: node number and word address in node's local memory. Link controller of each node provides request retranslation from one link to another using map table loaded with information about network topology. In such way node can access memory of any other node in the system. Reading memory request is processed in two separate steps: address transmission and data reception. This is nessesary to prevent network interconnection deadlocks. Node structure. Main part of node is CPU. It is single chip 32-bit microprocessor Kronos with architecture optimized for efficient execution of programs written in high level programming languages. All computations are performed on hardware arithmetic stack. Addressing modes reflect data structures of high level programming languages. Instruction set includes commands for access to global, local, intermediate and external variables, arrays, records. A number of commands are intended for recursive procedure calls, processes and signals management. CPU provides virtual memory support. Address translation table is stored in node local memory. Special address translation cash decreases memory access time by holding last translations' results. CPU uses translation table only when there is no physical address of accessed memory page in cash. Physical address may refer to the local RAM or to RAM of another node. Realy address translation cash is distributed between local memory controller and number of link controllers. All of them try to find translation in their cashes in parallel. If no one found translation then CPU calculates needed translation and puts it into one of the cashes. Physical address determine the cash in wich translation will be stored. So, only one controller may have translation for the appropriate virtual address. Each task has its own translation table and the kernel of operating environment must provide syncronous modification of the table in nodes executing the task. Single chip memory controller manage array of dynamic memory chips. It contains error detection and correction circuit, data cash, memory refresh circuit. Disk drives and another peripheral devices can be connected with SCSI controller chip. Several nodes may be connected to single SCSI bus. Controller provides direct memory access for input and output data. Conclusion ---------- Offered parallel architecture PARK, based on ideas of distributed memory, combines advantages of shared memory model with high speed local memory of message-passing transputer model. So it may be the compromise way increases the application area of parallel architectures. The powerfull of Modula-2 language based on modular style of programing permit us to hope that Modula-2 with minimal extensions is a good tool to design and implementation kernel of operating envirtonment. Abstract data types incapsulated in appropriate set of modules may support very easy and elegance tools for efficient parallel software design. We are very much obliged to proffesor V. Kotov for his important influence at our understanding of major parallel problems. We also thanks all members of Kronos Research Group who are realy promoted this work. Thursday August 24 1989 Kronos Research Group Computer Center Novosibirsk