This wiki service has now been shut down and archived
DPA Survey Paper
From ESIWiki
Survey of Distributed Programming Models and Abstractions
Given the centrality of distributed applications, while focussing on Programming Models and Abstractions, we believe this is will be a unique paper.
The main aims of the Survey paper are:
- Compile and discuss a comprehensive list of distributed applications
- Define and describe the Programming Models and Abstractions use
- Mapping each application to the Programming Models and perform a GAP analysis
Section 0: Introduction and Outline
An outline and section responsibility will be posted sortly.
Section 1: Application Areas
What exactly do we mean to encompass by the term "Distributed" particularly when talking about our application pool? A good starting point is the Berkeley Dwarf set. Is there anything in there which we *don't* consider relevant? If not, are we saying that parallel is now effectively a subset of distributed? Alternatively, what isn't in the Berkeley set which should be in our set? Why isn't it in the Berkeley set? For some fundamental "distributed" reason? (Murray)
Preliminary list from the discussions from workshop I Different ways of classify exist. Need to converge on a general and widely applicable classification.
- Loosely Coupled Applications
- Loosely Coupled Ensemble of Tightly-coupled
- Tightly-coupled Applications
- Irregular/Dynamic
- Interactive
- File Oriented, DB Oriented
- Stream/Event/Sensor Oriented
- Distributed versus Centralized Data
- Pre-determined vs Application determined distribution
- Other
- For each application area:
* What are the Components? * How are they Coupled? * Data flows ... * Different Languages * Discussion of the hosting environments * Boot strapping and deployment * others to come...
Section 2: Programming Models and Abstractions
The intent of this section is to define and describe the set of programming models and abstracts that the theme believes will capture all of the current work in distributed/grid computing. It appears that the overall set can be grouped into five or six areas:
- Composition
- Messaging
- Component Models
- Grid-Aware (grid libraries)
- Services
- Other (keep this?)
The first group, Composition, includes a number of items, such as:
visual programming, scripting, workflow, dataflow, functional languages, skeletons, and superscalar [ref. needed].
The main idea that ties these together is the concept of a user building an application out of pieces. These pieces are independent units of code that can be subroutines or executables. When a piece is executed, it is instantiated, started, runs to completion, and then is removed from the system. Important issues here are the ordering of the pieces, and the flow of data or messages between the pieces. In many cases, the user has a GUI of some sort that can be used to specify ordering. In other cases, only dependencies are decided by the user, and some automated system then decides the ordering that will be used to completely run the applications. Another issue is the resources on which the pieces are run. This may be determined by either the user or some automated system.
| model/abstraction | ordering or dependency specified? | Resource decision? |
|---|---|---|
| Visual Programming | Ordering | ? |
| Scripting | Ordering | User |
| Workflow | Dependency | Automatic |
| Dataflow | Ordering | User |
| Functional languages | ? | ? |
| Skeletons | ? | ? |
| Superscalar | Dependency | Automatic |
(what else should go in this table? Currently, scripting and dataflow are the same, and workflow and superscalar are the same. Another column or two might be useful to distinguish between them. And, can someone else fill in the "?"s?)
The second group, Messaging, includes a number of items, such as: two-sided messaging, RPC and other one-sided messaging, event-driven systems, transactional systems, streaming, and asynchronous operations.
The main idea that ties these together is how the pieces interact with each other. Two-sided messaging means the two pieces communicating both have agreed to exchange data, and both participate in the exchange. Remote proceedure calls (RPC) means that one pieces has exposed a proceedure interface to the other piece, and that second piece can then call the exposed proceedure within the first piece, eventually leading to a return value from the proceedure. One-sided messaging means that one piece can manipulate (get, put, or change) memory that belongs to the other piece, without the other piece being involved or perhaps even aware. Event-driven systems refers to a situation when the activities of each piece are driven by activities in other pieces, and a piece which is not reponding to another piece is idle or waiting. In transactional systems, there is a sense that some state exists in each piece, and that this state changes based on activities from another piece. Once the state has finished changing, it becomes a new state. However, until that time (often marked by a commit message from the other piece), the state can easily be rolled back to the previous state without harm to the overall system. Streaming implies that there is a continuous flow of data from one piece to another, and that the second piece does some processing to the data, and likely sends one or more streams to other pieces. So why isn't streaming dataflow with continuous data? Asynchronous operations mean that each piece can do other work while communication is occuring, rather than having to wait for the communication to complete.
The third group is Component Models. Inside this group is DObjects [DObjects], CORBA [CORBA], CCA [CCA], etc. This group is based in the area of object-oriented programming, and specifically, distributed objects. The main idea of all of the components models is that each component, which is really an object, can interact with other objects only through an explicitly-specified portion of its interface. Here, some of the methods of the component are permitted to be used by other components. In CORBA and CCA, this portion of the interface is defined in a language-independent manner, through IDL or SIDL, respectively, where IDL is interface definition language, and S is scientific. DObjects is a purely-JAVA system, so it does not require any langauge independence. The issue of how the location (or distribution) of the components is determined is not specified by by CORBA or CCA, although it appears to be automated for DObjects, based on a metacomputing resource brokering mechanism. Is this right? There are also issues related to parallelism that haven't been solved, such as how a parallel component running on 4 processors communicates with a different parallel component running on a different 4 processors. And this becomes even more complicated if the numbers of processes for the two components are not the same. One final matter related to component models is that they generally have the concept of a framework. An application is composed of components, and the framework is the structure that actually does the composition. For CORBA, the framework is the object request broker (ORB). For CCA, there are multiple frameworks, some of which handle parallel computing, some of which handle distributed computing, and none of which currently handle both. For DObjects, the framework is H2O [H2O]. (not clear if frameworks should be discussed or not...)
The fourth group is called Grid-Aware, which also really includes the idea of using grid Libraries. (Shantenu probably should finish this paragraph...) Yes
The fifth group is called Services, and it includes:
- Semantic services
- Groups
- Agents
The common element of this group is ... (Perhaps Omer can work on this paragraph...)
Finally, there are likely other programming models and abstractions that do not fit in any of these groups. An additional group called other is reserved for them. Examples are ...
This categorization was initially undertaken without reference to any particular existing effort, but afterwards compared with published work of Lee & Talia (Grid Programming Models: Current Tools, Issues and Directions), Parashar & Brown (Conceptual and implementation models for the grid) and Soh (Grid Programming Models and Evironments), and found to generally agree with these papers, with differences that were well-understood. Need to explain differences here, one by one, from each of these three articles... Also, need to change the URL links to paper-style references. Jha: I'll do this.
Section 3: A GAP analysis of the Programming Models through a mapping of Applications
Section 4: References
[DObjects] P. Jurczyk and L. Xiong. DOjects: Metacomputing Framework with Dynamic Query Processing for Distributed Data Networks, Emory University Mathematics and Computer Science Technical Report TR-2007-015, http://mathcs.emory.edu/technical-reports/techrep-00112.pdf
[CORBA] Object Management Group. 2002. CORBA component model, http://www.omg.org/technology/documents/formal/components.htm.
[CCA] D. E. Bernholdt, B. A. Allan, R. Armstrong, F. Bertrand, K. Chiu, T. L. Dahlgren, K. Damevski, W. R. Elwasif, T. G. W. Epperly, M. Govindaraju, D. S. Katz, J. A. Kohl, M. Krishnan, G. Kumfert, J. W. Larson, S. Lefantzi, M. J. Lewis, A. D. Malony, L. C. McInnes, J. Nieplocha, B. Norris, S. G. Parker, J. Ray, S. Shende, T. L. Windus, and S. Zhou. A Component Architecture for High-Performance Scientific Computing, International Journal of High Performance Computing Applications, v. 20(2), pp. 163-202, Summer 2006.
[H2O] D. Kurzyniec, T. Wrzosek, D. Drzewiecki, and V. Sunderam. Towards self-organizing distributed computing frameworks: The H2O approach. Parallel Processing Letters, v. 13(2), pp. 273–290, 2003 http://www.dcl.mathcs.emory.edu/h2o/papers/h2o_ppl03.pdf