This wiki service has now been shut down and archived

Dynamic Distributed Data-Intensive Applications

From ESIWiki

Jump to: navigation, search

Return to Workshop wiki Main Page

Data-Intensive Research Workshop: Dynamic, Distributed, Data-Intensive Applications: Soliciting comments

Please add any comment you would like to make about this theme after the organiser's, Shantenu Jha's, introduction below. It can be specifically related to a talk or breakout session or be general. Please separate entries with headings (two = signs) that flag new topics, or subheadings (three = signs) and add your signature.

Understanding the Landscape of Dynamic, Distributed Data-Intensive Applications

Two important trends make understanding the landscape of dynamic, distributed, data-intensive (DDDI) applications important. The first is the well known “data tsunami” problem, which in various incarnations is also referred to the “Big Data” problem [1]. The second, often less appreciated, is the growing need to manage dynamic and data-driven applications [2]. Such applications generally involve computational activities triggered as a consequence of data creation and arrival. The aim of this sub-plot is to focus on the challenges arising at the confluence of these two trends, as this characterises an emerging set of important applications.

To illustrate some of these challenges, we outline two simple, yet representative distributed dynamic data-driven applications. (i) For large-scale applications using sensor-driven distributed simulations, there are many sensors that lead to large volumes of data; changing data generating sensors (sources); the computational problem under study or the granularity of data required changes over time. (ii) Another prominent example involves the processing/handling of large volumes of scientific (e.g., biological) data placed into distributed data-centers (aka clouds), which in turn could need real-time analysis to guide the experiments or source(s) producing the large volumes of data. In such applications, processing of data often needs to take place “in-flight”, say via streaming, or data-volume reduction may be required to effectively store/manage data. The characteristics and requirements of such applications are common to many other analytics problems proposed in data-centers, such as information extraction from live video-streams.

Primacy of Dynamic Data: As illustrated in the applications above, the role of data – both static and dynamic, either as the driver of computation, or as a first-class (primary) computational entity, is set to become central. As DDDI applications become pervasive, the importance of dynamic placement, management and scheduling of data will increase. As an example, for dynamic applications there might be a possible change in application objectives and thus data requirements; or as a consequence of resource availability and infrastructure QoS issues, the ability to place data dynamically could be critical, or it could just be simply cheaper to recompute the required data than to store or transfer the data (but can’t be determined upfront). Alternatively, data to be operated upon may not be available until an indeterminate later time in the execution life-cycle, thus rendering static scheduling decisions ineffective. In all of these cases, the ability to manage data dynamically is important.

There are are several limitations in the current understanding and handling of dynamic data. For example, to address the challenge of “Big Data”, several programming models, such as MapReduce & variants, have been developed. However, most of these programming models (and associated tools and services) typically assume that the underlying data-set is “static”, i.e., it does not change during execution. Thus, performance, deployment and execution decisions, once made, are typically assumed to be valid throughout the life- cycle/execution of the application. This situation is analogous and reminiscent of the first-generation of distributed applications that inherited the static execution models of legacy applications. It is only with the right tools, abstractions and runtime support that a subsequent class of distributed applications have been able to break-free of the static (resource) usage model and dynamically optimize resource utilization, with a concomitant performance enhancement.

The overall aim of the proposed sub-plot is to begin to understand the landscape – as defined by the programming models and abstractions, runtime and middleware services and the computational infrastructure – of dynamic, distributed data-intensive applications. Specifically, the aims of the DDDI sub-plot can be categorised along the following:

  • Explore and analyse the landscape of DDDI applications. Understand current state of development, deployment & execution of such applications.
  • Programming Models and Abstractions: Not only is there is inadequate support for dynamic data, there is also a fundamental lack of programmatic and systems-level support for data-driven and autonomic applications. Abstractions and programming models are required to support the formulation of components and applications that are capable of correctly and consistently adapting their behaviors, interactions and compositions in real time in response to dynamic data. Which programming models are required, and how should they be offered?
  • Runtime Execution and Middleware Services (REMS): What are the crosscutting design objectives for which general-purpose abstractions must be found? How can existing REMS be extended to support these design objectives? How can REMS be developed and integrated with applications to preserve performance and resilience, yet not be tied to a specific infrastructure and support emerging distributed platforms?
  • Computational Infrastructure: How should experimental and production infrastructure be designed to support data-driven dynamic instantiation, configuration and aggregation of distributed resources and varied data sources? Analyse the support infrastructure for DDDI applications. What are the current capabilities and what are the missing pieces?

Understanding the answers to these questions will serve as the critical first step in framing the primary questions that we will ultimately seek to address in the Dynamic Distributed Data-Intensive Abstractions theme that will be proposed as an outcome of this sub-plot and the DPA theme.

References [1] The Fourth Paradigm: Data-Intensive Scientific Discovery fourthparadigm/default.aspx [2] NSF DDDIAS Workshop (2006)

Add further comments here with headings like this

=and subtopics with headings like this

This is an archived website, preserved and hosted by the School of Physics and Astronomy at the University of Edinburgh. The School of Physics and Astronomy takes no responsibility for the content, accuracy or freshness of this website. Please email webmaster [at] ph [dot] ed [dot] ac [dot] uk for enquiries about this archive.