This wiki service has now been shut down and archived

Capturing e-Science Fundamentals

From ESIWiki

Jump to: navigation, search



The research to develop e-Science is a continuous processes drawing on advances in Informatics and digital systems technology. It develops and applies new methods and technologies in the context of specific interdisciplinary research to achieve new research capabilities in that context. In an ideal case it then extracts generalisations which are then applied in other contexts, i.e. it has multi-disciplinary affects. If these have a well understood foundation, an understood range of application and can be expected to continue until greater insight replaces them, then they could be termed "e-Science Fundamentals".

  • comment (Malcolm Atkinson) The term "e-Science Fundamentals" may not be the best one. For example, there may be pragmatic generalisations that apply given the current state of technology. Being pragmatic and specific to current technology they would not be fundamentals; however, it is feasible that their dependence on specifics of the technology can also be understood. Then they can be reformulated in terms of the relevant properties, so that they always apply. That reformulation would again be a fundamental. Indeed, it could be required that a fundamental be invariant over the evolution of technology.
  • Carole Goble has suggested (22 April 2009) that they should be called Computing Science Fundamentals for e-Science. This resurrects the term used during the e-Science programme for some EPSRC funded projects. My concern with this nomenclature is that fundamentals may be recognised by mathematicians, statisticians and engineers and potentially by other disciplines, though I concur with the view that the progress towards fundamentals will be driven by computer scientists in most cases.
  • comment (Andrew M): the concerns I'd characterize as security cut across many of these topics. Insofar as e-Science covers collaborating people, cooperating machines, systems, and networks, the fidelity and integrity of data, the privacy of data subjects, and so on, it embodies much of the broad topic of information security, but it goes wider to the organizational context and what is known in Govt. circles as information assurance. IA is surely fundamental to the e-Research endeavour, but it isn't by any means constrained to Computer Science. Although these security-related issues can be distilled into concerns under the individual headings, they necessarily inter-relate.
  • comment (Malcolm): I agree with Andrew that my functional decomposition below omits issues best considered as cross cutting. What other cross-cutting issues are there?
  • Suggestions for a name that captures the idea and achieves wide support would be very much appreciated. It is proposed that a theme on this topic be organised (by me) at the e-Science All Hands Meeting in Oxford in December 2009. I would be happy to use a widely accepted name.
  • comment (Malcolm) I like the notion of "Foundations of e-Science" but I can't envisage building them in a workshop, though the two workshops could review progress in building foundations and the plans for building more of them.
  • comment (Richard S) call it foundations, fundamentals or what you like I am still not sure what it actually buys us. I can give you many examples of design patterns that repeat across disciplines but am not convinced that they represent truly fundamental e-Science. They are just examples from years of being involved in building e-Science applications/infrastructures. Thus the security models we have been working on can be generally applied and indeed are being used in nanoCMOS electronics // clinical trials & e-Health // geospatial systems // arts and humanities etc etc. Is it fundamental e-Science or just knowledge of how decentralised and/or centralised security models work? I am not sure...
  • comment (Malcolm) I think those frequently used patterns that are discovered by experience would become fundamentals if they were then described independently from implementation and contextual detail in such a way that anyone in the future wishing to apply them would (a) understand whether they would benefit their work, and (b) be able to apply them successfully without reference to others who have already used the method. I think this extraction from the original context requires intellectual effort and computing science insights/principles.
  • comment (Malcolm) One excellent outcome of the workshop would be research theme proposals that could be considered for funding in the next two years and be selected in the Autumn 2009.

It is often the case that sitting in one theme or workshop we hear the recognition of requirements or the proposition of solutions that are familiar from other themes or workshops.

Some of these may be common sense or a process that each group has to go through. Others may be superficially similar, but when you understand the application they differ fundamentally.

We hypothesise that once these ’false positives‘ have been filtered out there remains a significant body of ideas, concepts, methods and supporting tools that are applicable across several disciplines, research communities or stages of research. Precise formulations of these would be "e-Science fundamentals". Many will be based on informatics insights, some will depend on other disciplines, such as social or behavioural models, mathematics, statistics or engineering. Some may require deep investigation into supporting theory.

We believe that identifying and articulating these fundamentals, thereby supporting their transfer across communities, will have wide benefits.

To structure discussion we can partition the domains in which we look for fundamentals; the following list is a first stab at that structure. It requires refinement and discussion, and each domain will eventially be decomposed into more specific questions. Those of the form * n.m are intended to illustrate an element of such a decomposition. Please add your own examples. Answers to sufficiently explicit questions would deliver the sought-for fundamentals. Progress towards them is expected to be incremental and to require research with the intent of discovering fundamentals.

  • Comment (from Perdita Stevens, added by jcheney): I'm not sure, but I have the feeling that more concrete would be good. E.g., rather than discussing these fundamental issues in the abstract, pick a real e-science case study that demonstrates the problems that concern people in e-science, and invite people to come and talk about techniques from SE or elsewhere that might help with that particular case study. I pessimistically imagine it deteriorating into a session where everyone moans about how hard it all is and how nobody else has ever solved exactly these problems, without ever being precise enough about what the problems are for anyone to contradict that assertion.

A Complexity and Dependability

1. Computational models are built by composing submodels and no individual can understand the whole system. They cannot always be tested for competence as they are used where experimental validation is impossible. How can models be composed so that the validity of a result they produce can be stated with confidence?

  • Comment (jcheney): The above problem is basically a restatement of the central problems of the fields of software engineering and software verification. I doubt that novel solutions will be developed by researchers in other disciplines who are unfamiliar with the vast body of research on these topics.
  • Comment (Malcolm) In my view computer science research would be a major contributor the solution, but mathematics and engineering may also have relevant strategies. The current scope of software engineering and verification may not address the sort of problems that may occur. I construct a caricature to illustrate: a group of building engineers develop a model for heat and gas flow through windows to improve building design. Another group are modelling fires in buildings. They use (call as a service) the first model to model rooms and doors that contain windows. The original model worked well for domestic ambient temperatures and pressures. However, the iterations of the fire model continue to use it long after temperatures or shock waves would have shattered the window. I am not aware of current S/E techniques that would spot that the building model would take the window model outside its applicable context.
  • Response (jcheney): Sure, of course in general software engineering/verification needs to be complemented by knowledge fo the application domain. The problem in your example happens in nearly every large software project, many of which fail because of a failure to take such considerations into account. A famous (and probably tired) example is the Ariane-5 accident, in which a unit mismatch (metric vs. British) in the code for an unmanned rocket, led to the rocket self-destructing shortly after launch. Google for "software failure" for many, many other examples - the book "Software Failure, Management Failure" by Steven Flowers is an excellent source IIRC. My point is the converse, though: I am not claiming that computer scientists have solved the software engineering or verification problems or that mathematics, engineering or other disciplines are irrelevant to these problems, only that researchers in other disciplines risk either re-inventing the wheel or wasting their time if they ignore the 40+ years of research on these extraordinarily difficult problems. I guess what I am really asking for here is a clarification: do new challenges to software verification/engineering arise in eScience settings specifically, or do software engineering or verification problems in eScience differ from more traditional problems only in that more domain knowledge/participation is needed from scientists who will rely on the systems being developed? I don't know any evidence to support the former claim, but would be happy to hear some.
  • Response (Malcolm): I believe that taken in the long term (but not necessarily in a time-constrained project) the researchers who use computational models and data sources would like to avail themselves of the benefits of anything that would help, and that certainly includes "40+ years" of CS research that would be relevant. The nub of the issue is understanding what would help if you invested the time trying it. I think we should invest time doing just that; it is what I have in mind - consider a project where a group of software engineers and a group of researchers with their complex models work together to see if they can exploit the software engineering research in a full-scale model. I would call that a hunt for a particular e-Science fundamental. If they then characterise their solution in papers and others with similar challenges in their modelling try to use it and refine it until it is understood independent of context and application; then we would have captured one of these fundamentals. It remains for me a challenge to understand how to organise that hunt well. I find it difficult to know with confidence how to wisely direct such a hunt. The models exist already (e.g. a leading researcher at Boeing told me it takes 25 years for a new (numerical) model to be trusted by the engineers; e.g. a director from NCAR in a talk a few years ago said they had embarked on building a new atmospheric model and that it would involve >100 organisations and take >10 years). I suspect they will retain many numerical libraries and depend on complex data and dynamic library contexts when the models are used. Most work by most researchers is incremental in the context of existing models. I wore the label "prof of Software Engineering" for 20 years and I am unsure of how to start. This may be because I've not kept pace with progress and don't understand how much is now possible; but I also suspect that the current software engineering methods may experience difficulty because the models and understanding co-evolve, that the models often are a primary tool for expressing the understanding and they are used in complex environments of evolving software and parameter estimates, i.e. independently stating correctness criteria may be intractable. I am also unsure whether validating systems of this scale is feasible though I am aware of very large hardware verification achievements. So I would suggest that we will only know whether new challenges arise by pursuing this kind of hunt. But I would not presume that I correctly characterise what would most help complex system modellers. I am personally impressed by people reformulating models in terms of modern computing science notations, e.g. stochastic processes, to make them succinct and more easily understood. I'm not sure how much that has helped the biologists so far? But feel sure it must. But I think many sciences already have very good notations for their phenomena; but the manifestations of the phenomena are extremely complex. So another question arises "What is the most important fundamental to hunt?". Not that I would expect there to be only one hunt! I hope that the workshop will make progress on both fronts: (1) deciding what are likely fundamentals that should be hunted because in some sense they are important (and achievable) and (2) sketching how to organise a sample of those hunts.

2. Data is collected into collections from many sources, with differing procedures, instruments and standards. Data is composed from many such collections as input to analyses in data-intensive science. How can data be composed so that the validity of results produced when they are analysed can be stated with confidence?

  • Comment (jcheney): The above problem seems to me to be an accurate statement of what work on data provenance is really about.
  • Response (Malcolm): That would make it a candidate for being considered and e-Science fundamental in my mind? Your theme has made remarkable progress. Is it clear what else needs to be done?

3. Computational models are used to extract significant data from signals. Data are used to calibrate and provide coefficients in models. There are potential feedback loops here that could amplify error? How can models and data be composed so that the results produced have a validity that can be stated with confidence?

B Computation

1. There are many computational models (numerical, stochastic, process, logical, etc.), which were built with current methods and libraries and shaped by assumptions about available computing architectures. Each model is represented by a large body of code that was developed incrementally. Reformulating models and writing replacement code requires a great deal of skilled labour, and as the process is potentially error prone, it takes many years before the replacement is trusted. Is it possible to automate or computationally assist significant parts of the reformulation and replacement process?

2. Are there common patterns in computational models that can be supported by abstractions that can be implemented efficiently across a range of architectures, such that modellers will actually use them? How high-level can these abstractions be made while still delivering utility? Will such a strategy ameliorate the model replacement process in future and can it improve the quality of the resulting models?

3. Models can benefit from reformulation in new forms, e.g. as ensembles, or in new notations, e.g. stochastic process descriptions, i.e. computational thinking can inspire new approaches. Are there ways of recognising when such reformulation will achieve benefits and of informing the choice of notations and meta-models?

4. What are the best ways of describing models to (a) encourage their correct use by humans and (b) enable automated processes such as composition and validation? For example, can a model’s domain of competence be well characterised?

5. Data are increasing in volume and complexity and the repertoire of analyses and transformations is also growing. Many scientific tasks are defined by composing several computational and data handling steps, often by using workflow languages or scripts. Can these composition mechanisms be partially automated, better optimised and more completely validated? Does this require new composition notations or can previous investment in computational skills and scripts be retained?

6. How should the workloads executing computations be mapped onto available hardware resources? How can urgent computing be accommodated, for instance to run models during a medical intervention or emergency response to a fire? How should the resources be organised to make this possible?

7. Computations often suffer from hardware, system and operational errors. Can diagnostic and recovery mechanisms be improved to make results more trustworthy and to minimise the waste of resources?

8. Once a computation has been run how should its results be annotated with provenance data? The challenges here are the large volume of available provenance features and the uncertainty as to which of these features will prove vital for future understanding.

C Data

1. The requirements for scientific data range from very large reference collections, such as those held by NCBI, EBI, PDB, BADC, NCAR, ESA, NASA, SDSS, … to those collected by individual researchers and small groups, some of whom do it as unpaid work. Are there organisational and system principles that would be helpful across this range? Or are there ways of characterising data collections, e.g. published referenced collections, geo-spatial time series, exploratory collection, derived curated collection, etc. such that useful principles and tools that support them can be defined and sustained?

2. What automation is beneficial in the description, structuring, cleaning, standardisation and organisation of scientific data collections?

  • 2.1 Can the algorithms developed by Wenfei Fan for automatically cleaning business data, e.g. removing data entry errors, be deployed to clean research data in preparation for data mining?

3. What descriptions of data best serve (a) scientists finding and interpreting the data, (b) automated access and integration of data and (c) exploitation of data?

4. How should requests for the use of data from multiple, heterogeneous sources be described? To what extent can this relieve the researcher from detailed access mechanisms, data representations and access policies? To what extent does it support optimised distributed execution?

  • comment (Malcolm) There seems to be two prevalent strategies: extend data query languages or develop workflow notations. Are there other serious contenders? Are these branches in some sense equivalent?
  • 4.1 What is the appropriate notation for querying data held in files?
  • 4.2 Large volumes of data may be handled by streaming data between stages of data extraction, transformation, combination and analysis. What are the circumstances when data streaming is beneficial?
  • 4.2 If data streaming is in use analysis algorithms that work on the stream in one pass. To what extent can research questions be formulated so that streaming algorithms can be used?
  • 4.3 Real research data does not normally correspond either to synthetic distributions or worst-case patterns. How should the complexity of the algorithms being used at the stages of processing be described relative to characteristics of actual workloads?

5. How should requests formulated as in (4) be executed? All levels from planning the deployment of operations across resources to managing data movement require consideration, can this optimisation problem be successfully partitioned?

6. How should the workloads corresponding to the update, archiving, replication and execution of requests be mapped onto available hardware efficiently? How must the data services be organised to permit this?

  • 6.1 If data is stored in a "cloud" of data servers, e.g. a GrayWulf, what are the algorithms that should be applied to distribute (increments of) incoming data over the data servers? What matching algorithms would then be used to distribute data requests and analyses over these data servers to optimise throughput or response times? Note this was done be hand in recent experiments by Szalay et al.

7. Data collections are continuously changing, not only through update against an existing organisation, but also through revisions of structure and representation to reflect advances in understanding and data gathering. How are these essential changes best accommodated while retaining as much of the previous investment in workflows using these collections?

8. How should the results of a data request be annotated with provenance data?

D Collaboration

1. Multi-site and often international collaborations engage in scientific research projects. The partnerships may be dynamic in individuals, organisations and goals. How are these collaborations described and managed? How do they specify their rules and implement them to balance convenience for researchers with obligations of privacy, safety and constraints such as the limits on what resources may be used for?

2. Collaboration depends on communication and sharing information. How can that communication and information sharing be best facilitated? For example, how can computational methods be used to assist in the recognition and adoption of common vocabularies, standards and representations?

3. How should responsibility (leading to credit or blame) be tracked and apportioned in such collaborations? E.g. when many people contribute to a data collection, a body of software or a composition of models and software.

E Human-Computer Interaction

1. Research depends on much highly skilled labour. The goal of the computational support is to increase the effectiveness of that labour. How do we measure the effectiveness of the research into computationally supported aids for researchers?

2. Most scientists find most of the available tools difficult to understand and use? One strategy is to tailor tools and data for their particular work. Such tailoring takes much effort by software and data engineers and so the strategy does not scale. Can this tailoring be automated? If so what requirements does this place on the generic tools and the data services?

  • Comment (jcheney): There seems to be a general problem in computer science that most research into human-computer interaction focuses on broad usability issues and seldom provides help with how to make "high tech" being developed by other researchers more usable. For example, it is hard to publish research about usability and programming languages (or programming environments) in conferences/journals about HCI because the sample sizes are considered too small to draw meaningful conclusions, and it is hard to publish usability research in programming-languages conferences because it is not considered a "hard" (rigorously definable/solvable) problem. This makes it difficult for the best researchers in either field to justify spending time on this kind of problem.
  • Comment (Andrew M): I agree with James. There are many areas in which usability of the APIs, and the conceptual model of the software, is problematic. This is shown in sharp relief in security issues - developers may be diligent, but misunderstand and make unsound design decisions. The result is broken security. It also seems much easier to identify what's wrong with HCI than to work out how to do it right.
  • Comment (Neil Chue Hong): I also believe that much of the practice of HCI focuses very much on the software itself, and does not address the broader context of the development cycle itself. If we consider eScience to be different from previous models of research supported by computational software in that there is a closer interaction between researchers, developers and others in the project team then I believe that the communication of requirements and software development stage is important. I'm not saying that this is a new issue - it has been covered many times by software engineering theory - but that in eScience this becomes more common. My question is how this relates to any fundamental thing in terms of e-Science? Is it actually a fundamental related to e-Science team success?

F Society and Ethics

1. The general public and professionals who might benefit from the results of computational and data intensive science find it difficult to judge results and interpret them in decisions and policy. How should the trustworthiness of computationally enabled methods be established? How can professional decision makers and policy makers be helped?

2. Data used by researchers often have legal and privacy constraints. How can these be reliably and demonstrably met? How should the balance between important discoveries and personal integrity be maintained given global scientific research?

  • 2.1 Many groups look at how to pseudomyse data, medical records so they can be used for epidemiology and analysis of the effectiveness of current services, web access logs to enable social scientists to test hypotheses about user and group behaviour, etc. Many groups develop algorithms for this and take steps to establish their efficacy with ethics committees and watchdog bodies. Are there a set of algorithms that are always sufficient? Can their efficacy be established once and for all?
  • comment (Andrew M) I'm inclined to think that they can, and should. There won't be a single kind, but there can be a small menu of options. I fear - with near certainty - that some of the pseudonomization solutions currently in use are not cryptographically sound, and could be trivially broken. This is a pretty well-contained problem, and ought to be solvable.

Return to the workshop page:


Monday 11 May

  • 10:30 - Registration
  • 11:00 - Plenary to hear/review a priori collected views, organise them and fill in gaps. (steps 1 & 2 and first plenary of step 3)
  • 12:30 - Lunch
  • 13:15 - Breakouts discussing questions 2 & 3.
  • 15:00 - Tea/coffee break
  • 15:15 - Plenary reporting back; review whether consensus is emerging. Attempt to achieve agreement on items 4a, 4b & 4c.
  • 17:00 - Reception, with visiting panel and SAB members
  • 19:30 - Dinner in the Raeburn Room, Old College with visiting panel, members of SAB and selected guests, extract their eSF, sell them yours!

Tuesday 12 May

  • 09:30 - Plenary: Recap and analysis of prevous day
  • 11:00 - Break
  • 11:15 - Return to meeting with panel or to independent small-group meetings.
  • 12:30 - Lunch + further opportunity to meet panel
  • 13:30 - Plenary &/or breakout sessions
    • a) to review and revise yesterday's conclusions in the light of discussions and intervening thought.
    • b) to address step 5.
  • 16:00 - End of Meeting (coffee/tea available from 15:00)

Purpose of Meeting

Engage the expertise of the gathered community of e-Scientists in answering the following questions:

  • 1. Is the existence of "e-Science fundamentals"
    • 1. a mirage
    • 2. unproven, or
    • 3. demonstrated already?
  • 2. What are the characteristics of e-Science fundamentals?
  • 3. What are examples of ideas that may emerge as e-Science fundamentals?
  • 4. What are good strategies for discovering and exploiting e-Science fundamentals?

Suggested process for conducting the meeting

Step 1: Participants prepare their a priori views

  • 1. What e-Science fundamentals has my theme/my work/my workshop derived from work in other themes/work/workshops?
  • 2. What e-Science fundamentals can I extract from my theme/my work/my workshop and propagate?
  • 3. What is my view on the "e-Science Fundamentals exist" hypothesis?

Step 2: These are collated (preferably before the meeting) on a wiki or as a collection of presentations.

Step 3: Plenary; break out into smaller groups; plenary addressing questions 2 & 3 together, each group produce report (as presentation)

Step 4: Attempt in plenary to get a consensus on 1, 2 and 3, retrying in smaller groups if consensus not emerging, delivering:

  • 1. characteristics (what to look out for)
  • 2. features and criteria that would lead to exclusion from putative e-Science fundamentals list.
  • 3. list of putative e-Science fundamental topics for further study.

Step 5: Discussion of good strategies for discovering and handling e-Science fundamentals:

  • 1. e.g. Choose between laissez faire, where we do nothing more than hold the usual meetings
  • 2. Wanted posters asking the TLs and workshop leaders to be on the lookout for e-Science fundamentals
  • 3. Employ "Researchers in Residence" trained to spot, capture and harness e-Science fundamentals.
  • 4. any other good ideas.
This is an archived website, preserved and hosted by the School of Physics and Astronomy at the University of Edinburgh. The School of Physics and Astronomy takes no responsibility for the content, accuracy or freshness of this website. Please email webmaster [at] ph [dot] ed [dot] ac [dot] uk for enquiries about this archive.