This wiki service has now been shut down and archived
This theme aims to build a multi-disciplinary community which will launch a programme of data-intensive research that will focus on discovering and implementing the best strategies for supporting data-intensive researchers using the latest advances in relational DBMS technology. Many projects aim to advance data-intensive methods for specific applications; our theme differs as it seeks to chart a path to methods and technologies that will be relevant across multiple disciplines. We draw on DB experience but expect results of wide applicability.
The theme was spawned by the recent Data-Intensive Research workshop at eSI; it focuses on a tractable subset of the topics covered by that workshop.
MEETINGS & WORKSHOPS
6th-7th October 2010 during the 4th Extremely Large Databases conference held at SLAC National Accelerator Laboratory, Menlo Park, California. Participants: Malcolm Atkinson, Martin Kersten, Alex Szalay and Chee Sun Liew.
Meeting 1: Biomedical imaging – 3rd-5th November 2010 at the e-Science Institute. Organiser: Malcolm Atkinson.
13th-18th November 2010 during SC10. Various activities.
Meetings 2 and 3: Earth systems, Environmental and Geophysical data (including seismology) – 22nd-25th November 2010 at the e-Science Institute. Organisers: Alex Szalay and Malcolm Atkinson.
Technology Workshop 1: 15th-17th February 2011 in the School of Informatics, University of Edinburgh. Subject: Data streaming strategies. Organiser: Malcolm Atkinson.
Technology Workshop 2: 16th-18th March 2011 at Johns Hopkins University. Subject: Data-Intensive Research - Statistical Databases. Organiser: Alex Szalay.
Technology Workshop 3: 12th-14th April 2011 at CWI. Subject: array databases. Organiser: Martin Kersten.
Final Outreach Workshop and Closing lecture: 7th afternoon-9th morning June 2011 at the e-Science Institute. Organiser: Malcolm Atkinson.
That event will be followed immediately by XLDB-Europe 2011 from 8th-10th June 2011. PC chair: Malcolm Atkinson.
INTRODUCTION AND BACKGROUND
There is today very wide agreement that data-intensive methods are key to advancing research in many disciplines (see Realising the power of data-intensive research (draft report). Such methods are expected to play an increasing role in providing evidence for well-informed decisions and policies. They are therefore of great scientific and social importance.
The growing wealth of data is manifest as increasing numbers of collections of data, varying from the major curated collections to localised assemblies of files. The former provide reference resources, preservation and computational access, whilst the latter are often structured as spreadsheets and stored on individual researchers' computers. Many of these collections are growing both in size and complexity. As digital technology and laboratory automation increase in speed and reduce in cost, more and more primary sources of data are deployed and the flow of data to and from each one is increased.
At the same time, a growing number of researchers and decision makers are both contributing to the data and expecting to exploit this cornucopia of data for their own work. They require new combinations of data, new and ever more sophisticated data-analysis methods, and substantial improvements in the ways in which results are presented. As more governmental data are released and as citizens become aware of the data-intensive methods, they expect to access and review both the data and methods, and to contribute their own data and studies.
This pervasive change - part of the global digital revolution - that introduces a wave of data-driven approaches has been termed ‘The Fourth Paradigm’ by Jim Gray, as it is so transformative. Current strategies for supporting it demonstrate the power and potential of the new methods. However, they are not a sustainable strategy as they demand far too much expertise and help in addressing each new data-intensive task. This is reminiscent of the situation in IT that triggered R&D in Software Engineering; there were organisations that could undertake the tasks effectively and repeatedly, but others could not replicate their success. It was necessary to tease out the principles that underpinned the successes, and through clarification, articulation, education and tools, make it possible to replicate such successes widely with fewer demands for exceptional talents.
This theme will initiate such clarification, articulation, education and tools for data-intensive methods by taking a particular viewpoint and pursuing it aggressively. There are many approaches to supporting data-intensive methods, such as workflow technologies, augmented familiar tools and map-reduce algorithms over large-scale distributed systems. These will undoubtedly continue to contribute to data-intensive research and we will study our relationship with them. Our viewpoint is relational technology because it has very well developed scalable engineering, has been shown to be an effective platform for the reference resources and it has a mathematically sound theoretical foundation. We have access to expertise and open-source technologies in academia, on which we can draw for experiments and longer-term research. We expect to engage industry, which has leading players in relational databases and a significant interest in data-intensive methods for business, commerce, engineering, healthcare and government.
CURRENT RESEARCH CHALLENGES
The data that is used is complex for many reasons:
- the phenomena studied are themselves complex having many different properties that are observed separately;
- the analysis is complex in terms of the set of steps required and the computational complexity of some of these steps;
- the means of collecting data and of modelling phenomena are many and varied; and
- data originates from different autonomous groups who collect data without knowledge of its eventual use.
There are several forms of change that need to be accommodated:
- the continuous or periodic acquisition of new data takes a different form from transactional systems - it is predominantly an addition process of primary data with complex consequential changes to derived data;
- the required structures evolve as the understanding and scope of the data-intensive research evolves; and
- new users, new uses and new alliances change the set of data that must be accommodated and the patterns of use.
- There is a straightforward challenge of storing sufficient data economically.
- There are significant challenges in moving large volumes of data and
- There is a pressing need for effective incremental systems for maintaining replicas and derivatives.
- Whenever the analysis algorithms require access to large proportions of the data then it is necessary to use balanced architectures that give best throughput for given expenditure and energy.
- Many of today's analysis algorithms cannot be applied to large data volumes and a family of incremental alternatives is needed.
The increase in demand takes two forms:
- more users are making requests against the collections of data, and
- as users become familiar with the power of data-intensive methods they ask more and more sophisticated questions.
- How to build technology that is robust in the presence of such errors;
- How to discover and communicate issues regarding the quality of the data that may have affected a user's request; and
- How to compose quality information to indicate the reliability of derived results.
The present strategies for data storage and data-intensive computation and communication will hit an energy wall if the increasing volumes and demands are met by scaling out the existing technology.
- e-Science Institute
- Background information to this Theme
- Data-Intensive Research meeting held in Leeds on 12th January 2010
- Data-Intensive Research: how should we improve our ability to use data meeting held in Edinburgh from 15th-19th March 2010
- Realising the power of data-intensive research – draft report of March 2010 meeting
- 4th Extremely Large Databases Conference held in California from 6th-7th October 2010. Presentations can be downloaded from the programme page