This wiki service has now been shut down and archived

Thinking about DIR

From ESIWiki

Jump to: navigation, search

Return to Workshop wiki Main Page

Data-Intensive Research Workshop: Soliciting information and Ideas

This is an invitation to anyone planning to participate in the DIR workshop, already participating in the workshop or joining us by web cast.

  1. We would like you to supply us with information about yourself here so that we can shape the DIR workshop, particularly the breakout groups.
  2. We would like you to give us ideas so that they can be taken up in working groups and in the workshop reports. We undertake to properly acknowledge contributions.
  3. We invite you to bring an A3 poster about your work to the workshop, these can be displayed from Tuesday to Friday in the areas used for breaks and mingling; we hope these will let you gather interest and interactions with other participants. (If you have a poster please upload it and link to it from an item here, so that those not at the workshop can see it.)

If you are doing research that uses data-intensive methods or may require those methods, please add a short statement (<= 500 words) below that indicates what already works well for you, and if you can, identify new data-intensive capabilities that would significantly advance your domain of research – perhaps yield a breakthrough. For example, would you like better ways of coping with: complex data, heterogeneous data, poor quality data, high-volume data, etc. Would you need better methods of designing data collection systems? Would you like a new class of analyses or visualisations? Would you like better tools for finding data or for describing data? …

If you are doing research that is creating or enabling data-intensive methods please add a short statement (<= 500 words) below that indicates what you have already achieved and what DIR it makes possible. Identify what you hope it will achieve in the future and why this will advance data-intensive research. Identify breakthroughs that would enhance the power or impact of your work. Provide examples of what new DIR opportunities this may open up.

Please use a format similar to the examples below, so that the items appear in the content table and have URLs. Feel free to include citations and links to more information about your work or the challenges. Please sign the items you add (you may add more than one) so that we can acknowledge your contribution.

Malcolm Atkinson Understanding Data Use

A key issue is to improve data use in balance with investment in data creation or collection, and data preservation or curation. These activities are important but have no value to research and society unless data is well used. I would like to promote and achieve research which creates a better understanding of how we may use data and a significant advance in the tools which use data.

Today’s challenges are urgent and intellectually demanding; making the best use of the world’s growing wealth of data is a crucial strategy for addressing them. Data are the catalysts in research, engineering and diagnosis. Data underpin scholarship and understanding. Data fuel analysis to produce key evidence for decisions and supply the information for compelling communication. Data connect computational systems and enable global collaborative endeavours.

So data-use is important (even the Economist thinks so), but precisely what should we do? It is possible to carry on working on each contributing area: data-analysis methods, data-query and workflow languages, data-cleaning and transformation methods, data visualisation, and so on. But these will not automatically integrate to achieve an overall benefit. They also need translation, through considerable engineering effort, to be deployed for all research needs at all scales, all complexity and all rates of change. They need to co-evolve with advances in data-users' understanding, skills and practices, so that they are actually used. This combination seems to require new ways of thinking. We need to analyse and characterise the phenomena critical to successful data use. We need to understand how computing theory addresses those phenomena and how the thinking tools that the theory provides can be translated to actually improved practice. The goal is to move from one-off successes based on clustered of well-integrated intense intellectual effort with data-intensive methods and tools that are widely applicable. This is reminiscent of the early days of software engineering where there were beacons of success in a sea of growing requirements that couldn't be addressed reliably.

To get started, Dave De Roure and I, went on a three-week tour of the USA to see what others were doing. That led to a report we're still writing; a draft is here. It also led to this workshop. --MalcolmAtkinson 12:17, 5 March 2010 (UTC)

Malcolm Atkinson Distributed Data Use

Data is frequently complex and as the pervasive use of data grows we can anticipate the complexity of data growing in many ways:

  • the information captured and encoded by each data user, group or project gets more complex as they understand more about the real-world system they study, model or manage;
  • the number of people, groups and organisations collecting and using data is growing rapidly, they address their specialist needs first and consequently develop independently and hence differently the ways in which information is recorded, organised and managed - this autonomous activity is necessary for rapid progress and agility - it can be key to their success; and
  • more and more problems require multiple viewpoints of multiple interacting systems to be understood, modelled and managed; this is reflected in a requirement to use data from multiple, distributed and independently evolving data sources.

Initiatives that encourage common representations or mutually agreed interchange formats, such as: the GO ontology, the INSPIRE directive and the IVOA, are an important element for addressing this challenge as they provide common reference frameworks to which the diverse data can be related and a key means of communicating unambiguously. They never, however, remove all the diversity and autonomy, that remains because it is a legacy of past decisions, a consequence of the independent development of further insights in many locations and of requirements for focus and agility postponing consideration of existing (often complex) agreements. Fundamentally, the common agreements can never keep pace with the frontiers of innovation and research as they depend on consensus.

Therefore, working with diverse, distributed and independently managed (owned) data resources will always be a requirement for data use. To help with this we have been developing two technologies that provide frameworks for using such distributed, heterogeneous data:

  • OGSA-DAI, which provides a data-streaming architecture for accessing and integrating distributed data, providing and automating all of the data plumbing, providing a library of ready-made components to plug together, tools to help application developers build and deploy graphs of these components and a number of extensibility mechanisms to allow the range of applications to be progressively extended.
  • ADMIRE, which is a research project into how to make the whole process of accessing, obtaining, cleaning, integrating, analysing and presenting data easier and more scalable. Its prototypes are built using many other things. Like Meandre it aspires to driving much of the details of these processes and their optimisation from high-level descriptions. We're not there yet!

OGSA-DAI, ADMIRE and Meandre will be on show in the Research Village on Monday. They raise three important questions:

  1. Should we be eating our own dog food and agreeing common ways of describing all of the constructs in these distributed data systems?
  2. Is there a way of re-using the huge effort in user-defined extensions in multiple distributed data systems?
  3. Will we be able to make this so that researchers in an application domain can do all the DIR they want without having to understand how the engine works?

--MalcolmAtkinson 13:03, 5 March 2010 (UTC)

Malcolm Atkinson What is data

Apparently lawyers try to differentiate between data and things like text, documents and images. For me as a computer scientist that doesn't make sense. I'd define data as:

Data are any digitally encoded information that can be stored, processed and transmitted by computers. They include:

  • collections of data from instruments, observatories, surveys and simulations;
  • results from previous research and earlier surveys;
  • data from engineering and built-environment design, planning and production processes;
  • data from diagnostic, laboratory, personal and mobile devices;
  • streams of data from sensors in the built and natural environment,
  • data from monitoring digital communications;
  • the data transferred during the transactions that enable business, administration, healthcare and government;
  • digital material produced by news feeds, publishing, broadcasting and entertainment;
  • documents in collections and held privately, the texts and multi-media ‘images’ in web

pages, wikis, blogs, emails and tweets; and

  • digitised representations of diverse collections of objects, e.g. of museums’ curated objects.

I suspect I'll have missed some! The IDC estimates that we'll use (not store) 1.8 Zettabytes of data in 2011+. Only a tiny fraction of that will be research data. This means we have to go with the flow and adapt and exploit what is happening in the larger digital ecosystem during the digital revolution. How do we pick the relevant data for research and the useful commercial?

+ John F Gantz, Chute Christopher, Manfrediz Alex, Minton Stephen, Reinsel David, Schlichting Wolfgang, and Toncheva Anna. The diverse and exploding digital universe. Technical report, IDC, March 2008. --MalcolmAtkinson 14:14, 5 March 2010 (UTC)

This is an archived website, preserved and hosted by the School of Physics and Astronomy at the University of Edinburgh. The School of Physics and Astronomy takes no responsibility for the content, accuracy or freshness of this website. Please email webmaster [at] ph [dot] ed [dot] ac [dot] uk for enquiries about this archive.