This wiki service has now been shut down and archived

Analysis Paradigms

From ESIWiki

Jump to: navigation, search

Return to Workshop wiki Main Page

Data-Intensive Research: Data Analysis: Soliciting comments

Please add any comment you would like to make about this theme after the organiser's, Chris William's, introduction below. It can be specifically related to a talk or breakout session or be general. Please separate entries with headings (two = signs) that flag new topics, or subheadings (three = signs) and add your signature.

Data-Intensive Research: Data Analysis

Generally the most important reason for collecting data is in order to analyze it, either to understand its structure, or in order to make predictions. Although different disciplines will have different specific analysis goals, there are problems and solutions that arise with regard to the organization and analysis of data across a wide range of application domains. Data-centric thinking allows us to understand and gain strength from these similarities. Such statistical and algorithmic analysis methods help us address the problem of “drowning in information, but starving for knowledge” (Naisbett, 1982).

The analysis of data typically begins with data “cleaning”, dealing e.g. with missing or corrupted data. It is recognized that this step can take up 50-80% of the time in real-world data mining problems. Even if manual correction of the data is possible it is prohibitively expensive for large datasets, giving rise e.g. to challenges around tools to improve the automation of data cleaning, or the integration of data cleaning steps into the analysis itself.

Beyond data cleaning, data analysis can be divided into exploratory data analysis, descriptive modelling and predictive modelling. These techniques derive from statistics, machine learning and data mining. Exploratory data analysis refers mainly to visualization of datasets, descriptive modelling to methods such as clustering where the aim is to understand the inter-relationships between measurements (unsupervised learning), and predictive modelling to the supervised task of predicting one or more variables on the basis of others.

There are several different dimensions to consider for these analysis tasks, e.g. with respect to complexity, data quality, the incorporation of prior knowledge, and scale. Here complexity can refer to the complexity of the model being used for analysis (e.g. network/circuit models in systems biology or social network analysis), or to complexity in the data, e.g. arising from the integration of multiple data sources. With regard to data quality there can be a trade-off between large amounts of “dirty” data, or smaller amounts of cleaner data, which might arise through data curation. The incorporation of prior knowledge is very important in scientific applications; one way that this can arise is through the use of structured probabilistic graphical models to encode (at least some of) the domain knowledge. With regard to data scale, there is often an abundance of raw data, and technological advances mean that this will grow rapidly. In some cases it is possible to deal with the volume simply by sampling the data rather than processing it all.

Over the course of the meeting we seek to gain an understanding of the data analysis challenges faced in different domains, and to find out effective ways to train personnel in data-centric analysis techniques in order to make scientific progress. Often this will require data experts working with domain experts in order to develop new techniques to address specific structure or features of the problem domain.

Over the week there are a number of talks that address different aspects of data analysis, including: Thore Graepel (Microsoft Research) on analyzing large-scale complex data streams from online services; Jonty Rougier (University of Bristol) on incorporating parameter uncertainty into the data assimilation process; Chris Williams (University of Edinburgh) on the complexity dimension in data analysis; and Andrew McCallum (University of Massachusetts Amherst) on discovering patterns in text and relational data with Bayesian latent-variable models.

March 8, 2010

Add further comments here with headings like this

and subtopics with headings like this

This is an archived website, preserved and hosted by the School of Physics and Astronomy at the University of Edinburgh. The School of Physics and Astronomy takes no responsibility for the content, accuracy or freshness of this website. Please email webmaster [at] ph [dot] ed [dot] ac [dot] uk for enquiries about this archive.