This wiki service has now been shut down and archived

Wednesday DIR Breakouts

From ESIWiki

Jump to: navigation, search

Return to Workshop wiki Main Page


Break-out Sessions on Wednesday

Introduction to break-out sessions

After the morning talks there will be small groups meeting in "break-out sessions" in order to give everyone an opportunity to think about the issues and develop ideas. These breakouts will be on a mixture of topics that originate from the day's domain of interest, the morning's talks and the cross-cutting themes. There will normally be between four and eight concurrent themes. They may start at or after lunch and will finish at 16:00 or at a time specified by the day organiser during the morning. Refreshments will be available in the Chapterhouse during the last 30 minutes of a breakout period.

Each break-out group should identify someone to chair their session, someone to record their session and someone to report back to the plenary session at the end of the afternoon.

The reporting back should include:

  • What was the focus of the group?
  • What aspects of DIR were already working well? or Which of their methods or technology were already serving DIR well?
  • What are the current mission-critical challenges?
  • What strategy would the group use to address them?

(as reporting-back time will be short, the reporter should digest this into no more than 5 slides.)

We ask that one or more people record what happened during their group's session by adding an entry in the record below.

Programming Paradigms Break-out

See Programming Paradigms for background information.

The aim of this session will be to take a mid-workshop stock of how the exchanges, discussions and proceedings so far, have influenced our perception of Programming Paradigms for data-intensive research. Many of the issues laid out in Geoffrey's opening talk (on Programming Paradigms) will be revisited.

Arts, Humanities and Qualitative Data Intensive Research

The talks have been quite focused on quantitative research, but what about Data Intensive Research using other forms of data - e.g. text, audio, video, transcriptions of in-depth interviews? Is there a "computational turn" set to change research practice and methods in these disciplines?

Participants: Sheila Anderson, Beth Plale, Bernie Acs, Neil Chue Hong, Dave De Roure


  • Reflected on nature of research practices
  • Different paradigms: individualistic vs (and?) large projects
  • sharing - but only on my terms
  • Recognise existing work: digital humanities dates back to 1940's
  • Social and cultural change is key - by and for researcher

Changing Practices

  • Recognise, understand and model changing practices - scholarly primitives
  • AHRC grants increasingly collaborative
  • Impact of European funding with focus on e-Content and research infrastructures - practices follow the money...?
  • Publication - slower but discernable shift towards new forms of digital publiation? Do we need new publication models - a digital "book"? As compared to monograph, journal article

Data Intensive Research

  • Humanities work is data/source-driven and always has been (except maybe practice based arts)
  • but data/sources highly dispersed
  • impact of million books initiative / increasing digitisation of collections
  • Silos to break down
  • significant role for linked data
  • create social networks around linked data
  • facilitate easy to use minimal linked data environment
  • play the impact card to provide incentives

Data Intensive Methods

  • Some suspicion of computational methods - can't see what the computer is doing
  • But activity increasing: music information retrieval, visualisation, text mining, simulations
  • Build on these successes - case studies etc, fund more work in these areas
  • Training and safe (physiccal and virtual) places to experiment with their data

New forms of Social Sciences Data

Today's talks have introduced survey-based social research. However there is a growing movement that argues that social science research could better be based on the growing body of "social transactional" and "naturally occurring" data sources; e.g. supermarket loyalty cards, administrative records, traffic cameras, facebook, smart electricity meters - including realtime data. Which data intensive research methods need to be used or developed to successfully harnessing this data deluge for social research that has impact? Possible lead: Mark Birkin

One of the points discussed at Wednesday’s breakout was that methods that use 'social transactional' and 'naturally occurring' data to research the social (by social scientists and others) are not the same as either qualitative or quantitative social research methods, as conventionally understood. Or to put it another way, both quantitative and qualitative research method specialists are critical of them, on methodological grounds, although for different reasons.

It may be that they are a new paradigm and adherents to the old don’t necessarily change their ways and adopt them; the new paradigm is just adopted by a different (and possibly new) set of practitioners.

Can I (RMcNally) also recommend a short publication called ‘The End of the Virtual’ by Professor Richard Rogers. Available at:


[Paul Lambert, 19/MAR/2010, 0951]

Points noted from the group discussion included:

- Attention to likely users of such Administrative, Transational and other Born digital data ('ATB' data??) was important, with a particular view to training and capacity building - accordingly a priority within any research initiative should be supporting training resources for new researchers to exploit ATB data. As above, the user community may not be traditional research communities who already have relevant data analysis skills. They may well be sceptical of naturally occurring data (to qualitative researchers, it may seem over-simplifying and lacking in context; to quantitative researchers, it can seem to feature an insufficient depth of data, such as in the absence of adequate explanatory variables or longitudinal trajectories, and low standards of sampling).

- Major hurdles to access to ATB data exist in many domains, such as ethics boards imposing restrictions on access to e-Health data, or companies requiring payment for transational data. At best these have implication for scientific models of replicability (i.e. allowing other researchers access to the same data); at worst these lead to relevant data being denied to researchers altogether. A strong effort should be made to highlight the benefits of research analysis of relevant data (and even perhaps the moral obligation to share data, in some domains), so that more balanced considerations of data access are made by the data owners (compared to the present situation in which it often seems concerns are weighted towards negative risks).

- Nevertheless to above, some ATB data undoubtedly raised disclosure risks, and facilities for monitoring research use of data and potential indvidual level disclosures seem desirable

- Scale of some ATB data is enormous, and automated metadata generation and storage, along with data storage, is required to curate it (e.g. the example of collecting and analysing telecommunications interactions)

- It would be desirable if the possibilities of ATB data were not set up in apparent opposition to other forms of data resource, especially survey data analysis. Diferent resources ought to be complementary (and e-science paradigms have important contributions to both). However this opposition is often presented (especially with surveys due to their shared quantitative character) - as in the framing of the intro above, and in the interpretation of the influential article 'The coming crisis of empirical sociology', by Savage and Burrows, in the Sociology journal from 2007.

- There was discussion of the 'Toronto agreement' on data sharing and coordination in Biomedical data(?); a comparable agreement regarding ATB data seemed desirable and feasible


What were you thinking when Hugh was talking?

When Hugh Glaser presented Linked Data, were you thinking "this is interesting, it solves a problem for me" (if so what?) or were you thinking "this isn't going to work for me?" (if so why not?)

Many research questions in data intensive research require linkage between data, perhaps across space (google maps mashups) or time (longitudinal studies). The purpose of the breakout is to explore how well Linked Data helps and/or what are the issues for the Linked data community to address to make it (even) more useful?

Leader: Hugh Glaser

Attendees: James Cheney, Jeremy Cohen, Hugh Glaser, Ally Hume, Kashif Iqbal, Martin Kersten, Chris Rusbridge

There are many misconceptions about Linked Data, and some time was spent trying to clear some of these up, without complete success. Some had thought of Linked Data as a rebranding of Semantic Web, but not so; Linked Data at its most basic level does not include Ontologies and Inference.

The group agreed that we were not sufficiently expert (although some were) to make strong conclusions. We suspected (and later confirmed) that even this workshop of people interested in data-intensive research were generally aware of Linked Data, but mostly lacked expertise or a real understanding on how Linked Data could be used for research.

The breakout group spent some time thinking about that question. Answers included

  • dataset, collection and other kinds of science metadata
  • linking data to external resources
  • enable re-use of other people’s data (but this doesn’t abrogate responsibility for having evaluated those data)

The group did not believe this was anywhere near complete. The To Do list therefore included

  • a workshop to document use of Linked Data in research, based on successful use cases (implying agreement on criteria of success). The workshop to lead to a clear report.
  • a “meshup workshop” or barcamp on Linked Data in Research, maximum 2 days, for IT/techie support, bio-informatics (and xxx-informatics) types. This would be along the lines of JISC Dev8D “Developer happiness days”, with problems set and solved by the participants working together, to increase their real, hands-on understanding of what Linked Data could do for their researchers.

Biological and Image Data

Following on today's talks and integrating with the wide variety of biomedical challenges and strategies.


  • Alastair Droop, genetics perspective
  • Dave Liewald, cognetive ageing with genetic data
  • Jerome Avondo, live plants quantification
  • Simon Wong, supporting reasonable base of bioinformatics users, lots of data coming
  • Graham Kemp, bioinformatics, protein structures, docking, drug design, broad interests
  • Jason Swedlow, imaging, cell division, open microscopy
  • Richard Baldock, capturing biology data, larger scale data analysis
  • Liangxiu Han, data mining on developmental biology data
  • Jano van Hemert, lead the research in the UK National e-Science Centre
  • Rob Kitchen, technical reliability of several biological analysis (microarray, pqcr, next gen sequencing)
  • Jos Koetsier, web portals for data-intensive research, quick generation of user interfaces


  • Institutional constraints exist that do not allow appropriate provision or research services and tools, which prevents data-intensive research
  • Biologists (and many scientists) need the appropriate learning ramps to get to a sufficient levels to do the job required
  • Biologists (and many scientists) should know the fundamental method that lies underneath to make sure they do not misuse the tool that implements the method
  • A formal description with a record of what happened should be attached to publications
  • "Small" data already break existing "shop-floor" tools in the lab (R, Excel)
  • Collecting data from the "long tail" (many small data sets) is difficult because 1) getting people to share is almost impossible only sticks exist, not many carrots 2) uncertainty about the statistics are unknown (not modelled) and prevent integration and useful interpretation


  • (Alastair) No interaction with the Human-Computer Interaction in the past 6 years
  • (Jason) Has interaction with the HCI group in Dundee, which was tremendously useful for the development of their software
  • (Jason) Software developers are very smart, want their software used, and will use methods they see beneficial, thereby making the HCI folk obsolete quickly.
  • (Jason) ICT companies spending fortunes on employing HCI people
  • (Alastair) Lab equipment have very poor interfaces
  • (Rob) No need for HCI when doing your own analysis
  • (Alastair) But HCI is really important for visualisation
  • (Jason) Can we note that we include HCI in any next development plans?
  • (Richard) Pretty commercial software, but it does not necessarily do what you need.
  • (Jason) Very different goals for the developers in commercial and academic environments.
  • (Jano) Breaking the silo barrier, HCI group must report to RAE2015 in Computer Science, what do they get out of it? Find a group that is willing to do interdisciplinary projects.
  • (Jano) Need to have equal partnerships in projects to allow time for CS to publish as well.
  • (Graham) Secondary use of data, what are we doing about that in the future?
  • (Alastair) Is a data scavenger of microarray data. Next gen seq will scale up in volume, but will the use go up though?
  • (Alastair) Has problems using R, as a biologist he is not able to work with his, what others will see as small, data. Simple, traditional "shop-floor" applications for data exploration should work as well. Not just the new fancy database-enabled applications.
  • (Jason) Why?
  • (Alastair) Problem is tsunami of facts in molecular biology combined with lab work leaves no time for new tools. So they use what they know how to use not to use what works
  • (Dave) University policies make this often hard with managed desktops.
  • (Jason) Specific services are not for information services, they do the mean.
  • (Dave) We need reduction of barriers to allow subnets that provide research services.
  • (Richard) In the MRC that is also the problem, but setting up own resource is always possible.
  • (Jano) But not for Jo-biologist
  • (Alastair) As Carole said, with services most biologist will use pubmed, and maybe BLAST.
  • (Jason) Has good experience, when driven by necessity, will learn more such as working with canned Python scripts
  • (Alastair) When things break you can move people on to other tools
  • (Alastair) Feels he is sitting in between three expertises (CS, MolBio, Stats).
  • (Alastair) Jargon is a barrier.
  • (Jano) Depends on the mode of interaction. Chemists know what they want and they want to enable their tool/task to other chemists. We just make that happen. Biologists know less accurately what they want, and if they need assistance with their analysis, it takes more time to get a common language.* (Jason) A lot of experimentation to see what is possible, this data should not be used afterwards.
  • (Dave) Setting up data collections without a hypothesis, integrating data sets to make large consortia
  • (Richard) Traditional biology, observational science, looking through data then have the bright idea
  • (Dave) But now we can measure so much this is not possible (Jano): is the data deluge
  • (Jano) And the Fourth Paradigm, no hypothesis
  • (Alastair) There are models of small parts of the system that need to brought together
  • (Richard) You need an awfull number of words to remove overloading of terms
  • (Jano) No! URIs!
  • (Richard) Is this really a barrier (Rob) it only takes seconds to pin down
  • (Graham) Should we discuss the challenges the long tail as identified in a previous talk
  • (Graham) We will not get the data from the long tail, too hard
  • (Jano) Carrot vs stick, microarray data in a repository before publication is a stick, bbsrc data policy is a stick, having a tool that gets the data at the start and makes it quicker to get the publication out is a carrot. Then you can get the raw data.
  • (Jason) Do we need to collect all, what data is useful? Why not release all and found out what is.
  • (Richard) We can predict the lifetime of data based on new machines coming out
  • (Dave) There are still problems with data that has privacy restrictions (human data)
  • (Alastair) Many reasons to not publish data, errors and gazumping
  • (Jano) First reason invalid, as that should be proper science
  • (Richard) It is not science if people can reproduce the result
  • (Jano) Astronomy good example of how agreements about open and free data access work
  • (Alastair) Danger of moving the discussion to how we are doing science and how we should be doing science
  • (Jano) Difficult to change behaviour in whole disciplines, good examples of agreements made from the top down, but this is hard for the long tail
  • (Jano) Do you trust the usefulness and quality of data that is the result of experiments in the long tail?
  • (Rob) Comparing results in the same lab as duplicates is already hard, let alone over many labs
  • (Jason) Has example of data they know is from a faulty experiment, handing that data around would not be good practice
  • (Jason) Raw statistical uncertainty depends on domain, lab, equipment, experiment setting, without knowing these, integration or interpretation is dangerous
  • (Jason) We need a way of placing data (image) in a genomic framework (or other framework)
  • (Alastair) We need tools to help us integrate data to make it meaningful
  • (Jano) We have tried to integrate 12,13 biological databases, that was going to fail from the start, everything is too noisy and changing all the time
  • (Richard) Has curated (imaging) data for more than a decade, it is impossible to take the human curation out of the loop as any automatic fails when new techniques are used. Automated mappings/interpretations may work for a specific set of data given a carefully designed algorithm, but this algorithm must be updated for any new data that is slightly different.
  • (Richard) What data do you keep that is worth the while keeping live? Mostly data that is used directly in publication
  • (Alastair) We only get credit for data that works
  • (Jano) There are a few examples where opening the 95% of data not used in publications are shared
  • (Richard) In medical trials publications of bad data is good
  • (Rob) Sweeve is a tool that integrates R and LaTeX to make sure that the whole process from analysis to publication is automatic and can be replicated.

Analysing time series data

Several of today's talks have discussed the analysis of time series data. However, the other breakouts were so exciting that this one didn't occur. Which is perhaps ironic for a time series breakout.

What to tell your government

This question was posed by Carole Goble

You are about to meet someone in the ministry that funds research and higher education and you have the opportunity to tell them in a few succinct phrases how they should support data-intensive research in the next funding round. You can have at most 5 priority items if you want to be taken seriously; what are they?

From today's Earth Systems data breakout: Free and easy access to data (particularly geo data), otherwise Europe will suffer relative to USA.

Things Carole mentioned as potentially on the priority list: Data, Data repositories, Data citation and Software.

Leader: Carole Goble

--MalcolmAtkinson 19:02, 16 March 2010 (UTC)

  • David McAlister
  • Bob Mann
  • Scott Jensen
  • Jim Austin
  • Joel Saltz
  • Max Wilkinson
  • Carole Goble (Chair/Facilitator)

BRIEF: Responding to the OSI report on eInfrastructure for science and innovation, that was lost and now is found.

The new Government department is revisiting the report by interviewing community members. The Government Office for Science formed an expert group to define an action plan; only academic present is the chair, Carole.

Four focus areas for the action plan; Capacity Building Sustainability Adoption Interoperability (particularly with data policies) Data remains the focus

What was requested were suggestions and defined objectives for each of these.

Need a leadership model. Coordination is key but it is difficult to reach agreement on how; most people agree coordination is useful but no-one wants to be coordinated.

Suggestion is to identify and allocate a budget based on expert requirements rather than peer review. Perhaps though a more tender-oriented mechanism was required?

Academic models are starved of funds and short term Business deliver products but are detatched from the academic concepts which remain essential, in addition they can be excessively expensive (e.g. NPfIT), or vunerable to ‘political influence’. Both are not ideal for identifying the correct goals and reaching them.

Scope Software development doesn’t belong within a research project, and is not covered by sufficient funding. eScience is International rather than national

Charities, or other third sector oraginsations are not really involved in this activity. And there is no incentive for them to be influenced. The same could be said for standards movements, e.g. HL7, DICOM


Funds will be not a lot and unsure where it comes from;

  • What should they be used for? Specific work, identified by experts
  • Do we need a new model? Yes, allocated rather than peer review
  • How do we implement this? Not sure
  • Who are the expert group? This is fraught

For allocation by expert group

  • For the development of standards (however adoption or even realisation of discipline standards activity has a very low take up).
  • Where will the funding come from
  • For adoption

Interoperability: e.g. standards and their adoption? Appear to simplify the problem of interoperability but difficult to implement and encourage adoption. Is this a useful use of funds? Possibly if working at the interface between academic and industry with particular regard to standards.

Sustainability: What is the provenance for sustainability? Generic OS, transferability, immunising or hedgeing. It was suggested that sustainability is proportional to community presence or conscience, e.g. UNIX.

  • Expert group/government should be involved in market creation.

But who decides what is sustainable. Some technology goes out of date. The issue came up again about a role in between industry and academic, specifically in making software products more mainstream to aid dissemination.

  • Specifically reduce the burden of deploy-ability and supportable.
  • The idea of a standing army able to undertake this activity but this could be risky.
  • There is a model for this in the BBSRC that has ring fenced money for sustainability. Could develop this across research councils but also between research councils.

Capability: Career paths The continue to be unrecognised career paths, non-academic careers in academic frameworks that are not rewarding in long term

  • Idea of a ‘research technologist’
  • Need to retain resource but also supply industry. Balance is not right.

Employment time frames are inconsistent with retaining skills or transferring skills (forced churn). Centres of excellence are problematic. How do we have standing armies and not favour monolithic infrastructures? These are conflicting concepts and partially addressed with platform grants in the UK and in silico centres in the US. These fund computational scientists to undertake science that deliver tool prototypes, designed to seed other projects.

Below here are ones not used yesterday

Look: a wheel!

Do we really need to re-invent the wheel each time we move to a different platform, or the next time a new paradigm comes along? Oftentimes, we are simply reproducing our data-flows over the new paradigm.

David and Goliath

Is it better to be smart or big? What are the problem instances that allow us to be smart, and when do we have to use raw power? And how do we develop the tools for each processing paradigm?

Your blessing, my curse

The curse of dimensionality has always been around. However, many phenomena are only observed or simulated at sequences of multi-dimensional snapshots. What do researchers need? How can it be delivered?

Who's chasing who?

Are the software folk chasing the hardware folk, or vice-versa? Are software developers trying to find a way to make better use of existing infrastructure, or are new infrastructures being built because there is a software need?

Add general comments here

With topic headings like the one above (two = signs) and subtopics like this:

This is an example subtopic

Please sign your entries, like this, by pressing the "sign" button above. --MalcolmAtkinson 18:16, 8 March 2010 (UTC)

This is an archived website, preserved and hosted by the School of Physics and Astronomy at the University of Edinburgh. The School of Physics and Astronomy takes no responsibility for the content, accuracy or freshness of this website. Please email webmaster [at] ph [dot] ed [dot] ac [dot] uk for enquiries about this archive.