This wiki service has now been shut down and archived

Tuesday DIR Breakouts

From ESIWiki

Jump to: navigation, search

Return to Workshop wiki Main Page

Contents

Break-out Sessions on Tuesday

Introduction to break-out sessions

After the morning talks there will be small groups meeting in "break-out sessions" in order to give everyone an opportunity to think about the issues and develop ideas. These breakouts will be on a mixture of topics that originate from the day's domain of interest, the morning's talks and the cross-cutting themes. There will normally be between four and eight concurrent themes. They may start at or after lunch and will finish at 16:00 or at a time specified by the day organiser during the morning. Refreshments will be available in the Chapterhouse during the last 30 minutes of a breakout period.

Each break-out group should identify someone to chair their session, someone to record their session and someone to report back to the plenary session at the end of the afternoon.

The reporting back should include:

  • What was the focus of the group?
  • What aspects of DIR were already working well? or Which of their methods or technology were already serving DIR well?
  • What are the current mission-critical challenges?
  • What strategy would the group use to address them?

(as reporting-back time will be short, the reporter should digest this into no more than 5 slides.)

We ask that one or more people record what happened during their group's session by adding an entry in the record below.

Look: a wheel!

Do we really need to re-invent the wheel each time we move to a different platform, or the next time a new paradigm comes along? Oftentimes, we are simply reproducing our data-flows over the new paradigm.

David and Goliath

Is it better to be smart or big? What are the problem instances that allow us to be smart, and when do we have to use raw power? And how do we develop the tools for each processing paradigm?

Your blessing, my curse

The curse of dimensionality has always been around. However, many phenomena are only observed or simulated at sequences of multi-dimensional snapshots. What do researchers need? How can it be delivered?

Language matchmaking

Does some research match query languages and some match workflows? Or is it just who you first meet when you're ready to fall in love?

Who's chasing who?

Are the software folk chasing the hardware folk, or vice-versa? Are software developers trying to find a way to make better use of existing infrastructure, or are new infrastructures being built because there is a software need?

Can I have your data?

Most research data is in proprietarily formatted flat files which does not necessarily mean that it can easily be shared and/or conducive to experimental reproducibility. Is this a permanent condition for research data? Does it matter? If it does, what do we do to accommodate or change?

Attended by: Ally Hume, Dave Liewald, Magnus Hagdorn, Ruth McNally, Max Wilkinson, Kashif Iqbal, Bernie Acs, Jerome Avondo.

  • Files will not go away.
  • The system must recognise data citations as first class citizens
  • Authorative unique ID
  • Low barrier to data publication
  • Cloud sourcing (?) of annotations
 * Community defines metadata
 * Metadata is in the eye of the beholder
  • Standard evolve
  • Privacy issues
  • Collaborative web 2.0 data sharing/tagging system


Revised report on this session by Ruth McNally

DIR - Tuesday 17 March - Break out session – Can I have your data? Report

Ruth McNally ESRC Cesagen; Ally Hume EPCC; Dave Liewald Univ Edinburgh; Magnus Hagdorn Univ Edinburgh; Max Wilkinson British Library; Kashif Iqbal Irish Center for High End Computing; Bernie A’cs Univ Illinois at Urbana-Champaign; Jerome Avondo John Innes Centre.

‘Most research data is in proprietarily formatted flat files which does not necessarily mean that it can easily be shared and/or conducive to experimental reproducibility. Is this a permanent condition for research data? Does it matter? If it does, what do we do to accommodate or change?’

The main points were:

1 PROPRIETARY FORMATTED FLAT FILES IS A PROBLEM THAT WON’T GO AWAY. It is likely to be a permanent condition for research data. Therefore it has to be worked around so that the data becomes shareable, usable by others. This requires two main changes: a citation system for data; improved metadata descriptions. Points from our discussion of each of these follows.

2 A CITATION SYSTEM FOR DATA. Why don’t we start citing data? The paradigm is: data sits behind the knowledge base. The British Library is examining a way of implementing a system for data citation. Modelled on the principles of a citation system for academic publications. Data citation is to datasets what cross referencing is to publications.

a. Make data provision a first-class citizen – to move towards where one could have a career for good data.

b. Data needs identification if it is to be useful. Needs a stable identification mechanism. A catalogue. Need to find an authoritative unique identifier to the data files. But there are sustainability issues: the system needs to be supported by a stable institution?

c. Medical, legal and ethical issues of e.g. genetics data. Certain parts of the data may need to be kept confidential. For example, data about human research participants. Parts of the data might need to be private. Could there be some filters over access to certain fields? Controlled versus open sharing. Use of mechanism for data de-identification. A series of constructs to say who is allowed to see the data? But who would arbitrate that? Or would the communities be self-policing?

d. Intellectual property. Similar issues about intellectual property rights over data. Data sensitivity shifts over time. More sensitive before publication. Could be managed through a system of data rights. The providence of data needs to be acknowledged. Need to look at existing systems. Perhaps based on copyleft?

e. Some data can be deposited in UK data centres. But not all. It’s prohibitive because of the volumes. Imaging or sequencing data are enormous. Some data files can be stored locally. Different models for different data type. Is one model suitable to all types of datasets? Example of telescope in Chile that produces a terabyte of data per night. We know where it is going to be archived, but how to formalise a terabyte identifier? Seismology, environment, genomics have huge data volumes.

f. Wiki – hard to get data out of them. Not structured. At other end are full-on repositories, but are too heavy handed. Is there a middle platform – a Wiki plus? Medium ground between unstructured Wiki and heavily structured?

g. There need not be direct access to the data. Research queries sent to the data producers. But then, how do I know where the data lives and what is in it? Need a public way to point to it.

h. Sustainability. In the longer term, need to preserve the data, after the project ends and the project team dissipates. Who has responsibility? That problem needs to be addressed.

3. METADATA DESCRIPTIONS a. The metadata is as valuable as the data.

b. Files can be annotated to describe data inside and the reason for creating it.

c. Metadata files should be separate from the data files. Lessons from digital humanities community. This community has a standard text coding initiative and an XML standard. But the files got enormous because of the metadata added online. But where are the metadata files going to be situated?

d. Incentives. Under present conditions, the potential beneficiaries of metadata are colleagues and other users. Not primarily the data producer. Need incentives for data producers to create and make available metadata descriptions of their data so that their existence can be discovered by potential users. This incentive could be the potential of reward through data citation.

e. Support. Another way to increase provision of metadata is through the provision of resource as a recognised category for research council funding. Needs to be a consideration in all projects – big science with big data and small projects with smaller data sets.

f. Minimum requirements. The barriers to making data and metadata public must be minimal. But at the same time there needs to be some minimum metadata requirements that the data producer must provide. These could be data-producer community-specific minima, combined with a system which supports extensibility by re-users, including those from other communities .

g. Extensibility. Linked to the above point. Trying to write comprehensive metadata would be too onerous on the data producer, and also impossible. Different communities of data re-users have community-specific metadata needs. Metadata cannot be comprehensive. Neither can it be future-proof. Because of the above pragmatic considerations, once published, others should be allowed to add annotation to the data. The Web could be used to support a system for crowdsourcing of metadata. Mass collaboration. Web 2.0 ideology. But in separate files that do not ‘pollute’ either the original data files nor the original metadata. Also where would need to be a tracking system for revisions. Could this be supported by myExperiment?

h. XML assumes a schema definition – is like flat file lock in. If RDF used the extension process is simple. Unable to map concepts. The annotation method could be semantic. Suggested use of RDF. RDF is loosely defined and is reasonable for doing open ended descriptions and data annotations. Can also be self-describing. If a community has a dispersed collection of data then metadata descriptions could point to related datasets. That could be part of the annotation. One knows one’s colleagues – that’s a starting point – like Face book – become a community. They have adopted the ideology of metadata. Face book is centralised but it doesn’t have to be.

i. Disciplinary specific vocabularies. Language is dynamic and context specific, which are uneasy bedfellows with computing. Is there a need for a controlled vocabulary or an ontology?

The breakout discussion also considered:

I. WHAT IS A FLAT FILE? It is not a binary file. It is human readable. It may not have delimited categories or fields. Flat files may be structured or unstructured. II. DATA EXCHANGE STANDARDS. What about community data exchange standards – such as the Proteomics Standards Initative’s mass spectrometry data standard – that is designed to standardise instrument output? Does anyone think these are a solution to proprietarily formatted data files? - Standards do not match practices. Standards go out of date - But different imaging platforms have different proprietary formats. The community wants open formats. - Are convertors the alternative to standards? III. WHAT IS THE PURPOSE OF SHARING DATA? Is it to build communities around datasets? In some fields, the challenge is not to find datasets, because the platforms for sharing the data are in place. The challenge is to find collaborators. IV. HISTORICAL DATA. Some data is historical e.g. P3G group which is studying phenotypic data, some of which goes back over a century. How applicable are the above suggestions for these type of data? V. INTERCHANGEABILITY OF DATA. The above does not address this issue specifically.

I'll tidy up tomorrow, honest mum.

The real world we observe is very messy. All our data is in a mess as our understanding evolved while we were collecting it. Should we try to be more tidy & have better metadata, etc.? Or should we learn to cope with the mess?

Initial notes by --Alastair droop 20:22, 16 March 2010 (UTC);

Beth Plale 17 March 2010 09:12 Amended by --Neil Chue Hong 09:42, 17 March 2010 (UTC);

This discussion group was very active, and covered a wide range of very interesting points. Attendees: Alastair Droop, Beth Plale, Mark Baker, Mike Batty, Neil Chue Hong, Oscar Corcho, Hugh Glaser, Carole Goble, Scott Jensen, David Rodriguez, Stratis Viglas, Simon Wong and many more (who will add their names)

Cost-benefit analysis of metadata?

There is a consensus that the creation and addition of metadata to any dataset costs quite a lot (time, money, effort); and this often goes unrewarded. The process is often seen as a post-hoc activity, either forced upon researchers as necessary before publication, or when asked by someone else who is interested in the published data. We need to increase the status and rewards associated with metadata addition.

Who uses our (meta)data and who generates it?

An important point was made about the people who use any (meta)data we produce. It could be never used, only used in the lab, or part of a public dataset. The expected audience will necessarily dictate how much time and money is pushed into metadata creation.

Data generators fall into 3 general categories:

1.) Research lab data: Often transient data that doesn’t get out to anyone; much of it not of value. Flows from being Mine, to sharing with collaborators, the sharing with public. The quantity of data gets smaller as it flows out. The level of curation should be proportional to its value so public data more curated than local lab data. 2.) Instrument-based research lab: person collecting data using instrument for reason X may want to collect for reasons X and Y because someone (including themselves) may be interested in Y later. 3.) Data drawn in from sensor: Data interpretation and assessment of quality pulled in from sensors dependent on instrument and how deployed. Need to capture metadata on spot because use is always temporally distant from its generation and data is ephemeral. For example, an instrument can be configured to sample at high frequency when something interesting – this is a real time attribute, not static attribute of instrument configuration, for instance.

Do we need to bother, or does the published paper catch all of the metadata needed?

This resulted in interesting discussion. There was a consensus that text mining papers could reveal a lot of very useful information, but that this process is difficult. A lot more work could be done to make papers inherently more amenable to text mining. This datasource, however useful, is not the perfect solution to the metadata problem. Papers should not, therefore, be seen as the only way that we should be annotating our datasets.

Should we bother or just recreate the experiment?

The example of microarray data in biological research was discussed. In this field, a lot of data has been collected, but the levels of metadata often prevent reuse of experimental results. A question raised was "should we attempt to re-attach these (meta)data, or simply re-do the experiment from scratch?" It was suggested that emerging technologies (new, next-generation sequencing) will replace microarray technologies and make the older experimental methods effectively obsolete. Further discussion suggested that re-creation of some types of experiment are easy (or possible), whilst other datasets are truly unique.

Curation cost is an investment decision is based on sense of value of data. Sometimes a forward looking view at the problem puts us into right mindset (of generator instead of tool building problem solver).

Curating data may not be valuable because 5 years from now will obsoleted by additional technology (see microarray data above). Seismic data about Hati earthquake or Katrina hurricane cannot be replicated, however. Too, sustainability science data sets get more valuable as they age (particularly as collection longevity increases.)

Value of data changes as our understanding evolves. Tying data to process; high quality process metadata. Layering in metadata as time goes on and interpretations.

There are simple missing steps we’re not doing. 1. query database, don’t get version of database back. Simple thing and we’re not doing it ourselves. 2. Small semantics better than no semantics. No self respecting CS person would have developed tags – too unstructured 3. Use accepted data models. 63% of biology data does not use data models. 47% don’t use controlled vocabularies.

Publish early, publish often?

There was a discussion around whether techniques from software engineering, specifically open development, would work for the generation and publication of data and metadata. Two initial straw men were posited: the truly open model (where all data is published openly as soon as it is generated) and the embargo model (where "draft/working" data is collected but not published until some future point in time when the original researcher has had time to gain impact from the work).

The reasoning behind these approaches is to ensure that the community as a whole can benefit from both the data and the process, and for there to be more "eyes" on the work improving the quality and accuracy. The open model has the advantage of removing the stigma of publishing "messy data" but the drawback of the potential loss of ability to exploit. A good example was that of the human genome sequencing projects, one group following the open model and the other following the embargo model.


Achievable steps

Model 1 : open development model: data in open from start. OK for data/metadata to not be perfect. May lose first access rights, need embargo model: e.g., Human genome race: public group did first model; publish reads of chromosome map files. Data used by private group to do better job.

Model 2: At point where peer published paper, publish data and supplementary notes. Seeing recording of work in progress (supplementary notes) can be useful. Paper vehicle essentially benefits the academic.

In either model, need credit system for data; need propagation mechanism so attribution flows with data. Integration houses want to give credit, and need it at point of ingest.

Is the paper a good enough hook? Assumption of monitoring going on underneath. At paper point, gathered and published. Keywords expanded to include ontological or controlled vocabulary kinds of keywords. Microsoft has worked on integrating tagging into paper writing. Text mining strips context (hedging) language, so results in assertions that are not true. Need to be able to get from the paper to the data.

Next generation data-intensive researchers

How are we going to train them if we don't know what they will need? What are the underlying fundamental aspects of data-intensive research that will be relevant years from now.

Down to earth

A lot of today's talks have revolved around Earth Sciences. What are the common problems, ideas, and approaches practitioners in the area face and use? Is there a pattern that leads to a more general problem and maybe its solution? --Sviglas 07:47, 16 March 2010 (UTC)

Notes by Jano van Hemert --Jvhemert 16:08, 16 March 2010 (UTC)

Goal: Breakout should provide input for the report on this workshop and report on the challenges that earth science face.


Main outcomes

  • We should put together a portfolio of the expertise of participants of this workshop and then use this to form small projects where methods meet high-profile challenges
  • Earth science is a data-driven science, much of it means waiting for more data, many opportunities for data integration (example: waiting for seismic monitors installed in faults)
  • We need committments from domain scientists that they do not expect results tomorrow while methods and technology are being setup
  • Need tools for better metadata creation and management
  • A point to Carole Goble's activity of reporting on e-Infrastructure for Data-Intensive Research: in Europe, as compared to the US, much data is hard to access because of legislation issues: more data should be free in the public domain and easily accessible!
  • Crowdsourcing is an important approach to improving quality of both data and metadata
  • We need non-intrusive methods for the logging of scientific research to enable reproducibility and inform further development of methods and tools


Introductions

  • Malcolm Atkinson, UK e-Science Envoy, organisor of the workshop.
  • Michael Batty, interested in social data and link to spatial aspects on the earth's surface, how human and physical come together
  • Mike Mineter, worked on environmental systems and on e-Science systems, supports SAGES http://www.sages.ac.uk/
  • Gerard Devine, works on project METAPHOR, describing climate models, experiments, simulations and data output
  • Keith Haines, earth observation data, how to interpret in the human environment, who owns the data, overlay with other data, GIS contexts, missing metadata, historical data, creating our own metadata from people studying historical data
  • Ian Main, seismologist, statistical inference over large data, prediction quality of earthquake models, risk assessment, forecasting web portal
  • Jano van Hemert, leads research of the UK National e-Science Centre, data-intensive research, mapping challenges into concrete systems, enablign scalability of use of these systems and preventing the knowledge deluge that may arise from solving the data deluge.
  • Torild van Eck, leads ORFEUS http://www.orfeus-eu.org/ coordinating research efforts in seismology in Europe, sustainability of development as seismology is small group and must be able to support solutions beyond the end of projects. NERIES http://www.neries-eu.org/ and http://www.seismicportal.eu/ Potential of getting better data quality, interaction with Geo people (Keith is)
  • Chris Higgins, works for UK National Data Centre EDINA http://edina.ac.uk/ Topographics data with much work on standards, chairs the open geospatial working group for Edinburgh. Focussed on providing services, what are needed here for research? Have security, scaling issues and lack of tooling
  • Milena Ivanova, CWI http://www.cwi.nl/ MonetDB, several years of experience with astronomy, what are problems in earth sciences? Where can database technology be applied.
  • Martin Kersten, CWI http://www.cwi.nl/ MonetDB, old relational database are out because of performance. Has experience with GIS systems (TomTom).

Scientific questions / challenges

  • (Ian) Integration of seismic and deformation data to build models on top of that data, we need to wait for more faults to be monitored directly. Very much data-driven science.
  • (Gerry) From metadata point of view for climate modelling: in the collection of good quality metadata. Automatic generation of metadata output from models to be read by other tools: metadata standards. IPCC had to create a web-based questionnaire to collect the metadata of data.
  • (Malcolm) Could you exclude models in the next generation if they do not comply with metadata collection practices. Learning ramps required to get people to generate good quality and useful metadata.
  • (Mike) Many small groups have different modelling techniques, many have no standards for metadata capture. Integration of models hampered by the engineering behind how these models are implemented. We need to break isolation of models, such as via services.
  • (Keith) Metadata even more urgent in reporting impact. "How did you come to that conclusion using what climate models?"
  • (Jano) This must resonate also in predicting impact of earthquakes.
  • (Malcolm) Could an approach as shown by Martin in his talk where they recording the workflows used on the data be a practical approach to collecting metadata?
  • (Keith) That could bring reproducibility, which is essential.
  • (Torild) There is already a discussion and some schemas setup to capture workflow for tomography.
  • (Chris) If you use (ISO) standards then you can benefit from tools that use those standards
  • (Malcolm) You must operate without changing the behaviour of the scientists or their applications. Can we do something quickly for Keith's problem?
  • (Martin) Can work if 1. A model exists people agree on, 2. a log is in operation where we can harvast from. We need an incentive, the PI says do it this way or the provide a benefit (beer) to the users.
  • (Jano) This is the carrot and sticks issue.
  • (Keith) Can we automatically store everything scientists do? Surely this cannot be a lot of information even if we record raw keystrokes (or other input devices).
  • (Malcolm) How do we get a (simple) probe in the right place without disrupting the flow of science?
  • (Torild) Can you realise this on a small scientific problem, then publish it and show that it works.
  • (Michael) The reward system that drives us is not helping us to look at new and useful tools. Ignorance factor, we are all in fields where there is no right answers. We may end up not knowing how to do things best, just to act sensible. Probably start with a narrow mind.
  • (Chris) Integrating data to ask climate impact questions. Already work underway with services in this space.
  • (Michael) We need a group to help us push foreward multidisciplinary challenges.
  • (Malcolm) Hard to get and attain these people.
  • (Martin) Idea taken from the US. Put all the expertise together of the participants of the workshop and derive focussed projects from these. For example, one plan was to have relational views of Netcdf data, but that requires the right people working together.
  • (Keith) User should not care about technology underneath.
  • (Martin) Important that the user does not expect solutions tomorrow.
  • (Jano) What about the challenge in the US, 100 seismographs per some time unit delivered?
  • (Martin) We are very fragmented in Europe, but in many (database) ways are ahead of the game. They shout louder, we can deliver high quality pilots in focussed areas.


Sidebar discussion on copyright

  • (Ian) Are their copyright issues around EDINA? (Chris) No, as contracts are negotiated with JISC, which is the major funder of EDINA.
  • (Michael) Ordnance survey copyright is a big problem. Via Tim Burners Lee, government promises public data to be free. Problem with value-added retailers that benefit from Ordnance survey data.
  • (Ian) Problem for earth scientists in Europe, lack of access to data, as compared to the US. POINT TO CAROLE GOBLE.
  • (Chris) Must take the difference in quality into account, UK has very high quality survey data.
  • (Torild) Point: seismology data is open and accessible.
  • (Chris R) I wasn't there, sorry. However, there are definitely licence restrictions on data accessible vie EDINA, and these have caused and may in future cause difficulties in interoperability. Eg it's not clear to me how you could do linked data in a closed access system.

Sidebar discussion SciLens

  • (Martin) Is a proposal to the EU to have a competing EU activity to the SciDB http://scidb.org/ activity in the US
  • (Martin) Has several groups that will provide the traction for getting the technology going
  • (Malcolm) This requires committment from researchers to not have (or significantly less) scientific output during the setup

Add general comments here

With topic headings like the one above (two = signs) and subtopics like this:

This is an example subtopic

Please sign your entries, like this, by pressing the "sign" button above. --MalcolmAtkinson 18:16, 8 March 2010 (UTC)

Views
Navigation
This is an archived website, preserved and hosted by the School of Physics and Astronomy at the University of Edinburgh. The School of Physics and Astronomy takes no responsibility for the content, accuracy or freshness of this website. Please email webmaster [at] ph [dot] ed [dot] ac [dot] uk for enquiries about this archive.