2019 Workshop:Geospace Data Science

From CedarWiki
Jump to: navigation, search

The challenge, opportunity, and art of data science for geospace

Location, Date/Time and Duration

2 hours

Conveners

Ryan McGranaghan
Bharat Kunduri
Jade Morton
Eric Donovan
Asti Bhatt

Workshop Categories

Altitudes: IT - Latitudes: polar - Other:

Format of the Workshop

Two sessions:

  1. divided between geospace science presentations and invited presentation from 'sister discipline' followed by short presentations;
  2. panel discussion followed by breakout groups


Agenda

Thursday 20 June 2019 10 AM - 12 PM (Foundations of data science in CEDAR science)

- 10:00 - 11:40 AM: Talks (12-minute talks, 2-minute discussions)

    - Tomoko Matsuo (data assimilation; what it takes to fuse observations in geospace)
    - Farzad Kamalabadi (what is data science and how has it evolved in CEDAR science)
    - Steve Morley (machine learning and relationship to traditional statistical approaches)
    *Slot in a contributed talk (Kristina Lynch - prediction versus ‘interpolation’/’filling in the blanks’ approaches and what we need to know) to accommodate Kristina meeting schedule
    - Russell Stoneback (the data wrangling side of data science; Pysat)
    - Jim Ahrens LANL “Data Science at Scale” (big data in CEDAR/space science and tools to navigate it)
    - Yun-Ju Chen UT Dallas (applied data science in CEDAR applications

- 11:40 AM - 12:00 PM: Contributed Talks (5 minutes - focused on provocation) & Open Discussion:

    - Foci:
        - Emerge questions and topics for the afternoon panel session
        - Illustrate a concrete application or use case of data science in CEDAR science
    - Jenny Yang - Data and Machine Learning Challenges via an analysis of GNSS Network Position Errors during the March 2015 St. Patrick Storm
    - Muhammad Rafiq - Google Summer of Code and benefits of non-traditional partnerships
    - Asti Bhatt: Machine learning results from Frontier Development Laboratory
    - Gonzalo Cucho-Padin - “Optical tomography in CEDAR science and the data challenges and solutions”


Thursday 20 June 2019 1:30 - 3:30 PM (Emerging the trends and gaps for data science in CEDAR science and creating the needed new connections)

- 1:30 - 2:15 PM Panel

    - Short introduction by Ryan McGranaghan followed by 2-minute introduction by each panel member
        - Questions will be solicited from the audience
    - Nathaniel Frissel - (citizen science)
    - Seebany Datta-Barua (CEDAR science at intersection of physics and engineering)
    - Susan Skone (advanced instruments and intelligent operation for CEDAR science; Transition Region Explorer (TREx))
    - Enrico Camporeale (trends in machine learning)
    - Laura Mazzaro (Descartes Labs - the utility of data science for the geosciences)

- 2:15 - 3:00 PM Breakout groups

    - *Each moderator responsible to come up with a set of provocative questions that drive the topical conversation to the session goals; the more visual and concrete, the better
    - Machine learning applications in geospace (success stories, lessons learned, and trends)
        - Moderator: Bharat Kunduri
    - Data provenance; Modernization of geospace science workflows using community recommended best practices (e.g., the use of open source software and cloud computing)
        - Moderator: Asti Bhatt
    - Interdisciplinary efforts (best practices, potential applications)
        - Moderator: Eric Donovan
    - Intersection of physics-based and data-driven methods; Validation
        - Moderator: Jade Morton
    - Common misconceptions about data science, machine learning, and artificial intelligence & ML adoption
        - Moderator: Ryan McGranaghan
    - Potential: ‘Going beyond accuracy’: robust evaluation of ML models
        - Moderator: TBD

- 3:00 - 3:30 PM Regroup and group discussion about the cross-cutting themes from the breakout session and make plans to move forward (i.e., create a directed community)



Session Outcomes

Synopses of Contributed talks

- Tomoko Matsuo discussed the methods and considerations for fusing data with other data and with models. She illustrated various approaches of data assimilation, classified according to problem characteristics. Tomoko excellently introduced two distinct approaches that separate mindsets to geospace specification: deduction and induction. A resounding message from her discussion was the importance of quantifying representativeness error.

- Farzad Kamalabadi outlined the evolution of data science in CEDAR science. He began by noting that data science has been around since the 1990s, but the term was popularized, potentially with the advent of big data and computational capability, in the 2010-2012 time frame. Given its history, Farzad illustrated the breadth of data science, admitting that the small sub-component of the field focused on analytics is frequently emphasized. He chose to highlight two relevant topics of data science analytics: 1) data assimilation (statistical estimation) and 2) learning theory (learn the system from a set of observations and corresponding system states - i.e., outputs). His message about the wide spectrum of ‘learning’ techniques was resonant.

- Steve Morley offered an insightful, provocative presentation that objectified an exploration of the relationship between traditional statistical learning and machine learning (ML). He offered the provocative question, “What is the difference between applied statistics and ML?” The question remained a theme and topic of debate throughout both sessions. He compared linear regression with neural networks (NNs), intriguing the audience with an example where NNs essentially emulate a linear regression approach and concluded with the statement that ’NNs fit functions.’ Steve covered many forms of ML, which led to a discussion surrounding how to best become familiar with and learn new methods.

- Kristina Lynch gave a short talk essentially focused on how to get the most out of every data point available to her. She presented fascinating new concepts to study auroral arcs, both existing and planned for the future. She emphasized that the ‘goodness of reconstruction’ of the arcs is critical to quantify. - Russell Stoneback gave an illuminating talk about the data wrangling side of data science in geospace, specifically highlighting a new tool to bring together diverse data: Pysat. Pysat offers a foundation on which a capable space science data ecosystem can be built and one that will be interoperable with existing efforts for space science data wrangling.

- Jim Ahrens brought a fresh perspective to the CEDAR meeting, coming from the Los Alamos National Laboratory’s Data Science at Scale Group. Jim does not intently focus on the geospace environment, but his methods and techniques offer incredible potential. He specifically pointed the audience to new visualization and interactive exploration tools that could be created using tools like Paraview (paraview.org) from the Kitware group. Jim offered a new definition of Big Data: “data that are too big for you and your colleagues to process.” He stressed the importance of databases, specifically mentioning that relational databases are most common and advised that our databases of the future need to offer query services. He recommended the use of SQLite. Consideration of databases will be prerequisite to deliver the same functionality of our current tools for data that are constantly growing by orders of magnitude. He concluded with a message that visualization and interactivity allows scientists to use her own intuition to generate data-driven discovery, an important message for our community.

- Yun-Ju Chen closed out our invited talks by providing an early career, application-oriented perspective on geospace data science. Yun-Ju’s message, which will hopefully have lasting impact, was that data science applications MUST start with a well-formulated question and an understanding of precisely what is important to one’s application. She provides three examples in the closing slide of her talk that serve as fantastic references for the interested audience.


Panel Members' Charges to the Audience

- Datta-Barua: Be quantitative about uncertainties from early project stages

- Laura Mazzaro: Be very careful about what can and cannot be done with ML (useful, but NOT magic)

- Enrico Camporeale: Blur differences between statistics and machine learning (i.e., the grey box)

- Nathaniel Frissell: It takes time to play around with these techniques to well develop new ideas. Give it the requisite time and space. Do not get entrenched in traditional approaches

- Susan Skone: Aspire to be ‘power users’ of these trends and technologies. It is currently a struggle to find the requisite talent - so the question is ‘who will do this work?’ What is the phenotype of the CEDAR scientist of the future and what developments are required of the education and funding paths?


Additional Resources & Opportunity to Contribute

Please see the [Google Drive folder](https://drive.google.com/drive/u/1/folders/1fT33pNYT0fL6c7HADspnQUOZS2sgQEmK) for presentation slides, discussion notes, and other useful resources.

Estimated attendance

100

Requested Specific Days

Preferably Thursday or Friday due to a desire to involve GEM community, which will meet the following week, and limited availability to attend CEDAR outside of these dates for convener McGranaghan.

Special technology requests

Justification

Data to advance the scientific understanding of the geospace environment are growing across the four V’s of ‘big data’: 1) Volume; 2) Variety; 3) Veracity (i.e., uncertainty); and 4) Velocity. This growth represents both a challenge, to efficiently and comprehensively utilize these data, and an opportunity for new discovery by embracing new technologies and analysis capabilities that scale well to the geospace environment. These developments have revolutionized the creation of new scientific insights from data through the union of statistics, computer science, applied mathematics, and visualization (i.e., data science).

This session will respond to several thrusts of the Decadal Survey:

  • Determine the origins of the Sun’s activity and predict the variations of the space environment,
  • Enable effective space weather and climatology capabilities, and
  • The need to establish a space weather research program to effectively transition research to operations;

and the CEDAR Strategic Plan:

  • Strategic Thrust 6 : Manage, Mine, and Manipulate Geoscience Data and Models, and
  • Strategic Thrust 1 : Encourage and Undertake a Systems Perspective to Geospace;

which collectively emphasize a need to embrace data science.

Additionally, the National Science Foundation announced new investments that will be made toward their 10 ‘big ideas’, particularly focusing on two ideas that together objectify radically interdisciplinary work and data science across the scientific landscape:

Description

Timing is ripe for the CEDAR community to embrace data science and the NSF big ideas. Therefore, this session will create a new conversation around increasing capability to address data challenges and opportunities and growing convergence in the CEDAR community.

Our specific objectives will be to:

  1. Identify problems and challenges that can immediately be addressed using data science tools (i.e., the compelling and transformational ‘use cases’);
  2. Promote interaction and collaboration between the CEDAR community and related disciplines (e.g., Earth Science);
  3. Improve agility and capability within CEDAR science; and
  4. Grow methodology transfer to enhance CEDAR science.

Outcomes:

  • Progress toward these objectives will increase our community’s competitiveness in the NSF big ideas and usher in a New Frontier of CEDAR research McGranaghan et al., 2017. Additional outcomes will include:
  • Identify the powerful use cases to advance data science capabilities within CEDAR;
  • Sustain and amplify earlier data science efforts for CEDAR science applications
  • Encourage and facilitate the adoption of data science in the CEDAR community; and
  • Curate a community to develop a new CEDAR Grand Challenge Workshop in 2020, including targeted objectives, roadmap, and a draft proposal.

The proposed workshop is a timely effort to sustain and amplify momentum from several previous workshops with a data science focus that the conveners have planned or been central contributors to, including:

Workshop Summary

This is where the final summary workshop report will be.

Presentation Resources

Upload presentation and link to it here. Links to other resources.

Upload Files Here

  • Add links to your presentations here, including agendas, that are uploaded above. Please add bullets to separate talks. See further information on how to upload a file and link to it.