hpc-ch
  • Facebook
  • Twitter
  • Youtube
  • Rss
The Swiss HPC Service Provider Community
  • Home
    • Calls for Proposals
    • Conferences & Presentations
      • Video of talks
    • Courses & Workshops
      • Video of Courses
    • Job offers
    • Science
    • Technology
    • Video blog
  • About
    • hpc-ch Community News
    • hpc-ch Booth
    • Forums
    • HPC Advisory Council
  • Members Academia
    • [BC]2
    • CHIPP
    • CSCS
    • EMPA
    • EPF Lausanne
    • ETH Zurich
    • PSI – Paul Scherrer Institut
    • Speedup
    • SwiNG
    • SWITCH
    • SystemsX.ch
    • Università d. Svizzera italiana
    • Universität Basel
    • Universität Bern
    • Université de Fribourg
    • Université de Genève
    • Université de Lausanne
    • Universität Zürich
    • Vital-IT
    • WSL
  • Members Industry
    • Casale Group
    • Credit Suisse
    • Hilti
    • MeteoSwiss
    • Novartis
    • PartnerRe
    • Syngenta Crop Protection
  • Contact
Home» Topics » Science » Conference Report: Gordon Conference “Grand Challenges in Data-Intensive Discovery”

Conference Report: Gordon Conference “Grand Challenges in Data-Intensive Discovery”

Posted on November 3, 2010 by mdl in Science

The Grand Challenges in Data-Intensive Discovery conference was be held at the San Diego Supercomputer Center (SDSC) October 26 – 28, 2010. “Gordon” is a new machine funded by NSF with a 20 M$ grant for supporting data-intensive research applications. The machine will go in production in 3Q2011. The conference inaugurated a small predecessor system called “Dash”.

The two machines are built by Appro. They combine low-power processors, flash disk, and virtual shared memory technology. A Gordon compute node will consist two low-power CPU (Intel Atom or SandyBridge), 64 GB of RAM, and 256 GB of flash disc. An I/O node provides of 4 TB of flash disk. 32 compute nodes and 2 I/O nodes form a so-called supernode with virtual shared memory via ScaleMP. The machine has a total of 32 supernodes, connected by a 3D torus network built with IB QDR, and equipped with 4 PB of standard rotating disk. It is expected to provide an I/O rate of 35 Million IOPS and a disk I/O bandwidth of > 100 GB/s. It is expected that this makes it very suitable for applications with a lot of random disk access (see Gordon Architecture).

The talks during the conference were a “tour d’horizon” of different potential application fields like CFD, astrophysics/astronomy, climate, seismology, and biological sciences (see agenda).

Some of the essential topics in this area

  1. Discussing data
    As it seems there is a way of doing science with generated data that goes beyond the usual “I analyse and then visualise myself what I computed”. Researchers take their simulated data to workshops, seminars, and summer schools for discussing the results with other scientists from their field, i.e. they only understand what they computed when they talk with others about their results (e.g. the summer program at the Center for Turbulence Research at Stanford). It’s clear that you cannot do this with in-situ visualisation but you need archived results because the discussion goes constantly forward and backward in time.
  2. Feature extraction instead of (or in addition to) visualisation
    If you have very complex time-dependent 3D output with several fields, you may not be able to see effects that you easily identify in a 2D simulation. In order to detect these you need techniques like automatic feature extraction. That may also be able to detect features you have never identified before visually. Such mining techniques may be as compute intensive as a simulation itself. An example for a feature extraction package for which results were presented at the conference, is MineTool.
  3. Combine output of different simulations or of observations
    A series of talks showed how you can gain new insight and detect new features or events by combining the results of different simulations, observations, or both. You store the result of such processing as well as the source data, and the processed data may be several factors larger than the raw data (buzz term “value added federated data sets”).
  4. Graph problems
    The biggest computational obstacle in data analytics are graph problem where the graphs have little or no locality and therefore cannot be parallelised (easily). Such problems seem to be common in economics and social sciences. Because they do not fit the standard distributed memory architecture of today’s machines, they require special system designs in order to run efficiently. An example is the massively-multithreaded Cray XMT architecture which also provides a single shared memory for the whole machine. PNNL employs the Cray XMT for such problems and showed very promising results. I.e. you may need different computer architecture for your data analysis facility.
  5. Streaming large observational data
    If you run a large machinery producing data volumes beyond what can be stored, you basically cache the data on a system and try to identify online whether something interesting has been observed which a) needs to be stored long term and b) may require to adjust your scientific instrument, e.g. direct the telescope to a certain part of the sky.
  6. Networking
    For some data intensive applications it was claimed that you have to architect the network end-to-end, provide QoS or dedicated lambdas, and to bring glass on the desk of the researcher (“end-to-end lightpath”.

Other random pieces of information

  • Commodity flash is 5x slower (bandwidth, IOPS) than enterprise technology, and has large difference between read and write speed
  • There is now a I/O component in IPM, the performance tool from NERSC. It was used to accelerate the performance of HDF5 by an order of magnitude.
  • The combination of technology of today for image processing and data analytics is GPUs, CUDA, and SSDs. GPUs and CUDA seems to be technology that is adopted quickest in the research institutes by students.
  • Priorities for data-intensive computing from a scientist point of view:
  1. Total storage
  2. Cost
  3. Sequential I/O bandwidth <- This was challenged in the audience because other application types require random I/O and high IOPS
  4. Fast stream processing with GPU
  5. Low power
  • Activities on the way to bring down the cost for storing data on disk: project at John Hopkins to create 5 PB mass storage for 1 M$.
  • Don’t move data. Transfer 100TB on a dedicated 10 Gbit/s link takes 1 day. Rather drive the disk drives around. Generating data at one location and analysing it at a different location is a too expensive execution model.
  • 100 TB seems to be the barrier for managing your data within the institute. Beyond you need centralised facilities but make sure that you keep the agility of the local solutions.
Bookmark and Share
Conferences and Presentations, Storage

Featured video

Discussion between Daniel Duffy and William Putman (NASA): Challenges for global climate simulation

Latest hpc-ch Tweets

  • Slidecast (in Italian): Grid computing and the search for the new particle at CERN by Günther Dissertori, ETH Zurich http://t.co/iNt5eyX74F
    May 16, 2013
  • Special ISC’13 Session to Probe the Thinking behind Europe’s Human Brain Project #epfl http://t.co/k8ZyuQe3dF
    May 13, 2013
  • CSCS Call for Proposals – Allocation period starting on 1 October 2013 http://t.co/Net2H0YCvq
    May 6, 2013

Posts by Category

(c) 2013 www.hpc-ch.org