Conference Report: Gordon Conference “Grand Challenges in Data-Intensive Discovery”

The Grand Challenges in Data-Intensive Discovery conference was be held at the San Diego Supercomputer Center (SDSC) October 26 – 28, 2010. “Gordon” is a new machine funded by NSF with a 20 M$ grant for supporting data-intensive research applications. The machine will go in production in 3Q2011. The conference inaugurated a small predecessor system called “Dash”.

The two machines are built by Appro. They combine low-power processors, flash disk, and virtual shared memory technology. A Gordon compute node will consist two low-power CPU (Intel Atom or SandyBridge), 64 GB of RAM, and 256 GB of flash disc. An I/O node provides of 4 TB of flash disk. 32 compute nodes and 2 I/O nodes form a so-called supernode with virtual shared memory via ScaleMP. The machine has a total of 32 supernodes, connected by a 3D torus network built with IB QDR, and equipped with 4 PB of standard rotating disk. It is expected to provide an I/O rate of 35 Million IOPS and a disk I/O bandwidth of > 100 GB/s. It is expected that this makes it very suitable for applications with a lot of random disk access (see Gordon Architecture).

The talks during the conference were a “tour d’horizon” of different potential application fields like CFD, astrophysics/astronomy, climate, seismology, and biological sciences (see agenda).

Some of the essential topics in this area

  1. Discussing data
    As it seems there is a way of doing science with generated data that goes beyond the usual “I analyse and then visualise myself what I computed”. Researchers take their simulated data to workshops, seminars, and summer schools for discussing the results with other scientists from their field, i.e. they only understand what they computed when they talk with others about their results (e.g. the summer program at the Center for Turbulence Research at Stanford). It’s clear that you cannot do this with in-situ visualisation but you need archived results because the discussion goes constantly forward and backward in time.
  2. Feature extraction instead of (or in addition to) visualisation
    If you have very complex time-dependent 3D output with several fields, you may not be able to see effects that you easily identify in a 2D simulation. In order to detect these you need techniques like automatic feature extraction. That may also be able to detect features you have never identified before visually. Such mining techniques may be as compute intensive as a simulation itself. An example for a feature extraction package for which results were presented at the conference, is MineTool.
  3. Combine output of different simulations or of observations
    A series of talks showed how you can gain new insight and detect new features or events by combining the results of different simulations, observations, or both. You store the result of such processing as well as the source data, and the processed data may be several factors larger than the raw data (buzz term “value added federated data sets”).
  4. Graph problems
    The biggest computational obstacle in data analytics are graph problem where the graphs have little or no locality and therefore cannot be parallelised (easily). Such problems seem to be common in economics and social sciences. Because they do not fit the standard distributed memory architecture of today’s machines, they require special system designs in order to run efficiently. An example is the massively-multithreaded Cray XMT architecture which also provides a single shared memory for the whole machine. PNNL employs the Cray XMT for such problems and showed very promising results. I.e. you may need different computer architecture for your data analysis facility.
  5. Streaming large observational data
    If you run a large machinery producing data volumes beyond what can be stored, you basically cache the data on a system and try to identify online whether something interesting has been observed which a) needs to be stored long term and b) may require to adjust your scientific instrument, e.g. direct the telescope to a certain part of the sky.
  6. Networking
    For some data intensive applications it was claimed that you have to architect the network end-to-end, provide QoS or dedicated lambdas, and to bring glass on the desk of the researcher (“end-to-end lightpath”.

Other random pieces of information

  • Commodity flash is 5x slower (bandwidth, IOPS) than enterprise technology, and has large difference between read and write speed
  • There is now a I/O component in IPM, the performance tool from NERSC. It was used to accelerate the performance of HDF5 by an order of magnitude.
  • The combination of technology of today for image processing and data analytics is GPUs, CUDA, and SSDs. GPUs and CUDA seems to be technology that is adopted quickest in the research institutes by students.
  • Priorities for data-intensive computing from a scientist point of view:
  1. Total storage
  2. Cost
  3. Sequential I/O bandwidth <- This was challenged in the audience because other application types require random I/O and high IOPS
  4. Fast stream processing with GPU
  5. Low power
  • Activities on the way to bring down the cost for storing data on disk: project at John Hopkins to create 5 PB mass storage for 1 M$.
  • Don’t move data. Transfer 100TB on a dedicated 10 Gbit/s link takes 1 day. Rather drive the disk drives around. Generating data at one location and analysing it at a different location is a too expensive execution model.
  • 100 TB seems to be the barrier for managing your data within the institute. Beyond you need centralised facilities but make sure that you keep the agility of the local solutions.