Tutorial: Programming with Big Data in R

George Ostrouchov , Oak Ridge National Laboratory
Drew Schmidt , University of Tennessee


This tutorial will introduce attendees to High Performance Computing (HPC) concepts for dealing with big data using R. The content is particularly well suited for use on large distributed platforms and it is also accessible from small multicore platforms.

Parallel programming for distributed platforms is most naturally done with a Single Program/Multiple Data (SPMD) viewpoint. This programming model is used by the vast majority of the HPC community. A major focus of this tutorial will be introducing attendees to this viewpoint, and contrasting it with R's usual manager/worker viewpoint and map-reduce variants.

In this tutorial we will:


The tutorial aims to introduce the basics of parallel programming in the SPMD programming model using MPI and the pbdR system of packages.

Additionally, we hope to engage package developers to instrument their packages with pbdR so that more R analytics become scalable on large computational platforms and to motivate our further development of pbdR by specific user needs.



We assume intermediate knowledge of R. No prior parallel programming experience is necessary. If you wish to follow along on your multicore laptop during the tutorial, please install (or check that you have):

Please see our installation instructions on each major platform.

Intended Audience

The R programmer with an interest in parallel programming and a need to handle very large data.

Workshop Materials

Slides and source code for the tutorial will be made available by the first week of July 2013 on the pbrR website.

Thank you for registering to participate in the "Programming with Big Data in R" tutorial. The tutorial is structured so that you can follow along "lecture style" or you can engage with the examples "hands on." Here are a few suggestions that will allow you to get the most out of this tutorial.

Related Links