Data

This page contains links to some of the data sets used in the book for demonstration purposes.

USPS handwritten digit data

The usps handwritten image data are contained in the file usps_resampled.mat available as bz2 (7.0 Mb, unpack using "tar -xjf usps_resampled.tar.bz2") or zip (8.3 Mb) archives. Besides the data file, the archive also contains a tiny script loadBinaryUSPS.m for conveniently loading pairs of digits, suitable for binary classification tasks. The data has traditionally been used in a splitting of 7291 cases for training and 2007 cases for testing. However, these two sets were actually collected in slightly different ways, and thus not very suitable for demonstrating learning algorithms. (In fact, it is pretty well know by machine learning practitioners with experience on this data, that the cases in the test set are much harder to classify that the cases in the training set.) To avoid these problems, we concatenated both sets, randomly reshuffeled the cases, and divided them anew into training and test sets, containing 4649 cases each (this also avoid the problem with the previous partitions having too few test cases to be able to say very much about the statistical significance of the difference between the performance of different algorithms). The file contains 4 variables:

  test_labels         10x4649                371920  double array
  test_patterns      256x4649               9521152  double array
  train_labels        10x4649                371920  double array
  train_patterns     256x4649               9521152  double array

The *_patterns variables contain a raster scan of the 16 by 16 grey level pixel intensities, which have scaled such that the range is [-1; 1]. The *_labels variables contain a one-of-k encoding with values -1 and +1 of the classification; one +1 per column.

The USPS digits data were gathered at the Center of Excellence in Document Analysis and Recognition (CEDAR) at SUNY Buffalo, as part of a project sponsored by the US Postal Service. The dataset is described in A Database for Handwritten Text Recognition Research, J. J. Hull, IEEE PAMI 16(5) 550-554, 1994.

The SARCOS data

The data relates to an inverse dynamics problem for a seven degrees-of-freedom SARCOS anthropomorphic robot arm. The task is to map from a 21-dimensional input space (7 joint positions, 7 joint velocities, 7 joint accelerations) to the corresponding 7 joint torques. Following previous work we present results for just one of the seven mappings, from the 21 input variables to the first of the seven torques. There are two data files, the training data sarcos_inv.mat (9.9 Mb), and the test data sarcos_inv_test.mat (1.0 Mb). These two objects are of size

  sarcos_inv        44484x28                  9964416  double array
  sarcos_inv_test    4449x28                   996576  double array

There are 44,484 training examples and 4,449 test examples. The first 21 columns are the input variables, and the 22nd column is used as the target variable. We thank Sethu Vijayakumar for providing the data.

These data had previously been used in the papers

LWPR: An O(n) Algorithm for Incremental Real Time Learning in High Dimensional Space, S. Vijayakumar and S. Schaal, Proc ICML 2000, 1079-1086 (2000).
Statistical Learning for Humanoid Robots, S. Vijayakumar, A. D'Souza, T. Shibata, J. Conradt, S. Schaal, Autonomous Robot, 12(1) 55-69 (2002)
Incremental Online Learning in High Dimensions S. Vijayakumar, A. D'Souza, S. Schaal, Neural Computation 17(12) 2602-2634 (2005)

Go back to the web page for Gaussian Processes for Machine Learning.

Last modified: Tue Mar 28 20:48:13 CEST 2006