Data
This page contains links to some of the data sets used in the book for
demonstration purposes.
USPS handwritten digit data
The usps handwritten image data are contained in the file usps_resampled.mat
available as bz2 (7.0 Mb, unpack using
"tar -xjf usps_resampled.tar.bz2") or zip
(8.3 Mb) archives. Besides the data file, the archive also contains a tiny
script loadBinaryUSPS.m for conveniently loading
pairs of digits, suitable for binary classification tasks. The data has
traditionally been used in a splitting of 7291 cases for training and 2007
cases for testing. However, these two sets were actually collected in slightly
different ways, and thus not very suitable for demonstrating learning
algorithms. (In fact, it is pretty well know by machine learning practitioners
with experience on this data, that the cases in the test set are much harder to
classify that the cases in the training set.) To avoid these problems, we
concatenated both sets, randomly reshuffeled the cases, and divided them anew
into training and test sets, containing 4649 cases each (this also avoid the
problem with the previous partitions having too few test cases to be able to
say very much about the statistical significance of the difference between the
performance of different algorithms). The file contains 4 variables:
test_labels 10x4649 371920 double array
test_patterns 256x4649 9521152 double array
train_labels 10x4649 371920 double array
train_patterns 256x4649 9521152 double array
The *_patterns variables contain a raster scan of the 16 by 16 grey
level pixel intensities, which have scaled such that the range is [-1; 1]. The
*_labels variables contain a one-of-k encoding with values -1 and +1
of the classification; one +1 per column.
The USPS digits data were gathered at the Center of Excellence in Document
Analysis and Recognition (CEDAR) at SUNY Buffalo, as part of a project
sponsored by the US Postal Service. The dataset is described in A Database
for Handwritten Text Recognition Research, J. J. Hull, IEEE PAMI 16(5)
550-554, 1994.
The SARCOS data
The data relates to an inverse dynamics problem for a seven
degrees-of-freedom SARCOS anthropomorphic robot arm. The task is to
map from a 21-dimensional input space (7 joint positions, 7 joint
velocities, 7 joint accelerations) to the corresponding 7 joint
torques. Following previous work we present results for just one of
the seven mappings, from the 21 input variables to the first of the
seven torques.
There are two data files, the training data
sarcos_inv.mat (9.9 Mb), and the test data
sarcos_inv_test.mat (1.0 Mb).
These two objects are of size
sarcos_inv 44484x28 9964416 double array
sarcos_inv_test 4449x28 996576 double array
There are 44,484 training examples and 4,449 test examples.
The first 21 columns are the input variables, and the 22nd
column is used as the target variable.
We thank Sethu Vijayakumar for providing the data.
These data had previously been used in the papers
-
LWPR: An O(n) Algorithm for Incremental Real Time Learning in
High Dimensional Space, S. Vijayakumar and S. Schaal,
Proc ICML 2000, 1079-1086 (2000).
- Statistical Learning for Humanoid Robots,
S. Vijayakumar, A. D'Souza, T. Shibata, J. Conradt, S. Schaal,
Autonomous Robot, 12(1) 55-69 (2002)
- Incremental Online Learning in High Dimensions
S. Vijayakumar, A. D'Souza, S. Schaal, Neural Computation
17(12) 2602-2634 (2005)
Go back to the web page for Gaussian Processes for Machine
Learning.
Last modified: Tue Mar 28 20:48:13 CEST 2006