Massive-Scale Online Feature Selection for
Sparse Ultra-high Dimensional Data

Yue Wu, Steven C.H. Hoi, Tao Mei

Abstract

Feature Selection is an important technique in machine learning and data mining, especially for mining high-dimensional data. Most existing studies have been restricted to batch learning, which is often inefficient and poorly scalable when handling big data in real world, especially when data arrives sequentially. Recent years have witnessed some emerging promising feature selection techniques using online learning. Despite enjoying a significant advantage in efficiency and scalability over batch learning methods, the existing online feature selection methods are not always accurate enough, and still not sufficiently fast when handling massive-scale data with ultra-high dimensionality. To address the limitations, we propose a novel online feature selection method by exploiting second order information with optimized implementations, which not only improves the learning efficacy, but also significantly enhances computational efficiency. We conduct extensive experiments for evaluating both learning accuracy and time cost of different algorithms on massive-scale synthetic and real-world datasets, including a dataset with billion-scale features. Our results show that our technique achieves highly competitive accuracy as compared with state-of-the-art batch feature selection methods, but consumes significantly low computational cost that is orders of magnitude lower than both state-of-the-art batch and online feature selection methods. On the billion-scale dataset (1B dimensions, 1B nonzero features, and 1M samples), our algorithm took only eight minutes with a normal single PC.

Implementation

We implemented the program in C++ with two parallel threads, one for data loading and the other for learning. We compile the code in Visual Studio 2012. We will soon release the source code that can be used in cross compilers and cross platforms.

Click here to download the executable

For the usage of the executable, please refer to "ReadMe.txt" in the downloaded files.
Links to External Resources

DataSets

We use 3 synthetic datasets and 7 real world datasets. All datasets are publicly available, and can be downloaded from the links below.

Synthetic Data

Details of synthetic data are given in the following table.
DataSet #Train #Test Dim IDim NDim #Feat
X1 100K 10K 10K 100 200 30M
X2 100K 10K 20K 200 400 60M
X3 1M 100K 1B 500 500 1B
  • IDim is dimension of informative features of each sample
  • NDim is the dimension of noise features of each sample
  • Please refer to the paper for more details about synthetic data

Real-world Data for Binary Classification

We evaluate the performance on both medium and large scale real-world data. The datasets can be downloaded from Feature Selection website of Arizona State University , SVMLin, or LibSVM. Details of the datasets are shown in the following table.
DataSet Dim #Train #Test #Feat
relathe 4,322 1,000 427 87,352
pcmac 7,510 1,000 946 55,470
basehock 4,862 1,500 493 101,974
ccat 47,236 13,149 10,000 994,133
aut 20,707 40,000 22,581 1,969,407
real-sim 20,958 50,000 22,309 2,560,340
news 1,355,191 10,000 9,996 5,513,533
rcv1 47,236 781,265 23,149 59,155,144
url 3,231,961 2,000,000 396,130 231,249,028

Caltech101 for Object Recognition

We use the caltech-101 dataset which contains 9,197 images comprising 101 different categories of object categories and a background category. We randomly select 30 images for training and another 30 images for testing in each category. Each image will be represented by a 4,075 dimensional feature as introduced in Multiclass Object Recognition with Sparse, Localized Features.
DataSet Dim #Train #Test #Feat #Class
caltech* 4,075 3060 2293 12,469,500 102
  • We run online algorithms by 10 passes over the dataset.
  • *: the linked page provides the code to extract the features.

Back to Steven Hoi's Homepage