SOFS: Massive-Scale Ultra-high Dimensional Second Order Online Feature Selection

Massive-Scale Online Feature Selection for
Sparse Ultra-high Dimensional Data

Yue Wu, Steven C.H. Hoi, Tao Mei

Abstract

Feature Selection is an important technique in machine learning and data mining, especially for mining high-dimensional data. Most existing studies have been restricted to batch learning, which is often inefficient and poorly scalable when handling big data in real world, especially when data arrives sequentially. Recent years have witnessed some emerging promising feature selection techniques using online learning. Despite enjoying a significant advantage in efficiency and scalability over batch learning methods, the existing online feature selection methods are not always accurate enough, and still not sufficiently fast when handling massive-scale data with ultra-high dimensionality. To address the limitations, we propose a novel online feature selection method by exploiting second order information with optimized implementations, which not only improves the learning efficacy, but also significantly enhances computational efficiency. We conduct extensive experiments for evaluating both learning accuracy and time cost of different algorithms on massive-scale synthetic and real-world datasets, including a dataset with billion-scale features. Our results show that our technique achieves highly competitive accuracy as compared with state-of-the-art batch feature selection methods, but consumes significantly low computational cost that is orders of magnitude lower than both state-of-the-art batch and online feature selection methods. On the billion-scale dataset (1B dimensions, 1B nonzero features, and 1M samples), our algorithm took only eight minutes with a normal single PC.

Implementation

We implemented the program in C++ with two parallel threads, one for data loading and the other for learning. We compile the code in Visual Studio 2012. We will soon release the source code that can be used in cross compilers and cross platforms.

Click here to download the executable

For the usage of the executable, please refer to "ReadMe.txt" in the downloaded files.
Links to External Resources

FGM: Feature Generator Machine
LibLinear
mRMR

DataSets

We use 3 synthetic datasets and 7 real world datasets. All datasets are publicly available, and can be downloaded from the links below.

Synthetic Data

Details of synthetic data are given in the following table.

DataSet	#Train	#Test	Dim	IDim	NDim	#Feat
X1	100K	10K	10K	100	200	30M
X2	100K	10K	20K	200	400	60M
X3	1M	100K	1B	500	500	1B

IDim is dimension of informative features of each sample
NDim is the dimension of noise features of each sample
Please refer to the paper for more details about synthetic data

Real-world Data for Binary Classification

We evaluate the performance on both medium and large scale real-world data. The datasets can be downloaded from Feature Selection website of Arizona State University , SVMLin, or LibSVM. Details of the datasets are shown in the following table.

DataSet	Dim	#Train	#Test	#Feat
relathe	4,322	1,000	427	87,352
pcmac	7,510	1,000	946	55,470
basehock	4,862	1,500	493	101,974
ccat	47,236	13,149	10,000	994,133
aut	20,707	40,000	22,581	1,969,407
real-sim	20,958	50,000	22,309	2,560,340
news	1,355,191	10,000	9,996	5,513,533
rcv1	47,236	781,265	23,149	59,155,144
url	3,231,961	2,000,000	396,130	231,249,028

Caltech101 for Object Recognition

We use the caltech-101 dataset which contains 9,197 images comprising 101 different categories of object categories and a background category. We randomly select 30 images for training and another 30 images for testing in each category. Each image will be represented by a 4,075 dimensional feature as introduced in Multiclass Object Recognition with Sparse, Localized Features.

DataSet	Dim	#Train	#Test	#Feat	#Class
caltech*	4,075	3060	2293	12,469,500	102

We run online algorithms by 10 passes over the dataset.
*: the linked page provides the code to extract the features.

Back to Steven Hoi's Homepage