Abstract
Feature Selection is an important technique in machine learning and data mining, especially for mining high-dimensional data. Most existing studies have been restricted to batch learning, which is often inefficient and poorly scalable when handling big data in real world, especially when data arrives sequentially. Recent years have witnessed some emerging promising feature selection techniques using online learning. Despite enjoying a significant advantage in efficiency and scalability over batch learning methods, the existing online feature selection methods are not always accurate enough, and still not sufficiently fast when handling massive-scale data with ultra-high dimensionality. To address the limitations, we propose a novel online feature selection method by exploiting second order information with optimized implementations, which not only improves the learning efficacy, but also significantly enhances computational efficiency. We conduct extensive experiments for evaluating both learning accuracy and time cost of different algorithms on massive-scale synthetic and real-world datasets, including a dataset with billion-scale features. Our results show that our technique achieves highly competitive accuracy as compared with state-of-the-art batch feature selection methods, but consumes significantly low computational cost that is orders of magnitude lower than both state-of-the-art batch and online feature selection methods. On the billion-scale dataset (1B dimensions, 1B nonzero features, and 1M samples), our algorithm took only eight minutes with a normal single PC.Implementation
We implemented the program in C++ with two parallel threads, one for data loading and the other for learning. We compile the code in Visual Studio 2012. We will soon release the source code that can be used in cross compilers and cross platforms.Click here to download the executable
For the usage of the executable, please refer to "ReadMe.txt" in the downloaded files.
Links to External Resources
DataSets
We use 3 synthetic datasets and 7 real world datasets. All datasets are publicly available, and can be downloaded from the links below.Synthetic Data
Details of synthetic data are given in the following table.Real-world Data for Binary Classification
We evaluate the performance on both medium and large scale real-world data. The datasets can be downloaded from Feature Selection website of Arizona State University , SVMLin, or LibSVM. Details of the datasets are shown in the following table.| DataSet | Dim | #Train | #Test | #Feat |
| relathe | 4,322 | 1,000 | 427 | 87,352 |
| pcmac | 7,510 | 1,000 | 946 | 55,470 |
| basehock | 4,862 | 1,500 | 493 | 101,974 |
| ccat | 47,236 | 13,149 | 10,000 | 994,133 |
| aut | 20,707 | 40,000 | 22,581 | 1,969,407 |
| real-sim | 20,958 | 50,000 | 22,309 | 2,560,340 |
| news | 1,355,191 | 10,000 | 9,996 | 5,513,533 |
| rcv1 | 47,236 | 781,265 | 23,149 | 59,155,144 |
| url | 3,231,961 | 2,000,000 | 396,130 | 231,249,028 |
Caltech101 for Object Recognition
We use the caltech-101 dataset which contains 9,197 images comprising 101 different categories of object categories and a background category. We randomly select 30 images for training and another 30 images for testing in each category. Each image will be represented by a 4,075 dimensional feature as introduced in Multiclass Object Recognition with Sparse, Localized Features.| DataSet | Dim | #Train | #Test | #Feat | #Class |
| caltech* | 4,075 | 3060 | 2293 | 12,469,500 | 102 |
- We run online algorithms by 10 passes over the dataset.
- *: the linked page provides the code to extract the features.