OTL: Online Transfer Learning

Instructions for Source Code and Data usage

OTL, "Online Transfer Learning", aims to attack an online learning task on a target domain by transferring knowledge from some source domain. We do not assume data in the target domain follows the same distribution as that in the source domain, and the motivation of our work is to enhance a supervised online learning task on a target domain by exploiting the existing knowledge that had been learnt from training data in source domains.

In the source code package, there are two folders: Classification and Concept Drift .

Classification

In the Classification folder, there are two sub-folders Homogeneous and Heterogeneous .

Homogeneous

In the Homogenous folder, there are

· all the 5 binary classification algorithms including: PA1_K_M (i.e., PA-I), PAIO_K_M , HomOTLf_K_M (i.e., HomOTL (fixed)), HomOTL1_K_M , HomOTL2_K_M ;

· the "avePA1_K_M", which will output the average of all the online classifiers produced by PA1_K_M ;

· the main procedure Experiment_OTL_K_M , which is used to compare all the online algorithms;

· the procedure EOC , which is used to evaluate the effect of parameter C ;

· the procedure EObeta , which is used to evaluate the effect of parameter beta on the HomOTL2.

To compare the HomOTL algorithms with other baselines, you need to run Experiment_OTL_K_M . For example, if you would like to compare these online algorithms performances on the books_dvd dataset (please refer to the instruction to books_dvd ), you should type Experiment_OTL_K_M ( books_dvd ) in the command window of MATLAB then press the Enter key. Finally, you will get a figure of online mistake rates, a figure of online SV size, a figure of online time consumption, and a table for the final mistake rates, SV size, and time consumption for all the algorithms. You can use EOC( books_dvd ) to evaluate the effect of parameter C on the dataset books_dvd , and use EObeta to evaluate the effect of parameter beta for HomOTL2 on all the datasets.

Heterogeneous

In the Heterogeneous folder, there are

· all the 5 binary classification algorithms including: PA1_K_M (i.e., PA-I), PAIO_K_M , HetOTL0_K_M , Ensemble_K_M , and HetOTL_K_M ;

· the "avePA1_K_M", which will output the average of all the online classifiers produced by PA1_K_M ;

· the procedure Experiment_OTL_K_M , which is used to compare all the online algorithms;

· the procedure EOC , which is used to evaluate the effect of parameter C .

To compare the HetOTL algorithms with other baselines, you need to run Experiment_OTL_K_M . For example, if you would like to compare these online algorithms performances on the books_dvd dataset, you should type Experiment ('books_dvd') in the command window of MATLAB then press the Enter key. Finally, you would get a figure of online mistake rates, a figure of online SV size, a figure of online time consumption, and a table for the final mistake rates, SV size, and time consumption for all the algorithms. You can use EOC( books_dvd ) to evaluate the effect of parameter C on the dataset books_dvd .

Concept Drifting

In the Concept Drift folder, there are

· all the 5 online algorithms including: PE_K_M (i.e., Perceptron), PA1_K_M (i.e., PA-I), ShiftPE_K_M (i.e., Shifting Perceptron), ModiPE_K_M (i.e., Modified Perceptron), CDOLfix_K_M (i.e., CDOL(fixed)), and CDOL_K_M ;

· the main procedure Experiment_OTL_K_M , which is used to compare all the online algorithms;

· the procedure EOC , which is used to evaluate the effect of parameter C .

To compare the CDOL algorithms with other baselines, you need to run Experiment_OTL_K_M . For example, if you would like to compare these online algorithms performances on the emaildata dataset (please refer to the instruction to emaildata ), you should type Experiment_OTL_K_M ( emaildata ) in the command window of MATLAB then press the Enter key. Finally, you would get a figure of online mistake rates, a figure of online SV size, a figure of online time consumption, and a table for the final mistake rates, SV size, and time consumption for all the algorithms. Moreover, you can evaluate the effect of parameter C on the dataset emaildata by using EOC( emaildata )

Dataset Descriptions

In the zip file, there are two folders: Data for Classification , and Data for Concept Drift .

Data for Classification

In this folder, there are 6 binary-class datasets used for testing the HomOTL algorithms, HetOTL algorithms and other baselines. These datasets are: books_dvd , dvd_books , ele_kit (i.e., electronics-kitchen), kit_ele (i.e., kitchen_electronics), landmine1 , and landmine2 .

Take books_dvd for example, it is in mat format and consists of three matrices: data , ID_old , and ID_new . The dimension of data is 4000-by-473857, which means the number of training example is 4000 and the dimension of each example is 473857 consisting of one label and 473856 instance features. Take the first row for example; the first number is the label 1, while the rest 473857 number is the instance vector.

The structure of data for german

data	label	1-st feature	&	24-th feature
1-st example	1	0	&	0
&	&	&	&	&
4000-th example	-1	0	&	0

The dimension of ID_old is 1-by-2000, which means there are a permutations of 1, 2& 2000.

The dimension of ID_new is 20-by-2000. Every row of ID_new is a permutation of 2000, 2001, & , 4000.

When you would like to use these binary datasets, please put them into the folder: 1. Classification\1. Homogeneous\data or 1. Classification\1. Heterogeneous\data .

Data for Concept Drift

In this folder, there are 6 binary-class datasets used for testing the binary CDOL algorithm and other binary baselines. These datasets are: emaildata , mitface , newsgroup4 , usenet1 , usenet2 and usps .

Take emaildata for example, the structure of emaildata is similar with the binary classification datasets. The only difference is that ID_ALL is designed to only permutate the indices of instances in one period.

The structure of ID_ALL for emaildata

ID_ALL	Columns 1-300	Columns 301-600	&	Columns 1201-1500
1-st row	Random permutation of {1, 2, & , 300}	Random permutation of {301,302, & , 600}	&	Random permutation of {1201,1202, & , 1500}
&	&	&	&	&
20-th row	Random permutation of {1, 2, & , 300}	Random permutation of {301,302, & , 600}	&	Random permutation of {1201,1202, & , 1500}

Please refer to the OTL paper for more details