OTL: Online Transfer Learning

Instructions for Source Code and Data usage


OTL, "Online Transfer Learning", aims to attack an online learning task on a target domain by transferring knowledge from some source domain. We do not assume data in the target domain follows the same distribution as that in the source domain, and the motivation of our work is to enhance a supervised online learning task on a target domain by exploiting the existing knowledge that had been learnt from training data in source domains.

In the source code package, there are two folders:  Classification and  Concept Drift . 

Classification

In the  Classification folder, there are two sub-folders  Homogeneous and  Heterogeneous .

Homogeneous

In the  Homogenous folder, there are

·         all the 5 binary classification algorithms including:  PA1_K_M (i.e., PA-I),  PAIO_K_M ,  HomOTLf_K_M (i.e., HomOTL (fixed)),  HomOTL1_K_M ,  HomOTL2_K_M ;

·         the "avePA1_K_M", which will output the average of all the online classifiers produced by  PA1_K_M ;

·         the main procedure  Experiment_OTL_K_M , which is used to compare all the online algorithms;

·         the procedure  EOC , which is used to evaluate the effect of parameter  C ;

·         the procedure  EObeta , which is used to evaluate the effect of parameter  beta on the HomOTL2.

To compare the  HomOTL algorithms with other baselines, you need to run  Experiment_OTL_K_M . For example, if you would like to compare these online algorithms performances on the  books_dvd dataset (please refer to the instruction to  books_dvd ), you should type  Experiment_OTL_K_M ( books_dvd ) in the command window of MATLAB then press the  Enter key. Finally, you will get a figure of online mistake rates, a figure of online SV size, a figure of online time consumption, and a table for the final mistake rates, SV size, and time consumption for all the algorithms. You can use  EOC( books_dvd ) to evaluate the effect of parameter  C on the dataset  books_dvd , and use  EObeta to evaluate the effect of parameter  beta for HomOTL2 on all the datasets.

Heterogeneous

In the  Heterogeneous folder, there are

·         all the 5 binary classification algorithms including:  PA1_K_M (i.e., PA-I),  PAIO_K_M ,  HetOTL0_K_M ,  Ensemble_K_M , and  HetOTL_K_M ;

·         the "avePA1_K_M", which will output the average of all the online classifiers produced by  PA1_K_M ;

·         the procedure  Experiment_OTL_K_M , which is used to compare all the online algorithms;

·         the procedure  EOC , which is used to evaluate the effect of parameter  C .

To compare the  HetOTL algorithms with other baselines, you need to run  Experiment_OTL_K_M . For example, if you would like to compare these online algorithms performances on the  books_dvd dataset, you should type  Experiment ('books_dvd') in the command window of MATLAB then press the  Enter key. Finally, you would get a figure of online mistake rates, a figure of online SV size, a figure of online time consumption, and a table for the final mistake rates, SV size, and time consumption for all the algorithms. You can use  EOC( books_dvd ) to evaluate the effect of parameter  C on the dataset  books_dvd .

 

 

Concept Drifting

In the  Concept Drift folder, there are

·         all the 5 online algorithms including:  PE_K_M (i.e., Perceptron),  PA1_K_M (i.e., PA-I),  ShiftPE_K_M (i.e., Shifting Perceptron),  ModiPE_K_M (i.e., Modified Perceptron),  CDOLfix_K_M (i.e., CDOL(fixed)), and  CDOL_K_M ;

·         the main procedure  Experiment_OTL_K_M , which is used to compare all the online algorithms;

·         the procedure  EOC , which is used to evaluate the effect of parameter  C .

To compare the  CDOL algorithms with other baselines, you need to run  Experiment_OTL_K_M . For example, if you would like to compare these online algorithms performances on the  emaildata dataset (please refer to the instruction to  emaildata ), you should type  Experiment_OTL_K_M ( emaildata ) in the command window of MATLAB then press the  Enter key. Finally, you would get a figure of online mistake rates, a figure of online SV size, a figure of online time consumption, and a table for the final mistake rates, SV size, and time consumption for all the algorithms. Moreover, you can evaluate the effect of parameter  C on the dataset  emaildata by using  EOC( emaildata )

 

Dataset Descriptions

In the zip file, there are two folders:  Data for Classification , and  Data for Concept Drift .

 

Data for Classification

In this folder, there are 6 binary-class datasets used for testing the  HomOTL algorithms, HetOTL algorithms and other baselines.  These datasets are:  books_dvd ,  dvd_books ,  ele_kit (i.e., electronics-kitchen),  kit_ele (i.e., kitchen_electronics),  landmine1 , and  landmine2 .

 

Take  books_dvd for example, it is in  mat format and consists of three matrices:  data ,  ID_old , and  ID_new . The dimension of  data is 4000-by-473857, which means the number of training example is 4000 and the dimension of each example is 473857 consisting of one label and 473856 instance features. Take the first row for example; the first number is the label 1, while the rest 473857 number is the instance vector.

 

The structure of  data for  german

data

label

1-st feature

&

24-th feature

1-st example

1

0

&

0

&

&

&

&

&

4000-th example

-1

0

&

0

 

The dimension of  ID_old is 1-by-2000, which means there are a permutations of 1, 2& 2000.

 

The dimension of  ID_new is 20-by-2000. Every row of  ID_new is a permutation of 2000, 2001, & , 4000.

 

When you would like to use these binary datasets, please put them into the folder:  1. Classification\1. Homogeneous\data or  1. Classification\1. Heterogeneous\data .

 

Data for Concept Drift

In this folder, there are 6 binary-class datasets used for testing the binary CDOL algorithm and other binary baselines.  These datasets are:  emaildata ,  mitface ,  newsgroup4 ,  usenet1 ,  usenet2 and  usps .

 

Take  emaildata for example, the structure of  emaildata is similar with the binary classification datasets. The only difference is that  ID_ALL is designed to only permutate the indices of instances in one period.

 

The structure of  ID_ALL for  emaildata

ID_ALL

 Columns 1-300

Columns 301-600

&

Columns 1201-1500

1-st row

Random permutation of {1, 2, & , 300}

Random permutation of {301,302, & , 600}

&

Random permutation of {1201,1202, & , 1500}

&

&

&

&

&

20-th row

Random permutation of {1, 2, & , 300}

Random permutation of {301,302, & , 600}

&

Random permutation of {1201,1202, & , 1500}

 


Please refer to the OTL paper for more details