Features Selection

This code is for general features selection based on certain machine learning algorithm and evaluation methods

More features selection method will be included in the future!

More examples are added in example folder include:

Simple Titanic with 5-fold validation and evaluated by accuracy
Demo for S1 score improvement in JData 2018 predict purchase time competition

new features

Set sample ratio for large dataset
Set maximum quantity of features
Set maximum running time
Set certain features library

To run the demo, please install via pip3

version update

version 0.0.4.1: fix bug of sampling when sample ratio equals to 1

pip3 install MLFeatureSelection

Demo is here!

How to run

This demo is based on the IJCAI-2018 data moning competitions

Import library from FeatureSelection.py and also other necessary library

from MLFeatureSelection import FeatureSelection as FS 
from sklearn.metrics import log_loss
import lightgbm as lgbm
import pandas as pd
import numpy as np

Generate for dataset

def prepareData():
    df = pd.read_csv('data/train/trainb.csv')
    df = df[~pd.isnull(df.is_trade)]
    item_category_list_unique = list(np.unique(df.item_category_list))
    df.item_category_list.replace(item_category_list_unique, list(np.arange(len(item_category_list_unique))), inplace=True)
    return df

Define your loss function

def modelscore(y_test, y_pred):
    return log_loss(y_test, y_pred)

Define the way to validate

def validation(X,y, features, clf,lossfunction):
    totaltest = 0
    for D in [24]:
        T = (X.day != D)
        X_train, X_test = X[T], X[~T]
        X_train, X_test = X_train[features], X_test[features]
        y_train, y_test = y[T], y[~T]
        clf.fit(X_train,y_train, eval_set = [(X_train, y_train), (X_test, y_test)], eval_metric='logloss', verbose=False,early_stopping_rounds=200) #the train method must match your selected algorithm
        totaltest += lossfunction(y_test, clf.predict_proba(X_test)[:,1])
    totaltest /= 1.0
    return totaltest

Define the cross method (required when Cross = True)

def add(x,y):
    return x + y

def substract(x,y):
    return x - y

def times(x,y):
    return x * y

def divide(x,y):
    return (x + 0.001)/(y + 0.001)

CrossMethod = {'+':add,
               '-':substract,
               '*':times,
               '/':divide,}

Initial the seacher with customized procedure (sequence + random + cross)

sf = FS.Select(Sequence = False, Random = True, Cross = False) #select the way you want to process searching

Import loss function

sf.ImportLossFunction(modelscore, direction = 'descend')

Import dataset

sf.ImportDF(prepareData(), label = 'is_trade')

Import cross method (required when Cross = True)

sf.ImportCrossMethod(CrossMethod)

Define non-trainable features

sf.InitialNonTrainableFeatures(['used','instance_id', 'item_property_list', 'context_id', 'context_timestamp', 'predict_category_property', 'is_trade'])

Define initial features' combination

sf.InitialFeatures(['item_category_list', 'item_price_level','item_sales_level','item_collected_level', 'item_pv_level'])

Generate feature library, can specific certain key word and selection step

sf.GenerateCol(key = 'mean', selectstep = 2) #can iterate different features set

Set maximum features quantity

sf.SetFeaturesLimit(40) #maximum number of features

Set maximum time limit (in minutes)

sf.SetTimeLimit(100) #maximum running time in minutes

Set sample ratio of total dataset, when samplemode equals to 0, running the same subset, when samplemode equals to 1, subset will be different each time

sf.SetSample(0.1, samplemode = 0)

Define algorithm

sf.clf = lgbm.LGBMClassifier(random_state=1, num_leaves = 6, n_estimators=5000, max_depth=3, learning_rate = 0.05, n_jobs=8)

Define log file name

sf.SetLogFile('record.log')

Run with self-define validate method

sf.run(validation)

see complete code in demo.py

This code take a while to run, you can stop it any time and restart by replace the best features combination in temp sf.InitialFeatures()

This features selection method achieved

1st in Rong360

-- https://github.com/duxuhao/rong360-season2
Temporary Top 10 in JData-2018 (Peter Du)
12nd in IJCAI-2018 1st round

1003761102 / Feature-Selection

Features Selection

new features

version update

How to run

This features selection method achieved

Algorithm details

About

Languages