Evolutionary-HFT

Classification of Buy or Sell in HFT data with ensemble model of LightGBM and Random Forest.

1s side int
3s side int
5s side int
The first event that will occur in the following x seconds is labeled according to its kind, where:
0 -- No price change.
1 -- Bid price decreased.
2 -- Ask price increased.

"Data preprocessing: The first step in the machine learning pipeline is to convert the input data into a format that the model can understand. In this case, we need to convert the Python dictionary into a JSON format.
Data check: Before proceeding with further processing, it is important to check for any missing or null values in the dataset. To do this, we can use the check_null() function which will check for any missing values in the dataset.
Missing value handling: After identifying any missing or null values in the dataset, the next step is to handle them.The fill_null() function can be used to fill in the missing values based on certain assumptions or logic. In this example, it is stated that the missing values in the 'sum_trade_1s' column are likely to be 0 when the 'last_trade_time' is larger than 1 sec. Therefore, the assumption is made that all missing values in the 'sum_trade_1s' column can be filled with 0. Additionally, the 'last_trade_time' column can also be filled with the previous record's 'last_trade_time' plus a time movement if the record interval is smaller than 1 sec."

"Correlation filter: To reduce data redundancy and improve the efficiency of the model, it is important to remove columns that are highly correlated. The correlation_filter.filter() function can be used to identify and remove any highly correlated columns in the dataset.
Logical feature engineering: To improve the performance of the model, it is important to create new features that capture the underlying trading logic. The feature_eng.basic_features() function can be used to create new features based on trading logic.
Time-rolling feature engineering: In time-series data, it is important to create features that capture the temporal dependencies between observations. The feature_eng.lag_rolling_features() function can be used to create new features by lagging and rolling the time-series data. This function can help to capture the temporal dependencies and improve the performance of the model."

"Feature Selection: To improve the performance of the model, it is important to select the most relevant features from the dataset. The feature_selection.select() function uses a hybrid approach of genetic algorithm selection and feature importance selection to select the most relevant features.
Genetic algorithm selection: feature_selection.GA_features() function uses genetic algorithm to select features that maximizes the model's performance.
Feature importance selection: feature_selection.rf_imp_features() function uses feature importance scores from random forest to select relevant features.

Modelling: To build the model, an ensemble of lightGBM and random forest models are used.
Random Forest: model.random_forest() function is used to train a random forest model.
lightGBM: model.lightgbm() function is used to train a lightGBM model.

Parameter Tuning: To improve the performance of the model, it is important to fine-tune the model's parameters. Based on the search space, it is decided whether to use grid search or genetic search for lightGBM model's parameter tuning.
Grid search: model.lightgbm() model.GS_tune_lgbm() function uses grid search to tune the lightGBM model's parameters.
Genetic search: model.GA_tune_lgbm() function uses genetic search to tune the lightGBM model's parameters."

white07S / Evolutionary-HFT