【Data Analysis(5)】XGBoost Algorithm Predicts Returns (Part 1)

Use algorithm to learn the investment factors and predict returns.

TEJ 台灣經濟新報

Published in

TEJ-API Financial Data Analysis

5 min readOct 12, 2021

Highlights

Difficulty：★★★☆☆
Setting Virtual Environment
XGBoost Introduction and Installation

Preface

Recently, a lot of algorithms have emerged, and various mathematical models have been developed to solve problems. The classic model is “regression”. With the advancement of technology, algorithms now been developed which can improve and learn by themselves (Machine Learning). Nowaday has developed into the most popular type of neural network model (Deep Learning).

This article introduces the tree model XGBoost and will be divided into two parts. The first part will teach how to set environment and module installation. The second part is the preprocessing of the data, training, and prediction and visualization.

XGBoost Introduction

First, let’s introduce the popular algorithm XGBoost. The so-called Boosting is a kind of aggregating many weak learnings into a more powerful learner, which has higher accuracy for the final prediction result.

XGBoost (Extreme Gradient Boosting) is a gradient descent algorithm, Gradient Boosted Tree (GBDT), Each step of learning is based on previous errors, and will retain the original model, and add new functions as a correction the last error, this is a collection of multiple weak learners. The application mainly solves supervised learning, which can deal with classification and regression problems as well.

The Editing Environment and Modules Required

Mac OS and Jupyter Notebook

Virtual Environment

Due to XGBoost uses many modules, if the versions are inconsistent, it will cause endless errors. Therefore, we can create a new environment to install these modules. There are many ways to install them. This tutorial is a relatively simple and easy-to-understand way to minimize errors.

Step 1. Install Anaconda

Anaconda can be said to be a lazy package for beginners. It solves the current situation that the inconsistency of various systems causes installation difficulties. It has organized more than 1000 packages that can be installed, which are suitable for Windows, Linux and MacOS. Operating system environment, also has a virtual environment manager, which is simple and fast for installing and executing machine learning environment.

Step 2. Click terminal

Windows system is Anaconda Prompt

Enter the following command

conda create -n 新環境名稱 python==3.8

It will pop up and ask if you want to install it. Enter y and enter ！ The name of our new environment is test. Of course you can also type any name you like.

conda env list

This command will show all of the environment we have created.

step 3. Activate environment

conda activate 新環境名稱

At this time, the front bracket (base) of the terminal will turn into the name (test). It means we activate the environment successful. If the following installation fails and need to reinstall. We just remove the environment by simply entering a series of commands below.

conda env remove -n 新環境名稱

Install XGBoost

step 1. Activate environment

conda activate 新環境名稱

step 2. Enter command

conda install py-xgboost

The same will ask if you want to install these modules, type y and press enter to start the installation, and it will be successful after running! Is it very simple!

Install XGBoost visualization module graphviz

step 1. Install Homebrew (under our new environment)

Homebrew We can understand it as an installation method. For example, using pip to install python module. On macOS, Homebrew is the most widely used package management tool.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Enter the command on the terminal to install

step 2. graphviz

brew install graphviz

The above are the modules we will mainly use in this article! However, in the new environment, XGBoost does not have some of the modules we need, so we have to install them separately (pandas, matplotlib, tejapi). The command is separated by spaces.

pip install pandas matplotlib tejapi

Install jupyter notebook

step 1. Open Anaconda, choose the name we just created for the environment

step 2. Under jupyter notebook Click install

Final Result

Finally, checking whether the installation is successful in jupyter!

Database

We use TWN/AFF_RAW in this article. It provides trading factors for algorithms learning. Database refer to Kenneth R. French and top three financial journals (JF、RFS、JFE). The indicators are calculated by using Taiwan market data, and the all indicators are sorted out in a monthly frequency.

df = tejapi.get('TWN/AFF_RAW',
                coid = '9921',
                mdate={'gte': '2015-01-01', 'lte':'2020-12-31'}
                chinese_column_name = True,
                paginate = True)

Conclusion

The part 1 of this article is about module installation. I believe that most people will encounter many installation situations when first contact the program. The arrangement of the environment is the first class for programmer. After everyone has successfully installed it, the part 2 will start to use the database. We will process the data, feed the model, and predict returns as a reference for our investment.