LUCIDA: How to Build a Robust Crypto Asset Portfolio with Multi-Factor Strategies (Data Preprocessing)

2023-11-28 07:56:29

Preamble

In the last part of the book, we published the first article in the series “Building a Strong Crypto Asset Portfolio with Multi-Factor Strategies” - Theoretical Fundamentals, and this is the second article - Data Preprocessing.

The data needs to be processed before/after the calculation of the factor data, and before the validity of the single factor is tested. Specific data preprocessing involves the handling of duplicate values, outliers/missing values/extreme values, normalization, and data frequency.

I. Duplicate values

Data-related definitions:

Key: Indicates a unique index. eg. For a data with all tokens and all dates, the key is “token_id/contract_address - date”
Value: The object that is indexed by the key is called the “value”.

Diagnosing duplicate values starts with understanding what the data “should” look like. Usually the data is in the form of:

Time Series data. The key is “Time”. eg. 5 years of price data for a single token
Cross Section data. The key is “individual”. eg.2023.11.01 price data of all tokens in the crypto market on that day
Panel. The key is a combination of “individual-time”. eg. price data of all tokens from 2019.01.01-2023.11.01 for four years.

Principle: Once you have determined the index (key) of the data, you can know at what level the data should have no duplicate values.

Check Method:

Pd. DataFrame.duplicated(subset=[key 1, key 2, …])

Check the number of duplicate values: pd. DataFrame.duplicated(subset=[key 1, key 2, …]). sum()
Sample to see duplicate samples: df[df.duplicated(subset=[…])]. sample()After finding the sample, use df.loc to select all replicates corresponding to the index

pd.merge(df 1, df 2, on=[key 1, key 2, …], indicator=True, validate=‘1: 1’)

In the function of horizontal merging, add the indicator parameter, and the _merge field will be generated, and you can use dfm[‘_merge’].value_counts() to check the number of samples from different sources after merging
Add the validate parameter to verify that the indexes in the merged dataset are as expected (1 to 1, 1 to many, or many to many, the last of which is not really validated). If the results are not as expected, an error will be reported and the execution will be aborted.

2. Outliers/Missing Values/Extreme Values

Common causes of outliers:

Extreme cases. **For example, a token with a token price of 0.000001 $ or a token with a market value of only $500,000 will have a return rate of dozens of times if it changes a little bit.
Data characteristics. **For example, if the token price data is downloaded from January 1, 2020, then the return data for January 1, 2020 cannot be calculated because there is no previous day’s closing price.
Data errors. **Data providers will inevitably make mistakes, such as recording $12 per token as $1.2 per token.

Principles for handling outliers and missing values:

*Delete. Outliers that cannot be reasonably corrected or corrected can be considered for deletion. *Replacement. It is often used for handling extreme values, such as Winsorizing or logarithms (which are not commonly used). *Padding. For missing values, you can also consider filling in a reasonable way, common methods include Mean (or Moving Average), Interpolation, 0 df.fillna(0), forward df.fillna(‘ffill’)/backward df.fillna(‘bfill’), etc., to consider whether the assumptions on which the padding depends are correct.

Machine learning should be used with caution to backfill and risk a look-ahead bias

Handling of extreme values:

Percentile method.

By arranging the order from smallest to largest, replace data that exceeds the minimum and maximum proportions with critical data. For data with abundant historical data, this method is relatively rough and not applicable, and forcibly deleting a fixed proportion of data may cause a certain percentage of losses.

2.3σ / triple standard deviation method

LUCIDA：如何利用多因子策略构建强大的加密资产投资组合（数据预处理篇）

Make the following adjustments to all factors in the data range:

LUCIDA：如何利用多因子策略构建强大的加密资产投资组合（数据预处理篇）

The disadvantage of this method is that the data commonly used in the quantitative field, such as stock prices and token prices, often show a spike and thick-tail distribution, which does not conform to the assumption of normal distribution, and in this case, a large amount of data will be incorrectly identified as outliers by using the 3-σ method.

3.绝对值差中位数法（Median Absolute Deviation, MAD）

The method is based on median and absolute bias, making the processed data less sensitive to extremes or outliers. More robust than methods based on mean and standard deviation.

LUCIDA：如何利用多因子策略构建强大的加密资产投资组合（数据预处理篇）

Handling extreme value cases of factor data

class Extreme(object):
def __init__(s, this_data):
s.ini_data = this_data

def three_sigma(s, n= 3):
mean = in .ini_data.mea()
std = s.ini_data.std()
low = mean - n*std
high = mean + n*std

return np.clip(s.ini_data, low, high)

def mad(s, n= 3):
median = s.ini_data.median()
mad_median = abs(s.ini_data - median).median()
high = median + n * mad_median
low = median - n * mad_median

return np.clip(s.ini_data, low, high)

Def quantile(s, L = 0.025, H = 0.975):
low = s.ini_data.quantile(l)
high = s.ini_data.quantile(h)
return np.clip(s.ini_data, low, high)

3. Standardization

Z-score standardization

LUCIDA：如何利用多因子策略构建强大的加密资产投资组合（数据预处理篇）

Min-Max Scaling

Converting each factor to data in the (0, 1) interval allows for comparison of data of different sizes or ranges, but it does not change the distribution within the data and does not make the sum 1.

Sensitive to outliers due to consideration of maximum minima
Unified dimensions, which is conducive to comparing data from different dimensions.

LUCIDA：如何利用多因子策略构建强大的加密资产投资组合（数据预处理篇）

3.排序百分位(Rank Scaling)

Convert data features to their rankings, and convert those rankings into scores between 0 and 1, typically their percentiles in the dataset. *

Since rankings are not affected by outliers, this method is not sensitive to outliers. **

Instead of maintaining the absolute distance between points in the data, it is converted to a relative ranking.

LUCIDA：如何利用多因子策略构建强大的加密资产投资组合（数据预处理篇）

Normalization Factor Data class Scale(object):

def __init__(s, this_data, date):
s.ini_data = this_data
s.date = date
def zscore(s):
mean = in .ini_data.mea()
std = s.ini_data.std()
return s.ini_data.sub(mean).div(std)
def maxmin(s):
min = s.ini_data.min()
max = s.ini_data.max()
return s.ini_data.sub(min).div(max - min)
def normRank(s):

Rank the specified columns, method=‘min’ means that the same value will have the same rank, not the average rank

ranks = s.ini_data.rank(method=‘min’)
return ranks.div(ranks.max())

Fourth, data frequency

Sometimes the data obtained is not as frequent as we need for our analysis. For example, if the analysis level is monthly and the frequency of the raw data is daily, you need to use “downsampling”, that is, the aggregated data is monthly.

downsampling

It refers to aggregating data in a collection into a row of data, such as aggregating daily data into a monthly data. In this case, it is necessary to consider the characteristics of each aggregated indicator, and the usual operations are:

First value/last value
Mean/median *standard deviation

Upsampling

It refers to splitting a row of data into multiple rows of data, such as annual data for monthly analysis. This is usually a simple repetition, and sometimes it is necessary to aggregate the annual data in proportion to each month.

Link to original article

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.