🎉 Share Your 2025 Year-End Summary & Win $10,000 Sharing Rewards!
Reflect on your year with Gate and share your report on Square for a chance to win $10,000!
👇 How to Join:
1️⃣ Click to check your Year-End Summary: https://www.gate.com/competition/your-year-in-review-2025
2️⃣ After viewing, share it on social media or Gate Square using the "Share" button
3️⃣ Invite friends to like, comment, and share. More interactions, higher chances of winning!
🎁 Generous Prizes:
1️⃣ Daily Lucky Winner: 1 winner per day gets $30 GT, a branded hoodie, and a Gate × Red Bull tumbler
2️⃣ Lucky Share Draw: 10
LUCIDA: How to Build a Robust Crypto Asset Portfolio with Multi-Factor Strategies (Data Preprocessing)
Preamble
In the last part of the book, we published the first article in the series “Building a Strong Crypto Asset Portfolio with Multi-Factor Strategies” - Theoretical Fundamentals, and this is the second article - Data Preprocessing.
The data needs to be processed before/after the calculation of the factor data, and before the validity of the single factor is tested. Specific data preprocessing involves the handling of duplicate values, outliers/missing values/extreme values, normalization, and data frequency.
I. Duplicate values
Data-related definitions:
Diagnosing duplicate values starts with understanding what the data “should” look like. Usually the data is in the form of:
Principle: Once you have determined the index (key) of the data, you can know at what level the data should have no duplicate values.
Check Method:
Pd. DataFrame.duplicated(subset=[key 1, key 2, …])
pd.merge(df 1, df 2, on=[key 1, key 2, …], indicator=True, validate=‘1: 1’)
2. Outliers/Missing Values/Extreme Values
Common causes of outliers:
Principles for handling outliers and missing values:
*Delete. Outliers that cannot be reasonably corrected or corrected can be considered for deletion. *Replacement. It is often used for handling extreme values, such as Winsorizing or logarithms (which are not commonly used). *Padding. For missing values, you can also consider filling in a reasonable way, common methods include Mean (or Moving Average), Interpolation, 0 df.fillna(0), forward df.fillna(‘ffill’)/backward df.fillna(‘bfill’), etc., to consider whether the assumptions on which the padding depends are correct.
Machine learning should be used with caution to backfill and risk a look-ahead bias
Handling of extreme values:
By arranging the order from smallest to largest, replace data that exceeds the minimum and maximum proportions with critical data. For data with abundant historical data, this method is relatively rough and not applicable, and forcibly deleting a fixed proportion of data may cause a certain percentage of losses.
2.3σ / triple standard deviation method
Make the following adjustments to all factors in the data range:
The disadvantage of this method is that the data commonly used in the quantitative field, such as stock prices and token prices, often show a spike and thick-tail distribution, which does not conform to the assumption of normal distribution, and in this case, a large amount of data will be incorrectly identified as outliers by using the 3-σ method.
3.绝对值差中位数法(Median Absolute Deviation, MAD)
The method is based on median and absolute bias, making the processed data less sensitive to extremes or outliers. More robust than methods based on mean and standard deviation.
Handling extreme value cases of factor data
class Extreme(object):
def __init__(s, this_data):
s.ini_data = this_data
def three_sigma(s, n= 3):
mean = in .ini_data.mea()
std = s.ini_data.std()
low = mean - n*std
high = mean + n*std
return np.clip(s.ini_data, low, high)
def mad(s, n= 3):
median = s.ini_data.median()
mad_median = abs(s.ini_data - median).median()
high = median + n * mad_median
low = median - n * mad_median
return np.clip(s.ini_data, low, high)
Def quantile(s, L = 0.025, H = 0.975):
low = s.ini_data.quantile(l)
high = s.ini_data.quantile(h)
return np.clip(s.ini_data, low, high)
3. Standardization
Converting each factor to data in the (0, 1) interval allows for comparison of data of different sizes or ranges, but it does not change the distribution within the data and does not make the sum 1.
3.排序百分位(Rank Scaling)
Convert data features to their rankings, and convert those rankings into scores between 0 and 1, typically their percentiles in the dataset. *
Since rankings are not affected by outliers, this method is not sensitive to outliers. **
Normalization Factor Data class Scale(object):
def __init__(s, this_data, date):
s.ini_data = this_data
s.date = date
def zscore(s):
mean = in .ini_data.mea()
std = s.ini_data.std()
return s.ini_data.sub(mean).div(std)
def maxmin(s):
min = s.ini_data.min()
max = s.ini_data.max()
return s.ini_data.sub(min).div(max - min)
def normRank(s):
Rank the specified columns, method=‘min’ means that the same value will have the same rank, not the average rank
ranks = s.ini_data.rank(method=‘min’)
return ranks.div(ranks.max())
Fourth, data frequency
Sometimes the data obtained is not as frequent as we need for our analysis. For example, if the analysis level is monthly and the frequency of the raw data is daily, you need to use “downsampling”, that is, the aggregated data is monthly.
downsampling
It refers to aggregating data in a collection into a row of data, such as aggregating daily data into a monthly data. In this case, it is necessary to consider the characteristics of each aggregated indicator, and the usual operations are:
Upsampling
It refers to splitting a row of data into multiple rows of data, such as annual data for monthly analysis. This is usually a simple repetition, and sometimes it is necessary to aggregate the annual data in proportion to each month.
Link to original article