Overview

The introductory guide to participating in the NestQuant Tournament

Introduction

The NestQuant Tournament revolves around building models that utilize raw data to make predictions for the cryptocurrency market. Exceptional performance of your models in this competition can lead to earning rewards in the form of cryptocurrency.

Guideline

  1. Sign up here.

  2. Download the dataset with raw data and label from this page.

  3. Perform exploratory data analysis (EDA) on the raw data in order to generate the necessary features.

  4. Construct your model using the provided dataset and subsequently submit your predictions back to NestQuant.

  5. Stake tokens on your model to earn or burn them based on the performance of the model. The tokens will be rewarded or deducted in accordance with the model's performance in the evaluation process. (currently not supported)

  6. Automate the process of submitting your predictions to NestQuant, allowing your stake to grow over time.

Data

NestQuant provides free non-obfuscation datasets to users, we believe that data will be most valuable if users understand it best. That's because the deeper you understand the data, the more efficient it will be to build models, both users and NestQuant will benefit from the above problem.

We provide data from 3 different financial markets: cryptocurrency, forex, and stock markets; we also have economic indicators like US inflation and interest rates, US unemployment rate, US GDP and CPI,... and all this data is well documented. You can download and start mining the data, which is the starting point for your AI modeling journey.

For more information, go to Raw Data section.

Feature Engineering

Feature engineering is a crucial step in building machine learning and deep learning models. It involves selecting and transforming the input data to create a set of features that can be used to train a model.

Some notes on time series feature engineering

In time series modeling, feature engineering works in a different way. Time series data is sequential data and with historical batch data that we provide, care must be taken to avoid data leakage because any data leakage can lead to a very efficient model in training phase but at the testing phase, the model has no value.

Therefore, users must be careful when performing feature engineering with libraries, some library functions consider all data provided to them as known and they can perform some computations. directly in it. For instance, the pandas.DataFrame.rolling method has a parameter closed which, by default, equals None (or 'right'). And by executing the following code...

df = pd.DataFrame([[1,1.2],[2,1.4],[3,1.3]], columns=['Timeframe','Price'])
#       Timeframe  Price
#    0          1    1.2
#    1          2    1.4
#    2          3    1.3
print(df['Price'].rolling(2).mean())

We will get the following result:

0     NaN
1    1.30
2    1.35
Name: Price, dtype: float64

And this means that the average price by applying pandas.DataFrame.rolling method with a window value of 2 at timeframe 3 includes its current price and the price of previous timeframe (timeframe 2) in the calculation of mean, not the 2 last observed timeframe (timeframe 1 and timeframe 2). In order to include 2 observations in the average, we must pass closed='left' into the rolling method.

print(df['Price'].rolling(2, closed='left').mean())

The result will now be:

0    NaN
1    NaN
2    1.3
Name: Price, dtype: float64

Other libraries can make default parameter like 'center=True' with its method. The calculation is now taken over a window such that the midpoint of the window is on the current timeframe, subsequent timeframes are also taken in the calculation and this can create a leaky calculation.

Features creation

Standard original financial data contains only 5 columns OPEN, HIGH, LOW, CLOSE and VOLUME in each timeframe, which does not represent some characteristics of time series data such as momentum or trends. Therefore, to get a better performance model, data scientists often create additional features based on these 5 raw columns. And these features are created based on technical and financial indicators used by traders.

For instance, we can use TALIpp library to generate the RSI indicator:

import random

from talipp.ohlcv import OHLCVFactory
from talipp.indicators import RSI

if __name__ == "__main__":

    close = random.sample(range(1, 10000), 1000)
    ohlcv = OHLCVFactory.from_matrix2([
        random.sample(range(1, 10000), 1000),
        random.sample(range(1, 10000), 1000),
        random.sample(range(1, 10000), 1000),
        random.sample(range(1, 10000), 1000),
        random.sample(range(1, 10000), 1000)]
    )
  
    print(f'RSI: {RSI(14, close)[-1]}')

Modeling

The objective of the model is to predict future outcomes by utilizing real-time features created from raw data that corresponds to the present cryptocurrency market.

This serves as a simple illustration where we train a LightGBM model on features generated from historical raw data. Participants are free to utilize any programming language or framework of their choice to construct their models.

import pandas as pd
import lightgbm as lgb

# Training data contains Features (Extract from your Feature Engineering step) and Label (create by NestQuant)
training_data = pd.read_parquet('training_data.parquet')

# Future data need to predict contain Feature(Extract from your Feature Engineering step)
future_data = pd.read_parquet('future_data.parquet')

feature_names = [f for f in training_data.columns if "feature" in f]

# Params of model
params = {'objective': 'regression',
          'metric': ['mse'],
          'boosting':'gbdt',
          'num_boost_round': 100,
          'learning_rate': 0.01,
          'max_depth': 1,
          'num_leaves': 2}
# Training step
train_data = lgb.Dataset(training_data[feature_names], training_data['label'])
model = lgb.train(params, train_data, valid_names=['train'])

# Predict the Future data and Save it to submit results
predictions = model.predict(future_data[feature_names])
predictions.to_csv("predictions.csv")

Diagnostics

In this section, there are two tools available.

  1. Users have the capability to assess the performance of the model, estimate risk indicators, analyze the impact, and evaluate the correlation with the actual data using the provided tools.

  2. The backtest engine is a tool used to calculate the return on investment that would be achieved by applying the results of the model to a specific strategy. It allows users to simulate and evaluate the performance of their investment strategy based on the model's predictions.

Last updated