Optiver Realized Volatility Prediction Data Description This dataset contains stock market data relevant to the practical execution of trades in the financial markets In particular, it includes order.
Optiver Realized Volatility Prediction Data Description This dataset contains stock market data relevant to the practical execution of trades in the financial markets In particular, it includes order book snapshots and executed trades With one second resolution, it provides a uniquely fine grained look at the micro-structure of modern financial markets This is a code competition where only the first few rows of the test set are available for download The rows that are visible are intended to illustrate the hidden test set format and folder structure The remainder will only be available to your notebook when it is submitted The hidden test set contains data that can be used to construct features to predict roughly 150,000 target values Loading the entire dataset will take slightly more than GB of memory, by our estimation This is also a forecasting competition, where the final private leaderboard will be determined using data gathered after the training period closes, which means that the public and private leaderboards will have zero overlap During the active training stage of the competition a large fraction of the test data will be filler, intended only to ensure the hidden dataset has approximately the same size as the actual test data The filler data will be removed entirely during the forecasting phase of the competition and replaced with real market data Files book_[train/test].parquet A parquet file partitioned by stock_id Provides order book data on the most competitive buy and sell orders entered into the market The top two levels of the book are shared The first level of the book will be more competitive in price terms, it will then receive execution priority over the second level stock_id - ID code for the stock Not all stock IDs exist in every time bucket Parquet coerces this column to the categorical data type when loaded; you may wish to convert it to int8 time_id - ID code for the time bucket Time IDs are not necessarily sequential but are consistent across all stocks seconds_in_bucket - Number of seconds from the start of the bucket, always starting from bid_price[1/2] - Normalized prices of the most/second most competitive buy level ask_price[1/2] - Normalized prices of the most/second most competitive sell level bid_size[1/2] - The number of shares on the most/second most competitive buy level ask_size[1/2] - The number of shares on the most/second most competitive sell level trade_[train/test].parquet A parquet file partitioned by stock_id Contains data on trades that actually executed Usually, in the market, there are more passive buy/sell intention updates (book updates) than actual trades, therefore one may expect this file to be more sparse than the order book stock_id - Same as above time_id - Same as above seconds_in_bucket - Same as above Note that since trade and book data are taken from the same time window and trade data is more sparse in general, this field is not necessarily starting from price - The average price of executed transactions happening in one second Prices have been normalized and the average has been weighted by the number of shares traded in each transaction size - The sum number of shares traded order_count - The number of unique trade orders taking place train.csv The ground truth values for the training set stock_id - Same as above, but since this is a csv the column will load as an integer instead of categorical time_id - Same as above target - The realized volatility computed over the 10 minute window following the feature data under the same stock/time_id There is no overlap between feature and target data You can find more info in our tutorial notebook test.csv Provides the mapping between the other data files and the submission file As with other test files, most of the data is only available to your notebook upon submission with just the first few rows available for download stock_id - Same as above time_id - Same as above row_id - Unique identifier for the submission row There is one row for each existing time ID/stock ID pair Each time window is not necessarily containing every individual stock sample_submission.csv - A sample submission file in the correct format In [201… In [202… # Suppressing Warnings import warnings warnings.filterwarnings('ignore') # Importing Pandas and NumPy import pandas as pd, numpy as np In [203… # Importing all datasets Optiver_train = pd.read_csv("C:/Users/HP/Desktop/Upgrad Case Study/Optiver Realized Volatility Pre Optiver_train.head() stock_id time_id target 0 0.004136 11 0.001445 16 0.002168 31 0.002195 62 0.001747 Out[203… In [204… # Importing all datasets Optiver_test = pd.read_csv("C:/Users/HP/Desktop/Upgrad Case Study/Optiver Realized Volatility Pred Optiver_test.head() stock_id time_id row_id 0 0-4 32 0-32 34 0-34 Out[204… In [205… Optiver_train.dtypes int64 Out[205… stock_id time_id int64 target float64 dtype: object In [206… Optiver_test.dtypes int64 Out[206… stock_id time_id int64 row_id object dtype: object Inspecting the Null Values In [207… Optiver_train.isnull().sum() Out[207… stock_id time_id target dtype: int64 In [208… Optiver_test.isnull().sum() Out[208… stock_id time_id row_id dtype: int64 Rescaling the Features We will use MinMax scaling In [209… In [210… from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() # Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables num_vars = ["stock_id","time_id","target"] Optiver_train[num_vars] = scaler.fit_transform(Optiver_train[num_vars]) Optiver_train.head() stock_id time_id target 0.0 0.000000 0.057402 0.0 0.000183 0.019075 0.0 0.000336 0.029380 0.0 0.000794 0.029766 0.0 0.001740 0.023385 Out[210… Checking for Outliers In [211… # Checking for outliers in the continuous variables num_Optiver_train = Optiver_train[["stock_id","time_id","target"]] In [212… # Checking outliers at 25%, 50%, 75%, 90%, 95% and 99% num_Optiver_train.describe(percentiles=[.25, 5, 75, 90, 95, 99]) stock_id time_id target count 428932.000000 428932.000000 428932.000000 mean 0.495539 0.489408 0.053765 std 0.294654 0.285853 0.041817 0.000000 0.000000 0.000000 25% 0.238095 0.239576 0.027359 50% 0.500000 0.483731 0.041910 75% 0.761905 0.732220 0.065974 90% 0.896825 0.892342 0.101616 95% 0.952381 0.947866 0.133132 99% 0.992063 0.990813 0.215036 max 1.000000 1.000000 1.000000 Out[212… In [213… import seaborn as sns sns.boxplot(Optiver_train.stock_id) Out[213… In [214… sns.boxplot(Optiver_train.time_id) Out[214… In [215… sns.boxplot(Optiver_train.target) Out[215… In [216… # removing (statistical) outliers Q1 = Optiver_train.stock_id.quantile(0.238095) Q3 = Optiver_train.stock_id.quantile(0.761905) IQR = Q3 - Q1 Optiver_train = Optiver_train[(Optiver_train.stock_id >= Q1 - 1.5*IQR) & (Optiver_train.stock_id < Q1 = Optiver_train.time_id.quantile(0.239576) Q3 = Optiver_train.time_id.quantile(0.732220) IQR = Q3 - Q1 Optiver_train = Optiver_train[(Optiver_train.time_id >= Q1 - 1.5*IQR) & (Optiver_train.time_id = Q1 - 1.5*IQR) & (Optiver_train.target