Date: October 13, 2021
Table of Contents
- Project Planning
- Executive Summary
- Acquire Data
- Prepare Data
- Data Exploration
✓ 🟢 Plan ➜ ☐ Acquire ➜ ☐ Prepare ➜ ☐ Explore ➜ ☐ Model ➜ ☐ Deliver
- For this project we will be working with historical price and volume data from Bitcoin between 01-01-2012 & 03-31-2021, these are Bitstamp prices and all are annotated in USD.
- The primary focus is to see if Bitcoin price can be predicted with any reliability or if there is any cyclical observations within Bitcoin pricing or volume.
- The csv data can be downloaded from Kaggle here.
- Create models that are better at predicting Bitcoin price than the baseline.
- Put these models into a Juypter notebook and make the project replicable.
- Data science professionals as well as any curious cat.
- A clearly named final notebook. This notebook will be what you present and should contain plenty of markdown documentation and cleaned up code.
- A README that explains what the project is, how to reproduce you work, and your notes from project planning.
- A Python module or modules that automate the data acquisistion and preparation process. These modules should be imported and used in your final notebook.
- This project is to utilize machine learning modeling to predict the avgerage price of Bitcoin, better than the baseline.
- Abstract the functions to sub python scripts to have a clean presentation, and throughly document.
- The dataset had quite a few null values, but utilizing a forward fill method offered an easy and effective remedy. Otherwise, this dataset was quite clean.
- Another key aspect of predicting Bitcoin’s price was the percent change within a time interval.
- The best performing model was Holt’s Linear Trend that was fine tuned for the dataset via slope and level smoothness, with an RMSE of 53.
✓ Plan ➜ 🟢 Acquire ➜ ☐ Prepare ➜ ☐ Explore ➜ ☐ Model ➜ ☐ Deliver
|Timestamp||4857377 non-null: datetime64[ns]||start tiem of time window (60s window), in Unix Time|
|Open||3613769 non-null: float64||Open price at start time window|
|High||3613769 non-null: float64||High price within the time window|
|Low||3613769 non-null: float64||Low price within the time window|
|Close||3613769 non-null: float64||Close price at the end of the time window|
|Volume_(BTC)||3613769 non-null: float64||Volume of BTC transacted in this window|
|Volume_(Currency)||3613769 non-null: float64||Volume of corresponding currency transacted in this window|
|Weighted_Price||3613769 non-null: float64||VWAP - Volume Weighted Average Price|
Takeaways from Acquire:
- Target variable:
- This dataframe currenly has 4,857,377 rows and 8 columns
- There are 1,243,608 missing values.
- All columns are float64 types of data.
✓ Plan ➜ ✓ Acquire ➜ 🟢 Prepare ➜ ☐ Explore ➜ ☐ Model ➜ ☐ Deliver
- Add additional columns of
- Filling the null values with the most recent value will likely be the best course of action.
New Data Dictionary
|Open||4857377 non-null: float64||Open price at start time window|
|High||4857377 non-null: float64||High price within the time window|
|Low||4857377 non-null: float64||Low price within the time window|
|Close||4857377 non-null: float64||Close price at the end of the time window|
|Volume_(BTC)||4857377 non-null: float64||Volume of BTC transacted in this window|
|Volume_(Currency)||4857377 non-null: float64||Volume of corresponding currency transacted in this window|
|Weighted_Price||4857377 non-null: float64||VWAP - Volume Weighted Average Price|
|day_of_week||4857377 non-null: object||Verbose name of the week|
|day_of_week_num||4857377 non-null: int64||number representing the day of the week|
|month||4857377 non-null: object||Month number and month name|
|month_num||4857377 non-null: int64||number representing the month of the year|
|price_diff||4857377 non-null: float64||Delta between the Close and Open (Close - Open)|
|price_delta||4857377 non-null: float64||Delta between the High and Low (High - Low)|
|day_num||4857377 non-null: int64||The numeric number of the day of the month|
|avg_price||4857377 non-null: float64||Avg price for the time period ([Open + Close] / 2)|
|percent_change||4857377 non-null: float64||Price difference / Open price represented as a percentage (price_diff / Open)|
- The data is now prepared to be input into the explore aspects of the data pipeline to evaluate what features we should use to potentially run time-series analysis on.
✓ Plan ➜ ✓ Acquire ➜ ✓ Prepare ➜ 🟢 Explore ➜ ☐ Model ➜ ☐ Deliver
Bitcoin Price over Time
Hour vs Percent Change on a monthly basis
Correlations of Average Price
- It is apparent that the most correlated features with
- It also appears that the avg_price tends to trend upward during Oct, and downwards in Sept.
✓ Plan ➜ ✓ Acquire ➜ ✓ Prepare ➜ ✓ Explore ➜ 🟢 Model ➜ ☐ Deliver
Last Observed Value
Holt’s Linear Trend
Previous Cycle (6 months)
✓ Plan ➜ ✓ Acquire ➜ ✓ Prepare ➜ ✓ Explore ➜ ✓ Model ➜ 🟢 Deliver
Root Mean Square Deviation Results
Conclusions & Next Steps
- I found that there were large variations in certain models RMSE but the models that could be tuned (such as Holt’s and Previous Cycles) performed best after optimization.
- If I had more time to work on this project, I’d continue doing more performance tuning and possibly aggregating other external data factors that could provide a higher level of correlation or model accuracy.
- Download the the csv data from Kaggle here.
- Download the
prepare.pyfiles and then run the final notebook.