import itertools
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import fix_yahoo_finance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import portfolioopt
import copy
Before we even begin to explain the problem at hand it is essential that we first understand some basic finance terminology. The following set of points should give a brief outline of the concepts which we have interest in knowing, and they should serve as a solid foundation to understanding the rest of the analysis to come.
Provided you have somewhat of an understanding of these terms then you should be more than comfortable continuing through this project and being guided through our analysis and simulation process.
Perhaps shocked by the low interest rates offered by the banks, a client has approached us with $10 million and the simple goal of investing their money into a stock portfolio, spreading their wealth across an array of companies in hopes that its value will grow. The only caveat is that our client stipulated that the companies his money go into must only be companies which are listed within the S&P 500 stock index. Their questions are simple, given this restriction, which stocks should they invest in, and how much money should be put into each stock?
Simple enough right? But our problem doesn't just end there, we still have two more (perhaps obvious to some) factors we must keep in mind when pushing forward in our process. The portfolio as a whole must remain relatively low risk; there's no point us advising our client to put their money into a portfolio which has the likely potential to lose almost all of its value within a matter of weeks. Risk is of course quite a subjective measure, for some people, including our client, the rush of taking a heavy risk in hopes of a larger reward might just be a worth while endeavour. Greed can be a powerful ally, but instead we will err on the side of caution, assuming our client takes a more conservative approach when it comes to investing their hard earned pennies.
This is of course a double edged sword we're dealing with, we can't be too relaxed with our portfolio because in most cases opting for a lack of risk comes with a severe lack of rewards. And if the rewards aren't sufficient then our client might as well have put their money in a bank to begin with, skipping this whole process. As a result, our goal is the to best balance this risk & return trade-off, something which can be achieved through maximising the previously mentioned Sharpe ratio.
Before we begin jumping into our analysis, or even forming a plan to deal with this problem, it's imperative that we understand what has been done before us. For as long as markets have existed, people have been trying to figure out strategies which best balance their portfolio, generations of investors before us have developed tried and tested methods of doing just that, and we aren't intending to reinvent the finance wheel.
Modern-day finance models are built on an array of assumptions, one of the base assumptions being that the market being invested in is efficient[1]. What this assumption is basically saying is that investors will view the information in the same way across the board and are going to make similar pragmatic decisions when investing in the market. This however, is far from the truth, when in actual fact many investors are irrational and quite often make arbitrary decisions when to comes to buying or selling a particular or multiple stocks. We all view information in different ways and the conclusion arrived at by one investor given the same information is not generally the same for the entire market[2]. These general assumptions surrounding the market and investors are without a doubt a main contributing factor when it comes to the variability in the accuracy of these models, although still very useful, these are some major flaws which should be considered.
Portfolio theory first came to light in the 1950’s when Modern portfolio theory (MPT) was introduced through contributions from Harry Markowitz and William Sharpe[3]. A staple in the field, MPT laid the foundations, and is the basis for portfolio theory seen today. MPT was introduced as a way for investors to weight their assets into particular stock in order to maximise their expected return based on some level of assumed risk. Years later in the early 1990’s Post Modern Portfolio Theory (PMPT) was introduced and in all actuality is an extension of MPT, which relies on the same assumptions, but provides a different perspective on one of the main ideas used within MPT[4]. PMPT was still a way for investors to optimally weights their assets in stock to achieve an expected return based on some level of assumed risk, but the key difference between the two is in how the risk is calculated. Rather than associating risk with your positive and negative returns as done in MPT, i.e. making money and losing money both had a degree of risk, PMPT instead defines risk as a value purely based on negative returns, meaning that there is only risk associated in losing money. It seems like a sensible change as people are only worried about losing money when investing, although it all depends on how you wish to define your risk as to which model you would use.
In more recent years due to an exponential increase in computational affordability, access and power. Online portfolio selection (OPS) has been developed and is the modern-day culmination of the past 60 years of research in quantitative finance [5]. Unlike its predecessors OPS doesn’t have a single way of defining risk and optimally weighting assets. Instead using machine learning and simulation techniques it uses assortment of algorithms that encompass numerous different ways of optimally weighting assets based on actual stock performance and historical trends[6].
Through our own analysis we’ve tried to include these traditional finance ideas while also combining them with the computational power at our disposal, all in an attempt to best balance our own portfolio.
To explain the process in which we went about solving this problem we've presented the report using the following structure:
This structure hopes to guide you from the very beginning all the way to the very end of our process, showcasing our thought process and all the steps we've taken along the way. The introduction lays out a foundation of information in which the report is based on. The analysis documents our struggle to get our hands on this data, and then displays the wrangling, cleaning and exploratory process which proceeded. We then implemented quite a basic portfolio to make sure that our calculations were correct, a proof of concept of sorts. With that being a success our simulation could then be extended to a much larger portfolio, in the hopes that through tuning the parameters we could work towards finding the most optimal portfolio. With our optimal portfolio being found we then pushed forward to evaluating its performance using a risk management technique (Value-at-Risk), and simply just demonstrating its perspective real-life change in value through the first 5 months of 2018. Finally, we tie our whole process back to the question at hand by judging how happy we believe our client would be with this portfolio, and offer some future improvements to make this whole analysis and simulation a more useful and accurate process.
Before we even began querying some API for stock price data, we needed to actually know which companies are listed within the S&P 500 index as its component companies are what we're required to build our portfolio from. Thankfully, it wasn't hard to find a table hosted by Wikipedia which contains all the information we need plus some. Most importantly the table contains a column which is each company's ticker symbol, this identifies each company uniquely across the stock exchange and is the vital piece of information we needed to begin this process.
At first we managed to scrape each individual column using a Python package called BeautifulSoup, however with a little digging we realised it was significantly easier to just automatically parse this HTML table code through a prewritten Pandas function which took care of all the heavy lifting for us. As you can see not only were we left with the column of ticker symbols, which were vital, but we also had a lot of extra, useful information that we could use to group and filter companies; GICS Sector, GICS Sub Industry, date the company was first added to the S&P 500 index, etc, etc.
# Contacts the Wikipedia page listing the S&P500 companies. Gets the table which lists each company,
# their sector, industry, data added to S&P500, date founded. Pandas then converts this HTML table
# into a pandas data frame which is ready to be used
res = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
wiki_data = pd.read_html(str(table), header = 0)
wiki_data[0].head()
Interestingly enough, when we explored this table we noticed that there weren't 500 rows/companies as expected, but there were instead 505 rows/companies listed within the table. We found that this is because five companies listed in the index have two different classes of equity, therefore doubling their listings on the index and adding an extra five entries to the table. Although subtle, it's an important fact to notice as these stocks are almost guaranteed to have high correlation - possibly skewing the results if we analyse the index as a whole.
# download stocks if you want from here (sometimes NaN values differ)
# this however may take a few minutes, and may result in som slight data differences
# stocks = yf.download(list(wiki_data[0]['Ticker symbol']), start="2000-01-03", end="2017-12-29")
# save your downloaded file to a destination of your choosing (if you decide to download at all)
# stocks.to_hdf("/Users/troyengelhardt/Desktop/data/stocks00-17.h5", "stocks")
# so our analysis is repeatable we've saved OUR stock data to a .h5 file
# we will upload/attach a link to download so you can use the same data
# and follow our analysis EXACTLY, at your leisure
stocks = pd.read_hdf("/Users/troyengelhardt/Desktop/data/stocks00-17.h5")
We had all our stock tickers now, but the next hurdle we had to overcome in this process was actually gathering some company specific information; 505 sets of stock price information to be exact. Our first thought was to use Quandl, a service which we know to have a simple Python based API in which we could query data using our set of stock tickers. However, the popular large stocks (TSLA, AAPL, MSFT, etc) were available to download as much data as your could get your hands on, but attempting to grab the data for some (505) more obscure stocks seemed to require premium access to their databases - a $$350 yearly privilege.
With barely enough to afford instant noodles we searched far and wide for a free alternative, and alas our saviour Yahoo Finance, or so we thought. We found a Python package called yahoo-finance which seemed to be used quite frequently used by many and nicely documented, the only problem being that every time we tried to query it we were bombarded with errors no matter what we tried. This turned out to be a result of the fact that this package is based on an API which Yahoo has since stopped supporting and therefore the code was broken and useless. Back to the drawing board, and with a little more digging we stumbled upon another Python packed, this time called fix-yahoo-finance. This package was designed, as the name suggests, in an effort to fix its dysfunctional predecessor, the caveat being that instead of contacting a nice neat API as a quick fix it's now scraping the data from the Yahoo website; a process that can yield messy results sometimes.
Whatever the accuracy, it almost seemed too simple, with one line of code (and a few minutes of waiting) we parsed our list of ticker symbols through a function. What was returned was a Pandas panel object full of all the possible data, for all 505 stocks, from Jan, 2000 all the way to Dec, 2017. As can be seen below, for each day in this date range we had the adjusted close, close, high, low, open and volume values. For the majority of our analysis we decided to use the adjusted close price as it seemed the most accurate, being adjusted for significant corporate action, making it in our opinion the better reflection of a stock's true value.
# Showing we can filter all equity quantities, across all companies, by a specific date
stocks.major_xs('2000-01-03')[98:103]
However, in our first step of just looking at our data, calling the .head()
method, we realised we had some collection of stocks which had missing (NaN) values for some set of days within our 17 year date range.
# Showing we can filter all companies, across all dates, for a specifc equity quantity i.e. 'Adj Close'
stocks['Adj Close'].head()
This brought us to the problem of how we were going to deal with missing values? With more time we could've done so in a much more sophisticated manner. If the range of missing values was small we could have simply taken the average of the daily prices leading up to the missing chunk, and having these artificial values might have been much more valuable then just completely neglecting the company from our analysis. However, in the spirit of time we decided it'd be best to just do exactly that, get rid of any company which has a missing value using the .dropna(axis = 2)
method. This means that even if a company has only one missing error in all 4527 days, perhaps due to a bad scrape of the data on request, we're just throwing it completely out the window; quite an extreme action to take. However, if we look at the flip side of this argument, companies which have a complete data set from 2000 to 2017 likely have an intact data set preceding our date range for decades - making all the companies we're looking at (likely) to be much safer bets due to their long lasting nature.
No matter the assumptions we went ahead and took this action, dropping 119 tickers with NaN values, which resulted in a much more refined and polished data set, now containing only 386 stocks instead of 505 - a much easier data set for us to deal with.
stocks_clean = stocks.dropna(axis = 2)
stocks_clean['Adj Close'].head()
stocks_clean['Adj Close'].shape
After we had this more structured set of data we could move onto actually exploring relationships, correlations, between each stock. Our first, and perhaps the most obvious, line of thinking was to just blindly create a correlation matrix for every single company using the original data we'd gathered, their daily adjusted close price. As the following figure shows, when looking at the absolute price almost everything is very highly correlated, the majority being upwards of 80% positive correlation. We expected much looser correlations overall and as a result of this matrix we decided that perhaps absolute price isn't the most even playing field to be assessing the relationships between stocks.
# Taking a look at the correlation between every single stock's daily price
stock_corr_price = stocks_clean['Adj Close'].corr()
plt.figure(figsize=(13,10))
sns.heatmap(stock_corr_price,
xticklabels=stock_corr_price.columns,
yticklabels=stock_corr_price.columns,
cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Stock's Daily Adjusted Close Price", fontsize = 17)
After some thinking we instead opted to take the percentage change from day to day of each stock's price. This meant that instead of showing that a stock moved x-amount of dollars, we can see its movement proportionate to the previous day. We employed this change and recalculated the correlation matrix (as seen in the following figure) and were presented with a result much closer to what we were initially expecting. All companies are still overall positively correlated, but now correlations are much weaker, there still exist some clear dark red dots; highly correlated pairs.
# Taking a look at the correlation between every single stock's daily pct change
stock_corr_all = stocks_clean['Adj Close'].pct_change()[1:].corr()
plt.figure(figsize=(13,10))
sns.heatmap(stock_corr_all,
xticklabels=stock_corr_all.columns,
yticklabels=stock_corr_all.columns,
cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Stock's Daily % Change in Adj Close Price", fontsize = 17)
plt.show()
To extend this line of thinking we decided to look at the correlations on a monthly basis instead of a daily basis. Our reasoning being that stocks, especially stocks on the S&P 500 generally don't move too much from day to day, perhaps by grouping their movement into larger chunks, like months, we will have a much more accurate representation of their overall price's trend. To get these monthly values we simply took the mean across all daily percentage change within some month, and then used this now monthly average value as a new method to compare stock's movements; measure correlations. This is of course quite a crude method of going about grouping, as a blind average of course isn't the most sophisticated measure of a stock's movement within a month however it was quick, easy and gave us a reasonable baseline for monthly comparison.
As can be seen below, this yields a similar looking correlation matrix to the previous daily figure, with only some subtle differences. It seems that overall the correlations have lessened, with a large majority being close to 0% correlation, or being in the very low, weakly positive (<30%) range. The stronger positive correlations, the darker red, seems to have got more obvious in this new figure - cementing the idea that their movement is indeed directly linked in some way.
# Finding and visualising the correlation matrix made from this
# resampled monthly dataset - not the daily dataset (difference?)
stock_corr_monthly = stocks_clean['Adj Close'].pct_change()[1:].resample('M').mean().corr()
plt.figure(figsize=(13,10))
sns.heatmap(stock_corr_monthly,
xticklabels=stock_corr_monthly.columns,
yticklabels=stock_corr_monthly.columns,
cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Stock's Monthly % Change in Adj Close Price", fontsize = 17)
Assessing correlation through each individual stock is interesting but we can only really see so much - it's next to impossible to pick out any actual correlated stock pairings within these cluttered figures. As a way to overcome this problem we decided to use the information given to us in the original S&P 500 table that we scraped to associate each individual stock with its unique sector. With a little work it seemed we managed to reindex our data, making use of the .groupby()
method to sort our data in a sensible manner. In the end we were left with the following 11 sectors, each containing some subgroup of our total 386 stocks we are working with.
# Setting our original wikipedia df's index to the 'Ticker symbol' for better sorting
better_wiki = wiki_data[0].set_index('Ticker symbol', drop=False)
# Removing 'unnecessary' SEC filings and CIK columns (unnecessary for our purpose)
better_wiki = better_wiki.iloc[:,[0,1,3,4,5,6,8]]
# Removing the tickers from this table in which we DON'T have full data for
# (changed to .reindex as it's way more fitting then passing labels through .loc)
better_wiki = better_wiki.reindex(stocks_clean['Adj Close'].columns)
# Grouping data by 'GICS Sector'
sector_groups = better_wiki.groupby('GICS Sector').groups
# All unique sectors
sector_groups.keys()
For example, by just looked at the Energy sector we can see that it contains the following 25 stock tickers, and by filtering these specific tickers through our larger data set we can explore relationships held purely within this specific sector.
# Which stocks are contained within the 'Energy' sector
sector_groups['Energy']
As has been alluded to, this meant that we could now start looking at sector specific correlations. It makes sense that all stocks within one specific sector should move in the same way; if one energy company was to take a heavy loss then it'd seem likely that all other energy companies might follow suit - especially if it was a result of industry wide news or implications.
Here we decided to look at the correlation just within the energy sector, and as we expected, by and large everything is highly positively correlated. However, looking closely we can see there are some lighter patches of colour which relate to a couple stocks, particularly WMB & ANDV, indicating that they might be outliers as they tend to not be associated with the rest of the groups movement.
# Finding the pairwise correlation within the 'Energy' sector and visualising correlation
energy_stocks = stocks_clean['Adj Close'].loc[:,list(sector_groups['Energy'])].pct_change().corr()
plt.figure(figsize=(13,10))
sns.heatmap(energy_stocks,
xticklabels=energy_stocks.columns,
yticklabels=energy_stocks.columns,
cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Energy Sector Stock's Average Monthly % Change", fontsize = 17)
plt.show()
As an extension of this thinking we decided to remove these outliers which see little correlation, and recalculate the correlation matrix. As can be seen in the following figure, the expected behaviour took place. Across the board everything is much more strongly correlated, just by eyeballing it, it's clear that almost every stock within this pruned energy sector sees well and truly above 75% correlation.
# Finding the pairwise correlation within the 'Energy' sector and visualising correlation
energy_no_outliers = stocks_clean['Adj Close'].loc[:,list(list(sector_groups['Energy'])[1:-2][:-6] + list(['XOM', 'OXY', 'PXD', 'SLB']))].pct_change().corr()
plt.figure(figsize=(13,10))
sns.heatmap(energy_no_outliers,
xticklabels=energy_no_outliers.columns,
yticklabels=energy_no_outliers.columns,
cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Stock's Monthly % Change in Adj Close Price (Removed Outliers)", fontsize = 15)
plt.show()
It would have been nice to go through each sector in a similar fashion and remove these outliers, but due to how our code is implemented at the moment it's manual process even to get one outlier free sector. Again, in the spirit of saving time, we opted to skip this step and simply look at the correlation between each sector in the following figure - including these perspective outliers.
# Creating a dictionary where each key is a sector and each value is the set
# of daily mean price change for all stocks within that sector
# OLD CODE
# values = []
# for i in sector_groups.keys():
# values.append(stocks_clean['Adj Close'].loc[:,list(sector_groups[i])].pct_change()[1:].mean(1))
values = [stocks_clean['Adj Close'].loc[:,list(sector_groups[i])].pct_change()[1:].mean(1) for i in sector_groups.keys()]
all_sector_means = dict(zip(sector_groups.keys(), values))
#all_sector_means.keys()
# Turning dictionary into a dataframe so correlations can easily be taken
# and then visualising the correlation matrix
plt.figure(figsize=(13,10))
df_sector_means = pd.DataFrame(all_sector_means).corr()
sns.heatmap(df_sector_means,
xticklabels=df_sector_means.columns,
yticklabels=df_sector_means.columns,
cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Sector's Mean Daily % Change", fontsize = 17)
plt.show()
To our initial surprise, a good half of the sectors are quite strongly correlated with one another, the dark red squares being an indication that all these pairs show roughly above 60% positive correlation, some much higher. The other half seems to still show some positive correlation, but for the most part they seem to verge a lot closer to zero, the majority being insignificant pairs below 20% correlation. As this figure is much less cluttered it is also clear to pick out some quite peculiar pairs of companies. For instance, we'd expect the Telecommunication Services and Energy sectors to be quite correlated, as they both seem to revolve around cable based infrastructure as a large part of their business. However, they see some of the lowest correlation across the board. Of course, as we mentioned before, we have made no effort to remove outliers from each of these sectors; taking the effort to do so would likely result in much more meaningful correlations and is definitely a space for improvement.
As a method of confirming our analysis up to this point we decided it would be a good idea to visualise two of the sectors we found to be highly correlated, to make sure that they do in fact exhibit similar movement. To achieve this we have simply taken the average of all stocks within a particular sector, on any given day, we then used this as a measure for the mean performance of that sector - a basis for comparison. In the following figure we've done just that and as expected the industrials sector (black line) seems to quite closely follow the trend of the materials sector (red line) from month to month; proof of their high correlation.
# Grabbing the percentage change in daily price 'industrials' and 'materials'
sector1 = stocks_clean['Adj Close'].loc[:,list(sector_groups['Industrials'])].pct_change()[1:].resample('M').mean()
sector2 = stocks_clean['Adj Close'].loc[:,list(sector_groups['Materials'])].pct_change()[1:].resample('M').mean()
# Plotting the mean percentage change for each sector against one another
plt.figure(figsize=(20,8))
ax1 = sector1.mean(1).plot(color='black', alpha = 1, grid=True, label='Industrials')
ax2 = sector2.mean(1).plot(color='red', alpha = 0.5, grid=True, secondary_y=True, label='Materials')
h1, l1 = ax1.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()
plt.legend(h1+h2, l1+l2, loc=2, fontsize = 12)
plt.ylabel('Mean Monthly % Change', fontsize = 14)
plt.xlabel('test', fontsize = 20) # this isn't working ???? sorry
plt.title('Industrials vs Materials Sector-based Mean Monthly % Change', fontsize = 17)
plt.show()
We decided that as a more dynamic way of assessing this correlation we would try to calculate the rolling correlation between these two sectors, again we used the mean across the sectors as a basis for comparison. However, this time we were only looking at a year's worth of data at a time to calculate their correlation (hence why the following figure starts at 2001 and not 2000). As we can see the correlation does change throughout time, but on average it is hovering around 75% (or higher) for the majority of time. There does exist a clear drop in correlation when only taking into account the most recent year's (2017) data. This drop can be confirmed by taking a closer look at our previous plot and seeing how relative to one another the difference in movement is rather extreme - lacking correlation. It's impossible for us to comment on this recent extreme dip in correlation, but it would definitely be a point of interest for future analysis.
# Finding, and plotting, the ROLLING correlation between
# each sector's daily mean percentage price change
sector1.mean(1).rolling(13).corr(sector2.mean(1)).plot(figsize=(20,8))
plt.ylabel('Correlation', fontsize = 14)
plt.xlabel('Date', fontsize = 14)
plt.title('Industrials vs Materials Sector-based Mean Rolling Correlation', fontsize = 17)
plt.show()
It was unfortunate that our whole reason for carrying out this heavy correlation based analysis was in search of finding negative correlations, whether that be between specific stocks, or even better between different sectors. This is all because negative correlations in finance are a perfect way of hedging your bets. If you hold two negatively correlated stocks within your portfolio and one of these stocks takes a huge loss, it is likely that due to their indirect relationship the other stock will rise in value to make up for this large loss. As a result, these pairs act as a natural way to reduce your portfolio's exposure, or in other words, a method of minimising your overall risk.
The fact that we didn't find any negative correlations, not even just insignificant negative correlations was at first a shock and quite baffling. However, after thinking it through we believe to have made some logical sense out of it all. Saying it's not easy to get onto the S&P 500 is an understatement, a company has to literally be one of the top 500 largest in the world in terms of its market capitalisation; essentially its market value. The pure scale of these companies is what we believe to be the driving factor behind their weirdly strong correlations.
Diving deeper into this thought; big companies attract big interest. The only stakeholders who hold a significant amount of stock in these companies are those who can afford it, which in a lot of cases ends up being large hedge funds; sometimes these companies even attract government interest too. With all these large hedge funds having significant pieces of the pie, and almost always acting on similar information, it stands to reason that a lot of these companies might show similar characteristics - despite belonging to completely different sectors.
Another contributing factor could definitely be the fact that the S&P 500 index is seen by most people as one of the safest places to be putting their money. For their investment to take a hit, the top 500 companies world wide have to take a significant hit too - a prospect which seems unlikely. This means that investors, sometimes quite blindly, just throw their money onto the index. And when bad news strikes it's reach is wide spread, causing the majority to act in a similar money. As all this trading is happening on the S&P 500 index, which is simply the weighted average of all its components, then it stands to reason that all its components, will move in sync as the index rises or falls as a whole - thus leading to this overall correlation.
Even having not found any clear negative correlations it still would've been nice to carry out our simulations keeping each individual sector in mind. However, again in the spirit of saving time we though it best to keep things simple for now, and push forward keeping all 386 stocks as potential candidates for out optimal portfolio.
Once we had explored the data, we decided that before we did optimisation of any sort, we needed to get a basic portfolio functioning with all of the fundamental calculations producing their correct outputs. All of this in an attempt for us to make the transition from a basic portfolio to a more complex and optimised one as smooth as possible for future endeavours.
To begin this process we picked the first four stocks from our 386 that caught our eye on the list, singled out their adjusted closing prices and put them into a table that can be seen below. The stocks themselves don't matter at all because we aren’t concerned with our result, this whole process is simply a proof of concept to make sure things are functioning as we would like them to.
selected = ['YUM', 'AAPL', 'MSFT', 'AMZN']
table = stocks_clean['Adj Close'].loc[:,selected]
table.head()
As you can see we have the table of our chosen stocks with their adjusted close price for each day. The next step in producing this basic portfolio was to create an array of four weights with each weight representing the proportion of money that would be placed into each stock. To keep things simple we decided to evenly distribute the weights among all four stocks; placing 25% in each individual asset.
weights = np.array([0.25, 0.25, 0.25, 0.25])
weights
After we decided on the weights array we needed to find the daily percentage change of each stock's adjusted closing price. This was simple enough and achieved by using the .pct_change
method across all four columns of our table.
returns_daily = table.pct_change()
returns_daily.head()
After finding our daily percentage change we needed some way of finding the annual percentage change. A plot of the daily percentage change for the 'YUM' stock can be seen below. To achieve this annual percentage change we took the mean of each stock's daily percentage change for the last 17 years of data, multiplied each result by 252 (252 trading days in a year), and the resulting four values were our expected annual percentage change for each stock.
In doing this we are really just taking a blind average of the percentage change, we are not taking into account any trends in the historical data, it is in essence just a horizontal average. This is an area that could be improved to give a more accurate annual percentage change, however, due to time constraints we opted to take the unsophisticated route and just take a flat average.
returns_daily.iloc[:,0].plot(figsize=(13,10))
plt.title("Daily % Change of YUM (KFC, Taco Bell & Pizza Hut's Parent Company)", fontsize = 17)
plt.xlabel("Date", fontsize = 14)
plt.ylabel("Daily % Change (Adj Close)", fontsize = 14)
Here you can see our results for each stock's annual percentage change, or rather their individual expected annual returns.
returns_annual = returns_daily.mean() * 252
returns_annual
Once we had the stocks expected annual returns, we simply took the dot product of these values with the weights which we've previously defined. Our result was a scalar value which represents the expected returns for this basic portfolio.
total_portfolio_return = np.dot(weights, returns_annual)
total_portfolio_return
The next step in the process was to calculate the daily covariance of each stock, this was done in a similar fashion and required us to make use of the .cov()
method. The output seen below is the daily covariance matrix of each stock's daily percentage returns.
cov_daily = returns_daily.cov()
cov_daily
Similar to what has been done previously when finding the annual returns, to find the annual covariance of each stock we simply took the daily covariance matrix and multiplied it by 252. The result is the annual covariance matrix which can be seen below.
cov_annual = cov_daily * 252
cov_annual
Once we had this annual covariance matrix we performed some basic linear algebra to find the volatility, or overall risk, of our portfolio. First we took the dot product of our weights with the annual covariance matrix leaving us with a vector containing four values, as can be seen below.
np.dot(cov_annual, weights)
We then took the dot product again with this one-dimensional vector, but this time against the transposed weights vector. This resulted in the scalar value which can be seen below; a representation of our portfolio's variance.
np.dot(weights.T, np.dot(cov_annual, weights))
Although we have the variance, what we are really after is the volatility, the risk, the standard deviation. To find this we simply take the square root of the variance and we're left with the standard deviation or in other words, the total expected risk of our portfolio.
total_portfolio_risk = np.sqrt(np.dot(weights.T, np.dot(cov_annual, weights)))
total_portfolio_risk
What we're actually trying to achieve through calculating these values is working our way towards figuring out our portfolio's Sharpe ratio. This is important because it's the ratio between our portfolio's expected returns and portfolio's expected risk. Our Sharpe ratio is calculated quite trivially, we simply divide our portfolio's expected returns by its risk. We've followed through with this exact calculation as can be seen below. The Sharpe ratio we see here is by no means optimal, as we've taken no steps towards maximising it. However, that's not the point, the point of these calculations is to ensure they are performed accurately and have given us results that make sense, are logical, and were what we're expecting.
test_sharpe_ratio = total_portfolio_return / total_portfolio_risk
test_sharpe_ratio
After we'd confirmed that we can create a portfolio with our data, and logically find the expected portfolio wide returns, risk and Sharpe ratio it was just a manner of extending this process into some sort of automated simulation. Our simulation works as follows:
Once this whole simulation has been completed, which depending on the input parameters can take hours, we're left with a huge data frame containing all the simulated portfolios, and each of their associated values.
best_portfolios = []
simulated_sharpes = []
# !!!!
# change the number in `np.arange(Z)` for how many times you want to run the simulation
for i in np.arange(100):
# !!!!
# change the number in `np.random.choice(tickers, X, replace=F)` to
# choose how many stocks you want to polulate your portfolio (taken at
# random from the whole possible set of 386)
selected = np.random.choice(list(stocks_clean['Adj Close']), 5, replace = False)
table = stocks_clean['Adj Close'].loc[:,selected]
# stock specific annual returns
returns_daily = table.pct_change()
returns_annual = returns_daily.mean() * 252
# stock specific annual covariance
cov_daily = returns_daily.cov()
cov_annual = cov_daily * 252
# setting up simulation lists
port_returns = []
port_volatility = []
sharpe_ratio = []
stock_weights = []
# set to correct number for amount of stocks picked
num_assets = len(selected)
# !!!!
# change the number `num_portfolios = Y` to the desired amount of different
# weighting-pairs you wish to simulate
num_portfolios = 5000
# portfolio specific random weights, exp returns, exp risk & sharpe ratio
for single_portfolio in range(num_portfolios):
weights = np.random.random(num_assets)
weights /= np.sum(weights)
returns = np.dot(weights, returns_annual)
volatility = np.sqrt(np.dot(weights.T, np.dot(cov_annual, weights)))
sharpe = returns / volatility
sharpe_ratio.append(sharpe)
port_returns.append(returns)
port_volatility.append(volatility)
stock_weights.append(weights)
# save results into dictionary
portfolio = {'Returns': port_returns,
'Volatility': port_volatility,
'Sharpe Ratio': sharpe_ratio}
# extends dictionary to include ticker names for each weight
for counter,symbol in enumerate(selected):
portfolio[symbol+' Weight'] = [Weight[counter] for Weight in stock_weights]
# convert dictionary to dataframe
df = pd.DataFrame(portfolio)
# order column names in dataframe
column_order = ['Returns', 'Volatility', 'Sharpe Ratio'] + [stock+' Weight' for stock in selected]
df = df[column_order]
# find which simulated portfolios had min risk & max sharpe
min_volatility = df['Volatility'].min()
max_sharpe = df['Sharpe Ratio'].max()
simulated_sharpes.append(max_sharpe)
sharpe_portfolio = df.loc[df['Sharpe Ratio'] == max_sharpe]
min_variance_port = df.loc[df['Volatility'] == min_volatility]
# add the best portfolios to a seperate list
best_portfolios.append([sharpe_portfolio, min_variance_port])
Of course, once we'd simulated these portfolios the next step in our process was to find some way of filtering through this huge set of potential portfolios to find the portfolio(s) we deem to be the top performers; this is where we made use of the Sharpe ratio. We had three sets of potential portfolios we were most concerned with. First, and perhaps most obviously, was the portfolio which achieved the maximum Sharpe ratio, this is of course the largest (positive) difference between returns and risk, and as a result seems to be a heavy indicator of good portfolio performance overall. The next set of outputs was simply portfolios which achieved a Sharpe ratio of 1.0 or higher, indicating that they're at least expected to break even or make a little bit of profit. The final set was just an extension of the previous, portfolios which achieved a Sharpe ratio of 1.2 or higher, this is just a way of separating the boys from the men as a Sharpe ratio of above 1.2 generally has quite a significant expected profit margin associated with it.
The two numbers outputted below are simply the amount of portfolios contained within the latter two sets. The first number being the amount of simulated portfolios which achieved the 1.0 or higher status, and the second number represents the amount of portfolios which achieved 1.2 or higher status.
sharpe_one = [best_portfolios[i][0] for i in [simulated_sharpes.index(i) for i in [z for z in simulated_sharpes if z >= 1]]]
sharpe_two = [best_portfolios[i][0] for i in [simulated_sharpes.index(i) for i in [z for z in simulated_sharpes if z >= 1.2]]]
sharpe_max = [best_portfolios[i][0] for i in [simulated_sharpes.index(i) for i in [z for z in simulated_sharpes if z == max(simulated_sharpes)]]]
len(sharpe_one), len(sharpe_two)
Below we find a list of all of these portfolios, their expected returns, expected risk, their exact sharpe ratio, the exact weight held in each stock, and which stocks these weights belong to...
print('\n')
print('Maximum Sharpe Portfolio')
print('\n')
display(sharpe_max[0])
print('\n')
print('Sharpe ratio >= 1.0')
print('\n')
for i in range(len(sharpe_one)):
display(sharpe_one[i])
print('\n')
print('Sharpe ratio >= 1.2')
print('\n')
for i in range(len(sharpe_two)):
display(sharpe_two[i])
As has been mentioned, we ran through the above simulation literally hundreds of times, each time assessing this output and taking note of which stocks we would tend to see show up frequently, especially the stocks which are being favoured (i.e. weighted highly within the portfolio). It would have been really nice to be able to collect these stock tickers in an automated way, and then after we've run the simulation hundreds of times we could even just see a histogram of how often they are appearing in these top performing portfolios. However, due to how the simulation is setup and the fact that these tickers are stored within a column name of a data frame it would be a lengthy process. Again, in the spirit of saving time we opted to simply assess the output by eye and take note of what we were commonly seeing. Of course this is far from a scientific method but it was the best we could do in the time that we had - definitely room for future improvement in this process.
From these companies which we kept noticing appear quite frequently we started to develop and test another seperate portfolio. As we kept running our simulation and noticing more stocks, we would add these stocks into this perspective perfect portfolio and assess whether its performance got significantly better or worse. As hours went on, our list became bigger and bigger until we got to the point where we had the 13 stocks listed below within our portfolio. No matter what stock we added it didn't seem to be improving our Sharpe ratio, we would either remain roughly the same, or it would just diminish. As a result we decided to push through with our analysis, concluding that these were the best combination of 13 stocks we could find (in the time we had to work with).
a = ['MO', 'AAPL', 'GILD', 'ABC', 'COO', 'AOS', 'MNST', 'ESRX', 'BLL', 'CHD', 'HRL', 'LH', 'STZ']
print(a)
Once we had these 13 perfect stocks we decided to run our simulation but this time, as the choice of stocks is fixed, we ran as many simulations as our computers could muster. In most cases we would result in roughly the same weights and the portfolio as a whole would achieve roughly a ~1.658 Sharpe ratio. The figure below is a visualisation of our simulation; every single dot is one of the thousands of possible portfolios we could make, each having a different weight combination. The red diamond showcases the portfolio which achieved the maximum Sharpe ratio, this can also be seen as the darkest green on the figure as the colour scale simply represents the magnitude of the Sharpe ratio. The blue diamond on the other hand is the portfolio which achieved the minimum variance, and as a result is the expected safest, least risky way to balance the portfolio.
Having money on our minds we of course chose to push forward with the maximum Sharpe ratio based portfolio, as after these hundreds of thousands of simulations, it in essence is showcasing the best balance between our expected returns and our expected risk.
# !!!!
# change the number in `np.random.choice(tickers, NUMBER, replace=F)` to
# choose how many stocks you want to polulate your portfolio (taken at
# random from the whole possible set of 386)
selected = a
table = stocks_clean['Adj Close'].loc[:,selected]
# stock specific annual returns
returns_daily = table.pct_change()
returns_annual = returns_daily.mean() * 252
# stock specific annual covariance
cov_daily = returns_daily.cov()
cov_annual = cov_daily * 252
# setting up simulation lists
port_returns = []
port_volatility = []
sharpe_ratio = []
stock_weights = []
# set to correct number for amount of stocks picked
num_assets = len(selected)
# !!!!
# change the number `num_portfolios = NUMBER` to the desired amount of different
# weighting-pairs you wish to simulate
num_portfolios = 100000
# portfolio specific random weights, exp returns, exp risk & sharpe ratio
for single_portfolio in range(num_portfolios):
weights = np.random.random(num_assets)
weights /= np.sum(weights)
returns = np.dot(weights, returns_annual)
volatility = np.sqrt(np.dot(weights.T, np.dot(cov_annual, weights)))
sharpe = returns / volatility
sharpe_ratio.append(sharpe)
port_returns.append(returns)
port_volatility.append(volatility)
stock_weights.append(weights)
# save results into dictionary
portfolio = {'Returns': port_returns,
'Volatility': port_volatility,
'Sharpe Ratio': sharpe_ratio}
# extends dictionary to include ticker names for each weight
for counter,symbol in enumerate(selected):
portfolio[symbol+' Weight'] = [Weight[counter] for Weight in stock_weights]
# convert dictionary to dataframe
df = pd.DataFrame(portfolio)
# order column names in dataframe
column_order = ['Returns', 'Volatility', 'Sharpe Ratio'] + [stock+' Weight' for stock in selected]
df = df[column_order]
# find which simulated portfolios had min risk & max sharpe
min_volatility = df['Volatility'].min()
max_sharpe = df['Sharpe Ratio'].max()
simulated_sharpes.append(max_sharpe)
sharpe_portfolio = df.loc[df['Sharpe Ratio'] == max_sharpe]
min_variance_port = df.loc[df['Volatility'] == min_volatility]
# add the best portfolios to a seperate list
best_portfolios.append([sharpe_portfolio, min_variance_port])
# plot all simulated portfolios, with marker on MAX SHARPE & MIN RISK
plt.style.use('seaborn-dark')
df.plot.scatter(x='Volatility', y='Returns', c='Sharpe Ratio', cmap='RdYlGn',
edgecolors='black', figsize=(13, 10), grid=True, sharex=False)
plt.xlabel('Volatility (Std. Deviation)', fontsize = 14)
plt.ylabel('Expected Returns', fontsize = 14)
plt.scatter(x=sharpe_portfolio['Volatility'], y=sharpe_portfolio['Returns'], c='red', marker='D', s=100)
plt.scatter(x=min_variance_port['Volatility'], y=min_variance_port['Returns'], c='blue', marker='D', s=100)
plt.title('Optimal Portfolio Simulation', fontsize = 17)
plt.show()
display(sharpe_portfolio)
display(min_variance_port)
Once we had this portfolio which was performing so well be decided out of curiosity to quickly have a look at the portfolio's component stock correlations, maybe there was something we missed out on in our analysis, and by chance we might now stumble across negative correlations? This was unfortunately not the case, and to out surprise everything was largerly not correlated. The absolute highest value within the following matrix looks to be about 30% and even then that's no where near a significant correlation. After a bit of thinking, this sort of makes some sense though. We tried our best to find negative correlations during our exploration process but had no luck, and if negative correlations aren't possible then the next best thing for a portfolio would probably be completely uncorrelated - that is if one stock takes a huge hit the rest of the portfolio is likely to (hopefully) not even budge.
best_companies = stocks_clean['Adj Close'].loc[:,a].pct_change().resample('W').mean().corr()
plt.figure(figsize=(13,10))
sns.heatmap(best_companies,
xticklabels=best_companies.columns,
yticklabels=best_companies.columns,
cmap="RdBu_r", vmin=-1, vmax=1)
Despite these stock's lack of correlation, and our lack reasoning behind it, together they are still maximising our Sharpe ratio, and consistently achieving the highest ratio we've ever seen across the hundreds of simulation instances we've ran.
Up until this point we've been using the Sharpe ratio as a means of measuring a portfolio's perspective performance. However, this measure can only tell us so much as it's based off of our expected returns and expected risk. This means that we need a better way of evaluating our portfolio's performance that isn't just based on our expected values. The first thing that came to mind was to take a page from financial risk management, and conduct Value at Risk (VaR) analysis. To carry through with this process we needed to find the portfolio's total returns for each day, this was simply calculated by taking the dot product of our portfolio weights with each stock's daily percentage returns.
good_portfolio = stocks_clean['Adj Close'].loc[:,a].pct_change()[1:]
good_portfolio['Portfolio'] = np.dot(stocks_clean['Adj Close'].loc[:,a].pct_change()[1:].values, sharpe_portfolio.values[0][3:])
# thanks Felix
def HS_VaR(s, subsample, alpha=0.01, fixedwindows=False):
"""
Function to construct VaR via historical simulation assuming constant mean and variance.
Inputs:
s: (T,) numpy array containing financial returns data.
subample: float in [0,1]. Propotion of the sample to be used for construction the initial VaR.
alpha: float in [0,1]. Significant level.
fixedwindows: boolean. If true, the number of observations for calculating the critical value is fixed otherwise, we have a rolling windows.
Output:
sT: int. The observation where the forecast starts.
VaR: (T,) numpy array containing the fitted and forecast VaR.
"""
T = s.shape[0]
sT = int(np.floor(subsample*T))
r = copy.copy(s[0:sT])
VaR = np.zeros(T)
r.sort()
index = int(np.floor(alpha*sT))
VaR[0:sT] = r[index]
for t in np.arange(sT,T):
if fixedwindows is False:
r = copy.copy(s[0:t])
r.sort()
index = int(np.floor(alpha*t))
else:
r = copy.copy(s[t-sT:t])
r.sort()
index = int(np.floor(alpha*sT))
VaR[t] = r[index]
return sT, VaR
Once we've worked out these total daily portfolio return values we can simply pass this vector through the VaR function in order to predict the worst 1% case scenario for each day. We do this using two primary methods, rolling and recursive. Once each of these values have been calculated they are each put into a new column. These values are then plotted against the actual returns that we saw for each day, as can be seen in the plot below.
sub = 252/good_portfolio.shape[0]
sT, good_portfolio['VaR_Portfolio_rolling'] = HS_VaR(good_portfolio['Portfolio'].values, subsample=sub, fixedwindows=True)
sT, good_portfolio['VaR_Portfolio_recursive'] = HS_VaR(good_portfolio['Portfolio'].values, subsample=sub, fixedwindows=False)
good_portfolio[['Portfolio', 'VaR_Portfolio_rolling', 'VaR_Portfolio_recursive']].plot(figsize=(13, 10))
plt.xlabel('Date', fontsize = 14)
plt.ylabel('Daily % Change in Price', fontsize = 14)
plt.title('Simulated Rolling & Recursive Value at Risk (VaR) of Optimal Portfolio', fontsize = 17)
plt.legend(fontsize = 12)
Visually you can see that both lines do quite well in predicting these worst case, extreme losses. Our rolling scenario hugs the data much more closely, this becomes obvious during the extreme losses seen in 2009/2012 where it dramatically moves to fit the downwards trend. However, our recursive method does slightly better on average, with the trade-off being its predictions barely nudge when it comes to these extreme losses seen throughout the financial crisis. The table below shows that rolling doesn't predict these worst cases 1.22% of the time, and recursive is a little better having only not predicted these worse cases only 0.94% of the time.
backtest = pd.DataFrame(index=['VaR Violations'])
backtest['rolling'] = 100*sum(good_portfolio['Portfolio'].values[sT:] < good_portfolio['VaR_Portfolio_rolling'].values[sT:])/(good_portfolio.shape[0]-sT)
backtest['recursive'] = 100*sum(good_portfolio['Portfolio'].values[sT:] < good_portfolio['VaR_Portfolio_recursive'].values[sT:])/(good_portfolio.shape[0]-sT)
backtest.transpose()
Instead of using the same data to evaluate our portfolio's performance we instead opted to collect a new data set containing the prices from Jan, 2018 to Jun, 2018. This allows us to have a testing set of sorts which showcases the real-life performance that our portfolio would have had, had we invested at the beginning of 2018.
# download 2018 stock data here if you want
# stocks2018 = yf.download(list(stocks_clean['Adj Close']), start="2018-01-01", end="2018-06-4")
# save the 2018 stock data to a destination on your computer if you want
# stocks2018.to_hdf("/Users/troyengelhardt/Desktop/data/stocks18.h5", "stocks")
# here we're reading in a file, we'll attach this file to the report somehow
# so our results are reproducable
stocks2018 = pd.read_hdf("/Users/troyengelhardt/Desktop/data/stocks18.h5")
# some stocks were missing data from the start of 2018, or at the end of our
# date range, so we simply dropped these (three) companies; we did check and
# they weren't any companies being used in our optimal portfolio
# np.where([stocks2018['Adj Close'].isna().any(axis=0)])[True]
stocks_clean2018 = stocks2018.dropna(axis = 2)
# np.where([stocks_clean2018['Adj Close'].isna().any(axis=0)])[True]
stocks_clean2018['Adj Close'].loc[:,a].pct_change().head()
As you can see from the above data, we've simply got the same daily percentage change as before but this time our data is of course from 2018, made obvious by the row index seen on the left. Considering we now have the 2018 price data, and we of course have our optimal portfolio weights from the previous steps in this analysis, we can simulate the actual perspective performance of our portfolio, visualise its change in value.
Below we are doing just this, we're visualising the answer to the question, "if we were to invest $$10 million into our optimal stock portfolio at the beginning of 2018, then how would its total value have changed throughout its due course? How much profit, if any, would be have made?"
test1 = stocks_clean2018['Adj Close'].loc[:,a].pct_change()[1:]
values = []
for j,z in enumerate(list(test1)):
test2 = []
test2.append(list(sharpe_portfolio.values[0][3:] * 10000000)[j])
for x,i in enumerate(list(test1.loc[:,z].values)):
if x == 104:
break
test2.append((i * test2[x]) + test2[x])
values.append(test2)
test3 = dict(zip(set(list(test1)), values))
test4 = pd.DataFrame(test3)
test4 = test4.set_index(stocks_clean2018['Adj Close'].loc[:,a].pct_change().index)
test6 = []
test6.append(10000000)
for x,i in enumerate(good_portfolio['Portfolio'].values):
if x == 104:
break
test6.append((i * test6[x]) + test6[x])
test7 = pd.DataFrame(test6)
test7 = test7.set_index(stocks_clean2018['Adj Close'].loc[:,a].pct_change().index)
test7.columns = ['Portfolio']
test7.plot(figsize=(13, 10))
plt.ylabel('Actual Returns ($ e^7)', fontsize = 14)
plt.title('Total Portfolio Performance', fontsize = 17)
plt.xlabel('Date', fontsize = 14)
The following figure is the exact same as above, but instead of the total portfolio value, we've broken it down into each individual stock's performance, the individual stock value, over the first five or so months within 2018. As you can see the majority of the stocks hold steady throughout 2018, but any downward trend seen by the stocks which are weighted most heavily has a large influence on the total portfolio's value, as we'd expect.
test4.plot(figsize=(13, 10))
plt.ylabel('Actual Returns ($)', fontsize = 14)
plt.title('Individal Company Stock Performance', fontsize = 17)
plt.xlabel('Date', fontsize = 14)
plt.legend(loc = 'upper right', bbox_to_anchor=(0.971, 1), ncol = 13, columnspacing = 0.3)
Our original question from our client was simply which stocks they should invest in, and how much of their $$10 million should they put into each stock. Through our analysis the best answer to these questions based on our whole analysis and simulation would be to spread their money across the following companies:
The amount of money the client should put into each stock is based on the following weights:
The actual dollar values are simply the weight multiplied by $$10 million, which can be seen below:
Not only did this portfolio consistently provide the highest Sharpe ratio, but we can clearly see that throughout 2018, although the client might not have been happy in March, I think we'd all be celebrating considering that just a few days ago the portfolio (in this format) would have been up 10%, which is roughly $$1 million profit made for our client in just over 5 months.