Portfolio Optimisation¶

Troy Engelhardt & Jack Prumm [8 June 2018]¶

Introduction¶

import itertools
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import fix_yahoo_finance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import portfolioopt
import copy

Preamble¶

Before we even begin to explain the problem at hand it is essential that we first understand some basic finance terminology. The following set of points should give a brief outline of the concepts which we have interest in knowing, and they should serve as a solid foundation to understanding the rest of the analysis to come.

A stock, otherwise known as equity within a company in its most simple terms is a part of a company which you can buy and own. If you have purchased stock across multiple companies then your collection of stocks are said to be held in a portfolio. Equity can be issued by a company in a private manner, this is usually done by companies which are smaller in size, and as the name suggests these sorts of deals are never really see the light of day. However, the companies that this project is concerned with have grown to such a size where their equity is freely traded by the public on open exchanges.
A stock exchange is simply a market which facilitates the trading of this publicly trading company equity. Traditionally on these exchanges a bid is put in by a buyer who has a certain asking price in which they are willing to pay for some stock, if this price matches the price in which a holder is willing to sell for then a deal is struck and the equity switches hands. This means that a stock's value inherently comes from the price in which the majority is willing to pay, a measure which is influenced by a huge array of factors and as a result even the safest of stocks can be quite volatile.
A stock index is most easily described as the weighted average of some collection of stocks; these collections aren't just put together randomly, and usually follow some sort of theme or flavour. The index that we're heavily interested in is the S&P 500, this is an index which is composed of the top 500 companies listed on the NASDAQ and NYSE; the two largest US based stock exchanges. As it is made up of the top 500 large market cap companies it is seen as a relatively safe place to be investing your money, the thought being that for your investment to take a hit the top 500 companies have to take a significant hit first.
Your returns are quite a easy to explain, it is just the difference in price from when you bought a stock to when you have sold it; if you manage to make profit your returns are positive, if you take a loss your returns will be negative. This can be deceptively simple, and there is something to be said about the time value of money; even something as simple as the inflation rate could be taken into consideration when holding some stock, but (for now) this lays outside the scope of this project.
Your risk in any stock is primarily determined by the volatility of that stock's price, in more mathematical terms this is the price variance, or standard deviation. If a stock's price is fluctuating by significant amounts then it stands to reason that investors aren't quite sure how it should be priced, they lack confidence, this in turn results in a much riskier asset.
The Sharpe ratio is just the ratio between an asset's (or in our case a portfolio's) expected return and its expected risk. A Sharpe ratio of 1.0 would mean this ratio is 1:1, that is you're expecting an even amount of return for the risk you are taking on. If instead your Sharpe ratio was 2.0 this would mean the ratio is 2:1, that is your expecting to receive double the returns when compared to the risk you are taking on. As you might expect having a larger Sharpe ratio is almost always better, and is the primary criteria we are using to optimise our perspective portfolio's performance.

Provided you have somewhat of an understanding of these terms then you should be more than comfortable continuing through this project and being guided through our analysis and simulation process.

Our Problem¶

Perhaps shocked by the low interest rates offered by the banks, a client has approached us with $10 million and the simple goal of investing their money into a stock portfolio, spreading their wealth across an array of companies in hopes that its value will grow. The only caveat is that our client stipulated that the companies his money go into must only be companies which are listed within the S&P 500 stock index. Their questions are simple, given this restriction, which stocks should they invest in, and how much money should be put into each stock?

Simple enough right? But our problem doesn't just end there, we still have two more (perhaps obvious to some) factors we must keep in mind when pushing forward in our process. The portfolio as a whole must remain relatively low risk; there's no point us advising our client to put their money into a portfolio which has the likely potential to lose almost all of its value within a matter of weeks. Risk is of course quite a subjective measure, for some people, including our client, the rush of taking a heavy risk in hopes of a larger reward might just be a worth while endeavour. Greed can be a powerful ally, but instead we will err on the side of caution, assuming our client takes a more conservative approach when it comes to investing their hard earned pennies.

This is of course a double edged sword we're dealing with, we can't be too relaxed with our portfolio because in most cases opting for a lack of risk comes with a severe lack of rewards. And if the rewards aren't sufficient then our client might as well have put their money in a bank to begin with, skipping this whole process. As a result, our goal is the to best balance this risk & return trade-off, something which can be achieved through maximising the previously mentioned Sharpe ratio.

Brief Literature Review¶

Before we begin jumping into our analysis, or even forming a plan to deal with this problem, it's imperative that we understand what has been done before us. For as long as markets have existed, people have been trying to figure out strategies which best balance their portfolio, generations of investors before us have developed tried and tested methods of doing just that, and we aren't intending to reinvent the finance wheel.

Modern-day finance models are built on an array of assumptions, one of the base assumptions being that the market being invested in is efficient[1]. What this assumption is basically saying is that investors will view the information in the same way across the board and are going to make similar pragmatic decisions when investing in the market. This however, is far from the truth, when in actual fact many investors are irrational and quite often make arbitrary decisions when to comes to buying or selling a particular or multiple stocks. We all view information in different ways and the conclusion arrived at by one investor given the same information is not generally the same for the entire market[2]. These general assumptions surrounding the market and investors are without a doubt a main contributing factor when it comes to the variability in the accuracy of these models, although still very useful, these are some major flaws which should be considered.

Portfolio theory first came to light in the 1950’s when Modern portfolio theory (MPT) was introduced through contributions from Harry Markowitz and William Sharpe[3]. A staple in the field, MPT laid the foundations, and is the basis for portfolio theory seen today. MPT was introduced as a way for investors to weight their assets into particular stock in order to maximise their expected return based on some level of assumed risk. Years later in the early 1990’s Post Modern Portfolio Theory (PMPT) was introduced and in all actuality is an extension of MPT, which relies on the same assumptions, but provides a different perspective on one of the main ideas used within MPT[4]. PMPT was still a way for investors to optimally weights their assets in stock to achieve an expected return based on some level of assumed risk, but the key difference between the two is in how the risk is calculated. Rather than associating risk with your positive and negative returns as done in MPT, i.e. making money and losing money both had a degree of risk, PMPT instead defines risk as a value purely based on negative returns, meaning that there is only risk associated in losing money. It seems like a sensible change as people are only worried about losing money when investing, although it all depends on how you wish to define your risk as to which model you would use.

In more recent years due to an exponential increase in computational affordability, access and power. Online portfolio selection (OPS) has been developed and is the modern-day culmination of the past 60 years of research in quantitative finance [5]. Unlike its predecessors OPS doesn’t have a single way of defining risk and optimally weighting assets. Instead using machine learning and simulation techniques it uses assortment of algorithms that encompass numerous different ways of optimally weighting assets based on actual stock performance and historical trends[6].

Through our own analysis we’ve tried to include these traditional finance ideas while also combining them with the computational power at our disposal, all in an attempt to best balance our own portfolio.

What to Expect¶

To explain the process in which we went about solving this problem we've presented the report using the following structure:

Introduction
- Preamble
- Our Problem
- What to Expect
Analysis
- Gathering & Cleaning the Data
- Exploring the Data
Simulation
- A Basic Portfolio
- An Optimal Portfolio
Results
- VaR Evaluation
- Portfolio Performance
Conclusion
- Answering the Question
- Future Improvements

This structure hopes to guide you from the very beginning all the way to the very end of our process, showcasing our thought process and all the steps we've taken along the way. The introduction lays out a foundation of information in which the report is based on. The analysis documents our struggle to get our hands on this data, and then displays the wrangling, cleaning and exploratory process which proceeded. We then implemented quite a basic portfolio to make sure that our calculations were correct, a proof of concept of sorts. With that being a success our simulation could then be extended to a much larger portfolio, in the hopes that through tuning the parameters we could work towards finding the most optimal portfolio. With our optimal portfolio being found we then pushed forward to evaluating its performance using a risk management technique (Value-at-Risk), and simply just demonstrating its perspective real-life change in value through the first 5 months of 2018. Finally, we tie our whole process back to the question at hand by judging how happy we believe our client would be with this portfolio, and offer some future improvements to make this whole analysis and simulation a more useful and accurate process.

Analysis¶

Gathering & Cleaning the Data¶

Before we even began querying some API for stock price data, we needed to actually know which companies are listed within the S&P 500 index as its component companies are what we're required to build our portfolio from. Thankfully, it wasn't hard to find a table hosted by Wikipedia which contains all the information we need plus some. Most importantly the table contains a column which is each company's ticker symbol, this identifies each company uniquely across the stock exchange and is the vital piece of information we needed to begin this process.

At first we managed to scrape each individual column using a Python package called BeautifulSoup, however with a little digging we realised it was significantly easier to just automatically parse this HTML table code through a prewritten Pandas function which took care of all the heavy lifting for us. As you can see not only were we left with the column of ticker symbols, which were vital, but we also had a lot of extra, useful information that we could use to group and filter companies; GICS Sector, GICS Sub Industry, date the company was first added to the S&P 500 index, etc, etc.

# Contacts the Wikipedia page listing the S&P500 companies. Gets the table which lists each company,
# their sector, industry, data added to S&P500, date founded. Pandas then converts this HTML table
# into a pandas data frame which is ready to be used

res = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
wiki_data = pd.read_html(str(table), header = 0)

wiki_data[0].head()

Interestingly enough, when we explored this table we noticed that there weren't 500 rows/companies as expected, but there were instead 505 rows/companies listed within the table. We found that this is because five companies listed in the index have two different classes of equity, therefore doubling their listings on the index and adding an extra five entries to the table. Although subtle, it's an important fact to notice as these stocks are almost guaranteed to have high correlation - possibly skewing the results if we analyse the index as a whole.

# download stocks if you want from here (sometimes NaN values differ)
# this however may take a few minutes, and may result in som slight data differences
# stocks = yf.download(list(wiki_data[0]['Ticker symbol']), start="2000-01-03", end="2017-12-29")

# save your downloaded file to a destination of your choosing (if you decide to download at all)
# stocks.to_hdf("/Users/troyengelhardt/Desktop/data/stocks00-17.h5", "stocks")

# so our analysis is repeatable we've saved OUR stock data to a .h5 file
# we will upload/attach a link to download so you can use the same data
# and follow our analysis EXACTLY, at your leisure
stocks = pd.read_hdf("/Users/troyengelhardt/Desktop/data/stocks00-17.h5")

We had all our stock tickers now, but the next hurdle we had to overcome in this process was actually gathering some company specific information; 505 sets of stock price information to be exact. Our first thought was to use Quandl, a service which we know to have a simple Python based API in which we could query data using our set of stock tickers. However, the popular large stocks (TSLA, AAPL, MSFT, etc) were available to download as much data as your could get your hands on, but attempting to grab the data for some (505) more obscure stocks seemed to require premium access to their databases - a $$350 yearly privilege.

With barely enough to afford instant noodles we searched far and wide for a free alternative, and alas our saviour Yahoo Finance, or so we thought. We found a Python package called yahoo-finance which seemed to be used quite frequently used by many and nicely documented, the only problem being that every time we tried to query it we were bombarded with errors no matter what we tried. This turned out to be a result of the fact that this package is based on an API which Yahoo has since stopped supporting and therefore the code was broken and useless. Back to the drawing board, and with a little more digging we stumbled upon another Python packed, this time called fix-yahoo-finance. This package was designed, as the name suggests, in an effort to fix its dysfunctional predecessor, the caveat being that instead of contacting a nice neat API as a quick fix it's now scraping the data from the Yahoo website; a process that can yield messy results sometimes.

Whatever the accuracy, it almost seemed too simple, with one line of code (and a few minutes of waiting) we parsed our list of ticker symbols through a function. What was returned was a Pandas panel object full of all the possible data, for all 505 stocks, from Jan, 2000 all the way to Dec, 2017. As can be seen below, for each day in this date range we had the adjusted close, close, high, low, open and volume values. For the majority of our analysis we decided to use the adjusted close price as it seemed the most accurate, being adjusted for significant corporate action, making it in our opinion the better reflection of a stock's true value.

# Showing we can filter all equity quantities, across all companies, by a specific date

stocks.major_xs('2000-01-03')[98:103]

However, in our first step of just looking at our data, calling the .head() method, we realised we had some collection of stocks which had missing (NaN) values for some set of days within our 17 year date range.

# Showing we can filter all companies, across all dates, for a specifc equity quantity i.e. 'Adj Close'

stocks['Adj Close'].head()

This brought us to the problem of how we were going to deal with missing values? With more time we could've done so in a much more sophisticated manner. If the range of missing values was small we could have simply taken the average of the daily prices leading up to the missing chunk, and having these artificial values might have been much more valuable then just completely neglecting the company from our analysis. However, in the spirit of time we decided it'd be best to just do exactly that, get rid of any company which has a missing value using the .dropna(axis = 2) method. This means that even if a company has only one missing error in all 4527 days, perhaps due to a bad scrape of the data on request, we're just throwing it completely out the window; quite an extreme action to take. However, if we look at the flip side of this argument, companies which have a complete data set from 2000 to 2017 likely have an intact data set preceding our date range for decades - making all the companies we're looking at (likely) to be much safer bets due to their long lasting nature.

No matter the assumptions we went ahead and took this action, dropping 119 tickers with NaN values, which resulted in a much more refined and polished data set, now containing only 386 stocks instead of 505 - a much easier data set for us to deal with.

stocks_clean = stocks.dropna(axis = 2)
stocks_clean['Adj Close'].head()

/anaconda3/lib/python3.6/site-packages/pandas/core/panel.py:728: FutureWarning: '.reindex_axis' is deprecated and will be removed in a future version. Use '.reindex' instead.
  result = self.reindex_axis(new_ax, axis=axis)

stocks_clean['Adj Close'].shape

(4527, 386)

Exploring the Data¶

After we had this more structured set of data we could move onto actually exploring relationships, correlations, between each stock. Our first, and perhaps the most obvious, line of thinking was to just blindly create a correlation matrix for every single company using the original data we'd gathered, their daily adjusted close price. As the following figure shows, when looking at the absolute price almost everything is very highly correlated, the majority being upwards of 80% positive correlation. We expected much looser correlations overall and as a result of this matrix we decided that perhaps absolute price isn't the most even playing field to be assessing the relationships between stocks.

# Taking a look at the correlation between every single stock's daily price

stock_corr_price = stocks_clean['Adj Close'].corr()

plt.figure(figsize=(13,10))
sns.heatmap(stock_corr_price, 
        xticklabels=stock_corr_price.columns,
        yticklabels=stock_corr_price.columns,
        cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Stock's Daily Adjusted Close Price", fontsize = 17)

Text(0.5,1,"Correlation Heatmap of Each Stock's Daily Adjusted Close Price")

After some thinking we instead opted to take the percentage change from day to day of each stock's price. This meant that instead of showing that a stock moved x-amount of dollars, we can see its movement proportionate to the previous day. We employed this change and recalculated the correlation matrix (as seen in the following figure) and were presented with a result much closer to what we were initially expecting. All companies are still overall positively correlated, but now correlations are much weaker, there still exist some clear dark red dots; highly correlated pairs.

# Taking a look at the correlation between every single stock's daily pct change

stock_corr_all = stocks_clean['Adj Close'].pct_change()[1:].corr()

plt.figure(figsize=(13,10))
sns.heatmap(stock_corr_all, 
        xticklabels=stock_corr_all.columns,
        yticklabels=stock_corr_all.columns,
        cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Stock's Daily % Change in Adj Close Price", fontsize = 17)
plt.show()

To extend this line of thinking we decided to look at the correlations on a monthly basis instead of a daily basis. Our reasoning being that stocks, especially stocks on the S&P 500 generally don't move too much from day to day, perhaps by grouping their movement into larger chunks, like months, we will have a much more accurate representation of their overall price's trend. To get these monthly values we simply took the mean across all daily percentage change within some month, and then used this now monthly average value as a new method to compare stock's movements; measure correlations. This is of course quite a crude method of going about grouping, as a blind average of course isn't the most sophisticated measure of a stock's movement within a month however it was quick, easy and gave us a reasonable baseline for monthly comparison.

As can be seen below, this yields a similar looking correlation matrix to the previous daily figure, with only some subtle differences. It seems that overall the correlations have lessened, with a large majority being close to 0% correlation, or being in the very low, weakly positive (<30%) range. The stronger positive correlations, the darker red, seems to have got more obvious in this new figure - cementing the idea that their movement is indeed directly linked in some way.

# Finding and visualising the correlation matrix made from this
# resampled monthly dataset - not the daily dataset (difference?)

stock_corr_monthly = stocks_clean['Adj Close'].pct_change()[1:].resample('M').mean().corr()

plt.figure(figsize=(13,10))
sns.heatmap(stock_corr_monthly, 
        xticklabels=stock_corr_monthly.columns,
        yticklabels=stock_corr_monthly.columns,
        cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Stock's Monthly % Change in Adj Close Price", fontsize = 17)

Text(0.5,1,"Correlation Heatmap of Each Stock's Monthly % Change in Adj Close Price")

Assessing correlation through each individual stock is interesting but we can only really see so much - it's next to impossible to pick out any actual correlated stock pairings within these cluttered figures. As a way to overcome this problem we decided to use the information given to us in the original S&P 500 table that we scraped to associate each individual stock with its unique sector. With a little work it seemed we managed to reindex our data, making use of the .groupby() method to sort our data in a sensible manner. In the end we were left with the following 11 sectors, each containing some subgroup of our total 386 stocks we are working with.

# Setting our original wikipedia df's index to the 'Ticker symbol' for better sorting

better_wiki = wiki_data[0].set_index('Ticker symbol', drop=False)

# Removing 'unnecessary' SEC filings and CIK columns (unnecessary for our purpose)

better_wiki = better_wiki.iloc[:,[0,1,3,4,5,6,8]]

# Removing the tickers from this table in which we DON'T have full data for
# (changed to .reindex as it's way more fitting then passing labels through .loc)

better_wiki = better_wiki.reindex(stocks_clean['Adj Close'].columns)

# Grouping data by 'GICS Sector'

sector_groups = better_wiki.groupby('GICS Sector').groups

# All unique sectors

sector_groups.keys()

dict_keys(['Consumer Discretionary', 'Consumer Staples', 'Energy', 'Financials', 'Health Care', 'Industrials', 'Information Technology', 'Materials', 'Real Estate', 'Telecommunication Services', 'Utilities'])

For example, by just looked at the Energy sector we can see that it contains the following 25 stock tickers, and by filtering these specific tickers through our larger data set we can explore relationships held purely within this specific sector.

# Which stocks are contained within the 'Energy' sector

sector_groups['Energy']

Index(['ANDV', 'APA', 'APC', 'BHGE', 'COG', 'COP', 'CVX', 'DVN', 'EOG', 'EQT',
       'HAL', 'HES', 'HP', 'MRO', 'NBL', 'NFX', 'NOV', 'OKE', 'OXY', 'PXD',
       'RRC', 'SLB', 'VLO', 'WMB', 'XOM'],
      dtype='object')

As has been alluded to, this meant that we could now start looking at sector specific correlations. It makes sense that all stocks within one specific sector should move in the same way; if one energy company was to take a heavy loss then it'd seem likely that all other energy companies might follow suit - especially if it was a result of industry wide news or implications.

Here we decided to look at the correlation just within the energy sector, and as we expected, by and large everything is highly positively correlated. However, looking closely we can see there are some lighter patches of colour which relate to a couple stocks, particularly WMB & ANDV, indicating that they might be outliers as they tend to not be associated with the rest of the groups movement.

# Finding the pairwise correlation within the 'Energy' sector and visualising correlation

energy_stocks = stocks_clean['Adj Close'].loc[:,list(sector_groups['Energy'])].pct_change().corr()

plt.figure(figsize=(13,10))
sns.heatmap(energy_stocks, 
        xticklabels=energy_stocks.columns,
        yticklabels=energy_stocks.columns,
        cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Energy Sector Stock's Average Monthly % Change", fontsize = 17)
plt.show()

As an extension of this thinking we decided to remove these outliers which see little correlation, and recalculate the correlation matrix. As can be seen in the following figure, the expected behaviour took place. Across the board everything is much more strongly correlated, just by eyeballing it, it's clear that almost every stock within this pruned energy sector sees well and truly above 75% correlation.

# Finding the pairwise correlation within the 'Energy' sector and visualising correlation

energy_no_outliers = stocks_clean['Adj Close'].loc[:,list(list(sector_groups['Energy'])[1:-2][:-6] + list(['XOM', 'OXY', 'PXD', 'SLB']))].pct_change().corr()

plt.figure(figsize=(13,10))
sns.heatmap(energy_no_outliers, 
            xticklabels=energy_no_outliers.columns,
            yticklabels=energy_no_outliers.columns,
           cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Stock's Monthly % Change in Adj Close Price (Removed Outliers)", fontsize = 15)
plt.show()

It would have been nice to go through each sector in a similar fashion and remove these outliers, but due to how our code is implemented at the moment it's manual process even to get one outlier free sector. Again, in the spirit of saving time, we opted to skip this step and simply look at the correlation between each sector in the following figure - including these perspective outliers.

# Creating a dictionary where each key is a sector and each value is the set
# of daily mean price change for all stocks within that sector

# OLD CODE
# values = []
# for i in sector_groups.keys():
#     values.append(stocks_clean['Adj Close'].loc[:,list(sector_groups[i])].pct_change()[1:].mean(1))

values = [stocks_clean['Adj Close'].loc[:,list(sector_groups[i])].pct_change()[1:].mean(1) for i in sector_groups.keys()]
    
all_sector_means = dict(zip(sector_groups.keys(), values))

#all_sector_means.keys()

# Turning dictionary into a dataframe so correlations can easily be taken
# and then visualising the correlation matrix

plt.figure(figsize=(13,10))
df_sector_means = pd.DataFrame(all_sector_means).corr()

sns.heatmap(df_sector_means, 
        xticklabels=df_sector_means.columns,
        yticklabels=df_sector_means.columns,
           cmap="RdBu_r", vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Each Sector's Mean Daily % Change", fontsize = 17)
plt.show()

To our initial surprise, a good half of the sectors are quite strongly correlated with one another, the dark red squares being an indication that all these pairs show roughly above 60% positive correlation, some much higher. The other half seems to still show some positive correlation, but for the most part they seem to verge a lot closer to zero, the majority being insignificant pairs below 20% correlation. As this figure is much less cluttered it is also clear to pick out some quite peculiar pairs of companies. For instance, we'd expect the Telecommunication Services and Energy sectors to be quite correlated, as they both seem to revolve around cable based infrastructure as a large part of their business. However, they see some of the lowest correlation across the board. Of course, as we mentioned before, we have made no effort to remove outliers from each of these sectors; taking the effort to do so would likely result in much more meaningful correlations and is definitely a space for improvement.

As a method of confirming our analysis up to this point we decided it would be a good idea to visualise two of the sectors we found to be highly correlated, to make sure that they do in fact exhibit similar movement. To achieve this we have simply taken the average of all stocks within a particular sector, on any given day, we then used this as a measure for the mean performance of that sector - a basis for comparison. In the following figure we've done just that and as expected the industrials sector (black line) seems to quite closely follow the trend of the materials sector (red line) from month to month; proof of their high correlation.

# Grabbing the percentage change in daily price 'industrials' and 'materials'

sector1 = stocks_clean['Adj Close'].loc[:,list(sector_groups['Industrials'])].pct_change()[1:].resample('M').mean()
sector2 = stocks_clean['Adj Close'].loc[:,list(sector_groups['Materials'])].pct_change()[1:].resample('M').mean()

# Plotting the mean percentage change for each sector against one another

plt.figure(figsize=(20,8))

ax1 = sector1.mean(1).plot(color='black', alpha = 1, grid=True, label='Industrials')
ax2 = sector2.mean(1).plot(color='red', alpha = 0.5, grid=True, secondary_y=True, label='Materials')

h1, l1 = ax1.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()

plt.legend(h1+h2, l1+l2, loc=2, fontsize = 12)
plt.ylabel('Mean Monthly % Change', fontsize = 14)
plt.xlabel('test', fontsize = 20) # this isn't working ???? sorry
plt.title('Industrials vs Materials Sector-based Mean Monthly % Change', fontsize = 17)
plt.show()

We decided that as a more dynamic way of assessing this correlation we would try to calculate the rolling correlation between these two sectors, again we used the mean across the sectors as a basis for comparison. However, this time we were only looking at a year's worth of data at a time to calculate their correlation (hence why the following figure starts at 2001 and not 2000). As we can see the correlation does change throughout time, but on average it is hovering around 75% (or higher) for the majority of time. There does exist a clear drop in correlation when only taking into account the most recent year's (2017) data. This drop can be confirmed by taking a closer look at our previous plot and seeing how relative to one another the difference in movement is rather extreme - lacking correlation. It's impossible for us to comment on this recent extreme dip in correlation, but it would definitely be a point of interest for future analysis.

# Finding, and plotting, the ROLLING correlation between 
# each sector's daily mean percentage price change

sector1.mean(1).rolling(13).corr(sector2.mean(1)).plot(figsize=(20,8))
plt.ylabel('Correlation', fontsize = 14)
plt.xlabel('Date', fontsize = 14)
plt.title('Industrials vs Materials Sector-based Mean Rolling Correlation', fontsize = 17)
plt.show()

It was unfortunate that our whole reason for carrying out this heavy correlation based analysis was in search of finding negative correlations, whether that be between specific stocks, or even better between different sectors. This is all because negative correlations in finance are a perfect way of hedging your bets. If you hold two negatively correlated stocks within your portfolio and one of these stocks takes a huge loss, it is likely that due to their indirect relationship the other stock will rise in value to make up for this large loss. As a result, these pairs act as a natural way to reduce your portfolio's exposure, or in other words, a method of minimising your overall risk.

The fact that we didn't find any negative correlations, not even just insignificant negative correlations was at first a shock and quite baffling. However, after thinking it through we believe to have made some logical sense out of it all. Saying it's not easy to get onto the S&P 500 is an understatement, a company has to literally be one of the top 500 largest in the world in terms of its market capitalisation; essentially its market value. The pure scale of these companies is what we believe to be the driving factor behind their weirdly strong correlations.

Diving deeper into this thought; big companies attract big interest. The only stakeholders who hold a significant amount of stock in these companies are those who can afford it, which in a lot of cases ends up being large hedge funds; sometimes these companies even attract government interest too. With all these large hedge funds having significant pieces of the pie, and almost always acting on similar information, it stands to reason that a lot of these companies might show similar characteristics - despite belonging to completely different sectors.

Another contributing factor could definitely be the fact that the S&P 500 index is seen by most people as one of the safest places to be putting their money. For their investment to take a hit, the top 500 companies world wide have to take a significant hit too - a prospect which seems unlikely. This means that investors, sometimes quite blindly, just throw their money onto the index. And when bad news strikes it's reach is wide spread, causing the majority to act in a similar money. As all this trading is happening on the S&P 500 index, which is simply the weighted average of all its components, then it stands to reason that all its components, will move in sync as the index rises or falls as a whole - thus leading to this overall correlation.

Even having not found any clear negative correlations it still would've been nice to carry out our simulations keeping each individual sector in mind. However, again in the spirit of saving time we though it best to keep things simple for now, and push forward keeping all 386 stocks as potential candidates for out optimal portfolio.

A Basic Portfolio¶

Once we had explored the data, we decided that before we did optimisation of any sort, we needed to get a basic portfolio functioning with all of the fundamental calculations producing their correct outputs. All of this in an attempt for us to make the transition from a basic portfolio to a more complex and optimised one as smooth as possible for future endeavours.

To begin this process we picked the first four stocks from our 386 that caught our eye on the list, singled out their adjusted closing prices and put them into a table that can be seen below. The stocks themselves don't matter at all because we aren’t concerned with our result, this whole process is simply a proof of concept to make sure things are functioning as we would like them to.

selected = ['YUM', 'AAPL', 'MSFT', 'AMZN']
table = stocks_clean['Adj Close'].loc[:,selected]
table.head()

As you can see we have the table of our chosen stocks with their adjusted close price for each day. The next step in producing this basic portfolio was to create an array of four weights with each weight representing the proportion of money that would be placed into each stock. To keep things simple we decided to evenly distribute the weights among all four stocks; placing 25% in each individual asset.

weights = np.array([0.25, 0.25, 0.25, 0.25])
weights

array([0.25, 0.25, 0.25, 0.25])

After we decided on the weights array we needed to find the daily percentage change of each stock's adjusted closing price. This was simple enough and achieved by using the .pct_change method across all four columns of our table.

returns_daily = table.pct_change()
returns_daily.head()

After finding our daily percentage change we needed some way of finding the annual percentage change. A plot of the daily percentage change for the 'YUM' stock can be seen below. To achieve this annual percentage change we took the mean of each stock's daily percentage change for the last 17 years of data, multiplied each result by 252 (252 trading days in a year), and the resulting four values were our expected annual percentage change for each stock.

In doing this we are really just taking a blind average of the percentage change, we are not taking into account any trends in the historical data, it is in essence just a horizontal average. This is an area that could be improved to give a more accurate annual percentage change, however, due to time constraints we opted to take the unsophisticated route and just take a flat average.

returns_daily.iloc[:,0].plot(figsize=(13,10))
plt.title("Daily % Change of YUM (KFC, Taco Bell & Pizza Hut's Parent Company)", fontsize = 17)
plt.xlabel("Date", fontsize = 14)
plt.ylabel("Daily % Change (Adj Close)", fontsize = 14)

Text(0,0.5,'Daily % Change (Adj Close)')

Here you can see our results for each stock's annual percentage change, or rather their individual expected annual returns.

returns_annual = returns_daily.mean() * 252
returns_annual

YUM     0.204543
AAPL    0.321392
MSFT    0.085694
AMZN    0.287383
dtype: float64

Once we had the stocks expected annual returns, we simply took the dot product of these values with the weights which we've previously defined. Our result was a scalar value which represents the expected returns for this basic portfolio.

total_portfolio_return = np.dot(weights, returns_annual)
total_portfolio_return

0.22475292445747314

The next step in the process was to calculate the daily covariance of each stock, this was done in a similar fashion and required us to make use of the .cov() method. The output seen below is the daily covariance matrix of each stock's daily percentage returns.

cov_daily = returns_daily.cov()
cov_daily

Similar to what has been done previously when finding the annual returns, to find the annual covariance of each stock we simply took the daily covariance matrix and multiplied it by 252. The result is the annual covariance matrix which can be seen below.

cov_annual = cov_daily * 252
cov_annual

Once we had this annual covariance matrix we performed some basic linear algebra to find the volatility, or overall risk, of our portfolio. First we took the dot product of our weights with the annual covariance matrix leaving us with a vector containing four values, as can be seen below.

np.dot(cov_annual, weights)

array([0.04453229, 0.08175316, 0.06017093, 0.11699151])

We then took the dot product again with this one-dimensional vector, but this time against the transposed weights vector. This resulted in the scalar value which can be seen below; a representation of our portfolio's variance.

np.dot(weights.T, np.dot(cov_annual, weights))

0.07586197299692858

Although we have the variance, what we are really after is the volatility, the risk, the standard deviation. To find this we simply take the square root of the variance and we're left with the standard deviation or in other words, the total expected risk of our portfolio.

total_portfolio_risk = np.sqrt(np.dot(weights.T, np.dot(cov_annual, weights)))
total_portfolio_risk

0.27543052299432713

What we're actually trying to achieve through calculating these values is working our way towards figuring out our portfolio's Sharpe ratio. This is important because it's the ratio between our portfolio's expected returns and portfolio's expected risk. Our Sharpe ratio is calculated quite trivially, we simply divide our portfolio's expected returns by its risk. We've followed through with this exact calculation as can be seen below. The Sharpe ratio we see here is by no means optimal, as we've taken no steps towards maximising it. However, that's not the point, the point of these calculations is to ensure they are performed accurately and have given us results that make sense, are logical, and were what we're expecting.

test_sharpe_ratio = total_portfolio_return / total_portfolio_risk
test_sharpe_ratio

0.8160058733290871

An Optimal Portfolio¶

After we'd confirmed that we can create a portfolio with our data, and logically find the expected portfolio wide returns, risk and Sharpe ratio it was just a manner of extending this process into some sort of automated simulation. Our simulation works as follows:

We define X amount of stocks we would like to have within our portfolio
- Our simulation will then pick at random X amount of stocks from the 386 in total we have to choose from
We define Y amount of different weight combinations we would like to simulate between these randomly selected stocks
- This essentially determines how many random portfolios we are going to simulate; best practice is to increase the value Y if you choose to increase the number X
For each different set of weight combinations we determine this portfolio's:
- Expected return
- Expected risk
- Sharpe ratio
These values, along with the random set of weights used to calculate them, are all saved
- All of these results are contained within a data frame, with the index representing which run of the simulation the portfolio belongs to
Repeat steps 1 - 4, Z amount of times
- With each repeat a new set of X stocks are chosen at random and the whole process is completed again, all results being saved

Once this whole simulation has been completed, which depending on the input parameters can take hours, we're left with a huge data frame containing all the simulated portfolios, and each of their associated values.

best_portfolios = []
simulated_sharpes = []

# !!!!
# change the number in `np.arange(Z)` for how many times you want to run the simulation
for i in np.arange(100):
    # !!!!
    # change the number in `np.random.choice(tickers, X, replace=F)` to
    # choose how many stocks you want to polulate your portfolio (taken at
    # random from the whole possible set of 386)
    selected = np.random.choice(list(stocks_clean['Adj Close']), 5, replace = False)
    table = stocks_clean['Adj Close'].loc[:,selected]

    # stock specific annual returns
    returns_daily = table.pct_change()
    returns_annual = returns_daily.mean() * 252

    # stock specific annual covariance
    cov_daily = returns_daily.cov()
    cov_annual = cov_daily * 252

    # setting up simulation lists
    port_returns = []
    port_volatility = []
    sharpe_ratio = []
    stock_weights = []

    # set to correct number for amount of stocks picked
    num_assets = len(selected)
    
    # !!!!
    # change the number `num_portfolios = Y` to the desired amount of different
    # weighting-pairs you wish to simulate
    num_portfolios = 5000

    # portfolio specific random weights, exp returns, exp risk & sharpe ratio
    for single_portfolio in range(num_portfolios):
        weights = np.random.random(num_assets)
        weights /= np.sum(weights)
        returns = np.dot(weights, returns_annual)
        volatility = np.sqrt(np.dot(weights.T, np.dot(cov_annual, weights)))
        sharpe = returns / volatility
        sharpe_ratio.append(sharpe)
        port_returns.append(returns)
        port_volatility.append(volatility)
        stock_weights.append(weights)

    # save results into dictionary
    portfolio = {'Returns': port_returns,
                'Volatility': port_volatility,
                'Sharpe Ratio': sharpe_ratio}

    # extends dictionary to include ticker names for each weight
    for counter,symbol in enumerate(selected):
        portfolio[symbol+' Weight'] = [Weight[counter] for Weight in stock_weights]

    # convert dictionary to dataframe
    df = pd.DataFrame(portfolio)

    # order column names in dataframe
    column_order = ['Returns', 'Volatility', 'Sharpe Ratio'] + [stock+' Weight' for stock in selected]
    df = df[column_order]

    # find which simulated portfolios had min risk & max sharpe
    min_volatility = df['Volatility'].min()
    max_sharpe = df['Sharpe Ratio'].max()
    simulated_sharpes.append(max_sharpe)
    sharpe_portfolio = df.loc[df['Sharpe Ratio'] == max_sharpe]
    min_variance_port = df.loc[df['Volatility'] == min_volatility]

    # add the best portfolios to a seperate list
    best_portfolios.append([sharpe_portfolio, min_variance_port])

Of course, once we'd simulated these portfolios the next step in our process was to find some way of filtering through this huge set of potential portfolios to find the portfolio(s) we deem to be the top performers; this is where we made use of the Sharpe ratio. We had three sets of potential portfolios we were most concerned with. First, and perhaps most obviously, was the portfolio which achieved the maximum Sharpe ratio, this is of course the largest (positive) difference between returns and risk, and as a result seems to be a heavy indicator of good portfolio performance overall. The next set of outputs was simply portfolios which achieved a Sharpe ratio of 1.0 or higher, indicating that they're at least expected to break even or make a little bit of profit. The final set was just an extension of the previous, portfolios which achieved a Sharpe ratio of 1.2 or higher, this is just a way of separating the boys from the men as a Sharpe ratio of above 1.2 generally has quite a significant expected profit margin associated with it.

The two numbers outputted below are simply the amount of portfolios contained within the latter two sets. The first number being the amount of simulated portfolios which achieved the 1.0 or higher status, and the second number represents the amount of portfolios which achieved 1.2 or higher status.

sharpe_one = [best_portfolios[i][0] for i in [simulated_sharpes.index(i) for i in [z for z in simulated_sharpes if z >= 1]]]
sharpe_two = [best_portfolios[i][0] for i in [simulated_sharpes.index(i) for i in [z for z in simulated_sharpes if z >= 1.2]]]
sharpe_max = [best_portfolios[i][0] for i in [simulated_sharpes.index(i) for i in [z for z in simulated_sharpes if z == max(simulated_sharpes)]]]
len(sharpe_one), len(sharpe_two)

(16, 1)

Below we find a list of all of these portfolios, their expected returns, expected risk, their exact sharpe ratio, the exact weight held in each stock, and which stocks these weights belong to...

print('\n')
print('Maximum Sharpe Portfolio')
print('\n')
display(sharpe_max[0])
print('\n')
print('Sharpe ratio >= 1.0')
print('\n')
for i in range(len(sharpe_one)):
    display(sharpe_one[i])
print('\n')
print('Sharpe ratio >= 1.2')
print('\n')
for i in range(len(sharpe_two)):
    display(sharpe_two[i])


Maximum Sharpe Portfolio


Sharpe ratio >= 1.0


Sharpe ratio >= 1.2

As has been mentioned, we ran through the above simulation literally hundreds of times, each time assessing this output and taking note of which stocks we would tend to see show up frequently, especially the stocks which are being favoured (i.e. weighted highly within the portfolio). It would have been really nice to be able to collect these stock tickers in an automated way, and then after we've run the simulation hundreds of times we could even just see a histogram of how often they are appearing in these top performing portfolios. However, due to how the simulation is setup and the fact that these tickers are stored within a column name of a data frame it would be a lengthy process. Again, in the spirit of saving time we opted to simply assess the output by eye and take note of what we were commonly seeing. Of course this is far from a scientific method but it was the best we could do in the time that we had - definitely room for future improvement in this process.

From these companies which we kept noticing appear quite frequently we started to develop and test another seperate portfolio. As we kept running our simulation and noticing more stocks, we would add these stocks into this perspective perfect portfolio and assess whether its performance got significantly better or worse. As hours went on, our list became bigger and bigger until we got to the point where we had the 13 stocks listed below within our portfolio. No matter what stock we added it didn't seem to be improving our Sharpe ratio, we would either remain roughly the same, or it would just diminish. As a result we decided to push through with our analysis, concluding that these were the best combination of 13 stocks we could find (in the time we had to work with).

a = ['MO', 'AAPL', 'GILD', 'ABC', 'COO', 'AOS', 'MNST', 'ESRX', 'BLL', 'CHD', 'HRL', 'LH', 'STZ']
print(a)

['MO', 'AAPL', 'GILD', 'ABC', 'COO', 'AOS', 'MNST', 'ESRX', 'BLL', 'CHD', 'HRL', 'LH', 'STZ']

Once we had these 13 perfect stocks we decided to run our simulation but this time, as the choice of stocks is fixed, we ran as many simulations as our computers could muster. In most cases we would result in roughly the same weights and the portfolio as a whole would achieve roughly a ~1.658 Sharpe ratio. The figure below is a visualisation of our simulation; every single dot is one of the thousands of possible portfolios we could make, each having a different weight combination. The red diamond showcases the portfolio which achieved the maximum Sharpe ratio, this can also be seen as the darkest green on the figure as the colour scale simply represents the magnitude of the Sharpe ratio. The blue diamond on the other hand is the portfolio which achieved the minimum variance, and as a result is the expected safest, least risky way to balance the portfolio.

Having money on our minds we of course chose to push forward with the maximum Sharpe ratio based portfolio, as after these hundreds of thousands of simulations, it in essence is showcasing the best balance between our expected returns and our expected risk.

# !!!!
# change the number in `np.random.choice(tickers, NUMBER, replace=F)` to
# choose how many stocks you want to polulate your portfolio (taken at
# random from the whole possible set of 386)
selected = a
table = stocks_clean['Adj Close'].loc[:,selected]

# stock specific annual returns
returns_daily = table.pct_change()
returns_annual = returns_daily.mean() * 252

# stock specific annual covariance
cov_daily = returns_daily.cov()
cov_annual = cov_daily * 252

# setting up simulation lists
port_returns = []
port_volatility = []
sharpe_ratio = []
stock_weights = []

# set to correct number for amount of stocks picked
num_assets = len(selected)

# !!!!
# change the number `num_portfolios = NUMBER` to the desired amount of different
# weighting-pairs you wish to simulate
num_portfolios = 100000

# portfolio specific random weights, exp returns, exp risk & sharpe ratio
for single_portfolio in range(num_portfolios):
    weights = np.random.random(num_assets)
    weights /= np.sum(weights)
    returns = np.dot(weights, returns_annual)
    volatility = np.sqrt(np.dot(weights.T, np.dot(cov_annual, weights)))
    sharpe = returns / volatility
    sharpe_ratio.append(sharpe)
    port_returns.append(returns)
    port_volatility.append(volatility)
    stock_weights.append(weights)

# save results into dictionary
portfolio = {'Returns': port_returns,
            'Volatility': port_volatility,
            'Sharpe Ratio': sharpe_ratio}

# extends dictionary to include ticker names for each weight
for counter,symbol in enumerate(selected):
    portfolio[symbol+' Weight'] = [Weight[counter] for Weight in stock_weights]

# convert dictionary to dataframe
df = pd.DataFrame(portfolio)

# order column names in dataframe
column_order = ['Returns', 'Volatility', 'Sharpe Ratio'] + [stock+' Weight' for stock in selected]
df = df[column_order]

# find which simulated portfolios had min risk & max sharpe
min_volatility = df['Volatility'].min()
max_sharpe = df['Sharpe Ratio'].max()
simulated_sharpes.append(max_sharpe)
sharpe_portfolio = df.loc[df['Sharpe Ratio'] == max_sharpe]
min_variance_port = df.loc[df['Volatility'] == min_volatility]

# add the best portfolios to a seperate list
best_portfolios.append([sharpe_portfolio, min_variance_port])

# plot all simulated portfolios, with marker on MAX SHARPE & MIN RISK

plt.style.use('seaborn-dark')
df.plot.scatter(x='Volatility', y='Returns', c='Sharpe Ratio', cmap='RdYlGn', 
                edgecolors='black', figsize=(13, 10), grid=True, sharex=False)
plt.xlabel('Volatility (Std. Deviation)', fontsize = 14)
plt.ylabel('Expected Returns', fontsize = 14)
plt.scatter(x=sharpe_portfolio['Volatility'], y=sharpe_portfolio['Returns'], c='red', marker='D', s=100)
plt.scatter(x=min_variance_port['Volatility'], y=min_variance_port['Returns'], c='blue', marker='D', s=100)
plt.title('Optimal Portfolio Simulation', fontsize = 17)
plt.show()


display(sharpe_portfolio)
display(min_variance_port)

Once we had this portfolio which was performing so well be decided out of curiosity to quickly have a look at the portfolio's component stock correlations, maybe there was something we missed out on in our analysis, and by chance we might now stumble across negative correlations? This was unfortunately not the case, and to out surprise everything was largerly not correlated. The absolute highest value within the following matrix looks to be about 30% and even then that's no where near a significant correlation. After a bit of thinking, this sort of makes some sense though. We tried our best to find negative correlations during our exploration process but had no luck, and if negative correlations aren't possible then the next best thing for a portfolio would probably be completely uncorrelated - that is if one stock takes a huge hit the rest of the portfolio is likely to (hopefully) not even budge.

best_companies = stocks_clean['Adj Close'].loc[:,a].pct_change().resample('W').mean().corr()

plt.figure(figsize=(13,10))
sns.heatmap(best_companies, 
        xticklabels=best_companies.columns,
        yticklabels=best_companies.columns,
        cmap="RdBu_r", vmin=-1, vmax=1)

<matplotlib.axes._subplots.AxesSubplot at 0x1a41aa5a58>

Despite these stock's lack of correlation, and our lack reasoning behind it, together they are still maximising our Sharpe ratio, and consistently achieving the highest ratio we've ever seen across the hundreds of simulation instances we've ran.

Results¶

Value at Risk (VaR)¶

Up until this point we've been using the Sharpe ratio as a means of measuring a portfolio's perspective performance. However, this measure can only tell us so much as it's based off of our expected returns and expected risk. This means that we need a better way of evaluating our portfolio's performance that isn't just based on our expected values. The first thing that came to mind was to take a page from financial risk management, and conduct Value at Risk (VaR) analysis. To carry through with this process we needed to find the portfolio's total returns for each day, this was simply calculated by taking the dot product of our portfolio weights with each stock's daily percentage returns.

good_portfolio = stocks_clean['Adj Close'].loc[:,a].pct_change()[1:]
good_portfolio['Portfolio'] = np.dot(stocks_clean['Adj Close'].loc[:,a].pct_change()[1:].values, sharpe_portfolio.values[0][3:])

# thanks Felix

def HS_VaR(s, subsample, alpha=0.01, fixedwindows=False):
    """
    Function to construct VaR via historical simulation assuming constant mean and variance. 
    Inputs:
        s: (T,) numpy array containing financial returns data. 
        subample: float in [0,1]. Propotion of the sample to be used for construction the initial VaR. 
        alpha: float in [0,1]. Significant level. 
        fixedwindows: boolean. If true, the number of observations for calculating the critical value is fixed otherwise, we have a rolling windows. 
    Output:
        sT: int. The observation where the forecast starts. 
        VaR: (T,) numpy array containing the fitted and forecast VaR. 
    
    """
    T = s.shape[0]
    sT = int(np.floor(subsample*T))
    r = copy.copy(s[0:sT])
    VaR = np.zeros(T)
    r.sort()
    index = int(np.floor(alpha*sT))
    VaR[0:sT] = r[index]
    for t in np.arange(sT,T):
        if fixedwindows is False:
            r = copy.copy(s[0:t])
            r.sort()
            index = int(np.floor(alpha*t))
        else:
            r = copy.copy(s[t-sT:t])
            r.sort()
            index = int(np.floor(alpha*sT))
        VaR[t] = r[index]
    return sT, VaR

Once we've worked out these total daily portfolio return values we can simply pass this vector through the VaR function in order to predict the worst 1% case scenario for each day. We do this using two primary methods, rolling and recursive. Once each of these values have been calculated they are each put into a new column. These values are then plotted against the actual returns that we saw for each day, as can be seen in the plot below.

sub = 252/good_portfolio.shape[0]
sT, good_portfolio['VaR_Portfolio_rolling'] = HS_VaR(good_portfolio['Portfolio'].values, subsample=sub, fixedwindows=True)
sT, good_portfolio['VaR_Portfolio_recursive'] = HS_VaR(good_portfolio['Portfolio'].values, subsample=sub, fixedwindows=False)

good_portfolio[['Portfolio', 'VaR_Portfolio_rolling', 'VaR_Portfolio_recursive']].plot(figsize=(13, 10))
plt.xlabel('Date', fontsize = 14)
plt.ylabel('Daily % Change in Price', fontsize = 14)
plt.title('Simulated Rolling & Recursive Value at Risk (VaR) of Optimal Portfolio', fontsize = 17)
plt.legend(fontsize = 12)

<matplotlib.legend.Legend at 0x1a419b0b70>

Visually you can see that both lines do quite well in predicting these worst case, extreme losses. Our rolling scenario hugs the data much more closely, this becomes obvious during the extreme losses seen in 2009/2012 where it dramatically moves to fit the downwards trend. However, our recursive method does slightly better on average, with the trade-off being its predictions barely nudge when it comes to these extreme losses seen throughout the financial crisis. The table below shows that rolling doesn't predict these worst cases 1.22% of the time, and recursive is a little better having only not predicted these worse cases only 0.94% of the time.

backtest = pd.DataFrame(index=['VaR Violations'])
backtest['rolling'] = 100*sum(good_portfolio['Portfolio'].values[sT:] < good_portfolio['VaR_Portfolio_rolling'].values[sT:])/(good_portfolio.shape[0]-sT)
backtest['recursive'] = 100*sum(good_portfolio['Portfolio'].values[sT:] < good_portfolio['VaR_Portfolio_recursive'].values[sT:])/(good_portfolio.shape[0]-sT)
backtest.transpose()

Portfolio Performance¶

Instead of using the same data to evaluate our portfolio's performance we instead opted to collect a new data set containing the prices from Jan, 2018 to Jun, 2018. This allows us to have a testing set of sorts which showcases the real-life performance that our portfolio would have had, had we invested at the beginning of 2018.

# download 2018 stock data here if you want
# stocks2018 = yf.download(list(stocks_clean['Adj Close']), start="2018-01-01", end="2018-06-4")

# save the 2018 stock data to a destination on your computer if you want
# stocks2018.to_hdf("/Users/troyengelhardt/Desktop/data/stocks18.h5", "stocks")

# here we're reading in a file, we'll attach this file to the report somehow
# so our results are reproducable

stocks2018 = pd.read_hdf("/Users/troyengelhardt/Desktop/data/stocks18.h5")

# some stocks were missing data from the start of 2018, or at the end of our
# date range, so we simply dropped these (three) companies; we did check and
# they weren't any companies being used in our optimal portfolio
# np.where([stocks2018['Adj Close'].isna().any(axis=0)])[True]

stocks_clean2018 = stocks2018.dropna(axis = 2)

# np.where([stocks_clean2018['Adj Close'].isna().any(axis=0)])[True]

stocks_clean2018['Adj Close'].loc[:,a].pct_change().head()

/anaconda3/lib/python3.6/site-packages/pandas/core/panel.py:728: FutureWarning: '.reindex_axis' is deprecated and will be removed in a future version. Use '.reindex' instead.
  result = self.reindex_axis(new_ax, axis=axis)

As you can see from the above data, we've simply got the same daily percentage change as before but this time our data is of course from 2018, made obvious by the row index seen on the left. Considering we now have the 2018 price data, and we of course have our optimal portfolio weights from the previous steps in this analysis, we can simulate the actual perspective performance of our portfolio, visualise its change in value.

Below we are doing just this, we're visualising the answer to the question, "if we were to invest $$10 million into our optimal stock portfolio at the beginning of 2018, then how would its total value have changed throughout its due course? How much profit, if any, would be have made?"

test1 = stocks_clean2018['Adj Close'].loc[:,a].pct_change()[1:]

values = []
for j,z in enumerate(list(test1)):
    test2 = []
    test2.append(list(sharpe_portfolio.values[0][3:] * 10000000)[j])
    for x,i in enumerate(list(test1.loc[:,z].values)):
        if x == 104:
            break
        test2.append((i * test2[x]) + test2[x])
    values.append(test2)
    
test3 = dict(zip(set(list(test1)), values))
test4 = pd.DataFrame(test3)
test4 = test4.set_index(stocks_clean2018['Adj Close'].loc[:,a].pct_change().index)

test6 = []
test6.append(10000000)
for x,i in enumerate(good_portfolio['Portfolio'].values):
    if x == 104:
        break
    test6.append((i * test6[x]) + test6[x])

test7 = pd.DataFrame(test6)
test7 = test7.set_index(stocks_clean2018['Adj Close'].loc[:,a].pct_change().index)
test7.columns = ['Portfolio']

test7.plot(figsize=(13, 10))
plt.ylabel('Actual Returns ($ e^7)', fontsize = 14)
plt.title('Total Portfolio Performance', fontsize = 17)
plt.xlabel('Date', fontsize = 14)

Text(0.5,0,'Date')

The following figure is the exact same as above, but instead of the total portfolio value, we've broken it down into each individual stock's performance, the individual stock value, over the first five or so months within 2018. As you can see the majority of the stocks hold steady throughout 2018, but any downward trend seen by the stocks which are weighted most heavily has a large influence on the total portfolio's value, as we'd expect.

test4.plot(figsize=(13, 10))
plt.ylabel('Actual Returns ($)', fontsize = 14)
plt.title('Individal Company Stock Performance', fontsize = 17)
plt.xlabel('Date', fontsize = 14)
plt.legend(loc = 'upper right', bbox_to_anchor=(0.971, 1), ncol = 13, columnspacing = 0.3)

<matplotlib.legend.Legend at 0x1a387cbc50>

Conclusion¶

Answering the Question¶

Our original question from our client was simply which stocks they should invest in, and how much of their $$10 million should they put into each stock. Through our analysis the best answer to these questions based on our whole analysis and simulation would be to spread their money across the following companies:

['MO', 'AAPL', 'GILD', 'ABC', 'COO', 'AOS', 'MNST', 'ESRX', 'BLL', 'CHD', 'HRL', 'LH', 'STZ']

The amount of money the client should put into each stock is based on the following weights:

[0.150438, 0.080163, 0.018533, 0.057638, 0.017671, 0.085929, 0.129351, 0.104678, 0.040494, 0.104663, 0.118187, 0.002355, 0.0899]

The actual dollar values are simply the weight multiplied by $$10 million, which can be seen below:

[1504380, 801630, 185330, 576380, 176710, 859290, 1293510, 1046780, 404940, 1046630, 1181870, 23550, 899000]

Not only did this portfolio consistently provide the highest Sharpe ratio, but we can clearly see that throughout 2018, although the client might not have been happy in March, I think we'd all be celebrating considering that just a few days ago the portfolio (in this format) would have been up 10%, which is roughly $$1 million profit made for our client in just over 5 months.

Future Improvements¶

better graphs, d3 visualisation
more accurate way of finding annual expected values (something instead of blind mean multiplied by 252)
finding better ways to deal with NaN
balancing portfolio on a monthly basis, as opposed to making all our calculations off an annual expected value
- maybe some stocks have seasonality to their movements
- i.e agriculture stocks improve during harvest?
adding more variables; date added (only seasoned companies?), trading volume
fixed weight, not just purely random
picking stocks by sector, instead of just from the lot
a better way/more time to 'weed through' each sector
- i.e taking out the outliers and strengthening the overall sector-based correlation
a better way of searching for negative correlations, SURELY they exist somewhere, even if it just short term behaviour
having some way of storing, and visualising (histogram), the frequency of each stock who 'makes it into' a high performing, high sharpe value, portfolio
different criteria for finding the optimal weights, why just restrict ourselves to a simple Sharpe ratio?

References¶

Malkiel B. Efficient Market Hypothesis. Finance. 1989:127-134. doi:10.1007/978-1-349-20213-3_13.
Maurice Joy O, Jones C. Should We Believe the Tests of Market Efficiency?. The Journal of Portfolio Management. 1986;12(4):49-54. doi:10.3905/jpm.1986.49.
Markowitz H. Portfolio Selection. The Journal of Finance. 1952;7(1):77. doi:10.2307/2975974.
Rom B, Ferguson K. Post-Modern Portfolio Theory Comes of Age. The Journal of Investing. 1993;2(4):27-33. doi:10.3905/joi.2.4.27.
Sahoo D. OLPS - Online Portfolio Selection via Machine Learning. doyensahoo. 2018. Available at: http://www.doyensahoo.com/online-portfolio-selection.html. Accessed May 4, 2018.
Call for Papers: Special Issue on ‘AI and machine learning in finance’. Quantitative Finance. 2018;18(4):533. doi:10.1080/14697688.2018.1444134.

	Ticker symbol	Security	SEC filings	GICS Sector	GICS Sub Industry	Location	Date first added[3][4]	CIK	Founded	Unnamed: 9
0	MMM	3M Company	reports	Industrials	Industrial Conglomerates	St. Paul, Minnesota	NaN	66740	1902	NaN
1	ABT	Abbott Laboratories	reports	Health Care	Health Care Equipment	North Chicago, Illinois	1964-03-31	1800	1888	NaN
2	ABBV	AbbVie Inc.	reports	Health Care	Pharmaceuticals	North Chicago, Illinois	2012-12-31	1551152	2013 (1888)	NaN
3	ABMD	ABIOMED Inc	reports	Consumer Discretionary	Hotels, Resorts & Cruise Lines	Danvers, Massachusetts	2018-05-31	815094	1981	NaN
4	ACN	Accenture plc	reports	Information Technology	IT Consulting & Other Services	Dublin, Ireland	2011-07-06	1467373	1989	NaN

	Adj Close	Close	High	Low	Open	Volume
CI	20.127382	26.145834	26.833334	25.937500	26.770834	2089500.0
CINF	14.217810	27.692698	28.061224	27.437641	28.004536	323000.0
CL	16.210340	31.125000	32.000000	31.000000	32.000000	2434800.0
CLX	30.146725	49.187500	49.875000	48.750000	49.750000	721300.0
CMA	26.674679	45.000000	46.250000	44.562500	46.250000	1102400.0

	A	AAL	AAP	AAPL	ABBV	ABC	ABT	ACN	ADBE	ADI	...	XL	XLNX	XOM	XRAY	XRX	XYL	YUM	ZBH	ZION	ZTS
Date
2000-01-03	45.100422	NaN	NaN	2.695920	NaN	3.094380	6.788054	NaN	16.145748	31.872093	...	31.531628	34.363014	23.209375	6.756287	55.971760	NaN	4.533969	NaN	43.242111	NaN
2000-01-04	41.655251	NaN	NaN	2.468626	NaN	2.883116	6.594110	NaN	14.791295	30.259705	...	30.112312	33.582039	22.764822	6.756287	53.368431	NaN	4.442834	NaN	41.148190	NaN
2000-01-05	39.071369	NaN	NaN	2.504751	NaN	3.106807	6.581987	NaN	15.083735	30.701462	...	30.380846	32.847000	24.005850	6.873071	56.116405	NaN	4.465618	NaN	41.099476	NaN
2000-01-06	37.583683	NaN	NaN	2.287994	NaN	3.342924	6.812296	NaN	15.206868	29.862143	...	30.610987	29.998728	25.246895	6.882067	54.959366	NaN	4.427646	NaN	41.683834	NaN
2000-01-07	40.715664	NaN	NaN	2.396373	NaN	3.777877	6.885026	NaN	15.945663	30.701462	...	32.644066	33.398270	25.172806	6.864099	56.261024	NaN	4.328917	NaN	41.781235	NaN

	A	AAPL	ABC	ABT	ADBE	ADI	ADM	ADP	ADSK	AEE	...	WMT	WY	XEL	XL	XLNX	XOM	XRAY	XRX	YUM	ZION
Date
2000-01-03	45.100422	2.695920	3.094380	6.788054	16.145748	31.872093	7.486128	26.575865	7.492961	12.602593	...	47.874611	40.077419	8.251574	31.531628	34.363014	23.209375	6.756287	55.971760	4.533969	43.242111
2000-01-04	41.655251	2.468626	2.883116	6.594110	14.791295	30.259705	7.408154	26.575865	7.128137	12.602593	...	46.083241	38.571827	8.441576	30.112312	33.582039	22.764822	6.756287	53.368431	4.442834	41.148190
2000-01-05	39.071369	2.504751	3.106807	6.581987	15.083735	30.701462	7.291180	26.320320	6.679120	13.090122	...	45.142746	40.615143	8.767296	30.380846	32.847000	24.005850	6.873071	56.116405	4.465618	41.099476
2000-01-06	37.583683	2.287994	3.342924	6.812296	15.206868	29.862143	7.330174	26.671684	6.272199	13.041369	...	45.635387	42.515053	8.685866	30.610987	29.998728	25.246895	6.882067	54.959366	4.427646	41.683834
2000-01-07	40.715664	2.396373	3.777877	6.885026	15.945663	30.701462	7.447142	27.278587	7.015882	13.236378	...	49.083790	41.116989	8.685866	32.644066	33.398270	25.172806	6.864099	56.261024	4.328917	41.781235

	YUM	AAPL	MSFT	AMZN
Date
2000-01-03	4.533969	2.695920	42.641369	89.3750
2000-01-04	4.442834	2.468626	41.200928	81.9375
2000-01-05	4.465618	2.504751	41.635361	69.7500
2000-01-06	4.427646	2.287994	40.240646	65.5625
2000-01-07	4.328917	2.396373	40.766510	69.5625

	YUM	AAPL	MSFT	AMZN
YUM	0.000343	0.000107	0.000111	0.000146
AAPL	0.000107	0.000685	0.000212	0.000294
MSFT	0.000111	0.000212	0.000376	0.000257
AMZN	0.000146	0.000294	0.000257	0.001160

	YUM	AAPL	MSFT	AMZN
YUM	0.086411	0.027023	0.027891	0.036804
AAPL	0.027023	0.172697	0.053302	0.073991
MSFT	0.027891	0.053302	0.094715	0.064776
AMZN	0.036804	0.073991	0.064776	0.292395

	MO	AAPL	GILD	ABC	COO	AOS	MNST	ESRX	BLL	CHD	HRL	LH	STZ
Date
2018-01-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-01-03	-0.003675	-0.000174	0.007557	0.003722	0.027740	0.006013	0.006538	0.005716	-0.002589	-0.006474	-0.009901	0.004621	0.004149
2018-01-04	-0.003973	0.004645	-0.010180	-0.002225	-0.014069	0.004685	0.003802	0.005551	-0.004672	0.016901	0.012500	-0.003618	-0.007516
2018-01-05	0.002849	0.011385	0.008390	0.012104	0.031089	0.012864	0.002052	-0.000920	0.006258	0.002603	-0.002743	0.024744	-0.026175
2018-01-08	-0.001278	-0.003714	-0.001610	0.016576	0.004989	0.004763	0.008505	0.007236	0.000777	0.007390	0.001651	-0.003244	-0.013826

	YUM	AAPL	MSFT	AMZN
Date
2000-01-03	NaN	NaN	NaN	NaN
2000-01-04	-0.020100	-0.084310	-0.033780	-0.083217
2000-01-05	0.005128	0.014634	0.010544	-0.148741
2000-01-06	-0.008503	-0.086538	-0.033498	-0.060036
2000-01-07	-0.022298	0.047369	0.013068	0.061010

	VaR Violations
rolling	1.263453
recursive	0.935891