The main purpose of my final project was to perform an analysis of a selected data set using the tools I have built over this semester, such as Linear Regression, PCA, and Classification. My choice of data set is Dow Jones Index Data Set from UCI Machine Learning website, which link is here: https://archive.ics.uci.edu/ml/datasets/Dow+Jones+Index. The primary question regarding this dataset was as follows:
- 'Can I use machine learning methods to predict the performance of stocks (or at least classify them into sub-categories) given some information about them?'
Instructions for GUI:
Before I go further into explaining my final project, I want to ensure that user will be able to use my GUI (visualization/analysis app) to follow and understand my work. Thus, I will introduce the necessary steps to use my program exclusively for my dataset Dow Jones.
- Image 1 shows the full picture of my GUI.
There are three new buttons I created in order to analyze my dataset: 'Classify_PCA', 'Create Dow-Jones Labels', and 'Split Dow-Jones'. These buttons can be found in the left control frame under the title 'Final Project'. The first thing user will have to do is to click the button 'Create Dow-Jones Labels' button in order to write csv file with category labels for my dataset. This needs to the first step because my dataset, dow_jones.csv, does not contain category labels. I thus use the dependent variable in the dataset to create labels, which will be used for classification, i.e. supervised learning. When user clicks the button, image 2 will pop up; user then should simply select the file dow_jones.csv (this is the cleaned file from dow_jones_index.csv). When labeling is done, user should see a file named dow_jones_5_Y.csv in the current directory, i.e. where the display.py is located at. The number 5 indicates the number of unique labels.
- After creating labels for the dataset, user now needs to click the button 'Spit Dow-Jones', which will involve selecting two files in a row: 1) the data set and (dow_jones.csv) 2) the file with category labels (dow_jones_5_Y.csv). The button will split these files into four new files: dow_jones_train.csv (train set), dow_jones_test.csv (test set), dow_jones_5_train_Y.csv (training category labels), and dow_jones_5_test_Y.csv.(test category labels).
- With train/test sets and category labels, user is now able to conduct classification analysis. To do so, user needs to click the button 'Perform Classify' and dialog window (image 3) will pop up for user to select files for train/test sets, category labels, classification method, and necessary parameters for KNN if selected. Make sure that one selects correct file for each listbox in order to successfully perform the analysis. Once user makes selections, another dialog window (image 4) will pop up so that user can choose which headers to use for classification analysis.If user does not select any header, every header will be selected (if user also provided separate files for category labels; if not, ever header but the last one). This is where user needs to be very careful; for dow_jones dataset, user must select every header but the last one. The reason is that last column is the dependent variable I try to predict. If this variable is included in classification analysis, the result would be biased and inaccurate.
(Image 3) (Image 4)
- When all selections are made correctly and analysis executes successfully, then user should see a new entry in the second listbox of the left control frame as in image 5. Once analysis is loaded, user is then able to select the entry and click the button 'Classify Results' to see the confusion matrices and accuracy values (image 6) or click the button 'Project Classify' to project the newly generated analysis file into my GUI. In order to project the data, user will need to select the analysis file first (such as dow_jones_test_Naive Bayes_5_analysis.csv) and then choose headers from a dialog window. Note that the name of analysis file indicates the classification method and the number of unique category labels. After user performs classification analysis, another new file will be generated, such as dow_jones_test_Naive Bayes_5_results.csv (image 7), which contains the confusion matrices and accuracy values for training&test datasets. I generate this results file so that user won't have to run my GUI again just to see the results of analysis. I create separate analysis and results file because the former is used for projection.
(Image 5) (Image 6) (Image 7)
- Although I will discuss the results of my analysis in later sections, it is quite obvious from image 6 that the accuracy from Naive Bayes analysis was very low. In an effort to improve the accuracy, I enabled my program to conduct PCA on training set and project the test set onto the PCA-transformed space. To use this feature, user needs to click the button 'Classify_PCA', and a dialog window (image 8) will pop up for user to select files for training&test datasets and category labels. Note that this button will simply put training and test sets in PCA-transformed space. To conduct classification, user will need to click the button 'Perform Classify' and follow the steps described above (use the same files for category labels).
- When projecting the analysis, please note that the number of plotted data points will be smaller than what it actually is. I modified my GUI so that when there are more than 200 data points, it randomly chooses 200 data points and plots only them.
Dow Jones Index Data Set from UCI Machine Learning website contains weekly data for the Dow Jones industrial index. The Dow Jones industrial index indicates the value of 30 large, public companies based in the U.S. In the dataset, each row is data for a week. There are 750 data points (rows), which is not large for machine learning but not too small. The weekly data was recorded from the first to the second quarter of 2011 (January~June) and it contained information about the following list of companies:
- 3M MMM
- American Express AXP
- Alcoa AA
- AT&T T
- Bank of America BAC
- Boeing BA
- Caterpillar CAT
- Chevron CVX
- Cisco Systems CSCO
- Coca-Cola KO
- DuPont DD
- ExxonMobil XOM
- General Electric GE
- Hewlett-Packard HPQ
- The Home Depot HD
- Intel INTC
- IBM IBM
- Johnson & Johnson JNJ
- JPMorgan Chase JPM
- Kraft KRFT
- McDonald's MCD
- Merck MRK
- Microsoft MSFT
- Pfizer PFE
- Procter & Gamble PG
- Travelers TRV
- United Technologies UTX
- Verizon VZ
- Wal-Mart WMT
- Walt Disney DIS
The dataset contains 16 attributes that can be used for different purposes. For this project, I used six of those attributes as independent variables to classify the data into 5 category labels, which were generated using the dependent variable, percent_change_next_weeks_price. Due to my limited ability and lack of knowledge in data cleaning and machine learning, I was unable to use the independent variables to determine which stock will produce the greatest rate of return next week. Instead, I decided to explore whether 'machine learning methods can classify the stocks into sub-categories and thus predict their performance in the following week based on some data from the previous week.' In short, I discovered that my machine learning methods and dimension reduction (PCA) failed to accurately classify the stocks given the data. I will discuss these results in more detail later.
The dataset used for this project is Dow Jones Index Data Set from UCI Machine Learning website. There are originally 16 features in the dataset as follows:
- quarter: the yearly quarter (1 = Jan-Mar; 2 = Apr=Jun).
- stock: the stock symbol (see above)
- date: the last business day of the work (this is typically a Friday)
- open: the price of the stock at the beginning of the week
- high: the highest price of the stock during the week
- low: the lowest price of the stock during the week
- close: the price of the stock at the end of the week
- volume: the number of shares of stock that traded hands in the week
- percent_change_price: the percentage change in price throughout the week
- percent_chagne_volume_over_last_wek: the percentage change in the number of shares of stock that traded hands for this week compared to the previous week
- previous_weeks_volume: the number of shares of stock that traded hands in the previous week
- next_weeks_open: the opening price of the stock in the following week
- next_weeks_close: the closing price of the stock in the following week
- percent_change_next_weeks_price: the percentage change in price of the stock in the following week
- days_to_next_dividend: the number of days until the next dividend
- percent_return_next_dividend: the percentage of return on the next dividend
The original research (see the references at the bottom) used percent_change_price, percent_change_volume_over_last_wk, days_to_next_dividend, and percent_return_next_dividend to determine which stock will yield the highest rate of return in the following week. Unfortunately, I could not use all of these features for the reasons mentioned in the abstract, and there were some missing values, which I was not sure how to deal with. Thus, I modified the dataset according to the objective of my analysis. This involved eliminating specific attributes and enabling my program to read in $ sign for some data points. The main tasks to achieve my goal were 1) create category labels from the dependent variable, 2) split the data set and newly generated category labels into training&test sets appropriately, 3) modify several, related methods in display.py, analysis.py, and data.py in order to conduct classification and project it correctly, and 4) write a new method that performs PCA on training&tests for classification.
- I modified the original dataset (image 9, named dow_jones_index.csv) by simply using features in Excel and transformed it (named dow_jones.csv) as in image 10. In detail, I eliminated following columns (attributes): quarter, stocks, date, percent_change_volume_over_last_week, previous_weeks_volume, next_weeks_open, next_weeks_close, days_to_next_dividend, and percent_return_next_dividend. The first three were removed because I did not how to read and interpret the data (especially the string and date types) as independent variables. The next two were deleted because they showed missing values at consistent intervals for the data collected in the first quarter. Although they were missing for a reason (data over last week cannot be gathered in the first week of data collection), I removed them for simplicity of my analysis. The last four were erased because I wanted to focus on just percent_change_next_weeks_price as dependent variable for category labels.
(Image 9) (Image 10)
- After cleaning the dataset, I needed to create category labels for classification (supervised learning). To do so, I read in the last column of the dataset dow_jones.csv, which contains information about percent_change_next_weeks_price, and calculated its 20th, 40th, 60th, and 80th percentile value, using numpy function, to divide the variable into five category labels. I assigned the number from 1 to 5 in following order: <20th, <40th, <60th, <80th, >80th. I decided to create five labels to test whether machine learning algorithms can predict more than just positive and negative rates of return.
- After creating the labels, I then needed to split the data set and category labels into training&test sets for classification analysis. This step required me to write a new code. The pre-existing feature, 'Split Data' button, would not work for this data set because its has a separate file of category labels. Thus, I created a new button 'Split Dow-Jones' and new method to use sci-kit's split method (train:70%, test:30%) with dataset and category label as parameters. The split method for Dow Jones Index Data Set produces four csv files: training&test sets and corresponding category labels. The files of category labels have 'Y' in their naming (see image 11 for reference).
- Once the split method (splitDowJones() in display.py) was done, I had to modify the classify() method in analysis.py for two reasons: 1) enable user to select specific headers for classification analysis so that the dependent variable won't be part of the analysis and 2) write results file that contain confusion matrices and accuracy values in addition to the analysis files that contain the values of the test set along with the corresponding category labels. While making changes to the classify() method involved tremendous amount of testing and debugging, the biggest challenge occurred at reading in the data properly. After a struggle, I learned to pass in the encoding parameter when reading in a dataset in data.py's read() method.
- When I finished changing the classify() method, I was ready to conduct classification analyses. For this analysis, I used two supervised learning methods: Naive Bayes and KNN. I did not need to customize these two classifiers particularly for my dataset. After using them and seeing that the accuracy values were very low (by evaluating the confusion matrices and accuracy values), I decided to create method to conduct PCA on training&test sets. This involved creating a new button and method in display.py and pcaClassify() method, which conducts PCA on the training set first, transforms the test set as same as the training set has been, and projects the test set into the PCA-transformed space of the training set, in analysis.py.
- After enabling PCA for classification analysis, I used Naive Bayes and KNN algorithms again and compared the results to observe the effects of PCA on the accuracy. Lastly, I modified the projectClassify() method of display.py so that it reads in the data and category labels from the analysis file to project the classification.
When discussing the results of non PCA-transformed datasets, I will present the images of confusion matrices and accuracy values for both training and test sets and summarize the results. With respect to PCA-transformed sets, I will present the projected data sets as well.
- Naive Bayes
: Note that the accuracy of confusion matrix for test set is lower than that of confusion matrix for training set (results 1). This indicates Naive Bayes did not do a good job of training data and predicted the dependent variable. However, the accuracy value, 0.22, is slightly better than randomly predicting the variable (which would yield the probability of 0.2).
- KNN (Default): Again, results 2 shows that the confusion matrix for the test set has a lower value of accuracy than that for the training set. There is also something interesting going on here; the accuracy value of confusion matrix for training set is actually much higher than that of test set. This may indicate an issue of overfitting.
- KNN (K=10, Neighbors=3, Distance=L2): According to results 3, changing the number of exemplar points increased the accuracy rate of the test set's confusion matrix by 0.02 while decreasing that of the training set's confusion matrix by nearly half. Regardless, KNN algorithm is not doing a good job of learning from the training set and predicting the dependent variable.(Results 3)
Because Naive Bayes and KNN, which yielded relatively high accuracy rates for previous datasets, such as Iris and Wine in the past project, alone were not effective, I conducted PCA on the training&test sets and performed classification again to observe any changes.
- PCA_Naive Bayes: Results 4 clearly indicates that PCA transformation merely had any effect on the accuracy of classification. Compare the values of test set's accuracy in results 4 and 1 and note that they are the same. Plot 1 shows the projection of the two significant eigenvectors. Given that the data points are dominated by two labels, we can infer that the classification is not accurate of the true data.(Results 4) (Plot 1)
- PCA_KNN (Default): Results 5 shows no improvement from results 2, in terms of the accuracy value. Moreover, its accuracy value of the confusion matrix for the training set is high, which is similar with results 2. Perhaps, KNN algorithm is related to the issue of overfitting. Although plot 2 shows better distribution of labels, they show no pattern.(Results 5) (Plot 2)
- PCA_KNN (K=10, Neighbors=3, Distance=L2): Results 6 and plot 3 essentially show no improvement from the previous result and plots. Unfortunately, Naive Bayes & KNN along with PCA do not appear to be effective tools for predicting the performance of stock prices.(Results 6) (Plot 3)
First of all, my results and plots clearly demonstrate my failure of predicting the dependent variable percent_change_next_weeks_price (more precisely, the category labels derived from it) despite using two different machine learning algorithms (Naive Bayes & KNN) and dimension reduction (PCA). To be concise and clear, I suggest following list of possibilities for such failure.
- Inappropriate data cleaning
: Before performing classification analysis on my dataset, I removed several dependent variables from the data. Perhaps those eliminated attributes played significant roles in predicting the dependent variable percent_change_next_weeks_price. In other words, I may have needed to figure out how to read in 'string' and 'date' types of data and interpret them properly to answer the primary question of this project. Given the low values of accuracy throughout my analyses, I possibly made my model too simple by reducing the number of input features and decreased its ability to describe the target feature.
- Improper train/test splitting
: Although I split the data by 70%: training set and 30%: test set, which are default values of sci-kit's splitting method, I may have had to train data with 90% and test it with 10% so that the algorithms would have had more data to learn from.
- Implement other machine learning methods
: Since Naive Bayes and KNN were proved to be ineffective algorithms for predicting the stocks' performance, I should have considered using other algorithms, such as decision tree and neural network, instead.
- Use unsupervised learning
: After realizing that supervised learning with this data set may not be appropriate, I used clustering feature of my GUI with the modified data set to see if unsupervised learning would perform better. Plot 4 shows a good clustering of the data set. Perhaps clustering, rather than classification, is more suitable for this data set.
- Fundamental challenge in predicting the stock market
: While my analysis had many other issues, this may be the biggest factor to my failure of predicting the dependent variable. Stock market is influenced by various features, and it would require a highly sophisticated model to predict its performance. Given the results from the original research (the referenced paper), it seems possible to build such model that accurately predicts the stock's rate of return in the future.
Despite the excitement and hope for successful results while planning this final project, I observed that some machine learning algorithms, such as Naive Bayes and KNN, may not be effective tools for predicting the performance of stocks.
Using dimension reduction method, such as PCA, was also a futile attempt to improve the accuracy of my model. To highlight, PCA in the past projects has been effective increasing the performance of different algorithms, such as clustering. It was in fact effective for the previous datasets (Iris and Wine). More analysis is needed to conclude about the effectiveness of PCA.
In summary, this project illustrated that machine learning algorithms do not work for all kinds of model and that particular data sets will require specific algorithms and data cleaning approaches to draw successful results.
Although I worked alone for this project, professor Taylor and Maxwell gave vital help for understanding the data and how to use it to build a model.
Brown, M. S., Pelosi, M. & Dirska, H. (2013). Dynamic-radius Species-conserving Genetic Algorithm for
the Financial Forecasting of Dow Jones Index Stocks. Machine Learning and Data Mining in Pattern
Recognition, 7988, 27-41.