AI Project #2: Learning the Market

Due Monday 2/11/13, 11:59 PM

In this project, you will be using Adaboost and decision stumps to discover connections between different pieces of the stock market. Specifically, you will see how well you can predict whether the NASDAQ Composite Index went up or down on a given day, given the performance of the 30 stocks in the Dow Jones Industrial average.

To assist you in this project, I have located and massaged (somewhat!) the data you will need. From here I obtained daily data for the stocks in the S&P 500 (which is a superset of the Dow stocks). This file, which is described on the page in the link given, is here [Caution! 5MB text file! Right-click and Save-as!]. Note that you will need to filter out (either by hand or in a program somewhere) just the 30 stocks that you are allowed to use. The 30 stocks currently in the Dow are listed here. For the output, I got historical data for the NASDAQ composite from here - I have modified the date format to make it the same as the other file, and you can get that here. Note that this file does not include the exact data you need to learn (i.e. whether it went up or down between the open and the close) but it can be easily determined from the given data.

To learn the relationship, you should use Adaboost, with decision stumps (one-level decision trees) as your individual models. Note that the textbook has an error in its Adaboost pseudocode: z[k] should be log((1-error)/error). Also, the more common way of doing the weight update is using the z value and updating all weights, as follows: w[j] = w[j]*exp(-z*y(j)*h(j)), where y(j) is the correct class for example j, and h(j) is the predicted class for the example using the current model. If your two classes are represented by +1 and -1, this will increase the weight when the model is wrong (exp(positive number)) and decrease when the model is right (exp(negative number)).

Your code must be written in C, C++, Java, or Python. Regardless of language, you must provide three entry points (independently callable from the command line, whether functions, classes or executables):

In addition, you must provide a discussion of your training process. Explain what options you gave to your learner in terms of creating different stumps, and how you decided what options worked best. What is the best single stump that you found, in terms of correct classification? What is the best result you got for a collection of 3, 5 and 10 stumps? Did you get any improved results with more than 10 stumps?

Submit all of your code, including any data files if you modified them (so that we can test your learner), via submit:

submit zjb-grd project2 all-file-names
Grading: