AI Project #2: Learning the Market
Due Monday 2/11/13, 11:59 PM
In this project, you will be using Adaboost and decision stumps
to discover connections between different pieces of the stock
market. Specifically, you will see how well you can predict
whether the NASDAQ Composite Index went up or down on a given
day, given the performance of the 30 stocks in the Dow Jones
Industrial average.
To assist you in this project, I have located and massaged
(somewhat!) the data you will need. From here I obtained daily data for the stocks in the S&P
500 (which is a superset of the Dow stocks). This file, which
is described on the page in the link given, is
here [Caution! 5MB text file!
Right-click and Save-as!]. Note that you will need to filter out
(either by hand or in a program somewhere) just the 30 stocks that you
are allowed to use. The 30 stocks currently in the Dow are
listed
here. For the output, I got historical data for the NASDAQ
composite
from here
- I have modified the date format to make it the same as the other
file, and you can get that here. Note that
this file does not include the exact data you need to learn
(i.e. whether it went up or down between the open and the close) but
it can be easily determined from the given data.
To learn the relationship, you should use Adaboost, with decision
stumps (one-level decision trees) as your individual models.
Note that the textbook has an error in its Adaboost pseudocode:
z[k] should be log((1-error)/error). Also, the more
common way of doing the weight update is using the z value and
updating all weights, as follows: w[j] =
w[j]*exp(-z*y(j)*h(j)), where
y(j) is the correct class for example j, and h(j) is
the predicted class for the example using the current model. If your
two classes are represented by +1 and -1, this will increase the
weight when the model is wrong (exp(positive number)) and decrease
when the model is right (exp(negative number)).
Your code must be written in C, C++, Java, or Python. Regardless of
language, you must provide three entry points (independently callable
from the command line, whether functions, classes or executables):
- learn should not take any parameters, but simply
read in data files (you may provide massaged data files) and
produce a set of weighted stumps. Please provide some sort of
human-readable output for the stumps.
- testtraining should not take any parameters, but show
the results of your best set of stumps on the training data
(the year of data given above). Note that this must not do any
learning, just read in some previously learned best stumps, either
from a file or hard-coded values in the program.
- testoneday should take one parameter. This parameter
will be the name of a file with stock information for one day.
The file will consist of 30 lines in the same format as the S&P
data, one for each of the Dow stocks. This function should use
your best stumps to give a prediction for the NASDAQ Composite (up
or down) for that day, again doing no learning.
In addition, you must provide a discussion of your training
process. Explain what options you gave to your learner in terms
of creating different stumps, and how you decided what options
worked best. What is the best single stump that you found, in
terms of correct classification? What is the best result you
got for a collection of 3, 5 and 10 stumps? Did you get any
improved results with more than 10 stumps?
Submit all of your code, including any data files if you modified
them (so that we can test your learner), via submit:
submit zjb-grd project2 all-file-names
Grading:
- Decision stump implementation: 40 points
- Adaboost implementation: 25 points
- Implementation of all three entry points: 5 points
- Success rate on training data: 9 points
- Success rate on the 3 days following the submission: 6 points
- Discussion: 15 points