CS229 Machine Learning
Link to old page: http://wiki.hackerdojo.com/MachineLearning
starting 4/22 at 7pm at Hackerdojo.
This class is based on the Stanford cs229 material developed by Professor Andrew Ng. We have permission to use his materials from the course.
We are trying something things differently to emphasize the work related nature of the student population. We have sponsorship from Amazon for Elastic Map Reduce and AWS so students can implement versions of the algorithms presented in class on a cluster. We should have something to report back to Professor Ng at the end of class. We have a wide variety of people from industry, the goal is SHDH with some structure so people can meet other people to do some cool machine learning projects. Free compute time.
The course videos are on youtube or they can be downloaded from this site. The assignments, handouts, and lecture notes are available from the course website: http://www.stanford.edu/class/cs229/
We will meet once a week for ~10 weeks to discuss the lecture material and problem sets.
We also have a volunteer willing to lead and teach the class, people who have a background in this area and who have taken the class before.
Please sign up in advance. We are limiting enrollment because of limited resources (time of volunteer instructors).
Volunteer Instructor: Mike Bowles:http://www.linkedin.com/in/mikebowles
Patricia Hoffman, PhD
First meeting on 4/22 will cover administration details, hw1 and review of lecture 1 on youtube site of cs229.
http://www.youtube.com/results?search_query=stanford+cs229&search_type=&aq=1m&oq=cs229
Lecture 1: http://www.youtube.com/watch?v=UzxYlbK2c7E (useless, skip it)
Lecture 2: http://www.youtube.com/watch?v=5u4G23_OohI
Lecture 3: http://www.youtube.com/watch?v=HZ4cvaztQEs
Lecture 4: http://www.youtube.com/watch?v=nLKOQfKLUks
Lecture 5: http://www.youtube.com/watch?v=qRJ3GKMOFrE
Lecture 6: http://www.youtube.com/watch?v=qyyJKd-zXRE
Lecture 7: http://www.youtube.com/watch?v=s8B4A5ubw6c&feature=channel (SVMS)
Lecture 8: http://www.youtube.com/watch?v=bUv9bfMPMb4&feature=channel (SVMS)
Lecture 9:http://www.youtube.com/watch?v=tojaGtMPo5U&feature=PlayList&p=A89DCFA6ADACE599&playnext_from=PL (SVMS)
CS229 lectures
cs229Stanford Online - 9 21 2009.rm, Lecture 1
Stanford Online - 9 23 2009.rm , Lecture 2
Stanford Online - 9 25 2009.rm, PS1 Linear Algebra Review
Stanford Online - 9 28 2009.rm , Lecture 3
Stanford Online - 9 30 2009.rm , Lecture 4
Stanford Online - 10 5 2009.rm , Lecture 5
Stanford Online - 10 7 2009.rm Lecture 6
Stanford Online - 10 12 2009.rm Lecture 7 This one is truncated, you can replace this with the YouTube lectures,7-9 listed above.
Stanford Online - 10 14 2009.rm Lecture 8
Stanford Online - 10 19 2009.rm
Stanford Online - 10 21 2009.rm
Stanford Online - 10 26 2009.rm
Stanford Online - 10 28 2009.rm
Stanford Online - 10 31 2009.rm
Stanford Online - 11 2 2009.rm
Stanford Online - 11 4 2009.rm
Stanford Online - 11 9 2009.rm
Stanford Online - 11 11 2009.rm
Stanford Online - 11 16 2009.rm
Stanford Online - 11 18 2009.rm
Stanford Online - 11 30 2009.rm
Stanford Online - 12 2 2009.rm
4/21/2010: 20 people signed up
HW #1 Notes:
To install Octave under windows, you don't need to download additional packages, install Cygwin for windows and check the Octave AND Gnuplot package under Math when running setup.exe for Cygwin.
If Octave sucks for you as it did me, try R: http://cran.r-project.org/
public cs229 course page: http://see.stanford.edu/see/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
Past hw1: http://see.stanford.edu/materials/aimlcs229/problemset1.pdf
Past Solutions: http://see.stanford.edu/materials/aimlcs229/ps1_solution.pdf
Past hw2: http://see.stanford.edu/materials/aimlcs229/problemset2.pdf
Past Solutions: http://see.stanford.edu/materials/aimlcs229/ps2_solution.pdf
HW1:
Problem 1a Solutions: Problem 1a.pdf
Problem 1b,c Solutions: cs229-public_hw1_1
Problem 1b,c & LWLR implementation in python: cs229-hw1_1b_py
"Public" 2a solution in matlab: cs229-public_hw1_2
Problem 2a,b Solutions:cs2292abc.pdf
2d solutions (Matlab)cs229_hw1_2
Problem 3a,b,c Solutions: Problem 3abc.pdf
Converted Peter Harrington's cs229-public_hw1_1 to R http://machinelearning123.pbworks.com/f/cs229_hw_1_R.R
I uploaded my XL solutions for Probs 1 & 2. I also uploaded a couple of small text files that explain how to make the spreadsheets work. If you've got any questions, send me an email mike@mbowles.com.
HW2:
I uploaded a Python function for de-sparsifying the input matrix given by Professor Ng. I don't have Matlab so I converted the Matlab de-sparsifier that Prof Ng gives to Python. Others of you who don't have Matlab may find this handy.
You'll also find a single sheet version of Platt's SMO algorithm in the uploads. In the fall, people seemed to have trouble with the simplified version given in class. I found this version easy to code and it worked satisfactorily for me. -Mike Bowles
Python DeSparsifier for Prob Set 2.txt
smo-algo on a sheet.pdf
Using Mike's XL soln for Prob2.txt
Prob 2 soln.xls
data set 1 with solution 1.2.xls
Using Mike's XL soln for Prob 1.txt
Matlab Solution to cs229_hw2_3abc
R Solution using naiveBayes in R package e1071: cs229_homework_2_3
Matlab Solution to cs229_hw2_3de
Matlab solution to hw2 3de using SMO2
patricia hoffman has found a nice SVM applet:
http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html
Generative and Discriminative Learning Notes
http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
Amazon AWS/EMR Resources
Anything written by Jinesh Varia from Amazon. His documentation is extremely well written. He will be here to talk to the class on 6/17.
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1633
Hadoop MR by Jinesh Varia:
http://developer.yahoo.net/blogs/theater/archives/2009/07/amazon_elastic_mapreduce.html
You have a choice, you can either use Amazon EMR, elastic map reduce
EC2 Resources:
http://www.cs.washington.edu/education/courses/490h/08au/ec2.htm
or you can use Hadoop on AWS; see Cloudera
HackerDojoAmazonHelloWorld.pdf
Map Reduce Assignments
Below is a list of 4 assignments for map reduce. You can use either Amazon EMR or Hadoop MR for the assignments.
http://code.google.com/edu/submissions/uwspr2007_clustercourse/listing.html
http://code.google.com/edu/submissions/uwashington-scalable-systems/
The UW 490H class materials, 2008 are very good.
Assignment 1: Inverted Index: assignm ent1.pdf
Assignment 2: Run Page Rank on Wikipedia: assignment2.pdf
Assignment 3: create a tiled series of Rendered Map Images from Public TIGER data:assignment3.pdf geosource.zip
Assignment 4: Push data from Assignment 3 onto Amazon EC2 and create servers to publish data. assignment4.pdf ec2source.zip
UC Berkeley Using Hadoop for Machine Learning
A lot of the Hadoop examples are written in older versions of Hadoop, or assume you run an older version. (0.18 and 0.20 have different APIs.) Cloudera's whole business is making Hadoop easy to use. They have some good free training videos here: http://www.cloudera.com/resources/?type=Training there also is a machine image you can download to experiment with Hadoop without messing with installing it on your own system.
Doug Chang
doug.chang@hackerdojo.com
Mapreducable k-Nearest Neighbors
aka locality-sensitive hashing (LSH) for real vectors
Here are the slides from the talk I gave on June 10th: LSH_slides
Here is the paper (pdf): [A locality-sensitive hash for real vectors, SODA'10]
Related links:
k-Nearest neighbors (k-NN) on wikipedia
Locality-sensitive hashes on wikipedia
Kevin Murphy's slides on k-NN (pdf)
- Tyler Neylon
tyler@bynomial.com
Machine Learning Challenges
Predictive Data Analysis (PDF)
KNIME Data Mining UI
Greased Lightnin' Talk
http://www.knime.org
7/07/10 Files
Stephens Project
Comments (18)
LanceNorskog said
at 4:38 pm on Jun 5, 2010
The first lines of code in cs229_hw_1_R.R is:
# CLEANUP R BEFORE STARTING
rm(list = ls())
May I request that contributed programs not start with the line "remove all files in the current directory" :)
stephen.oconnell said
at 11:10 pm on Jun 15, 2010
This is an R command which removes all the current R objects from the current working set. It doesn't remove any files from from any directories.
For example:
> a <- 1
> b <- 2
> c <- 3
> ls()
[1] "a" "b" "c"
> rm(list = ls())
> ls()
character(0)
>
This is useful when starting a new analysis in R to make sure you have not carried any data from a prior session into the new analysis.
LanceNorskog said
at 1:08 pm on Jun 16, 2010
Apologies. Maybe I should learn R before spouting off.
mike@mbowles.com said
at 10:07 am on Jun 22, 2010
At the next class, we're going to consider what we should do when we grow up. So far, we've got 4 ideas for what the class might do next. 1. Work and DM competition as a group 2. Select compact topics and cover them one-at-a-time in two or three sessions (for example, using trees for regression) 3. work on a platform for trading securities 4. some fraud detection projects
I'll get someone to give a pitch on each of these things and we can consider them. if anyone would like to volunteer to pitch one of these please speak up. also, be thinking about these and about other potential topics that you would find interesting and/or useful.
ben.lipkowitz said
at 10:29 am on Jun 22, 2010
I'd like to learn about tree regression. (sorry, I don't know anything about it yet.)
I tried option #1 at noisebridge (ACM KDD 2010) and it wasn't very fun or helpful, just frustrating.
wroscoe said
at 10:37 pm on Jun 22, 2010
Compact topics that include some theory and an application could be great. How to make money wouldn't be a waste of time either.
This is a project I've been working on. daduce.com (you need a gmail account and testing the predictors won't work on IE). Let me know what you guys think.
fenn said
at 1:35 pm on Jun 24, 2010
i probably won't be at class tonight, so here are some more suggestions for compact topics from the peanut gallery:
how to use k-nearest neighbors, decision trees, boosting, bagging, ANN, HMM, MCMC, mahout tutorial
doug chang said
at 1:52 pm on Jun 24, 2010
Ideas:
we have 100$ credits which expire at the end of august. Here are some projects which don't require us learning new material
a) implement the autoscale and spot instance in scripts for mahout. Hook it into mahout.
b) implement tyler's algorithm and run it on a public data set.
c) run the naive bayes and svd smo on larger data sets, the ones in the public repository for spam filtering. measure performance on larger data sets, does it scale linearly?
mike@mbowles.com said
at 12:45 pm on Jun 26, 2010
I created a new page "Stephens CPU classification Prob". Let's use that to collect data, discussion, etc. about the problem that Stephen described at Thursday's meeting. I've uploaded the files, etc, that i have.
stephen.oconnell said
at 1:16 pm on Jun 29, 2010
Just came across this data mining competition that is underway right now: http://kaggle.com/informs2010 Here is an r solution: http://www.or-exchange.com/questions/492/ideas-for-the-informs-data-mining-contest
Thoughts on participating as a group? As noted the end result could be quite valuable to day traders...
This could make for a good discussion at a minimum, talking about how to approach a problem like this; strategy, algorithms, interpretation, iterative refinement, etc.
stephen.oconnell said
at 1:18 pm on Jun 29, 2010
I am at the Hadoop 2010 Summit and about to get two hours of machine learning using Hadoop, http://developer.yahoo.com/events/hadoopsummit2010/agenda.html, see the Research track. I'll try and summarize next week.
stephen.oconnell said
at 11:58 pm on Jul 1, 2010
Here is the paper on "Robust De-anonymization of Large Sparse Datasets", specifically finding individuals in the Netflix Dataset used in Netflix 2006 recommendation competition.
http://userweb.cs.utexas.edu/%7Eshmat/shmat_oak08netflix.pdf
Peter Harrington said
at 8:37 am on Jul 4, 2010
I am very interested in this Kaggle INFORMS DATA MINING CONTEST, and doing a project as a group.
James Salsman said
at 3:33 pm on Jul 6, 2010
Will this class be covering category k-means, too? http://cran.at.r-project.org/web/packages/knncat/index.html
Stephen O'Connell said
at 5:41 am on Jul 9, 2010
Here is a link to the paper that apparently killed the netflix contest.
http://userweb.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
mike@mbowles.com said
at 2:27 pm on Jul 13, 2010
we're planning to meet this thursday (july 15). We're going to consider what data mining competition we're going to enter. Stephen is going to make a pitch for the INFORMS data mining contest and Will is going to pick another contest for us to consider. (the KDD or ACM websites are good places to look). Oh baby. this sounds like fun to me!
wroscoe said
at 6:51 pm on Jul 15, 2010
I will not be able to make it today. I will be happy working on any contest with this group. The INFORMS contest would give us more time to develop a solution.
Stephen O'Connell said
at 10:04 am on Jul 21, 2010
Our future meetings have been approved and added to the schedule for Thursday evenings at 7:00pm. I have arranged for the deck as our meeting place.
I will not be able to attend this week, however, will look forward to seeing everyone next week.
Thanks,
Stephen...
You don't have permission to comment on this page.