| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Whenever you search in PBworks or on the Web, Dokkio Sidebar (from the makers of PBworks) will run the same search in your Drive, Dropbox, OneDrive, Gmail, Slack, and browsed web pages. Now you can find what you're looking for wherever it lives. Try Dokkio Sidebar for free.

View
 

cs229_hw2_3abc

Page history last edited by peter.harrington 12 years, 10 months ago

Below are two Matlab files for the Bayesian classifier used in problem 3abc.

 

*********************

nb_train.m

*********************

 

[spmatrix, tokenlist, trainCategory] = readMatrix('MATRIX.TRAIN.1400');

 

trainMatrix = full(spmatrix);

numTrainDocs = size(trainMatrix, 1);

numTokens = size(trainMatrix, 2);

 

% trainMatrix is now a (numTrainDocs x numTokens) matrix.

% Each row represents a unique document (email).

% The j-th column of the row $i$ represents the number of times the j-th

% token appeared in email $i$.

 

% tokenlist is a long string containing the list of all tokens (words).

% These tokens are easily known by position in the file TOKENS_LIST

 

% trainCategory is a (numTrainDocs x 1) vector containing the true

% classifications for the documents just read in. The i-th entry gives the

% correct class for the i-th email (which corresponds to the i-th row in

% the document word matrix).

 

% Spam documents are indicated as class 1, and non-spam as class 0.

% Note that for the SVM, you would want to convert these to +1 and -1.

 

%Peter Harrington HW2 #3

%Our job is to calculate the probability of a class (spam or not spam here)

%we need to calculate three things in order to make a classifier

%1. the probability of spam

%2. the prob. of token j appearing given the document is spam

%3. the prob. of token j appearing given the document is not spam

 

%to calculate p(spam) we can sum up the 1's in trainCategory

p_spam = sum(trainCategory)/numTrainDocs;

 

%to complete steps 1&2 we need vectors to hold the probablities

p_0_num = ones(1,numTokens);

p_1_num = ones(1,numTokens);        % to use Laplace smoothing

p_0_denom = 2;                      % we initialize the numerator to 1

p_1_denom = 2;                      % and the denominator to 2

% the 2 on the denominator is the number of values our RV can take on

% if we had a 5 value multinonomial it would be 5 on the denom

 

%we are going to have to loop through m docuemnts, and j tokens

for i=1:numTrainDocs

    if (trainCategory(i) == 1)

        p_1_num = p_1_num + trainMatrix(i,:);

        p_1_denom = p_1_denom + sum(trainMatrix(i,:));

    else

        p_0_num = p_0_num + trainMatrix(i,:);

        p_0_denom = p_0_denom + sum(trainMatrix(i,:));

    end

end

p_0 = log(p_0_num / p_0_denom);      %vector of length(numTokens)

p_1 = log(p_1_num / p_1_denom);

%the log is used to prevent underflow, rather than taking the product

%of the probabilities, we will take the sum of the logs of the probs

 

*********************

nb_test.m

*********************

 

[spmatrix, tokenlist, category] = readMatrix('MATRIX.TEST');

 

testMatrix = full(spmatrix);

numTestDocs = size(testMatrix, 1);

numTokens = size(testMatrix, 2);

 

% Assume classify.m has just been executed, and all the parameters computed/needed

% by your classifier are in memory through that execution. You can also assume

% that the columns in the test set are arranged in exactly the same way as for the

% training set (i.e., the j-th column represents the same token in the test data

% matrix as in the original training data matrix).

 

% Write code below to classify each document in the test set (ie, each row

% in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM.

 

% Construct the (numTestDocs x 1) vector 'output' such that the i-th entry

% of this vector is the predicted class (1/0) for the i-th  email (i-th row

% in testMatrix) in the test set.

output = zeros(numTestDocs, 1);

 

%---------------

% Peter Harrington HW2 #3a

 

%p_0, and p_1 are assumed to be generated in nb_train.m

 

for i=1:numTestDocs         %loop through all the test documents

    p1 = 0.0;

    p0 = 0.0;               %reset the probability predictions

 

    p1 = testMatrix(i,:)*p_1' + log(p_spam);   %claculate the prob it's spam

    p0 = testMatrix(i,:)*p_0' + log(1 - p_spam); %prob not spam

 

    if (p1 > p0)            %compare the two

        output(i) = 1;      %set the output if it's spam, otherwise do nothing

    end

    

end

%---------------

 

 

% Compute the error on the test set

error=0;

for i=1:numTestDocs

  if (category(i) ~= output(i))

    error=error+1;

  end

end

 

%Print out the classification error on the test set

error/numTestDocs

 

 

 

Comments (0)

You don't have permission to comment on this page.