| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

cs229_hw2_3abc

Page history last edited by peter.harrington 13 years, 11 months ago

Below are two Matlab files for the Bayesian classifier used in problem 3abc.

 

*********************

nb_train.m

*********************

 

[spmatrix, tokenlist, trainCategory] = readMatrix('MATRIX.TRAIN.1400');

 

trainMatrix = full(spmatrix);

numTrainDocs = size(trainMatrix, 1);

numTokens = size(trainMatrix, 2);

 

% trainMatrix is now a (numTrainDocs x numTokens) matrix.

% Each row represents a unique document (email).

% The j-th column of the row $i$ represents the number of times the j-th

% token appeared in email $i$.

 

% tokenlist is a long string containing the list of all tokens (words).

% These tokens are easily known by position in the file TOKENS_LIST

 

% trainCategory is a (numTrainDocs x 1) vector containing the true

% classifications for the documents just read in. The i-th entry gives the

% correct class for the i-th email (which corresponds to the i-th row in

% the document word matrix).

 

% Spam documents are indicated as class 1, and non-spam as class 0.

% Note that for the SVM, you would want to convert these to +1 and -1.

 

%Peter Harrington HW2 #3

%Our job is to calculate the probability of a class (spam or not spam here)

%we need to calculate three things in order to make a classifier

%1. the probability of spam

%2. the prob. of token j appearing given the document is spam

%3. the prob. of token j appearing given the document is not spam

 

%to calculate p(spam) we can sum up the 1's in trainCategory

p_spam = sum(trainCategory)/numTrainDocs;

 

%to complete steps 1&2 we need vectors to hold the probablities

p_0_num = ones(1,numTokens);

p_1_num = ones(1,numTokens);        % to use Laplace smoothing

p_0_denom = 2;                      % we initialize the numerator to 1

p_1_denom = 2;                      % and the denominator to 2

% the 2 on the denominator is the number of values our RV can take on

% if we had a 5 value multinonomial it would be 5 on the denom

 

%we are going to have to loop through m docuemnts, and j tokens

for i=1:numTrainDocs

    if (trainCategory(i) == 1)

        p_1_num = p_1_num + trainMatrix(i,:);

        p_1_denom = p_1_denom + sum(trainMatrix(i,:));

    else

        p_0_num = p_0_num + trainMatrix(i,:);

        p_0_denom = p_0_denom + sum(trainMatrix(i,:));

    end

end

p_0 = log(p_0_num / p_0_denom);      %vector of length(numTokens)

p_1 = log(p_1_num / p_1_denom);

%the log is used to prevent underflow, rather than taking the product

%of the probabilities, we will take the sum of the logs of the probs

 

*********************

nb_test.m

*********************

 

[spmatrix, tokenlist, category] = readMatrix('MATRIX.TEST');

 

testMatrix = full(spmatrix);

numTestDocs = size(testMatrix, 1);

numTokens = size(testMatrix, 2);

 

% Assume classify.m has just been executed, and all the parameters computed/needed

% by your classifier are in memory through that execution. You can also assume

% that the columns in the test set are arranged in exactly the same way as for the

% training set (i.e., the j-th column represents the same token in the test data

% matrix as in the original training data matrix).

 

% Write code below to classify each document in the test set (ie, each row

% in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM.

 

% Construct the (numTestDocs x 1) vector 'output' such that the i-th entry

% of this vector is the predicted class (1/0) for the i-th  email (i-th row

% in testMatrix) in the test set.

output = zeros(numTestDocs, 1);

 

%---------------

% Peter Harrington HW2 #3a

 

%p_0, and p_1 are assumed to be generated in nb_train.m

 

for i=1:numTestDocs         %loop through all the test documents

    p1 = 0.0;

    p0 = 0.0;               %reset the probability predictions

 

    p1 = testMatrix(i,:)*p_1' + log(p_spam);   %claculate the prob it's spam

    p0 = testMatrix(i,:)*p_0' + log(1 - p_spam); %prob not spam

 

    if (p1 > p0)            %compare the two

        output(i) = 1;      %set the output if it's spam, otherwise do nothing

    end

    

end

%---------------

 

 

% Compute the error on the test set

error=0;

for i=1:numTestDocs

  if (category(i) ~= output(i))

    error=error+1;

  end

end

 

%Print out the classification error on the test set

error/numTestDocs

 

 

 

Comments (0)

You don't have permission to comment on this page.