Below are two Matlab files for the Bayesian classifier used in problem 3abc.
*********************
nb_train.m
*********************
[spmatrix, tokenlist, trainCategory] = readMatrix('MATRIX.TRAIN.1400');
trainMatrix = full(spmatrix);
numTrainDocs = size(trainMatrix, 1);
numTokens = size(trainMatrix, 2);
% trainMatrix is now a (numTrainDocs x numTokens) matrix.
% Each row represents a unique document (email).
% The j-th column of the row $i$ represents the number of times the j-th
% token appeared in email $i$.
% tokenlist is a long string containing the list of all tokens (words).
% These tokens are easily known by position in the file TOKENS_LIST
% trainCategory is a (numTrainDocs x 1) vector containing the true
% classifications for the documents just read in. The i-th entry gives the
% correct class for the i-th email (which corresponds to the i-th row in
% the document word matrix).
% Spam documents are indicated as class 1, and non-spam as class 0.
% Note that for the SVM, you would want to convert these to +1 and -1.
%Peter Harrington HW2 #3
%Our job is to calculate the probability of a class (spam or not spam here)
%we need to calculate three things in order to make a classifier
%1. the probability of spam
%2. the prob. of token j appearing given the document is spam
%3. the prob. of token j appearing given the document is not spam
%to calculate p(spam) we can sum up the 1's in trainCategory
p_spam = sum(trainCategory)/numTrainDocs;
%to complete steps 1&2 we need vectors to hold the probablities
p_0_num = ones(1,numTokens);
p_1_num = ones(1,numTokens); % to use Laplace smoothing
p_0_denom = 2; % we initialize the numerator to 1
p_1_denom = 2; % and the denominator to 2
% the 2 on the denominator is the number of values our RV can take on
% if we had a 5 value multinonomial it would be 5 on the denom
%we are going to have to loop through m docuemnts, and j tokens
for i=1:numTrainDocs
if (trainCategory(i) == 1)
p_1_num = p_1_num + trainMatrix(i,:);
p_1_denom = p_1_denom + sum(trainMatrix(i,:));
else
p_0_num = p_0_num + trainMatrix(i,:);
p_0_denom = p_0_denom + sum(trainMatrix(i,:));
end
end
p_0 = log(p_0_num / p_0_denom); %vector of length(numTokens)
p_1 = log(p_1_num / p_1_denom);
%the log is used to prevent underflow, rather than taking the product
%of the probabilities, we will take the sum of the logs of the probs
*********************
nb_test.m
*********************
[spmatrix, tokenlist, category] = readMatrix('MATRIX.TEST');
testMatrix = full(spmatrix);
numTestDocs = size(testMatrix, 1);
numTokens = size(testMatrix, 2);
% Assume classify.m has just been executed, and all the parameters computed/needed
% by your classifier are in memory through that execution. You can also assume
% that the columns in the test set are arranged in exactly the same way as for the
% training set (i.e., the j-th column represents the same token in the test data
% matrix as in the original training data matrix).
% Write code below to classify each document in the test set (ie, each row
% in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM.
% Construct the (numTestDocs x 1) vector 'output' such that the i-th entry
% of this vector is the predicted class (1/0) for the i-th email (i-th row
% in testMatrix) in the test set.
output = zeros(numTestDocs, 1);
%---------------
% Peter Harrington HW2 #3a
%p_0, and p_1 are assumed to be generated in nb_train.m
for i=1:numTestDocs %loop through all the test documents
p1 = 0.0;
p0 = 0.0; %reset the probability predictions
p1 = testMatrix(i,:)*p_1' + log(p_spam); %claculate the prob it's spam
p0 = testMatrix(i,:)*p_0' + log(1 - p_spam); %prob not spam
if (p1 > p0) %compare the two
output(i) = 1; %set the output if it's spam, otherwise do nothing
end
end
%---------------
% Compute the error on the test set
error=0;
for i=1:numTestDocs
if (category(i) ~= output(i))
error=error+1;
end
end
%Print out the classification error on the test set
error/numTestDocs
Comments (0)
You don't have permission to comment on this page.