Below are two Matlab files for the Bayesian classifier used in problem 3abc.

*********************

nb_train.m

*********************

[spmatrix, tokenlist, trainCategory] = readMatrix('MATRIX.TRAIN.1400');

trainMatrix = full(spmatrix);

numTrainDocs = size(trainMatrix, 1);

numTokens = size(trainMatrix, 2);

% trainMatrix is now a (numTrainDocs x numTokens) matrix.

% Each row represents a unique document (email).

% The j-th column of the row $i$ represents the number of times the j-th

% token appeared in email $i$.

% tokenlist is a long string containing the list of all tokens (words).

% These tokens are easily known by position in the file TOKENS_LIST

% trainCategory is a (numTrainDocs x 1) vector containing the true

% classifications for the documents just read in. The i-th entry gives the

% correct class for the i-th email (which corresponds to the i-th row in

% the document word matrix).

% Spam documents are indicated as class 1, and non-spam as class 0.

% Note that for the SVM, you would want to convert these to +1 and -1.

%Peter Harrington HW2 #3

%Our job is to calculate the probability of a class (spam or not spam here)

%we need to calculate three things in order to make a classifier

%1. the probability of spam

%2. the prob. of token j appearing given the document is spam

%3. the prob. of token j appearing given the document is not spam

%to calculate p(spam) we can sum up the 1's in trainCategory

p_spam = sum(trainCategory)/numTrainDocs;

%to complete steps 1&2 we need vectors to hold the probablities

p_0_num = ones(1,numTokens);

p_1_num = ones(1,numTokens); % to use Laplace smoothing

p_0_denom = 2; % we initialize the numerator to 1

p_1_denom = 2; % and the denominator to 2

% the 2 on the denominator is the number of values our RV can take on

% if we had a 5 value multinonomial it would be 5 on the denom

%we are going to have to loop through m docuemnts, and j tokens

for i=1:numTrainDocs

if (trainCategory(i) == 1)

p_1_num = p_1_num + trainMatrix(i,:);

p_1_denom = p_1_denom + sum(trainMatrix(i,:));

else

p_0_num = p_0_num + trainMatrix(i,:);

p_0_denom = p_0_denom + sum(trainMatrix(i,:));

end

end

p_0 = log(p_0_num / p_0_denom); %vector of length(numTokens)

p_1 = log(p_1_num / p_1_denom);

%the log is used to prevent underflow, rather than taking the product

%of the probabilities, we will take the sum of the logs of the probs

*********************

nb_test.m

*********************

[spmatrix, tokenlist, category] = readMatrix('MATRIX.TEST');

testMatrix = full(spmatrix);

numTestDocs = size(testMatrix, 1);

numTokens = size(testMatrix, 2);

% Assume classify.m has just been executed, and all the parameters computed/needed

% by your classifier are in memory through that execution. You can also assume

% that the columns in the test set are arranged in exactly the same way as for the

% training set (i.e., the j-th column represents the same token in the test data

% matrix as in the original training data matrix).

% Write code below to classify each document in the test set (ie, each row

% in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM.

% Construct the (numTestDocs x 1) vector 'output' such that the i-th entry

% of this vector is the predicted class (1/0) for the i-th email (i-th row

% in testMatrix) in the test set.

output = zeros(numTestDocs, 1);

%---------------

% Peter Harrington HW2 #3a

%p_0, and p_1 are assumed to be generated in nb_train.m

for i=1:numTestDocs %loop through all the test documents

p1 = 0.0;

p0 = 0.0; %reset the probability predictions

p1 = testMatrix(i,:)*p_1' + log(p_spam); %claculate the prob it's spam

p0 = testMatrix(i,:)*p_0' + log(1 - p_spam); %prob not spam

if (p1 > p0) %compare the two

output(i) = 1; %set the output if it's spam, otherwise do nothing

end

end

%---------------

% Compute the error on the test set

error=0;

for i=1:numTestDocs

if (category(i) ~= output(i))

error=error+1;

end

end

%Print out the classification error on the test set

error/numTestDocs

## Comments (0)

You don't have permission to comment on this page.