If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

cs229_hw2_3abc

Page history last edited by peter.harrington 13 years, 11 months ago

Below are two Matlab files for the Bayesian classifier used in problem 3abc.

*********************

nb_train.m

*********************

[spmatrix, tokenlist, trainCategory] = readMatrix('MATRIX.TRAIN.1400');

trainMatrix = full(spmatrix);

numTrainDocs = size(trainMatrix, 1);

numTokens = size(trainMatrix, 2);

% trainMatrix is now a (numTrainDocs x numTokens) matrix.

% Each row represents a unique document (email).

% The j-th column of the row $i$ represents the number of times the j-th

% token appeared in email $i$.

% tokenlist is a long string containing the list of all tokens (words).

% These tokens are easily known by position in the file TOKENS_LIST

% trainCategory is a (numTrainDocs x 1) vector containing the true

% classifications for the documents just read in. The i-th entry gives the

% correct class for the i-th email (which corresponds to the i-th row in

% the document word matrix).

% Spam documents are indicated as class 1, and non-spam as class 0.

% Note that for the SVM, you would want to convert these to +1 and -1.

%Peter Harrington HW2 #3

%Our job is to calculate the probability of a class (spam or not spam here)

%we need to calculate three things in order to make a classifier

%1. the probability of spam

%2. the prob. of token j appearing given the document is spam

%3. the prob. of token j appearing given the document is not spam

%to calculate p(spam) we can sum up the 1's in trainCategory

p_spam = sum(trainCategory)/numTrainDocs;

%to complete steps 1&2 we need vectors to hold the probablities

p_0_num = ones(1,numTokens);

p_1_num = ones(1,numTokens); % to use Laplace smoothing

p_0_denom = 2; % we initialize the numerator to 1

p_1_denom = 2; % and the denominator to 2

% the 2 on the denominator is the number of values our RV can take on

% if we had a 5 value multinonomial it would be 5 on the denom

%we are going to have to loop through m docuemnts, and j tokens

for i=1:numTrainDocs

if (trainCategory(i) == 1)

p_1_num = p_1_num + trainMatrix(i,:);

p_1_denom = p_1_denom + sum(trainMatrix(i,:));

else

p_0_num = p_0_num + trainMatrix(i,:);

p_0_denom = p_0_denom + sum(trainMatrix(i,:));

end

p_0 = log(p_0_num / p_0_denom); %vector of length(numTokens)

p_1 = log(p_1_num / p_1_denom);

%the log is used to prevent underflow, rather than taking the product

%of the probabilities, we will take the sum of the logs of the probs

*********************

nb_test.m

*********************

[spmatrix, tokenlist, category] = readMatrix('MATRIX.TEST');

testMatrix = full(spmatrix);

numTestDocs = size(testMatrix, 1);

numTokens = size(testMatrix, 2);

% Assume classify.m has just been executed, and all the parameters computed/needed

% by your classifier are in memory through that execution. You can also assume

% that the columns in the test set are arranged in exactly the same way as for the

% training set (i.e., the j-th column represents the same token in the test data

% matrix as in the original training data matrix).

% Write code below to classify each document in the test set (ie, each row

% in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM.

% Construct the (numTestDocs x 1) vector 'output' such that the i-th entry

% of this vector is the predicted class (1/0) for the i-th email (i-th row

% in testMatrix) in the test set.

output = zeros(numTestDocs, 1);

%---------------

% Peter Harrington HW2 #3a

%p_0, and p_1 are assumed to be generated in nb_train.m

for i=1:numTestDocs %loop through all the test documents

p1 = 0.0;

p0 = 0.0; %reset the probability predictions

p1 = testMatrix(i,:)*p_1' + log(p_spam); %claculate the prob it's spam

p0 = testMatrix(i,:)*p_0' + log(1 - p_spam); %prob not spam

if (p1 > p0) %compare the two

output(i) = 1; %set the output if it's spam, otherwise do nothing

end

%---------------

% Compute the error on the test set

error=0;

for i=1:numTestDocs

if (category(i) ~= output(i))

error=error+1;

end

%Print out the classification error on the test set

error/numTestDocs

Comments (0)

You don't have permission to comment on this page.

Loading…

This is your Sidebar, which you can edit like any other page in your workspace.

This Sidebar appears everywhere on your workspace. Add to it whatever you like -- a navigation section, a link to your favorite web sites, or anything else.

New games for PC

Download or play onlineСкачать Мини Игры Нарды Играть Backgammon Online Super Mario Bros Game

Loading…

cs229_hw2_3abc

cs229_hw2_3abc

Page Tools

Insert links

Comments (0)

Navigator

SideBar

New games for PC

Recent Activity