Filewatcher File Search File Search
Content Search
» » » » » » perl-Algorithm-NaiveBayes-0.04-1.el5.rf.noarch.rpm » Content »
pkg://perl-Algorithm-NaiveBayes-0.04-1.el5.rf.noarch.rpm:15719/usr/share/man/man3/  info  HEADER  downloads

perl-Algorithm-NaiveBayes - Bayesian prediction of categories…  more info»


Algorithm::NaiveBUser(Contributed Perl DocumAlgorithm::NaiveBayes(3)

       Algorithm::NaiveBayes - Bayesian prediction of categories

         use Algorithm::NaiveBayes;
         my $nb = Algorithm::NaiveBayes->new;

           (attributes => {foo => 1, bar => 1, baz => 3},
            label => 'sports');

           (attributes => {foo => 2, blurp => 1},
            label => ['sports', 'finance']);

         ... repeat for several more instances, then:

         # Find results for unseen instances
         my $result = $nb->predict
           (attributes => {bar => 3, blurp => 2});

       This module implements the classic "Naive Bayes" machine
       learning algorithm.  It is a well-studied probabilistic algo‐
       rithm often used in automatic text categorization.  Compared
       to other algorithms (kNN, SVM, Decision Trees), it's pretty
       fast and reasonably competitive in the quality of its

       A paper by Fabrizio Sebastiani provides a really good intro‐
       duction to text categorization:

           Creates a new "Algorithm::NaiveBayes" object and returns
           it.  The following parameters are accepted:

               If set to a true value, the "do_purge()" method will
               be invoked during "train()".  The default is true.
               Set this to a false value if you'd like to be able to
               add additional instances after training and then call
               "train()" again.

       add_instance( attributes => HASH, label => STRING⎪ARRAY )
           Adds a training instance to the categorizer.  The
           "attributes" parameter contains a hash reference whose
           keys are string attributes and whose values are the
           weights of those attributes.  For instance, if you're
           categorizing text documents, the attributes might be the
           words of the document, and the weights might be the num‐
           ber of times each word occurs in the document.

           The "label" parameter can contain a single string or an
           array of strings, with each string representing a label
           for this instance.  The labels can be any arbitrary
           strings.  To indicate that a document has no applicable
           labels, pass an empty array reference.

           Calculates the probabilities that will be necessary for
           categorization using the "predict()" method.

       predict( attributes => HASH )
           Use this method to predict the label of an unknown
           instance.  The attributes should be of the same format as
           you passed to "add_instance()".  "predict()" returns a
           hash reference whose keys are the names of labels, and
           whose values are the score for each label.  Scores are
           between 0 and 1, where 0 means the label doesn't seem to
           apply to this instance, and 1 means it does.

           In practice, scores using Naive Bayes tend to be very
           close to 0 or 1 because of the way normalization is per‐
           formed.  I might try to alleviate this in future versions
           of the code.

           Returns a list of all the labels the object knows about
           (in no particular order), or the number of labels if
           called in a scalar context.

           Purges training instances and their associated informa‐
           tion from the NaiveBayes object.  This can save memory
           after training.

           Returns true or false depending on the value of the
           object's "purge" property.  An optional boolean argument
           sets the property.

           This object method saves the object to disk for later
           use.  The $path argument indicates the place on disk
           where the object should be saved:


           This class method reads the file specified by $path and
           returns the object that was previously stored there using

             $nb = Algorithm::NaiveBayes->restore_state($path);

       Bayes' Theorem is a way of inverting a conditional probabil‐
       ity. It states:

                       P(y⎪x) P(x)
             P(x⎪y) = -------------

       The notation "P(x⎪y)" means "the probability of "x" given
       "y"."  See also "‐
       isfore.03.22.99.html" for a simple but complete example of
       Bayes' Theorem.

       In this case, we want to know the probability of a given cat‐
       egory given a certain string of words in a document, so we

                           P(words ⎪ cat) P(cat)
         P(cat ⎪ words) = --------------------

       We have applied Bayes' Theorem because "P(cat ⎪ words)" is a
       difficult quantity to compute directly, but "P(words ⎪ cat)"
       and "P(cat)" are accessible (see below).

       The greater the expression above, the greater the probability
       that the given document belongs to the given category.  So we
       want to find the maximum value.  We write this as

                                        P(words ⎪ cat) P(cat)
         Best category =   ArgMax      -----------------------
                          cat in cats          P(words)

       Since "P(words)" doesn't change over the range of categories,
       we can get rid of it.  That's good, because we didn't want to
       have to compute these values anyway.  So our new formula is:

         Best category =   ArgMax      P(words ⎪ cat) P(cat)
                          cat in cats

       Finally, we note that if "w1, w2, ... wn" are the words in
       the document, then this expression is equivalent to:

         Best category =   ArgMax      P(w1⎪cat)*P(w2⎪cat)*...*P(wn⎪cat)*P(cat)
                          cat in cats

       That's the formula I use in my document categorization code.
       The last step is the only non-rigorous one in the derivation,
       and this is the "naive" part of the Naive Bayes technique.
       It assumes that the probability of each word appearing in a
       document is unaffected by the presence or absence of each
       other word in the document.  We assume this even though we
       know this isn't true: for example, the word "iodized" is far
       more likely to appear in a document that contains the word
       "salt" than it is to appear in a document that contains the
       word "subroutine".  Luckily, as it turns out, making this
       assumption even when it isn't true may have little effect on
       our results, as the following paper by Pedro Domingos argues:

       My first implementation of a Naive Bayes algorithm was in the
       now-obsolete AI::Categorize module, first released in May
       2001.  I replaced it with the Naive Bayes implementation in
       AI::Categorizer (note the extra 'r'), first released in July
       2002.  I then extracted that implementation into its own mod‐
       ule that could be used outside the framework, and that's what
       you see here.

       Ken Williams,

       Copyright 2003-2004 Ken Williams.  All rights reserved.

       This library is free software; you can redistribute it and/or
       modify it under the same terms as Perl itself.

       AI::Categorizer(3), perl.

perl v5.8.8                  2007-06-08     Algorithm::NaiveBayes(3)
Results 1 - 1 of 1
Help - FTP Sites List - Software Dir.
Search over 15 billion files
© 1997-2017