Filewatcher File Search File Search
Catalog
Content Search
» » » » » dbacl_1.12-2_amd64.deb » Content »
pkg://dbacl_1.12-2_amd64.deb:727900/usr/share/man/man1/  info  control  downloads

dbacl - digramic Bayesian text classifier…  more info»

mailcross.1.gz

MAILCROSS(1)                                            MAILCROSS(1)



NAME
       mailcross - a cross-validation simulator for use with dbacl.

SYNOPSIS
       mailcross command [ command_arguments ]

DESCRIPTION
       mailcross  automates  the task of cross-validating email fil‐
       tering and classification programs such as dbacl(1).  Given a
       set  of categorized documents, mailcross initiates simulation
       runs to estimate the classification errors and  thereby  per‐
       mits fine tuning of the parameters of the classifier.

       Cross-validation  is a method which is widely used to compare
       the quality of classification and learning algorithms, and as
       such  permits  rudimentary  comparisons between those classi‐
       fiers which make use of dbacl(1) and  bayesol(1),  and  other
       competing classifiers.

       The  mechanics  of  cross-validation are as follows: A set of
       pre-classified email messages is first split into a number of
       roughly equal-sized subsets.  For each subset, the filter (by
       default, dbacl(1)) is used to classify  each  message  within
       this  subset,  based  upon having learned the categories from
       the remaining subsets. The  resulting  classification  errors
       are then averaged over all subsets.

       The  results  obtained by cross validation essentially do not
       depend upon the ordering of the sample emails. Other  methods
       (see mailtoe(1),mailfoot(1)) attempt to capture the behaviour
       of classification errors over time.

       mailcross uses the  environment  variables  MAILCROSS_LEARNER
       and MAILCROSS_FILTER when executing, which permits the cross-
       validation of arbitrary filters, provided these  satisfy  the
       compatibility  conditions  stated  in the ENVIRONMENT section
       below.

       For convenience, mailcross implements a  testsuite  framework
       with predefined wrappers for several open source classifiers.
       This permits the direct comparison of dbacl(1) with competing
       classifiers  on  the same set of email samples. See the USAGE
       section below.

       During preparation, mailcross  builds  a  subdirectory  named
       mailcross.d  in  the  current  working directory.  All needed
       calculations are performed inside this subdirectory.

EXIT STATUS
       mailcross returns 0 on success, 1 if a problem occurred.

COMMANDS
       prepare size
              Prepares a subdirectory named mailcross.d in the  cur‐
              rent  working  directory,  and populates it with empty
              subdirectories for exactly size subsets.

       add category [FILE]...
              Takes a set of emails from either FILE  if  specified,
              or  STDIN,  and  associates  them  with category.  All
              emails are distributed randomly into  the  subdirecto‐
              ries  of mailcross.d for later use. For each category,
              this command can be repeated several times, but should
              be executed at least once.

       clean  Deletes  the  directory  mailcross.d  and all its con‐
              tents.

       learn  For every previously built subset of  email  messages,
              pre-learns all the categories based on the contents of
              all the subsets except this  one.   The  command_argu‐
              ments are passed to MAILCROSS_LEARNER.

       run    For  every  previously built subset of email messages,
              performs the classification based upon the pre-learned
              categories  associated  with all but this subset.  The
              command_arguments are passed to MAILCROSS_FILTER.

       summarize
              Prints statistics for the latest cross-validation run.

       review truecat predcat
              Scans the last run statistics  and  extracts  all  the
              messages  which  belong  to  category truecat but have
              been classified into category predcat.  The  extracted
              messages   are   copied   to   the   directory   mail‐
              cross.d/review for perusal.

       testsuite list
              Shows a  list  of  available  filters/wrapper  scripts
              which can be selected.

       testsuite select [FILTER]...
              Prepares  the  filter(s)  named  FILTER to be used for
              simulation. The filter name is the name of  a  wrapper
              script located in the directory /usr/share/dbacl/test‐
              suite.  Each filter has a rigid  interface  documented
              below,  and  the  act of selecting it copies it to the
              mailcross.d/filters directory.  Only  filters  located
              there are used in the simulations.

       testsuite deselect [FILTER]...
              Removes  the  named filter(s) from the directory mail‐
              cross.d/filters so that they are not used in the simu‐
              lation.

       testsuite run
              Invokes  every  selected  filter on the datasets added
              previously, and calculates misclassification rates.

       testsuite status
              Describes the scheduled simulations.

       testsuite summarize
              Shows the cross validation results  for  all  filters.
              Only makes sense after the run command.

USAGE
       The  normal usage pattern is the following: first, you should
       separate your email collection into several categories (manu‐
       ally  or  otherwise). Each category should be associated with
       one or more folders, but each folder should not contain  more
       than  one  category. Next, you should decide how many subsets
       to use, say 10.  Note that too many subsets  will  slow  down
       the calculations rapidly. Now you can type

       % mailcross prepare 10

       Next,  for  every category, you must add every folder associ‐
       ated with this category. Suppose you  have  three  categories
       named  spam,  work,  and  play, which are associated with the
       mbox files spam.mbox, work.mbox, and play.mbox  respectively.
       You would type

       % mailcross add spam spam.mbox
       % mailcross add work work.mbox
       % mailcross add play play.mbox

       You  can  now  perform  as many simulations as desired. Every
       cross validation consists of a learning, a running and a sum‐
       marizing stage. These operations are performed on the classi‐
       fier specified in the MAILCROSS_FILTER and  MAILCROSS_LEARNER
       variables.  By setting these variables appropriately, you can
       compare classification performance as you  vary  the  command
       line options of your classifier(s).

       % mailcross learn
       % mailcross run
       % mailcross summarize

       The  testsuite  commands  are  designed to simplify the above
       steps and allow comparison of a wide range of  email  classi‐
       fiers,  including  but not limited to dbacl.  Classifiers are
       supported through wrapper scripts, which are located  in  the
       /usr/share/dbacl/testsuite directory.

       The  first  stage  when using the testsuite is deciding which
       classifiers to compare.  You can view  a  list  of  available
       wrappers by typing:

       % mailcross testsuite list

       Note  that the wrapper scripts are NOT the actual email clas‐
       sifiers, which must be installed separately  by  your  system
       administrator  or  otherwise.   Once  this  is  done, you can
       select one or more wrappers for the simulation by typing, for
       example:

       % mailcross testsuite select dbaclA ifile

       If  some  of  the selected classifiers cannot be found on the
       system, they are not selected. Note also that  some  wrappers
       can  have  hard-coded  category names, e.g. if the classifier
       only supports binary classification. Heed  the  warning  mes‐
       sages.

       It  remains only to run the simulation. Beware, this can take
       a long time (several hours depending on the classifier).

       % mailcross testsuite run
       % mailcross testsuite summarize

       Once you are all done with simulations, you  can  delete  the
       working files, log files etc. by typing

       % mailcross clean

       The  progress  of the cross validation is written silently in
       various log files which are located  in  the  mailcross.d/log
       directory. Check these in case of problems.

SCRIPT INTERFACE
       mailcross  testsuite  takes  care of learning and classifying
       your prepared email corpora  for  each  selected  classifier.
       Since  classifiers  have  widely  varying interfaces, this is
       only possible by wrapping those interfaces individually  into
       a standard form which can be used by mailcross testsuite.

       Each  wrapper  script  is a command line tool which accepts a
       single command followed by zero or more  optional  arguments,
       in the standard form:

       wrapper command [argument]...

       Each  wrapper  script also makes use of STDIN and STDOUT in a
       well defined way. If no behaviour is described, then no  out‐
       put  or  input  should  be  used.   The possible commands are
       described below:

       filter In this case, a single email is expected on STDIN, and
              a  list  of  category filenames is expected in $2, $3,
              etc. The script writes the category name corresponding
              to  the  input email on STDOUT. No trailing newline is
              required or expected.

       learn  In this case, a standard mbox stream  is  expected  on
              STDIN, while a suitable category file name is expected
              in $2. No output is written to STDOUT.

       clean  In this case, a directory is expected in $2, which  is
              examined  for  old  database  information.  If any old
              databases are found, they are purged or reset. No out‐
              put is written to STDOUT.

       describe
              IN this case, a single line of text is written to STD‐
              OUT, describing the filter's functionality.  The  line
              should  be  kept  short  to prevent line wrapping on a
              terminal.

       bootstrap
              In this case, a directory is expected in $2. The wrap‐
              per script first checks for the existence of its asso‐
              ciated classifier, and  other  prerequisites.  If  the
              check  is  successful, then the wrapper is cloned into
              the  supplied  directory.   A  courtesy   notification
              should  be given on STDOUT to express success or fail‐
              ure.  It is also permissible to give  longer  descrip‐
              tions caveats.

       toe    Used by mailtoe(1).

       foot   Used by mailfoot(1).

ENVIRONMENT
       Right  after  loading, mailcross reads the hidden file .mail‐
       crossrc in the $HOME directory, if it exists, so  this  would
       be a good place to define custom values for environment vari‐
       ables.

       MAILCROSS_FILTER
              This variable contains a shell command to be  executed
              repeatedly  during  the  running  stage.   The command
              should accept an email message on STDIN and  output  a
              resulting  category name. It should also accept a list
              of category file names on the command line.  If  unde‐
              fined, mailcross uses the default value MAILCROSS_FIL‐
              TER="dbacl -T email -T xml  -v"  (and  also  magically
              adds the -c option before each category).

       MAILCROSS_LEARNER
              This  variable contains a shell command to be executed
              repeatedly during  the  learning  stage.  The  command
              should  accept  a  mbox type stream of emails on STDIN
              for learning, and the file name of the category on the
              command   line.   If  undefined,  mailcross  uses  the
              default value MAILCROSS_LEARNER="dbacl -H 19 -T  email
              -T xml -l".

       TEMPDIR
              This  directory is exported for the benefit of wrapper
              scripts. Scripts which need to create temporary  files
              should place them a the location given in TEMPDIR.

NOTES
       The  subdirectory  mailcross.d  can grow quite large. It con‐
       tains a full copy of the training corpora, as well as  learn‐
       ing  files for size times all the added categories, and vari‐
       ous log files.

WARNING
       Cross-validation is a widely  used,  but  ad-hoc  statistical
       procedure,  completely unrelated to Bayesian theory, and sub‐
       ject to controversy.  Use this at your own risk.

SOURCE
       The source code for the latest version  of  this  program  is
       available at the following locations:

       http://www.lbreyer.com/gpl.html
       http://dbacl.sourceforge.net

AUTHOR
       Laird A. Breyer <laird@lbreyer.com>

SEE ALSO
       bayesol(1) dbacl(1), mailinspect(1), mailtoe(1), mailfoot(1),
       regex(7)




Version 1.12     Bayesian Text Classification Tools     MAILCROSS(1)
Results 1 - 1 of 1
Help - FTP Sites List - Software Dir.
Search over 15 billion files
© 1997-2017 FileWatcher.com