<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel about="http://blog.gmane.org/gmane.mail.bogofilter.general">
    <title>gmane.mail.bogofilter.general</title>
    <link>http://blog.gmane.org/gmane.mail.bogofilter.general</link>
    <description/>
    <syn:updatePeriod>hourly</syn:updatePeriod>
    <syn:updateFrequency>1</syn:updateFrequency>
    <syn:updateBase>1901-01-01T00:00+00:00</syn:updateBase>
    <items>
      <rdf:Seq>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11308"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11307"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11306"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11305"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11304"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11303"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11302"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11301"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11300"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11299"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11298"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11297"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11296"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11295"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11294"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11293"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11292"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11291"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11290"/>
        <rdf:li rdf:resource="http://permalink.gmane.org/gmane.mail.bogofilter.general/11289"/>
      </rdf:Seq>
    </items>
    <image rdf:resource="http://gmane.org/img/gmane-25t.png"/>
    <textinput rdf:resource=""/>
  </channel>
  <image rdf:about="http://gmane.org/img/gmane-25t.png">
    <title>Gmane</title>
    <url>http://gmane.org/img/gmane-25t.png</url>
    <link>http://gmane.org</link>
  </image>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11308">
    <title>Re: nan in bogofilter stats</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11308</link>
    <description>_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter
</description>
    <dc:creator>Anne Wilson</dc:creator>
    <dc:date>2008-11-19T11:55:08</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11307">
    <title>Re: nan in bogofilter stats</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11307</link>
    <description>Don't know for sure but I certainly get HEAPS of Russian and Japanese (and 
even some German) spam.
Many hundreds of messages per week plus all the usual English crap.

"Real" mail is in the tens per week.

Stephen

On Wednesday 19 November 2008 19:38:42 Matthias Andree wrote:



</description>
    <dc:creator>Stephen Davies</dc:creator>
    <dc:date>2008-11-19T10:49:05</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11306">
    <title>Re: nan in bogofilter stats</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11306</link>
    <description>

This looks rather unbalanced - does it roughly reflect your
junk/solicited mail ratio?

</description>
    <dc:creator>Matthias Andree</dc:creator>
    <dc:date>2008-11-19T09:08:42</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11305">
    <title>Re: nan in bogofilter stats</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11305</link>
    <description>On Wed, 19 Nov 2008 11:49:44 +1030
Stephen Davies wrote:


Glad you had a backup!  Due to occasional BerkelyDB problems in
bogofilter's early days (years ago), I have a cron job that dumps the
database daily and copies each day's wordlist to a day-of-week
directory.  No doubt it's overkill, but I'd rather be safe :-&gt;
_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

</description>
    <dc:creator>David Relson</dc:creator>
    <dc:date>2008-11-19T04:06:11</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11304">
    <title>Re: nan in bogofilter stats</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11304</link>
    <description>OK. I brought back a backup wordlist and now get:

bogoutil -w wordlist.db .MSG_COUNT
                                 spam   good
.MSG_COUNT                     266713  12982

(Assuming that the previous spam count was correct, this implies nearly 20000 
spams per month. Bogofilter has been busy!!)

I'll look at the change to TRANSACTIONAL also.

Cheers,
Stephen

On Wednesday 19 November 2008 11:33:36 David Relson wrote:



</description>
    <dc:creator>Stephen Davies</dc:creator>
    <dc:date>2008-11-19T01:19:44</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11303">
    <title>Re: nan in bogofilter stats</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11303</link>
    <description>On Wed, 19 Nov 2008 10:47:25 +1030
Stephen Davies wrote:


To compute a token's spamicity, bogofilter needs to know how
many spam and ham messages have been registered (in the
wordlist).  .MSG_COUNT is the special token that provides this info.

The numbers 312870 and 1 indicate that 312870 spam messages and 1 ham
message have been registered.  The value 312870 is reasonable while the
value 1 seems unreasonably low.

FWIW, "bogoutil -d wordlist.db &gt; wordlist.txt" will dump your wordlist
as a text file.  Each line has a token, its spam and ham counts, and a
timestamp.  .MSG_COUNT's "good" value _should_ be greater than any ham
count.

It might be time to start a new wordlist and register all the ham and
spam you have available.  I'd also recommend backing up your wordlist
periodically in case of future problems.  Lastly, switching from
NON-TRANSACTIONAL bogofilter to TRANSACTIONAL bogofilter will provide a
more secure database environment.

HTH,

David
_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

</description>
    <dc:creator>David Relson</dc:creator>
    <dc:date>2008-11-19T01:03:36</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11302">
    <title>Re: nan in bogofilter stats</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11302</link>
    <description>Thanks for the feedback David.

I get:

bogoutil -w wordlist.db .MSG_COUNT
                                 spam   good
.MSG_COUNT                     312870      1

What does this actually mean?

Cheers,
Stephen


On Wednesday 19 November 2008 10:32:40 David Relson wrote:



</description>
    <dc:creator>Stephen Davies</dc:creator>
    <dc:date>2008-11-19T00:17:25</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11301">
    <title>Re: nan in bogofilter stats</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11301</link>
    <description>On Wed, 19 Nov 2008 09:38:25 +1030
Stephen Davies wrote:

...[snip]...

Hello Stephen,

It sounds like your wordlist is b0rked. "nan" shows up when .MSG_COUNT
has 0 for either spam or ham counts.  run 

"bogoutil -w wordlist.db .MSG_COUNT"

to test this.  If a zero shows up, then it's time to replace your
wordlist with a backup copy.

HTH,

David
_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

</description>
    <dc:creator>David Relson</dc:creator>
    <dc:date>2008-11-19T00:02:40</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11300">
    <title>nan in bogofilter stats</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11300</link>
    <description>Recently I am seeing large numbers of false Ham results from my well-trained 
bogofilter.

The following is the output of a bogofilter scan of an obvious spam mail.

The Ham result seems to result from the "nan" values.

Where do these come from and how do I fix it?

Cheers and thanks,
Stephen Davies


[scldad&lt; at &gt;mustang bogofilter]$ bogofilter --version
bogofilter version 1.1.6
    Database: Berkeley DB 4.6.21: (December 28, 2007) NON-TRANSACTIONAL

X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.1.6
                                        n    pgood     pbad      fw     U
  "$59.95"                           2160       nan  0.006905       nan -
  "Viagra"                          12385       nan  0.039592       nan -
  "buy"                             13387       nan  0.042795       nan -
  "childrencloud.com"                  11       nan  0.000035       nan -
  "from:Blackburn"                     23       nan  0.000074       nan -
  "from:Destin"                        11       nan  0.000035       nan -
  "from:collocationsai8"               11       nan  0.000035       nan -
  "from:tiavacams.com"                 11       nan  0.000035       nan -
  "head:Content-transfer-encoding"    1616       inf  0.003887  0.000006 +
  "head:Content-type"                   0  --------  --------  0.520000 i
  "head:Date"                      161813       nan  0.517277       nan -
  "head:From"                           0  --------  --------  0.520000 i
  "head:MIME-Version"              147105       nan  0.470259       nan -
  "head:Mail"                        2280       nan  0.007289       nan -
  "head:Message-ID"                145482       nan  0.465071       nan -
  "head:Microsoft"                  98282       nan  0.314184       nan -
  "head:MimeOLE"                    95795       nan  0.306233       nan -
  "head:Normal"                    111608       nan  0.356784       nan -
  "head:Nov"                            0  --------  --------  0.520000 i
  "head:Produced"                   96908       nan  0.309791       nan -
  "head:Status"                         0  --------  --------  0.520000 i
  "head:T8!!"                          11       nan  0.000035       nan -
  "head:V6.0.6001.18049"              589       nan  0.001883       nan -
  "head:Wed"                            0  --------  --------  0.520000 i
  "head:Windows"                     7409       nan  0.023685       nan -
  "head:X-KMail-EncryptionState"        0  --------  --------  0.520000 i
  "head:X-KMail-MDN-Sent"               0  --------  --------  0.520000 i
  "head:X-KMail-SignatureState"         0  --------  --------  0.520000 i
  "head:X-MIMEOLE"                    976       nan  0.003120       nan -
  "head:X-MSMail-priority"            730       nan  0.002334       nan -
  "head:X-Mailer"                  130528       nan  0.417266       nan -
  "head:X-Priority"                116917       nan  0.373755       nan -
  "head:X-Status"                       0  --------  --------  0.520000 i
  "head:X-UIDL"                     17414       nan  0.055668       nan -
  "head:X-Virus-Scanned"            23900       nan  0.076402       nan -
  "head:amavisd-new"                23214       nan  0.074210       nan -
  "head:bit"                        67432       nan  0.215564       nan -
  "head:charset"                    70412       nan  0.225090       nan -
  "head:collocationsai8"                5       nan  0.000016       nan -
  "head:flowed"                     17461       nan  0.055819       nan -
  "head:format"                     17403       nan  0.055633       nan -
  "head:hnP!!S"                        11       nan  0.000035       nan -
  "head:iso-8859-1"                     0  --------  --------  0.520000 i
  "head:original"                   13479       nan  0.043089       nan -
  "head:plain"                      52824       nan  0.168866       nan -
  "head:reply-type"                 13580       nan  0.043412       nan -
  "head:sdc.com.au"                 26556       nan  0.084893       nan -
  "head:text"                       75118       nan  0.240134       nan -
  "head:tiavacams.com"                  5       nan  0.000016       nan -
  "http"                           224709       nan  0.718340       nan -
  "now"                             42561       inf  0.126982  0.000000 +
  "pills"                           10378       nan  0.033176       nan -
  "rcvd:ESMTP"                          0  --------  --------  0.520000 i
  "rcvd:Nov"                        22848       nan  0.073040       nan -
  "rcvd:Wed"                        19712       nan  0.063014       nan -
  "rcvd:andrada"                       11       nan  0.000035       nan -
  "rcvd:for"                        61212       nan  0.195680       nan -
  "rcvd:forged"                      8385       nan  0.026805       nan -
  "rcvd:from"                      119609       nan  0.382361       nan -
  "rcvd:may"                         8386       nan  0.026808       nan -
  "rcvd:mustang.sdc.com.au"             0  --------  --------  0.520000 i
  "rcvd:scldad"                         0  --------  --------  0.520000 i
  "rcvd:sdc.com.au"                 59654       nan  0.190699       nan -
  "rcvd:with"                       76798       nan  0.245505       nan -
  "rtrn:collocationsai8"               11       nan  0.000035       nan -
  "rtrn:tiavacams.com"                 11       nan  0.000035       nan -
  "subj:$89.95"                        96       nan  0.000307       nan -
  "subj:Price"                        438       nan  0.001400       nan -
  "subj:Sildenafil"                   687       nan  0.002196       nan -
  "subj:Viagra"                      3559       nan  0.011377       nan -
  "subj:for"                        17960       nan  0.057414       nan -
  "subj:pills"                       3815       nan  0.012196       nan -
  "to:scldad"                      150032       nan  0.479616       nan -
  "to:sdc.com.au"                  279676       nan  0.894056       nan -
  "url:89"                           2872       nan  0.009181       nan -
  "url:89.165"                         25       nan  0.000080       nan -
  "url:89.165.243"                     11       nan  0.000035       nan -
  "url:89.165.243.217"                 11       nan  0.000035       nan -
  N_P_Q_S_s_x_md                        2  1.000000  0.000000  0.000000
                                           0.017800  0.520000  0.375000

[scldad&lt; at &gt;mustang bogofilter]$ bogoutil -w wordlist.db Viagra
                                 spam   good
Viagra                          12385      0

</description>
    <dc:creator>Stephen Davies</dc:creator>
    <dc:date>2008-11-18T23:08:25</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11299">
    <title>Re Tuning Bogofilter</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11299</link>
    <description>Message: 1
Date: Mon, 20 Oct 2008 21:18:47 -0400
From: Tom Anderson &lt;tanderson&lt; at &gt;orderamidchaos.com&gt;
Subject: Re: Tuning bogofilter
To: bf-users &lt;bogofilter&lt; at &gt;bogofilter.org&gt;
Message-ID: &lt;48FD2DF7.9010303&lt; at &gt;orderamidchaos.com&gt;
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

David Relson wrote:

[Hide Quoted Text]
We recommend against training over and over with the same messages as
it biases the wordlist.  Training with ham and spam yields a wordlist
that indicates how often individual words (tokens) occur in ham and in
spam.  For a simplified example:  if "xyzw" occurs in 10% of your ham
and in 20% of your spam, then a message with "xyzw" is twice as likely
to be spam as it is to be ham.  If you keep training with the same spam
you skew the results -- which is not recommended.
I don't really see biasing a spam message as spam to be particularly
problematic.  If indeed a particular word ought to be hammier, then it
will become so in the course of training your hams.  My experience has
been that sometimes you don't receive enough spams to make some tokens
spammy enough, and I therefore train these spams multiple times until
bogofilter recognizes them appropriately as spams.  Otherwise, I will
keep receiving them as false negatives.  For instance, if "xyzw" has
only occurred twice, but it is absolutely 100% always spammy and you
never ever want to see it again, then just keep training on that spam
until bogofilter recognizes it as such.  This is as if you have received
the spam many times, but without the inconvenience of actually having
done so.  When you later train a ham message which contains another
token "abcd" which may have appeared alongside "xyzw" and is now
spammier than it should have been, "abcd" will become hammier while
"xyzw" remains very spammy.  In the end, the result you want.

As I see it, most of the English language should wind up essentially
neutral in your wordlist, with only the truly hammy and spammy words
standing out, like a whitelist and blacklist respectively.  If some
portion of the general language moves slightly hammy or spammy due to
"over-training" some particular emails, it shouldn't have a large effect
on the classification since it is largely only the trigger tokens which
will ultimately decide it.  If a message is so wishy-washy as to contain
no such trigger tokens which are obviously hammy or spammy, or perhaps
well-crafted enough to contain equal numbers of each, then it deserves
to be marked unsure so that you can manually determine it.

Tom


And in a sense, Tom, I did exactly as you suggested.  Using  
bogominitrain.pl does exactly that.

However, if I continued to use the corpus AND new spam e-mail, I'd be  
concerned that I was skewing results so that ham message may become  
spam looking.

That was my main concern.

After thinking about what happened in my case, because I was on such  
an old version.  It made better sense for me to wipe my entire db and  
retrain from my saved corpus.

Additionally, I had old procmail scripts that I had put in place to  
cause messages to not be filtered correctly.

So in the end, for me anyway, doing the initial training with  
bogominitrain.pl worked exactly how I'd expect it to.

Now I'll save messages that didn't get marked as spam and then  
reprocess them with bogominitrain.pl

If anyone can definitively say that re-running my old corpus using  
bogominitrain.pl won't affect my spam scoring in a negative way, then  
it will be OK to continue adding any new missed spam to my main corpus  
and continue using bogominitrain.pl to train the db.

Thanks for your post.

Mike B.


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

</description>
    <dc:creator>barsalou</dc:creator>
    <dc:date>2008-10-22T23:22:20</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11298">
    <title>Re: Tuning bogofilter</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11298</link>
    <description>
I don't really see biasing a spam message as spam to be particularly 
problematic.  If indeed a particular word ought to be hammier, then it 
will become so in the course of training your hams.  My experience has 
been that sometimes you don't receive enough spams to make some tokens 
spammy enough, and I therefore train these spams multiple times until 
bogofilter recognizes them appropriately as spams.  Otherwise, I will 
keep receiving them as false negatives.  For instance, if "xyzw" has 
only occurred twice, but it is absolutely 100% always spammy and you 
never ever want to see it again, then just keep training on that spam 
until bogofilter recognizes it as such.  This is as if you have received 
the spam many times, but without the inconvenience of actually having 
done so.  When you later train a ham message which contains another 
token "abcd" which may have appeared alongside "xyzw" and is now 
spammier than it should have been, "abcd" will become hammier while 
"xyzw" remains very spammy.  In the end, the result you want.

As I see it, most of the English language should wind up essentially 
neutral in your wordlist, with only the truly hammy and spammy words 
standing out, like a whitelist and blacklist respectively.  If some 
portion of the general language moves slightly hammy or spammy due to 
"over-training" some particular emails, it shouldn't have a large effect 
on the classification since it is largely only the trigger tokens which 
will ultimately decide it.  If a message is so wishy-washy as to contain 
no such trigger tokens which are obviously hammy or spammy, or perhaps 
well-crafted enough to contain equal numbers of each, then it deserves 
to be marked unsure so that you can manually determine it.

Tom


_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

</description>
    <dc:creator>Tom Anderson</dc:creator>
    <dc:date>2008-10-21T01:18:47</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11297">
    <title>Re: Berkeley DB vs Sqlite3</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11297</link>
    <description>_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter
</description>
    <dc:creator>Gour</dc:creator>
    <dc:date>2008-10-21T06:25:01</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11296">
    <title>Re: Berkeley DB vs Sqlite3</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11296</link>
    <description>Gour schrieb:


bf_compact was fixed in release 1.1.7. Since then, it is supposed to
support Oracle/SleepyCat Berkeley DB, sqlite3, TokyoCabinet, and its
predecessor, QDBM.
We forgot to update the bf_compact documentation. Sorry for that.

Best regards
Matthias

_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

</description>
    <dc:creator>Matthias Andree</dc:creator>
    <dc:date>2008-10-20T13:16:58</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11295">
    <title>Re: Tuning bogogilter</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11295</link>
    <description>On Sun, 19 Oct 2008 20:21:12 -0800
barsalou wrote:

...[snip]...
 

Goodness, you must have been using an _old_ version of bogofilter.  The
change in default wording from "Yes, No" to "Spam, Ham, Unsure" was
_years_ ago.

In any case, 'tis good you found the problem!


Yes.
_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

</description>
    <dc:creator>David Relson</dc:creator>
    <dc:date>2008-10-20T11:37:01</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11294">
    <title>Re: Tuning bogogilter</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11294</link>
    <description>
Thanks David,

I cleared my wordlist and retrained it from my corpus.  Also found  
that the newer version I was using used the word 'Spam' instead of  
'Yes' in the X-Bogosity line.  So procmail wasn't doing the sorting  
properly.

So things are working more like I'd expect them to.

You said using the same spam corpus will skew the results...but it  
isn't clear to me in what way it will do that.  I assume you mean that  
it will mark words as being more spammy than they should be...is that  
right?

Thanks for your response.

Mike B.

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

</description>
    <dc:creator>barsalou</dc:creator>
    <dc:date>2008-10-20T04:21:12</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11293">
    <title>Re: Tuning bogofilter</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11293</link>
    <description>On Sat, 18 Oct 2008 12:09:24 -0800
barsalou wrote:


H'lo Mike,

We recommend against training over and over with the same messages as
it biases the wordlist.  Training with ham and spam yields a wordlist
that indicates how often individual words (tokens) occur in ham and in
spam.  For a simplified example:  if "xyzw" occurs in 10% of your ham
and in 20% of your spam, then a message with "xyzw" is twice as likely
to be spam as it is to be ham.  If you keep training with the same spam
you skew the results -- which is not recommended.

If you're seeing 0.52000 scores for both ham and spam, then there's
something wrong.  

Bogofilter has flags that will show you how/why it's scoring a message
as ham or spam.  Look in the FAQ for the writeup on "-vv" and "-vvv",
then give these flags a try with sample ham and spam messages to see
what you learn.  Also, bogoutil has a "-p" flag that will show the ham
and spam scores of tokens passed to it.  That is likely to be helpful.

HTH,

David
_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

</description>
    <dc:creator>David Relson</dc:creator>
    <dc:date>2008-10-18T20:38:32</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11292">
    <title>Tuning bogofilter</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11292</link>
    <description>If I repeatedly use the same set of initial spam messages to train  
bogofilter, will that cause it to work less well?

I have a spam corpus to which I continually add messages.  Then using  
bogominitrain.pl, occasionally retrain.

I'm wondering if this could cause problem.

My concern is born out of looking at the bogosity header and that both  
my "ham" messages and "spam" messages get a spamicity of .52000

Thanks for any guidance.

Mike B.

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

</description>
    <dc:creator>barsalou</dc:creator>
    <dc:date>2008-10-18T20:09:24</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11291">
    <title>Re: Berkeley DB vs Sqlite3</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11291</link>
    <description>
Matthias&gt; which one has given you problems?

None (so far).

Matthias&gt; Not all of the bf_* utilities are required for sqlite3
Matthias&gt; databases though, most of them came into existence to help
Matthias&gt; users handle Berkeley DB idiosyncrasies.

I see.

Matthias&gt; If there are contrib/* scripts that don't work with sqlite3,
Matthias&gt; then please tell us which one you wanted to use and how it
Matthias&gt; failed (error messages perhaps) so we can ask the contributor
Matthias&gt; if he wants to update/fix them.

I was looking at e.g. man for bf_compact which says it is meant for
Berkeley, but it probably does not apply for sqlite3.

Matthias&gt; I do /not/ recommended the Berkeley DB version over an
Matthias&gt; established sqlite3. In your situation, I'd suggest to stick
Matthias&gt; with sqlite3.

Thank you for the hint.

Matthias&gt; The speed difference but SQLite3 does not, IMO, matter on
Matthias&gt; personal computers (Berkeley DB has been faster when I
Matthias&gt; compared them long ago, but sqlite3 has been optimized since).

Right and I'm fetching mail via cron (getmail).

Matthias&gt; sqlite3 is easier to handle (there are less points on the
Matthias&gt; checklists to care) - read through README.db and see for
Matthias&gt; yourself.

OK. Will do.


Matthias&gt; Yes, it is. The procedure is:

Matthias&gt; - stop your mail system - dump all your *.db files to text
Matthias&gt; files with bogoutil -d.  - install the berkeleydb-based
Matthias&gt; bogofilter - load the text files back into *.db files - test -
Matthias&gt; if everything works, start your mail system and remove the old
Matthias&gt; *.db files - backup the dumped text files instead.

Thank you for giving it although I may not need it ;)


Sincerely,
Gour

</description>
    <dc:creator>Gour</dc:creator>
    <dc:date>2008-10-18T13:34:25</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11290">
    <title>Re: Berkeley DB vs Sqlite3</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11290</link>
    <description>

Hi Gour,

which one has given you problems?

Not all of the bf_* utilities are required for sqlite3 databases though,
most of them came into existence to help users handle Berkeley DB
idiosyncrasies.

If there are contrib/* scripts that don't work with sqlite3, then please
tell us which one you wanted to use and how it failed (error messages
perhaps) so we can ask the contributor if he wants to update/fix them.

I do /not/ recommended the Berkeley DB version over an established
sqlite3. In your situation, I'd suggest to stick with sqlite3.

The speed difference but SQLite3 does not, IMO, matter on personal
computers (Berkeley DB has been faster when I compared them long ago,
but sqlite3 has been optimized since).

sqlite3 is easier to handle (there are less points on the checklists to
care) - read through README.db and see for yourself.


Yes, it is. The procedure is:

- stop your mail system
- dump all your *.db files to text files with bogoutil -d.
- install the berkeleydb-based bogofilter
- load the text files back into *.db files
- test
- if everything works, start your mail system and remove the old *.db
  files - backup the dumped text files instead.

</description>
    <dc:creator>Matthias Andree</dc:creator>
    <dc:date>2008-10-18T08:22:40</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11289">
    <title>Berkeley DB vs Sqlite3</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11289</link>
    <description>_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter
</description>
    <dc:creator>Gour</dc:creator>
    <dc:date>2008-10-18T07:05:15</dc:date>
  </item>
  <item rdf:about="http://permalink.gmane.org/gmane.mail.bogofilter.general/11288">
    <title>Re: re-training with -Ns and -Sn</title>
    <link>http://permalink.gmane.org/gmane.mail.bogofilter.general/11288</link>
    <description>On Mon, 6 Oct 2008 12:26:01 -0500
Bill McClain wrote:


"-N" and "-S" are the options to undo the registration of an
incorrectly registered message.

"-Ns" is used when spam was incorrectly registered as ham.
Bogofilter's action for "-Ns" is to lower each token's ham count and
raise the spam count.  For "-Sn" the actions are lower spam count and
raise ham count.

To register an "Unsure" as ham, you should just use "-n" (to tell
bogofilter that the message is ham, _not_ spam).  To register an
"Unsure" as spam, use "-s" (to tell bogofilter that it _is_ spam).

HTH,

David
_______________________________________________
Bogofilter mailing list
Bogofilter&lt; at &gt;bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

</description>
    <dc:creator>David Relson</dc:creator>
    <dc:date>2008-10-06T23:10:34</dc:date>
  </item>
  <textinput about="http://search.gmane.org/?group=$group=gmane.mail.bogofilter.general">
    <title>Search Engine</title>
    <description>Search the mailing list at Gmane</description>
    <name>query</name>
    <link>http://search.gmane.org/?group=$group=gmane.mail.bogofilter.general</link>
  </textinput>
</rdf:RDF>
