
			KERNEL_NOTES FOR DIABLO

(0) Location of options

    Diablo compilation options mainly appear in two files: lib/config.h and
    lib/vendor.h.  lib/config.h is supposed to hold only permanent 
    configuration options.  The more advanced options are usually disabled
    unless it is possible to do preprocessor conditionals on the OS version.

    Generally speaking, any option overrides that you do should be done in
    lib/vendor.h
 
(I) Use of mmap()

    Diablo requires at least shared read-only file maps to work properly.
    This is known to work on Sun, Solaris, IRIX, AIX, and FreeBSD.  

    BSDI releases including 3.0 are known to have serious problems with mmap()
    and it is not suggested that you run diablo on it.

    Once you get past shared read-only file maps, you get into shared 
    read-write file maps, shared read-write anonymous maps, and sys-v
    shared memory maps.  These are optional.  I believe the Sun, Solaris, 
    IRIX, and FreeBSD support shared r/w maps but SunOS does not support 
    anonymous maps (solaris does).  Most systems support sys-v shared memory.
    I have only tested advanced mmap features on FreeBSD.

    Diablo will work fine with systems which do not have a unified buffer
    cache for read+write mmaps, which means all mmap features will work
    with FreeBSD 2.2.x or greater just fine.

    Memory allocation features:

    USE_ANON_MMAP	Allows diablo to use an anonymous private r/w mmap
			to allocate memory.  This will cause the least 
			memory fragmentation.

    USE_FALL_MMAP	Diablo uses a temporary file private mmap which it
			then remove()s to allocate memory.  May or may not
			work well depending on how the filesystem works.

			The default is to simply use malloc().

    USE_SPAM_RW_MAP	Use a read+write mmap() for the spam cache file, 
			otherwise uses a read-only mmap and seek+write to
			write.

    USE_SPAM_SHM	Use sysv-shared memory to map the spam cache.  The
			spam cache will be read from its file into shared
			memory on diablo startup and written back on final
			exit.  This is the most efficient spam-cache memory
			option in diablo and should be used whenever possible.

    USE_PCOMMIT_RW_MAP	Use a read+write mmap() for the precommit cache,
			otherwise uses a read-only mmap and seek+write to
			write.

    USE_PCOMMIT_SHM	use sysv-shared memory to map the precommit cache.
			This can double dhistory lookup performance and lead
			to better stability under extreme loading conditions
			when used with DO_PCOMMIT_POSTCACHE.  This option is
			recommended.

    DO_PCOMMIT_POSTCACHE use the precommit cache to hold recent dhistory file
			hits.  Recommended only if USE_PCOMMIT_RW_MAP or
			USE_PCOMMIT_SHM is set.

(IIa) memory, disk, and cpu

    A 100 MIPS class cpu is suggested for up to 40 feeds, a 200 MIPS class cpu
    is suggested otherwise.  Nominally, a pentium-pro 200 running Linux or
    FreeBSD, a Sun-ultra running solaris, or a 150MHz R4400 or better SGI box
    running IRIX is suggested.  I use FreeBSD boxes.

    A minimum of 128MB of ram is required (mainly to maintain the dhistory 
    file efficiently).  If you have more then 30 feeds, 192MB of ram is
    suggested.  If you have more then 70 feeds, 256MB of ram is suggested.
    The more memory the merrier.

    The minimum recommended disk configuration is three fast 4G disks
    sd0 would be used as the root disk, but half of it (2G) would be the
    /news partition.  sd1 and sd2 would be striped together to make an 8G 
    spool.  A stripe size of 2048 sectors (1 MByte) is suggested.  NOTE that
    a large /news partition is required.  It must not only hold the dhistory
    file and a backup of the dhistory file, it must also hold outgoing queue
    files and not blow up if outgoing feeds have problems and start to
    back up.  /news/dqueue can easily take a gig all by itself.

    The nominally recommended disk configuration is two fast 2G disks and
    two or more fast 4G disks, with /news striped on the first two disks and
    the spool striped on the second two.

    An ultra-wide SCSI controller is recommended.  One will generally be
    sufficient, but if you intend to run more then 80 feeds you should
    consider having two.  UW is suggested for the transaction rate, not
    the disk throughput.  Well-cooled Seagate drives are recommended.

    The machine should not ever have to swap, but swap should be configured
    to allow the machine to retire idle processes.  I suggest configuring
    128MB of swap on every disk to spread any swap activity around.

(IIb) sysctl tuning on FreeBSD

    FreeBSD has VM tuning parameters accessible via sysctl.  The most critical
    number is vm.v_cache_min.  This controls the minimum number of pages
    that MUST be in the cache (clean unreferenced pages).  And has a cascade
    effect on the other VM tuning parameters.  You generally want to set
    vm.v_cache_min to a relatively low number, 5000 pages or so (20MB).
    This reduces the pressure on the kernel to deactivate active pages (which
    it will do to maintain the cache).

    Having a high v_cache_min results in a double-whammy: not only are you
    enforcing a high minimum cache size for clean pages, you are also 
    indirectly enforcing a similar cache size for inactive (dirty
    unreferenced) pages.  Neither do you want it too low, or the VM system 
    will have to page-in more.  But if you are going to error, error on 
    the low side because too-high a v_cache_min will result in thrashing.

(III) file descriptors, process limits, datasize resource limits

    Configure the system to support a minimum of 512 descriptors per process
    and at least 8192 descriptors for the system as a whole.  The system
    must support at least 512 processes per user and 1024 total processes.
    This may involve both kernel configuration and resource limit settings.

    The number of descriptors used by Diablo will increase 6 fold if you
    turn on reader expiration (variable expire), verses feeder expiration
    (straight FIFO expire).

    The datasize limit should be at least 128MB, though 64MB will work.

    NOTE:  FreeBSD has an /etc/login.conf file.  You must ensure that 
    sufficient limits are set for daemon, default, standard, root, and
    news.  Specifically, do not set a small hard datasize limit in daemon
    or cron will not be able to re-limit the process to a higher datasize
    limit.  'datasize=...' is a HARD limit.  'datasize-curr=...' is a 
    soft limit.

(IV) NBUFs - kernel filesystem buffers

    On kernels for which filesystem buffers are static, configure a large
    number of buffers.  If you have 256MB of ram, I would dedicate half
    of it to filesystem buffers.

    On kernels which have a dynamic buffer cache (FreeBSD, for example), but
    do not have a unified buffer cache, NBUF should be confiured to at least
    6144 (around 24 MBytes of KVM) because it is implemented on top of the
    primary buffer cache, which is dynamic.  If you configure too much, you
    will reduce the system's ability to manage its memory.

    The typical FreeBSD kernel config line is:

	options "NBUF=6144"

(V) DHistory file tuning

    Diablo should be able to handle upwards of 3000 accepted articles/min
    and message-id history lookups (check/ihave) rates between 40,000 and
    100,000 lookups/minute.  The actual performance depends heavily on
    the amount of memory you have and the number of diablo processes 
    in contention with each other.

    Many kernels will bog down on internal filesystem locks as the number
    of incoming feeds rises.  You need to worry once you get over 35 or so
    simultanious diablo processes.   Adding memory or reducing the size of
    the dhistory file will help here.

    The dhistory file defaults to a 14 day retention and will stabilize
    at between 350 and 400 MBytes given an article rate of 800,000 articles/day
    (a full feed as of this writing).  You can configure a lower expiration
    by setting the 'remember' variable in diablo.config to a lower number,
    such as 7 or 3.  It is recommended that you set the value to at least
    7 and possibly smaller.

    The DHistory file hash table size is programmable, but not dynamic.
    The default is 4 million entries.  You can change this by editing
    diablo.config or using the -h option in diload.  For example, '-h 8m'.
    The hash table size must be a power of 2.  The new hash table size will
    then take effect when you next run biweekly.trim.  Either 4m or 8m
    is recommended.  NOTE: if you make a mistake specifying the hash table
    size, you can blow up the news system so be careful.

(VI) Tuning outgoing feeds to INN

    Please examine the samples/dnewsfeeds file.  Generally speaking, you need
    to tune any outgoing feeds to INN reader boxes.

    You want to do two things:  First, you want to make sure the spam filter
    is configured properly and turned on.  The spam filter is turned on by
    default in Diablo 1.13 or greater.  The sample dnewsfeeds file contains a
    spam filter starter which you should use.

    Second, you should consider cutting control messages in front of articles
    and then delaying non-control messages by 5 minutes.  This will allow
    cancel controls to leap ahead of articles and reduce INN's article write
    overhead (which is usually the big bottleneck in INN).

    Typically, you separate control messages out by creating two separate
    feeds to your reader box.  The first one has a 'delgroupany control.*',
    and the second one has a 'requiregroup control.*'.  Taking the example
    from the sample dnewsfeeds file:

	# dnewsfeeds
	#
	label   nntp2a
	    alias       nntp2.best.com
	    ... other add and delgroups ...
	    delgroupany control.*
	end

	label   nntp2c
	    alias       nntp2.best.com
	    ... other add and delgroups ...
	    requiregroup control.*
	end

    Then, in dnntpspool.ctl you program the normal feed for queue-delayed,
    to delay it by 5 minutes (assuming you run dspoolout from cron every 5
    minutse), and you program the control feed as realtime.  Also, if you
    don't mind slightly longer delays, q2 may be a better choice then q1.

	# dnntpspool.ctl
	#
	nntp2a          oldnntp.best.com                500     n4 q1
	nntp2c          nntp1x.ba.best.com              500     n4 realtime

(VII) Tuning Incoming feeds

    The main thing to remember when tuning incoming feeds is that the 
    load on your news system is related to the number of message-id check
    or ihave requests you receive.  You do not have to go overboard taking
    full feeds... three or four incoming full feeds is quite sufficient.
    Most other incoming feeds will be from smaller sites and having them
    ship you just the local postings is good enough.  The message-id load
    determines how quickly your news box can catch up after prolonged
    downtime or loss of network connectivity and it may be a good idea to
    test this by purposefully taking the machine offline for an hour, just
    to see where you stand.

    There is more then one way to ensure incoming feed redundancy.  Due to
    the way the precommit cache works, if you get offered the same article
    from N different feeds at the same time, Diablo will return a duplicate
    reply code to all but one of those incoming feeds.  If diablo were to
    crash at that point, you wind up relying on that one incoming feed to
    retry the article because the others have already marked it off. 

    It may be beneficial to purposefully lag one of your incoming full
    feeds to provide added redundancy.  This is something your peer must
    set up for you and it isn't easy unless they are running news software
    that can do it.  Diablo 1.12 or greater can through the 'q2' or 'q3'
    option in dnntpspool.ctl on the feeder.  While this virtually guarentees
    that you will never accept an article from that particular site under
    normal conditions, it gives your system added redundancy by ensuring
    that the same message-id will be offered at two different times.  If
    something does go wrong, the time delay may help you recover more quickly
    without any article loss.

(VIII) Tuning dexpire

    There are two cron jobs that deal with dexpire.  The first is called
    quadhr.expire and nominally runs dexpire every four hours (6 times a day).
    The second is called hourly.expire and attempts to rerun dexpire if
    the quadhr cron fails.

    DExpire in Diablo is very fast.  Since diablo stores multiple-articles
    per spool file, DExpire is able to free up disk space very quickly and
    you should not be scared of running it often.  DExpire's biggest hog
    is that it must scan the dhistory file.  Unlike INN's expire, dexpire
    does not rewrite the dhistory file.  Instead, it expires entries in-
    place which is considerably faster.

    The sample expiration cron jobs adm/quadhr.expire and adm/hourly.expire
    set a free space target of 2 gigabytes.  This is the suggested free space
    target if you run expire every 4 hours and is designed to deal with
    large influxes of data that may occur in a 4 hour period.  You can run
    a tighter free space target if you run dexpire more often.  You can
    probably get away with a 1 gigabyte (1000 megabyte) free space target
    if you run dexpire every 2 or 3 hours, but I suggest leaving the free
    space target alone.

    You can manually retire articles from the spool if you like without running
    dexpire.  The history file will still be completely synchronized when
    dexpire does run.  The only legal way to retire articles from the spool is
    to remove an entire spool directory.  You MUST RENAME the directory before
    rm -rf'ing it or you risk creating corrupted article files due to the way
    diablo's article create/append spooling works.  Manually retiring articles
    is so quick you can do it on an hourly or even a ten-minute basis without
    loading the machine down.  This allows you to use a much tighter free space
    target but should only be considered if you have a very small spool to work
    with.

(IX) Typical Performance from news1.best.com

    news1.best.com is a FreeBSD 2.2.x box running on a PPro 200 with 192 MB
    of ram, one 2940UW SCSI controller, and three 4G Seagate ST34371W's.
    One 4G.  It is partitioned as follows:

	Filesystem  1K-blocks     Used    Avail Capacity  Mounted on
	/dev/sd0a      127151    49473    67506    42%    /
	/dev/sd0d       63567     1369    57113     2%    /var
	/dev/sd0e      465940    43490   385175    10%    /var/log
	/dev/sd0f      232474        5   213872     0%    /var/tmp
	/dev/sd0g     1017327   432274   503667    46%    /usr
	/dev/sd0h     1705391   720650   848310    46%    /news	<--- too small
	/dev/ccd0c    8176355  5596859  1925387    74%    /news/spool/news
	procfs              4        4        0   100%    /proc

    The ccd partition is configured with a 4M stripe, designed to
    optimally handle a large number of diablo processes each writing
    to its own private, but large, spool file.   Under FreeBSD,
    ccd must usually be configured such that the filesystem partition
    you use is offset 16 sectors from the base, in order to leave
    room for the disklabel.  It's just the way ccd works.  Be sure
    to adjust the filesystem offset and size accordingly.  NOTE!  To
    prevent the filesystem from hogging one disk, either try to make the stripe
    as large as a single cylinder group, or make the stripe oddly-sized
    so different cylinder groups put their inodes on different physical disks.

	ccd0    8192    none    /dev/sd1d       /dev/sd2d

    The machine is currently configured with 95 feeds, of which 15 are
    'official' fully transited backbone feeds and another 10 are
    fully transited backup backbone feeds.  Another 20 send me message-id's
    equivalent to nearly full feeds.  Most of the remainder are mainly 
    outgoing feeds to T1 customers and their incoming component is minor.

    When news1.best.com is taken down for 30 minutes, then brought
    up again, it gets pounded by about half of its feeds and is 
    able to put away around 25 articles/sec and around 500 
    message-id lookups/sec.  What this means, basically, is that
    although the machine is able to catch up in real articles,
    many of the feeds continue to get behind for a short period of time
    (500/45 = 11 checks/sec/full-incoming-feed, not quite enough).

    The reason the check rate is so low is basically due to the load on
    the system.  90+ diablo processes all pounding away on the caches
    and the disks reduces efficiency all around.  Half the feeds would
    result in almost tripple the efficiency due not only to the lower
    level of pounding, but also due to the greater amount of memory available.
    The real issue is one of message-id load.  I run news1.best.com
    with a high message-id load on purpose... most news admins do not need
    45 full incoming message-id loads to get good news coverage... four
    or five will do just as well. 

    In anycase, with news1.best.com, the caches start to recover once the
    articles have begun to catchup and get back in synch.  The message-id
    lookup capability increases from 500/sec to 10000/sec and the incoming
    feeds catch up very quickly after that.

    Disk I/O is limited by seeking, so the transfers/sec statistics is often
    more useful then throughput statistics.  Once caught up news1.best.com
    stabilizes at around 30 tps on each of its three disks.  When catching up,
    under its heaviest load, sd0 hits around 90 tps which is basically 
    saturation, while sd1 and sd2 stabilize at between 60 and 70 tps.  The
    disks are theoretically capable of around 100 tps (averaged).

    There a number of ways to reduce the dhistory file load.  Reducing the
    number of full incoming feeds to a reasonable number (4 or 5) is one
    way.  Another way is to stripe /news AND the spool rather then just the
    spool.  A third way is to simply pack in more memory for better caching.
    A fourth way is to reduce the default history retention (see the release
    notes for setting REMEMBER_DAYS) from 14 days to 9 to significantly 
    reduce the size of the dhistory file.  Probably the best way to reduce
    the dhistory file load is with better management of incoming feeds, only
    a few actually need to be full feeds.

    After you handle the dhistory file load, tuning realtime vs non-realtime
    feeds comes next.  realtime feeds should only be used under certain
    conditions.  If you are a large ISP providing feeds to your T1 customers,
    making those feeds realtime gets news to them rather then them getting it
    over your internet backhaul from someone else.  If you peer at a MAE,
    where you do not pay on a bandwidth basis, making feeds that go over that 
    link in realtime will reduce the load on other feeds that go over more
    expensive links, especially if your MAE peers return the favor.  Local
    feeds to newsreader boxes do not have to be realtime, nor do most other
    feeds.  Why make a feed over a costly internet backhaul realtime when all
    it does is increase your outgoing bandwidth ?

(X) SOLARIS SPECIFIC NOTES

    The shared memory defaults in /etc/system may have to be tuned due to
    having too low a maximum segment size, the following is suggested:

	set shmsys:shminfo_shmmni = 100
	set shmsys:shminfo_shmseg = 16
	set shmsys:shminfo_shmmax = 16777216

    The file descriptor limits may also be too low, the following is
    suggested:

	set rlim_fd_max = 4096
	set rlim_fd_cur = 1024


(XI) FREEBSD SPECIFIC NOTES

    


