NuzeBot Documentation

The NuzeBot is a robot that is designed to find interesting new headlines. The headlines are in the form of hyperlinks, allowing further reading at the source. The output of the NuzeBot is an HTML file that you can conveniently view with your favorite web browser.

The NuzeBot is designed to remember the hyperlinks that it sees. Old links are penalized, moving them down the list until they are no longer shown on the page.

Though the NuzeBot is functional without editing anything, you should probably customize a few files so the bot can provide the kinds of results you want. The bot is designed to use plain text files for configuration, so when you're editing these files, you should only use a simple text editor that doesn't add any formatting or markup.

Comments in the configuration files have the '#' character at the beginning of the line.

The NuzeBot is free and open source software.

Purpose

The NuzeBot can serve different purposes:

"You should be on top of all the news within your industry, and beyond that all local, national, and global news as well." ~ Donald Trump, Think Like a Billionaire

Usage

To use the NuzeBot, simply run the run.sh (on Linux) or run.bat (on Windows) file. The NuzeBot might take a few minutes if you have many sites in your sites.txt file. The output file will be named nuze.htm by default. Open that file with your favorite web browser when the bot is finished.

It may be good to observe how the NuzeBot works before continuing, but for the kind of results you really want, you will need to edit a couple of files that are used for controlling the NuzeBot.

In case you want to create multiple output pages of headlines, use the -oh <file.ext> option to store headlines for repeated use by the Nuzebot to avoid unnecessary network activity and load on servers. Use the -ih <file.ext> to load the headlines to generate different pages. Where zlib is enabled in the build options, headlines are stored in a compressed format.

Using the option -r nw will run the bot now ('n') and start the web interface ('w') on the default port 8888 (or change via the -wp option), which you may access at address 127.0.0.1:8888 via your web browser. To be available from other computers on the LAN, use the appropriate LAN address such as 192.168.0.8 that is found by checking your connection information. To be available from behind a router, enable port forwarding. Running the web interface is not required to use the NuzeBot, but it provides a built-in server and some extra features.

Automation

Version 2 of NuzeBot includes a built-in scheduler / timer system, if you choose to use it. The additional configuration file "sched.txt" uses a cron-like syntax except ',' and '/' are not yet supported. There are only four numbers per schedule entry telling when to run the bot: minute, hour, day (of month), and the day of the week. You can use a '*' as a wildcard to match any value.

To edit scheduled times to run the bot, edit the sched.txt file.

To run every day at 5 PM:

#Minute Hour Day Weekday
0 17 * *

To activate this scheduler, you must use the -r=t option. To run the NuzeBot now and also run the scheduler, provide -r=nt as an argument.

Otherwise, to run NuzeBot automatically every day on Linux, you could use cron. Assuming cron is installed and running, type crontab -e to edit tasks, and append the following line:

0 11 * * * cd /full/path/to/nuzebot;./run.sh

That will run the NuzeBot every day at 11 AM. To run daily at 5 PM and 9 PM, use this:

0 17,21 * * * cd /full/path/to/nuzebot;./run.sh

Edit the path to match the path of the NuzeBot folder on your computer.

Web Interface

The web interface uses a simple built-in HTTP server to eliminate the need for a separate web server to share the news found by the NuzeBot on your LAN or WAN. The web interface also provides additional features such as search.

The web interface starts on port 8888 by default, so if it is running on your computer, go to http://127.0.0.1:8888 in your favorite web browser to use it.

The web interface is designed with security in mind, but HTTPS is not yet supported, so you must use an unencrypted/unsecured HTTP connection for now.

Files

The NuzeBot package that you download might contain these files:

config.txtconfiguration file
sites.txtweb addresses to scan for headlines
words.txtwords for scoring
reg.txtregular expressions for scoring
nuze.htmthe output file
nuze.csscss file
nuze-lib.csource code specific to NuzeBot
nuze-lib.hsource code specific to NuzeBot
std.csource code, general
std.hsource code, general
nuze.csource code, main
compile.shLinux script for compiling
crossCompile.shLinux script for compiling for Win32
mem.datmemory file
index.htmhome page for the web interface
mimes.txtmime types for files for the web interface

Below, more explanation is provided for some of the files.

sites.txt

This file should contain the addresses of all of the pages that you want to be scanned for news headlines, so the sites.txt file should be customized to contain the addresses of pages that contain interesting headlines in the form of hyperlinks.

words.txt

This file contains the words to be used to score the headlines. The format has recently changed to be more efficient. On one line is a number (specifically an integer) specifying a score. Each line after it contain a word with that score, until an empty line is reached. Then we have another integer number score and more words, and so on until the end of the file.

Scoring of words allows headlines to be ranked, placing the most interesting headlines at the top of the output page. Scoring also allows unwanted results to be penalized, moving them down the page or even out of the results entirely. When scanning web pages, the bot doesn't know the difference between links that are news versus links that are ads or even links to those boring "terms of service" pages, so you must use the words.txt file to tell the bot what kind of content you want.

Of course, the words.txt file must be customized if NuzeBot is to find the headlines that suit your personal interests, so simply edit the words.txt to reflect your interests. When the bot has finished running, check the output page (nuze.htm) to see which headlines should have scored higher and which links should have been penalized. Then you have some clues about how to improve your words.txt file.

For example, if you happen to like cheese and butter, you might append the following five lines to your words.txt file:

50
cheese
butter

Any headline containing "cheese" or "butter" is given fifty points. A headline containing both words "cheese" and "butter" gets a hundred points. Neat, eh?

As another example, if you want to get rid of all links about toadstools, you could simply append the following two lines to your words.txt file:

-9999
toadstool

This gives "toadstool" a score of -99, so any link containing that string of characters is penalized by 99 points.

Every headline/hyperlink starts with a score of zero, and depending on the words in the name and address parts of the link, its score is increased or decreased. Headlines with negative scores are usually not shown, but you will be able to change that via the "limit" option later on. It can be useful during testing to see which headlines are being penalized.

As you add more and more words and their scores, the words.txt file can become too big and disorganized to handle. That's why we've made it so you can use the following syntax to include another file:

@otherfile.txt

The '@' tells the bot to look for more words in the file with the name that is specified directly after the @. This way, you can organize your words by topic, with a separate file for each topic.

Sometimes, you might want to match a phrase containing multiple words. But headlines in HTML might contain multiple spaces or even a newline between words, making precise matching difficult with normal string matching. Furthermore, web addresses used for links often contain clues about their content, but they typically contain dashes or other characters between words instead of spaces. That's why we've designed the NuzeBot so that in the words file, the '-' character in the words.txt file will match any number (including zero) of non-alphanumeric characters.

peter-pan
8

That will boost any headline about Peter Pan by 8 points, whether it appears as Peter Pan or Peter-Pan or Peter.Pan or PeterPan. However, it will also match Peter Panda and Peter Panama, and that's why we've made it so that the '_' character matches a single character of a non-alphanumeric type.

_peter-pan_
8

That will exclude Peter Panda, Peter Panama, and other false positives. The '_' at the beginning and the end will help make sure the bot only boosts relevant links. You could use a space in place of any '_' character, but spaces are hard to see in most text editors, so we prefer the '_' character.

reg.txt

This is the file for scoring headlines via regular expressions. You can leave this file empty if you don't like regular expressions. The format is slightly different from the word.txt file: The score is given on one line, then all regular expressions with that score are given on consecutive lines. A blank line must be found at the end of each list of regular expressions.

For example:

5
\b(f|do|bur)rito
\bcheeto
\btacos?\b

3
\b(bubble|chewing)\s*gum

Regular expressions are a new feature for NuzeBot and have not been tested.

Command Line Options

The command line options allow you to change how NuzeBot works without your needing to figure out how to edit the C programming code and recompile. For boolean yes/no true/false options, use 1 for yes or true and 0 for no or false. The same options can be used in the config.txt file. Options that are given on the command line will override the options in the config.txt file.

Input options:
Input options begin with the letter 'i'.

-iw wordfile.txt
The specified file contains the words that will be used for scoring headlines.

-is sitefile.txt
The file contains the web addresses that will be scanned for headlines.

-ih headlines.txt
This tells NuzeBot to load the headlines from the headlines.txt file. This is useful when you want to create multiple output pages about different topics.

-ip "mycmd -myoption"
This tells NuzeBot to use a pipe instead of a library function to load pages. You could specify a custom command, perhaps using wget or curl, or specify 'c' or 'w' to use generic curl or wget command lines.

Output options:
Output options begin with the letter 'o'.

-of outfile.htm
Specify - for stdout.

-ot "My News Page"
title of your news page

-oh headlines.txt
This tells NuzeBot to save the headlines to a file and quit. These headlines can be loaded later using the -ih headlines.txt option. When creating multiple news pages, save time and avoid unnecessary HTTP requests by saving headlines to a file to use for generating all news pages.

-ox cal
This tells NuzeBot to execute a command when done.

-om 100
maximum number of headlines to show

-ol -1
lower limit for scores of headlines to show

-op 0
whether to show the full page or just headlines
Set to zero if you only want the headlines.

-oi 0
whether to include informative stats at the bottom of the output page
Set to zero to disable stats.

-oe 1
whether to show extra information about hyperlinks on mouseover

-ov 1
set verbosity

Web interface options:
Web interface options begin with the letter 'w'.

-wp 8008
port to use for web interface

Other options:

-r ntw
There are three ways to run the NuzeBot: now, timer, and web interface. Specify any combination of 'n', 't', and 'w'.

-d 5
delay in seconds to wait between page loads

-c 9
sets compression level (0-9) for the mem.dat file

-?
shows help

Compiling

To compile the NuzeBot on Linux, simply run the new ./setup.sh script.

In an effort to make it easy for you to get started with minimal hassle, we've made it so you can compile the NuzeBot without any of the libraries that add extra features. To change these build options, just define the string in capital letters. The best way is to use the -D arg when calling GCC, which is normally done via the compile.sh file when compiling NuzeBot. For example, to define NOCURL when compiling with GCC, add -DNOCURL to the command line. Build options work on Linux or Windows.

NOCURL
Define this to build without libcurl support. You will need to use the -ip option to specify a pipe.

NOZLIB
Define this to build without zlib support. This option simply disables compression of the memory file.

NOREG
Define this to build without support for regular expressions. You can use the pre-existing syntax for matching with the words.txt file as explained above.

NOWEB
Define this to build without the web interface.

STRNDUP
Define this to build the bot with its own strndup function in case you get an compile error for missing strndup.

LITE
Define this to build the lite NuzeBot, which uses no unusual DLLs. Defining LITE is the same as defining NOREG, NOCURL, and NOZLIB.

Build options are a new feature for NuzeBot and have not been thoroughly tested.

Websites

Keep up with the NuzeBot project at the following web address:

https://sourceforge.net/p/nuzebot/

Contact

If you don't want to use the websites, you can send your comments, bug reports, and feature requests to the following email address:
wrspain@gmx.us

License

The files in the NuzeBot project are Copyright ©2021 Ron Spain and are provided under the MIT license, a comparatively permissive license for open source projects.

Share and enjoy.