From d12811c6f65afa228ac43916f6e78461c855cbeb Mon Sep 17 00:00:00 2001 From: wibyweb <49052850+wibyweb@users.noreply.github.com> Date: Wed, 13 Jul 2022 22:07:17 -0400 Subject: [PATCH] Update guide.html --- html/about/guide.html | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/html/about/guide.html b/html/about/guide.html index 5044f48..386127e 100755 --- a/html/about/guide.html +++ b/html/about/guide.html @@ -245,13 +245,13 @@ You may want to run this on startup, easiest way to set that is with a cron job

Start the Crawler

It is best to run the crawler in a screen session so that you can monitor its output. You can have more than one crawler running as long as you keep them in separate directories, include a symlink to the same robots folder, and also set the correct parameters on each. -To view the parameters, type './cr -h'. Without any parameters set, you can only run one crawler (which is probably all you need anyway). +To view the parameters, type './cr -h'. Without any parameters set, you can only run one crawler (which is probably all you need anyway). If necessary, you can change the connection from 'localhost' to a different IP from inside cr.c, then rebuild.

Note that you may need to change the crawler's user-agent if you have issues indexing some websites. Pages that fail to index are noted inside of abandoned.txt.

-Make sure the robots folder exists. robots.txt files are stored in the robots folder and are downloaded once and then referenced from that folder on future updates. Clear this folder every few weeks to ensure robots.txt files get refreshed from time to time. +Make sure the robots folder exists. All robots.txt files are stored in the robots folder. They are downloaded once and then referenced from that folder on future updates. Clear this folder every few weeks to ensure robots.txt files get refreshed from time to time. You can turn off checking for robots.txt files by commenting out the line calling the "checkrobots" function inside of cr.c.