|
@@ -382,7 +382,7 @@ Explanation of the above options:
|
|
<b>Skip</b> - Selecting this option will skip indexing the page and it will reappear on the review form after you submit the rest of the pages for crawling.
|
|
<b>Skip</b> - Selecting this option will skip indexing the page and it will reappear on the review form after you submit the rest of the pages for crawling.
|
|
<br>
|
|
<br>
|
|
<br>
|
|
<br>
|
|
-<b>Bury</b> - Selecting this will move the page to a grave yard (/grave/), a holding place with the same options as /review/ for websites that might have stopped working but that you suspect may come back online. The crawler will detect this automatically and send the page back into review. When you click on the link and see a 404, you can be assured the crawler sent it back to review after failing two update cycles. This also happens if the title of the page changes. The crawler will only do this for pages directly submitted by people. This curtesy is not given to websites that are automatically crawled but then fail to work later on. For those sites, after two failed update cycles, the page will be removed.
|
|
|
|
|
|
+<b>Bury</b> - Selecting this will move the page to a graveyard (/grave/), a holding place with the same options as /review/ for websites that might have stopped working but that you suspect may come back online. The crawler will detect this automatically and send the page back into review. When you click on the link and see a 404, you can be assured the crawler sent it back to review after failing two update cycles. This also happens if the title of the page changes. The crawler will only do this for pages directly submitted by people. This curtesy is not given to websites that are automatically crawled but then fail to work later on. For those sites, after two failed update cycles, the page will be removed.
|
|
<br>
|
|
<br>
|
|
<br>
|
|
<br>
|
|
<b>Deny</b> - Select this to drop the page from being indexed. If the page does not meet your submission criteria, this would be the option to remove it from the queue.
|
|
<b>Deny</b> - Select this to drop the page from being indexed. If the page does not meet your submission criteria, this would be the option to remove it from the queue.
|
|
@@ -456,10 +456,10 @@ If you need to stop the web crawler in a situation where it was accidently queue
|
|
You can help ensure sub-second search queries as your index grows by building MySQL replica servers on a local network close to eachother, run the core application AND replication tracker (rt) on one or more replica servers and point your reverse proxy to use it. Edit the servers.csv file for rt to indicate all available replica IPs.
|
|
You can help ensure sub-second search queries as your index grows by building MySQL replica servers on a local network close to eachother, run the core application AND replication tracker (rt) on one or more replica servers and point your reverse proxy to use it. Edit the servers.csv file for rt to indicate all available replica IPs.
|
|
<br>
|
|
<br>
|
|
<br>
|
|
<br>
|
|
-There are two methods of scaling, the default method reads different sections of the 'windex' table, dividing the work of searching those sections between replicas. The second method is a little more complicated, where the crawler can store different sections of the 'windex' table into shard tables (ws0 to wsX) where all or some of them can be duplicated on a replica server. The servers.csv file includes the corresponding shard table (ws0 to wsX) names to use, although they won't be referenced if you stick with the default method. You can include made-up shard table names in that case.
|
|
|
|
|
|
+There are two methods of scaling, the default method reads different sections of the 'windex' table, dividing the work of searching those sections between replicas. The second method is a more complicated, where the crawler can store different sections of the 'windex' table into shard tables (ws0 to wsX) where all or some of them can be duplicated on a replica server. The servers.csv file includes the corresponding shard table (ws0 to wsX) names to use, although they won't be referenced if you stick with the default method. You can include made-up shard table names in that case.
|
|
<br>
|
|
<br>
|
|
<br>
|
|
<br>
|
|
-To try out the sharding method, indicate the number of shards in the 'shards' file that the crawler references. Also set 'shards := true' on line 48 of the core application. You also have to initially balance out the shard tables, which is explained <a href="guide.html#balance">here</a>. This method offers some slight speed improvements and less hard-drive storage across replicas if you use <a href="https://mydbops.wordpress.com/2021/09/24/replication-filters-in-mysql-an-overview/">replication filtering</a>, but also requires the primary database and full replicas be twice as big.
|
|
|
|
|
|
+To try out the sharding method, indicate the number of shards in the 'shards' file that the crawler references. Also set 'shards := true' on line 48 of the core application. You also have to initially balance out the shard tables, which is explained <a href="guide.html#balance">here</a>. This method offers a speed advantage for exact searches and less hard-drive storage across replicas if you use <a href="https://mydbops.wordpress.com/2021/09/24/replication-filters-in-mysql-an-overview/">replication filtering</a>, but also requires the primary database and full replicas to use double the storage.
|
|
<br>
|
|
<br>
|
|
<br>
|
|
<br>
|
|
If you have a machine with a huge amount of resources and cores, entering multiple duplicate entries to the same sever inside servers.csv (e.g. one for each CPU core) works also, with the sharding method you must include additional shard tables on that replica for each duplicate connection.
|
|
If you have a machine with a huge amount of resources and cores, entering multiple duplicate entries to the same sever inside servers.csv (e.g. one for each CPU core) works also, with the sharding method you must include additional shard tables on that replica for each duplicate connection.
|