Add files via upload

This commit is contained in:
wibyweb 2023-04-16 00:52:25 -04:00 committed by GitHub
parent ec8a162133
commit 256f367f7f
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -453,20 +453,13 @@ If you need to stop the web crawler in a situation where it was accidently queue
<hr> <hr>
<h2><a name="scale">Scaling the Search Engine</a></h2> <h2><a name="scale">Scaling the Search Engine</a></h2>
<br> <br>
You can help ensure sub-second search queries as your index grows by building MySQL replica servers on a local network close to eachother, run the core application AND replication tracker (rt) on one or more replica servers and point your reverse proxy to use it. Edit the servers.csv file for rt to indicate all available replica IPs. You can help ensure sub-second search queries as your index grows by building MySQL replica servers on a local network close to eachother, run the core application AND replication tracker (rt) on one or more replica servers and point your reverse proxy to use it. Edit the servers.csv file for rt to indicate all available replica IPs.
<br> <br>
<br> <br>
There are two methods of scaling, the default method reads different sections of the 'windex' table, dividing the work of searching those sections between replicas. The second method is a more complicated, where the crawler can store different sections of the 'windex' table into shard tables (ws0 to wsX) where all or some of them can be duplicated on a replica server. The servers.csv file includes the corresponding shard table (ws0 to wsX) names to use, although they won't be referenced if you stick with the default method. You can include made-up shard table names in that case. If you have a machine with a huge amount of resources and cores, entering multiple duplicate entries to the same sever inside servers.csv (e.g. one for each CPU core) works also.
<br> <br>
<br> <br>
To try out the sharding method, indicate the number of shards in the 'shards' file that the crawler references. Also set 'shards := true' on line 48 of the core application. You also have to initially balance out the shard tables, which is explained <a href="guide.html#balance">here</a>. This method offers a speed advantage for exact searches and less hard-drive storage across replicas if you use <a href="https://mydbops.wordpress.com/2021/09/24/replication-filters-in-mysql-an-overview/">replication filtering</a>, but also requires the primary database and full replicas to use double the storage. The core application checks the replication tracker (rt) output to determine if any replicas are online, it will initiate a connection on those replicas and task each one to search a different section of the index, drastically speeding up search speeds especially for multi-word queries. By default, single-word queries will not initiate multiple connections across replicas. To enable that on single-word queries, comment out the IF statement on line 396 and rebuild the core application.
<br>
<br>
If you have a machine with a huge amount of resources and cores, entering multiple duplicate entries to the same sever inside servers.csv (e.g. one for each CPU core) works also, with the sharding method you must include additional shard tables on that replica for each duplicate connection.
<br>
<br>
The core application checks the replication tracker (rt) output to determine if any replicas are online, it will initiate a connection on those replicas and task each one to search a different section of the index,
drastically speeding up search speeds especially for multi-word queries. By default, single-word queries will not initiate multiple connections across replicas. To enable that on single-word queries, comment out the IF statement on line 396 and rebuild the core application.
<br> <br>
<br> <br>
The search results per page limit is 12, and must evenly divide 'into' OR 'by' the total number of replicas defined in servers.csv. If there is an excess of available replicas such that The search results per page limit is 12, and must evenly divide 'into' OR 'by' the total number of replicas defined in servers.csv. If there is an excess of available replicas such that
@ -475,7 +468,15 @@ they do not divide evenly, those will remain in sync but will not be used for se
<br> <br>
The reverse proxy and replica servers can be connected through a VPN such as wireguard or openvpn, however the IPs for servers.csv should be the local IPs for the LAN The reverse proxy and replica servers can be connected through a VPN such as wireguard or openvpn, however the IPs for servers.csv should be the local IPs for the LAN
the replicas are all connected on. <a href="https://www.digitalocean.com/community/tutorials/how-to-set-up-replication-in-mysql">Here</a> is a tutorial for setting up MySQL replicas. the replicas are all connected on. <a href="https://www.digitalocean.com/community/tutorials/how-to-set-up-replication-in-mysql">Here</a> is a tutorial for setting up MySQL replicas.
<br><br> <br>
<br>
There are two methods of scaling, the default method reads different sections of the 'windex' table, dividing the work of searching those sections between replicas. The second method is more complicated and was developed after the first method as an experiment, where the crawler can store different sections of the 'windex' table into shard tables (ws0 to wsX), and all or some of them can be duplicated on or across replica servers. The servers.csv file includes the corresponding shard table (ws0 to wsX) names to use, although they won't be referenced if you stick with the default method. You can include made-up shard table names in that case.
<br>
<br>
To try out the sharding method, indicate the number of shards in the 'shards' file that the crawler references. Also set 'shards := true' on line 48 of the core application. You also have to initially balance out the shard tables, which is explained <a href="guide.html#balance">here</a>. This method offers a speed advantage for exact searches and less hard-drive storage across replicas if you use <a href="https://mydbops.wordpress.com/2021/09/24/replication-filters-in-mysql-an-overview/">replication filtering</a>, but also requires the primary database and full replicas to use double the storage.
<br>
<br>
<br>
<b>Instructions for Building a MySQL Replica:</b> <b>Instructions for Building a MySQL Replica:</b>
<br> <br>
<br> <br>
@ -607,6 +608,7 @@ Make sure to <a href="guide.html#accessshards">give access</a> to the new shard
You will need to rebalance the shards, follow the steps below, then restart the crawler. Going forward it will round-robin insert into those shards as new pages are crawled. You will need to rebalance the shards, follow the steps below, then restart the crawler. Going forward it will round-robin insert into those shards as new pages are crawled.
<br> <br>
<br> <br>
<br>
<a name="balance"><b>Balancing Additional Shards</b> <a name="balance"><b>Balancing Additional Shards</b>
<br> <br>
<br> <br>
@ -636,6 +638,7 @@ Repeat those steps for each shard table.
These changes will propagate down to the replicas, and the core application will be able to use them as long as permissions to those tables were added. These changes will propagate down to the replicas, and the core application will be able to use them as long as permissions to those tables were added.
<br> <br>
<br> <br>
<br>
<a name="accessshards"><b>Accessing Additional Shards</b></a> <a name="accessshards"><b>Accessing Additional Shards</b></a>
<br> <br>
<br> <br>