our segments are starting to grow too big, so the assumption that number of position bytes can be in a u32 is no longer the case. storing it in u64 might not be what regular users of tantivy want, as our use of the library most likely doesn't resemble the average user. forking tantivy allows us to customize it directly for our usecase
• delay between robots.txt requests
• retry robots.txt requests 3 times
• if /robots.txt request returns anything except 404 or 200 (times out etc.), don’t crawl site
• respect crawldelay (up to max limit)
• respect 429 Retry-After header (up to max limit)
• increase timeout in reqwest client
• only visit sites on port 80 and 443
while it scales low term frequencies, it also adds a positive score for pages even though they don't match the query at all. that seems like a really bad idea...
e.g. the term 'the' might not be very common in titles but should still be scaled as a less important term than other terms in the query. instead of duplicating all text in the index we approximate the bm25f IDF weight as the highest IDF across the fields
the edges were compared by their sort_key, but some nodes don't have a centrality value (sort_key) so they got erronously mistaken for other nodes and skipped. it's better to compare directly on the node id
this allows us to skip links to tag pages etc. when calculating harmonic centrality which should greatly improve the centrality values for the page graph
we might now about more urls for each domain than the ones that got a page centrality > 0.0. these urls won't be scheduled unless we convert some of the wander budget back to scheduled urls
* ranking diff tool structure
* fix missing icon types
* add admin for queries and experiments
* minor cleanup
* show experiment progress
* upgrade node adapter for svelte
* hopefully fix ci
* display common queries between experiments
* display serp diffs with top signals for each result
* like experiments and show overview in queries
* settings to toggle experiment shuffle and show/hide signals
* keyboard shortcuts
* visualise improvements by query category
* document how to use tool