Commit graph

1308 commits

Author SHA1 Message Date
Mikkel Denker
ba14aaab68 remove optional and multivalued columns from tantivy as we only use full columnar indices 2024-07-04 14:12:14 +02:00
Mikkel Denker
8af2144898 mark the query parser in tantivy with #[cfg(test)] to ensure we don't accidentally use it instead of stracts 2024-07-02 09:08:21 +02:00
Mikkel Denker
f46abd0511 move tantivy dependencies up to workspace for consistent versioning between crates 2024-07-01 20:40:15 +02:00
Mikkel Denker
297f79d46b store number of position bytes as u64 instead of u32 so we can have more than 4gb of positions in a segment 2024-07-01 15:11:52 +02:00
Mikkel Denker
28d2eff2c2 remove unused feature flags from tantivy 2024-07-01 14:32:23 +02:00
Mikkel Denker
3e5875839b use workspace fst in tantivy 2024-07-01 14:22:34 +02:00
Mikkel Denker
b15261b003 remove some unused dependencies 2024-07-01 14:12:58 +02:00
Mikkel Denker
4fb6af6fed remove aggregations for simplicity 2024-07-01 13:25:32 +02:00
Mikkel Denker
4306586763 fix clippy warnings 2024-07-01 13:09:38 +02:00
Mikkel Denker
a3292143d3 fix clippy warnings 2024-07-01 12:42:02 +02:00
Mikkel Denker
454774dfa7 fork tantivy
our segments are starting to grow too big, so the assumption that number of position bytes can be in a u32 is no longer the case. storing it in u64 might not be what regular users of tantivy want, as our use of the library most likely doesn't resemble the average user. forking tantivy allows us to customize it directly for our usecase
2024-07-01 11:26:41 +02:00
Mikkel Denker
3ab9c301e4 update config files to match new dual_encoder path 2024-06-28 15:21:56 +02:00
Mikkel Denker
e5cc6f8442 add total num hosts and domains to crawlplan stats 2024-06-28 14:05:55 +02:00
Mikkel Denker
f6995402a1 bump crawler version 2024-06-28 13:50:58 +02:00
Mikkel Denker
071e41d167 ensure reasonable limit for robotstxt files
currently same as html body (32mb)
2024-06-28 13:47:09 +02:00
Mikkel Denker
e1ccaf9251 crawler robustness
• delay between robots.txt requests
• retry robots.txt requests 3 times
• if /robots.txt request returns anything except 404 or 200 (times out etc.), don’t crawl site
• respect crawldelay (up to max limit)
• respect 429 Retry-After header (up to max limit)
• increase timeout in reqwest client
• only visit sites on port 80 and 443
2024-06-28 13:35:37 +02:00
Mikkel Denker
303a2cf2da accept unicode-3.0 license 2024-06-27 17:12:28 +02:00
Mikkel Denker
126eedc0d0 way more robust robotstxt parser 2024-06-27 16:10:41 +02:00
Mikkel Denker
95f0703602 approximate centrality naming consistency with exact harmonic 2024-06-24 11:51:57 +02:00
Mikkel Denker
385e8375c6 deduplicate urls during indexing 2024-06-24 10:43:12 +02:00
Mikkel Denker
3844e29bd1 make bm25 constants configurable for each field 2024-06-21 15:16:07 +02:00
Mikkel Denker
e4ae26470e skip links to/from same domain when calculating harmonic centralities
makes it more expensive for linkfarms
2024-06-21 14:21:31 +02:00
Mikkel Denker
de5c946b58 move last worker folder into correct output location after indexing 2024-06-21 12:16:18 +02:00
Mikkel Denker
1bcf74ec11 remove bm25+ again
while it scales low term frequencies, it also adds a positive score for pages even though they don't match the query at all. that seems like a really bad idea...
2024-06-20 15:58:06 +02:00
Mikkel Denker
5beae3b9a9 simplified bm25f that uses same IDF weight across all fields
e.g. the term 'the' might not be very common in titles but should still be scaled as a less important term than other terms in the query. instead of duplicating all text in the index we approximate the bm25f IDF weight as the highest IDF across the fields
2024-06-20 15:01:41 +02:00
Mikkel Denker
2fd6db3cfa bm25+ to not over penalize long documents
https://dl.acm.org/doi/10.1145/2063576.2063584
2024-06-20 10:07:08 +02:00
Mikkel Denker
ca9b249992 [crawler] don't re-crawl url when it redirects to another url 2024-06-18 15:05:16 +02:00
Mikkel Denker
5968185136 [crawler] normalize url before fetch 2024-06-18 14:46:50 +02:00
Mikkel Denker
ad617a8151 [crawler] another check to avoid re-crawling urls 2024-06-18 14:19:15 +02:00
Mikkel Denker
2376f33e19 [crawler] some sites seem to only host robots.txt files on the 'www.' subdomain without redirects 2024-06-18 14:06:56 +02:00
Mikkel Denker
07369e1005 [crawler] continue wandering while budget > 0 and we still know about uncrawled urls 2024-06-18 13:37:29 +02:00
Mikkel Denker
919441850b parse html mime metadata 2024-06-17 14:26:45 +02:00
Mikkel Denker
2bc0d69e00 support both 'linkto:' and 'linksto:' 2024-06-17 12:47:18 +02:00
Mikkel Denker
6aec31525a add a 'linksto' query operator 2024-06-17 12:33:10 +02:00
Mikkel Denker
a1c9721b6c
remove some telemetry hosts from list of ad servers so pages aren't mistakenly tagged as containing ads (#209) 2024-06-12 17:29:21 +02:00
Mikkel Denker
feafa7507a fix webgraph merge bug of missing edges
the edges were compared by their sort_key, but some nodes don't have a centrality value (sort_key) so they got erronously mistaken for other nodes and skipped. it's better to compare directly on the node id
2024-06-12 12:58:24 +02:00
Mikkel Denker
a7b61559db optimize trivial segment merges 2024-06-11 15:57:47 +02:00
Mikkel Denker
a9eb8acd80 store 'rel' attribute for each edge in the webgraph
this allows us to skip links to tag pages etc. when calculating harmonic centrality which should greatly improve the centrality values for the page graph
2024-06-11 15:34:54 +02:00
Mikkel Denker
efa6f9fab0 ability to convert some of the wander budget to scheduled urls
we might now about more urls for each domain than the ones that got a page centrality > 0.0. these urls won't be scheduled unless we convert some of the wander budget back to scheduled urls
2024-06-10 14:11:36 +02:00
Mikkel Denker
817bda9738 optionally merge all webgraph segments into a single segment for improved read performance 2024-06-09 14:48:11 +02:00
Mikkel Denker
295444de4e store sort-key in webgraph edges to properly apply limits with multiple segments
this should also allow us to merge segments which should improve read performance
2024-06-07 11:13:17 +02:00
Mikkel Denker
07663c1687 no need to consider nodes where we have found a better path 2024-06-06 17:31:27 +02:00
Mikkel Denker
90980a1aa4 no need to consider nodes where we have found a better path 2024-06-06 17:05:32 +02:00
Mikkel Denker
6bc7a5caf0 simplify dijkstra to reduce allocations and btree searches 2024-06-06 15:20:10 +02:00
Mikkel Denker
78c4676165 approx harmonic parallel graph executor 2024-06-06 14:11:46 +02:00
Mikkel Denker
4fbffe63b4 optionally store pages with 0 centrality from approx harmonic 2024-06-06 11:54:04 +02:00
Mikkel Denker
4b459b579b reduce indirections in dht (and hopefully memory usage) by removing some arc's
most dht usecases store very small keys/values but store a ton of them, so the arcs don't really help a lot
2024-06-06 11:23:31 +02:00
Mikkel Denker
f7a354572e reduce memory usage and serialization/deserialization in dht by storing keys and values as enums instead of vec<u8> 2024-06-06 10:39:00 +02:00
Mikkel Denker
3c4a0c480e use remote webgraph in crawl planner 2024-06-05 09:41:03 +02:00
Mikkel Denker
265b1b7871
Ranking diff tool (#207)
* ranking diff tool structure

* fix missing icon types

* add admin for queries and experiments

* minor cleanup

* show experiment progress

* upgrade node adapter for svelte

* hopefully fix ci

* display common queries between experiments

* display serp diffs with top signals for each result

* like experiments and show overview in queries

* settings to toggle experiment shuffle and show/hide signals

* keyboard shortcuts

* visualise improvements by query category

* document how to use tool
2024-06-03 15:00:16 +02:00