Commit graph

377 commits

Author SHA1 Message Date
Mikkel Denker
ecb495a66c fix flaky test 2024-12-11 10:40:43 +01:00
Mikkel Denker
0912e159fd ampc shortest path test case 2024-12-11 10:12:20 +01:00
Mikkel Denker
e24a389132 approximate centrality progress bar 2024-12-10 15:42:36 +01:00
Mikkel Denker
bd9f1cb435 [ampc] ability to send requests without timeouts (should be used sparringly) 2024-12-10 15:10:10 +01:00
Mikkel Denker
852a10c08e use shortest paths as subroutine in approx centrality 2024-12-10 14:24:12 +01:00
Mikkel Denker
38362b723c support WARC-Date in .warc files 2024-12-10 10:41:39 +01:00
Mikkel Denker
bd349549eb optional max distance in shortest paths 2024-12-09 15:52:52 +01:00
Mikkel Denker
774dbd87fb optimise shortest path to use exact changed nodes (stored in hashset) if there are very few updated nodes 2024-12-09 15:31:25 +01:00
Mikkel Denker
037ec2cc9d combine sketches across workers for more precise bloom filters 2024-12-09 14:27:53 +01:00
Mikkel Denker
49abc7419f bellman-ford inspired shortest path for distributed graph. works better than approach in approx centrality when graph is sharded. still need to implement some low-hanging fruits for optimisation 2024-12-09 14:16:12 +01:00
Mikkel Denker
7633b61ef3
Improve code docs (#246)
* document all entrypoints

* document ampc framework

* document ranking pipeline

* document the different searchers

* document generic search query flow

* document main crawler elements
2024-12-05 14:19:53 +01:00
Mikkel Denker
040d04413e
Web spell as dedicated module (#240)
* separate web-spell into a dedicated module

* web-spell readme
2024-11-29 15:15:18 +01:00
Mikkel Denker
05d3cf9de5
urlencode user queries when forwarding to bang location to prevent open redirect vulnerabilities (#239) 2024-11-29 12:17:01 +01:00
Mikkel Denker
3b2a5f7895 [webgraph] sort edges by centrality rank instead of raw centrality
this ensures consistent ordering of edges across segments/shards
2024-11-28 11:58:19 +01:00
Mikkel Denker
3e683b02f8 no need to collect all page nodes when calculating centrality. there will be a few duplicates when iterating but that's okay as their hyperloglogs will stay the same 2024-11-28 10:12:03 +01:00
Mikkel Denker
1d821ef4db rustup update and fix clippy warnings 2024-11-27 17:15:24 +01:00
Mikkel Denker
83d1e7d7e0 don't lowercase path of url when normalising
some sites use case-sensitive urls, so lowercasing the urls might result in a bunch of 404s
2024-11-27 15:07:15 +01:00
Mikkel Denker
3d4b272ed8 don't lowercase path of url when normalising
some sites use case-sensitive urls, so lowercasing the urls might result in a bunch of 404s
2024-11-27 14:54:58 +01:00
Mikkel Denker
46016ab03d [tantivy] cached column to reduce disk reads 2024-11-27 14:36:22 +01:00
Mikkel Denker
761eba772e progress messages when starting workers 2024-11-27 14:00:02 +01:00
Mikkel Denker
9d507debce [webgraph] control whether or not to skip self links in more queries 2024-11-26 10:51:26 +01:00
TheIronBorn
bf70a4666f
remove extra hash in related_entities (#153) 2024-11-26 10:29:36 +01:00
TheIronBorn
7331dd6123
reduce allocations in metrics.rs (#237) 2024-11-26 10:29:08 +01:00
Mikkel Denker
f241c3238c [webgraph] control whether or not to skip self links in group_by queries 2024-11-25 14:43:55 +01:00
Mikkel Denker
13f06e41ba [webgraph] add from_centrality, to_centrality etc. fields to edges 2024-11-23 12:57:51 +01:00
Mikkel Denker
efbe1db918 [tantivy] support u128 in row order 2024-11-23 10:53:08 +01:00
Mikkel Denker
6deb599d78 [webgraph] optionally deduplicate edges 2024-11-20 11:38:12 +01:00
Mikkel Denker
0681588b2a ugc rel tag and filter during centrality calculation 2024-11-20 10:19:07 +01:00
Mikkel Denker
12e9502e80
Improve API documentation (#235)
* add docusaurus scalar api documentation structure

* bump openapi 3.0 to 3.1 so we can mark internal endpoints

* improve search api docs

* webgraph api docs

* point docs to prod
2024-11-19 13:43:42 +01:00
Mikkel Denker
6048d9f133 support filters in similar hosts 2024-11-15 14:05:27 +01:00
Mikkel Denker
0d0405caa6 [webgraph] rename *params -> *query 2024-11-15 10:23:26 +01:00
Mikkel Denker
5639813d89 [webgraph] speedup similar_hosts by estimating the best candidate nodes based on how many of the backlink nodes they have a link from 2024-11-15 10:08:44 +01:00
Mikkel Denker
f56012e770 [webgraph] refactor url tokenizer into smaller helper functions 2024-11-13 11:36:51 +01:00
Mikkel Denker
3e88c6e8a9 [webgraph] match non-url-like queries against any term in parsed url 2024-11-13 10:25:19 +01:00
Mikkel Denker
b4a68a7385 [webgraph] a not-filter inside an or-filter mistakenly filtered everything 2024-11-13 10:15:00 +01:00
Mikkel Denker
e94cd2de11 [webgraph] rename HostLinkQuery to LinksQuery 2024-11-12 14:31:57 +01:00
Mikkel Denker
369ab36fe0 [webgraph] apply filters in between query 2024-11-12 13:22:55 +01:00
Mikkel Denker
fb7c191c73 fix typo 2024-11-12 11:13:24 +01:00
Mikkel Denker
444ef9fdce [webgraph] apply column filters in HostGroupSketchQuery 2024-11-12 11:04:51 +01:00
Mikkel Denker
fb4aec5424 [webgraph] rel flags filter (nofollow etc) 2024-11-12 08:42:17 +01:00
Mikkel Denker
48eebbef43 [webgraph] filters in group_by queries 2024-11-11 14:13:11 +01:00
Mikkel Denker
e4a8a9f2fe [webgraph] query filters 2024-11-11 11:02:38 +01:00
Mikkel Denker
abef4f26ff treat live index as an extra shard in search index to re-use result merge logic etc 2024-11-08 14:32:54 +01:00
Mikkel Denker
3459405b45 use specific shard_ids for search servers (backbone and live) so we can potentially simply treat the live index as extra shards and re-use all merge logic etc. for searches 2024-11-07 14:34:54 +01:00
Mikkel Denker
14e6e11058 [search_server[ implement GetSiteUrls as generic query 2024-11-07 12:10:36 +01:00
Mikkel Denker
846a853969 [search_server] implement get homepage as GenericQuery 2024-11-06 15:18:18 +01:00
Mikkel Denker
2f14ed4898 [search_server] implement get webpage as GenericQuery 2024-11-06 14:48:01 +01:00
Mikkel Denker
11e30b1391 [search_server] implement size query as GenericQuery 2024-11-06 12:13:18 +01:00
Mikkel Denker
94d319209b [tantivy] reduce memory usage when writing postings
each posting list is already sorted by the new document ids (even without index sorting). if new(a) < new(b) => old(a) < old(b) and vice versa. the posting lists can therefore be streamed to disk instead of reading the full lists into memory and sort them
2024-11-05 17:26:52 +01:00
Mikkel Denker
aba861ea37 [search_server] implement top key phrases as generic query 2024-11-05 14:12:55 +01:00