Commit graph

413 commits

Author SHA1 Message Date
Mikkel Denker
ecb495a66c fix flaky test 2024-12-11 10:40:43 +01:00
Mikkel Denker
0912e159fd ampc shortest path test case 2024-12-11 10:12:20 +01:00
Mikkel Denker
e24a389132 approximate centrality progress bar 2024-12-10 15:42:36 +01:00
Mikkel Denker
bd9f1cb435 [ampc] ability to send requests without timeouts (should be used sparringly) 2024-12-10 15:10:10 +01:00
Mikkel Denker
852a10c08e use shortest paths as subroutine in approx centrality 2024-12-10 14:24:12 +01:00
Mikkel Denker
38362b723c support WARC-Date in .warc files 2024-12-10 10:41:39 +01:00
Mikkel Denker
bd349549eb optional max distance in shortest paths 2024-12-09 15:52:52 +01:00
Mikkel Denker
774dbd87fb optimise shortest path to use exact changed nodes (stored in hashset) if there are very few updated nodes 2024-12-09 15:31:25 +01:00
Mikkel Denker
037ec2cc9d combine sketches across workers for more precise bloom filters 2024-12-09 14:27:53 +01:00
Mikkel Denker
49abc7419f bellman-ford inspired shortest path for distributed graph. works better than approach in approx centrality when graph is sharded. still need to implement some low-hanging fruits for optimisation 2024-12-09 14:16:12 +01:00
Mikkel Denker
7633b61ef3
Improve code docs (#246)
* document all entrypoints

* document ampc framework

* document ranking pipeline

* document the different searchers

* document generic search query flow

* document main crawler elements
2024-12-05 14:19:53 +01:00
Mikkel Denker
de7291daa1 just update 2024-12-03 15:00:08 +01:00
Mikkel Denker
01de7a107b
Improve zimba+web-spell docs and release the modules under MIT (#242)
* improve web-spell docs

* improve zimba docs

* release zimba + web-spell under MIT
2024-12-02 13:21:26 +01:00
Mikkel Denker
040d04413e
Web spell as dedicated module (#240)
* separate web-spell into a dedicated module

* web-spell readme
2024-11-29 15:15:18 +01:00
Mikkel Denker
05d3cf9de5
urlencode user queries when forwarding to bang location to prevent open redirect vulnerabilities (#239) 2024-11-29 12:17:01 +01:00
Mikkel Denker
3b2a5f7895 [webgraph] sort edges by centrality rank instead of raw centrality
this ensures consistent ordering of edges across segments/shards
2024-11-28 11:58:19 +01:00
Mikkel Denker
3e683b02f8 no need to collect all page nodes when calculating centrality. there will be a few duplicates when iterating but that's okay as their hyperloglogs will stay the same 2024-11-28 10:12:03 +01:00
Mikkel Denker
1d821ef4db rustup update and fix clippy warnings 2024-11-27 17:15:24 +01:00
Mikkel Denker
83d1e7d7e0 don't lowercase path of url when normalising
some sites use case-sensitive urls, so lowercasing the urls might result in a bunch of 404s
2024-11-27 15:07:15 +01:00
Mikkel Denker
3d4b272ed8 don't lowercase path of url when normalising
some sites use case-sensitive urls, so lowercasing the urls might result in a bunch of 404s
2024-11-27 14:54:58 +01:00
Mikkel Denker
46016ab03d [tantivy] cached column to reduce disk reads 2024-11-27 14:36:22 +01:00
Mikkel Denker
761eba772e progress messages when starting workers 2024-11-27 14:00:02 +01:00
Mikkel Denker
9d507debce [webgraph] control whether or not to skip self links in more queries 2024-11-26 10:51:26 +01:00
TheIronBorn
bf70a4666f
remove extra hash in related_entities (#153) 2024-11-26 10:29:36 +01:00
TheIronBorn
7331dd6123
reduce allocations in metrics.rs (#237) 2024-11-26 10:29:08 +01:00
Mikkel Denker
f241c3238c [webgraph] control whether or not to skip self links in group_by queries 2024-11-25 14:43:55 +01:00
Mikkel Denker
13f06e41ba [webgraph] add from_centrality, to_centrality etc. fields to edges 2024-11-23 12:57:51 +01:00
Mikkel Denker
efbe1db918 [tantivy] support u128 in row order 2024-11-23 10:53:08 +01:00
Mikkel Denker
b81d84762c [tantivy] support u128 column values 2024-11-22 15:26:43 +01:00
Mikkel Denker
7b75a854b7 [tantivy] support u128 values in inverted index 2024-11-22 14:29:15 +01:00
Mikkel Denker
3a48c8c7b4 [tantivy] raw column values codec
there seems to be a bug somewhere in the decompression of one of the other codecs for enormous columns. the to_id and from_id columns in the webgraph seems to point to wrong ids in the webgraph (don't know exactly why). as the ids are random hashes anyway, we don't gain much from compression and can therefore simply store the columns as raw u64 values on disk
2024-11-21 16:03:11 +01:00
Mikkel Denker
6deb599d78 [webgraph] optionally deduplicate edges 2024-11-20 11:38:12 +01:00
Mikkel Denker
0681588b2a ugc rel tag and filter during centrality calculation 2024-11-20 10:19:07 +01:00
Mikkel Denker
d2fa5f4061
use nom in zimba parser (#236) 2024-11-20 10:12:52 +01:00
Mikkel Denker
12e9502e80
Improve API documentation (#235)
* add docusaurus scalar api documentation structure

* bump openapi 3.0 to 3.1 so we can mark internal endpoints

* improve search api docs

* webgraph api docs

* point docs to prod
2024-11-19 13:43:42 +01:00
Mikkel Denker
6048d9f133 support filters in similar hosts 2024-11-15 14:05:27 +01:00
Mikkel Denker
0d0405caa6 [webgraph] rename *params -> *query 2024-11-15 10:23:26 +01:00
Mikkel Denker
5639813d89 [webgraph] speedup similar_hosts by estimating the best candidate nodes based on how many of the backlink nodes they have a link from 2024-11-15 10:08:44 +01:00
Mikkel Denker
f56012e770 [webgraph] refactor url tokenizer into smaller helper functions 2024-11-13 11:36:51 +01:00
Mikkel Denker
3e88c6e8a9 [webgraph] match non-url-like queries against any term in parsed url 2024-11-13 10:25:19 +01:00
Mikkel Denker
b4a68a7385 [webgraph] a not-filter inside an or-filter mistakenly filtered everything 2024-11-13 10:15:00 +01:00
Mikkel Denker
e94cd2de11 [webgraph] rename HostLinkQuery to LinksQuery 2024-11-12 14:31:57 +01:00
Mikkel Denker
369ab36fe0 [webgraph] apply filters in between query 2024-11-12 13:22:55 +01:00
Mikkel Denker
fb7c191c73 fix typo 2024-11-12 11:13:24 +01:00
Mikkel Denker
444ef9fdce [webgraph] apply column filters in HostGroupSketchQuery 2024-11-12 11:04:51 +01:00
Mikkel Denker
fb4aec5424 [webgraph] rel flags filter (nofollow etc) 2024-11-12 08:42:17 +01:00
Mikkel Denker
48eebbef43 [webgraph] filters in group_by queries 2024-11-11 14:13:11 +01:00
Mikkel Denker
e4a8a9f2fe [webgraph] query filters 2024-11-11 11:02:38 +01:00
Mikkel Denker
abef4f26ff treat live index as an extra shard in search index to re-use result merge logic etc 2024-11-08 14:32:54 +01:00
Mikkel Denker
3459405b45 use specific shard_ids for search servers (backbone and live) so we can potentially simply treat the live index as extra shards and re-use all merge logic etc. for searches 2024-11-07 14:34:54 +01:00