Mikkel Denker
ecb495a66c
fix flaky test
2024-12-11 10:40:43 +01:00
Mikkel Denker
0912e159fd
ampc shortest path test case
2024-12-11 10:12:20 +01:00
Mikkel Denker
e24a389132
approximate centrality progress bar
2024-12-10 15:42:36 +01:00
Mikkel Denker
bd9f1cb435
[ampc] ability to send requests without timeouts (should be used sparringly)
2024-12-10 15:10:10 +01:00
Mikkel Denker
852a10c08e
use shortest paths as subroutine in approx centrality
2024-12-10 14:24:12 +01:00
Mikkel Denker
38362b723c
support WARC-Date in .warc files
2024-12-10 10:41:39 +01:00
Mikkel Denker
bd349549eb
optional max distance in shortest paths
2024-12-09 15:52:52 +01:00
Mikkel Denker
774dbd87fb
optimise shortest path to use exact changed nodes (stored in hashset) if there are very few updated nodes
2024-12-09 15:31:25 +01:00
Mikkel Denker
037ec2cc9d
combine sketches across workers for more precise bloom filters
2024-12-09 14:27:53 +01:00
Mikkel Denker
49abc7419f
bellman-ford inspired shortest path for distributed graph. works better than approach in approx centrality when graph is sharded. still need to implement some low-hanging fruits for optimisation
2024-12-09 14:16:12 +01:00
Mikkel Denker
7633b61ef3
Improve code docs ( #246 )
...
* document all entrypoints
* document ampc framework
* document ranking pipeline
* document the different searchers
* document generic search query flow
* document main crawler elements
2024-12-05 14:19:53 +01:00
Mikkel Denker
040d04413e
Web spell as dedicated module ( #240 )
...
* separate web-spell into a dedicated module
* web-spell readme
2024-11-29 15:15:18 +01:00
Mikkel Denker
05d3cf9de5
urlencode user queries when forwarding to bang location to prevent open redirect vulnerabilities ( #239 )
2024-11-29 12:17:01 +01:00
Mikkel Denker
3b2a5f7895
[webgraph] sort edges by centrality rank instead of raw centrality
...
this ensures consistent ordering of edges across segments/shards
2024-11-28 11:58:19 +01:00
Mikkel Denker
3e683b02f8
no need to collect all page nodes when calculating centrality. there will be a few duplicates when iterating but that's okay as their hyperloglogs will stay the same
2024-11-28 10:12:03 +01:00
Mikkel Denker
1d821ef4db
rustup update and fix clippy warnings
2024-11-27 17:15:24 +01:00
Mikkel Denker
83d1e7d7e0
don't lowercase path of url when normalising
...
some sites use case-sensitive urls, so lowercasing the urls might result in a bunch of 404s
2024-11-27 15:07:15 +01:00
Mikkel Denker
3d4b272ed8
don't lowercase path of url when normalising
...
some sites use case-sensitive urls, so lowercasing the urls might result in a bunch of 404s
2024-11-27 14:54:58 +01:00
Mikkel Denker
46016ab03d
[tantivy] cached column to reduce disk reads
2024-11-27 14:36:22 +01:00
Mikkel Denker
761eba772e
progress messages when starting workers
2024-11-27 14:00:02 +01:00
Mikkel Denker
9d507debce
[webgraph] control whether or not to skip self links in more queries
2024-11-26 10:51:26 +01:00
TheIronBorn
bf70a4666f
remove extra hash in related_entities ( #153 )
2024-11-26 10:29:36 +01:00
TheIronBorn
7331dd6123
reduce allocations in metrics.rs ( #237 )
2024-11-26 10:29:08 +01:00
Mikkel Denker
f241c3238c
[webgraph] control whether or not to skip self links in group_by queries
2024-11-25 14:43:55 +01:00
Mikkel Denker
13f06e41ba
[webgraph] add from_centrality, to_centrality etc. fields to edges
2024-11-23 12:57:51 +01:00
Mikkel Denker
efbe1db918
[tantivy] support u128 in row order
2024-11-23 10:53:08 +01:00
Mikkel Denker
6deb599d78
[webgraph] optionally deduplicate edges
2024-11-20 11:38:12 +01:00
Mikkel Denker
0681588b2a
ugc rel tag and filter during centrality calculation
2024-11-20 10:19:07 +01:00
Mikkel Denker
12e9502e80
Improve API documentation ( #235 )
...
* add docusaurus scalar api documentation structure
* bump openapi 3.0 to 3.1 so we can mark internal endpoints
* improve search api docs
* webgraph api docs
* point docs to prod
2024-11-19 13:43:42 +01:00
Mikkel Denker
6048d9f133
support filters in similar hosts
2024-11-15 14:05:27 +01:00
Mikkel Denker
0d0405caa6
[webgraph] rename *params -> *query
2024-11-15 10:23:26 +01:00
Mikkel Denker
5639813d89
[webgraph] speedup similar_hosts by estimating the best candidate nodes based on how many of the backlink nodes they have a link from
2024-11-15 10:08:44 +01:00
Mikkel Denker
f56012e770
[webgraph] refactor url tokenizer into smaller helper functions
2024-11-13 11:36:51 +01:00
Mikkel Denker
3e88c6e8a9
[webgraph] match non-url-like queries against any term in parsed url
2024-11-13 10:25:19 +01:00
Mikkel Denker
b4a68a7385
[webgraph] a not-filter inside an or-filter mistakenly filtered everything
2024-11-13 10:15:00 +01:00
Mikkel Denker
e94cd2de11
[webgraph] rename HostLinkQuery to LinksQuery
2024-11-12 14:31:57 +01:00
Mikkel Denker
369ab36fe0
[webgraph] apply filters in between query
2024-11-12 13:22:55 +01:00
Mikkel Denker
fb7c191c73
fix typo
2024-11-12 11:13:24 +01:00
Mikkel Denker
444ef9fdce
[webgraph] apply column filters in HostGroupSketchQuery
2024-11-12 11:04:51 +01:00
Mikkel Denker
fb4aec5424
[webgraph] rel flags filter (nofollow etc)
2024-11-12 08:42:17 +01:00
Mikkel Denker
48eebbef43
[webgraph] filters in group_by queries
2024-11-11 14:13:11 +01:00
Mikkel Denker
e4a8a9f2fe
[webgraph] query filters
2024-11-11 11:02:38 +01:00
Mikkel Denker
abef4f26ff
treat live index as an extra shard in search index to re-use result merge logic etc
2024-11-08 14:32:54 +01:00
Mikkel Denker
3459405b45
use specific shard_ids for search servers (backbone and live) so we can potentially simply treat the live index as extra shards and re-use all merge logic etc. for searches
2024-11-07 14:34:54 +01:00
Mikkel Denker
14e6e11058
[search_server[ implement GetSiteUrls as generic query
2024-11-07 12:10:36 +01:00
Mikkel Denker
846a853969
[search_server] implement get homepage as GenericQuery
2024-11-06 15:18:18 +01:00
Mikkel Denker
2f14ed4898
[search_server] implement get webpage as GenericQuery
2024-11-06 14:48:01 +01:00
Mikkel Denker
11e30b1391
[search_server] implement size query as GenericQuery
2024-11-06 12:13:18 +01:00
Mikkel Denker
94d319209b
[tantivy] reduce memory usage when writing postings
...
each posting list is already sorted by the new document ids (even without index sorting). if new(a) < new(b) => old(a) < old(b) and vice versa. the posting lists can therefore be streamed to disk instead of reading the full lists into memory and sort them
2024-11-05 17:26:52 +01:00
Mikkel Denker
aba861ea37
[search_server] implement top key phrases as generic query
2024-11-05 14:12:55 +01:00