Mikkel Denker
ecb495a66c
fix flaky test
2024-12-11 10:40:43 +01:00
Mikkel Denker
0912e159fd
ampc shortest path test case
2024-12-11 10:12:20 +01:00
Mikkel Denker
e24a389132
approximate centrality progress bar
2024-12-10 15:42:36 +01:00
Mikkel Denker
bd9f1cb435
[ampc] ability to send requests without timeouts (should be used sparringly)
2024-12-10 15:10:10 +01:00
Mikkel Denker
852a10c08e
use shortest paths as subroutine in approx centrality
2024-12-10 14:24:12 +01:00
Mikkel Denker
38362b723c
support WARC-Date in .warc files
2024-12-10 10:41:39 +01:00
Mikkel Denker
bd349549eb
optional max distance in shortest paths
2024-12-09 15:52:52 +01:00
Mikkel Denker
774dbd87fb
optimise shortest path to use exact changed nodes (stored in hashset) if there are very few updated nodes
2024-12-09 15:31:25 +01:00
Mikkel Denker
037ec2cc9d
combine sketches across workers for more precise bloom filters
2024-12-09 14:27:53 +01:00
Mikkel Denker
49abc7419f
bellman-ford inspired shortest path for distributed graph. works better than approach in approx centrality when graph is sharded. still need to implement some low-hanging fruits for optimisation
2024-12-09 14:16:12 +01:00
Mikkel Denker
7633b61ef3
Improve code docs ( #246 )
...
* document all entrypoints
* document ampc framework
* document ranking pipeline
* document the different searchers
* document generic search query flow
* document main crawler elements
2024-12-05 14:19:53 +01:00
Mikkel Denker
daff4d06d6
document supported search operators ( #245 )
2024-12-04 10:45:03 +01:00
Mikkel Denker
d69fd5b8c3
update documentation links
2024-12-03 15:56:45 +01:00
Mikkel Denker
cd9a794cd5
just update
2024-12-03 15:05:07 +01:00
Mikkel Denker
de7291daa1
just update
2024-12-03 15:00:08 +01:00
Mikkel Denker
9e8dc92a41
Improve architecture documentation ( #243 )
...
* cleanup assets
* update crawler docs
* update search index docs
* update webgraph docs
2024-12-03 14:57:54 +01:00
Mikkel Denker
01de7a107b
Improve zimba+web-spell docs and release the modules under MIT ( #242 )
...
* improve web-spell docs
* improve zimba docs
* release zimba + web-spell under MIT
2024-12-02 13:21:26 +01:00
Mikkel Denker
040d04413e
Web spell as dedicated module ( #240 )
...
* separate web-spell into a dedicated module
* web-spell readme
2024-11-29 15:15:18 +01:00
Mikkel Denker
05d3cf9de5
urlencode user queries when forwarding to bang location to prevent open redirect vulnerabilities ( #239 )
2024-11-29 12:17:01 +01:00
Mikkel Denker
3945651330
Check if either the request or response ip is an internal ip. Fail the request if this is the case ( #238 )
2024-11-29 11:27:30 +01:00
Mikkel Denker
3b2a5f7895
[webgraph] sort edges by centrality rank instead of raw centrality
...
this ensures consistent ordering of edges across segments/shards
2024-11-28 11:58:19 +01:00
Mikkel Denker
3e683b02f8
no need to collect all page nodes when calculating centrality. there will be a few duplicates when iterating but that's okay as their hyperloglogs will stay the same
2024-11-28 10:12:03 +01:00
Mikkel Denker
1d821ef4db
rustup update and fix clippy warnings
2024-11-27 17:15:24 +01:00
Mikkel Denker
83d1e7d7e0
don't lowercase path of url when normalising
...
some sites use case-sensitive urls, so lowercasing the urls might result in a bunch of 404s
2024-11-27 15:07:15 +01:00
Mikkel Denker
3d4b272ed8
don't lowercase path of url when normalising
...
some sites use case-sensitive urls, so lowercasing the urls might result in a bunch of 404s
2024-11-27 14:54:58 +01:00
Mikkel Denker
46016ab03d
[tantivy] cached column to reduce disk reads
2024-11-27 14:36:22 +01:00
Mikkel Denker
761eba772e
progress messages when starting workers
2024-11-27 14:00:02 +01:00
Mikkel Denker
9d507debce
[webgraph] control whether or not to skip self links in more queries
2024-11-26 10:51:26 +01:00
TheIronBorn
bf70a4666f
remove extra hash in related_entities ( #153 )
2024-11-26 10:29:36 +01:00
TheIronBorn
7331dd6123
reduce allocations in metrics.rs ( #237 )
2024-11-26 10:29:08 +01:00
Mikkel Denker
f241c3238c
[webgraph] control whether or not to skip self links in group_by queries
2024-11-25 14:43:55 +01:00
Mikkel Denker
13f06e41ba
[webgraph] add from_centrality, to_centrality etc. fields to edges
2024-11-23 12:57:51 +01:00
Mikkel Denker
efbe1db918
[tantivy] support u128 in row order
2024-11-23 10:53:08 +01:00
Mikkel Denker
b81d84762c
[tantivy] support u128 column values
2024-11-22 15:26:43 +01:00
Mikkel Denker
7b75a854b7
[tantivy] support u128 values in inverted index
2024-11-22 14:29:15 +01:00
Mikkel Denker
3a48c8c7b4
[tantivy] raw column values codec
...
there seems to be a bug somewhere in the decompression of one of the other codecs for enormous columns. the to_id and from_id columns in the webgraph seems to point to wrong ids in the webgraph (don't know exactly why). as the ids are random hashes anyway, we don't gain much from compression and can therefore simply store the columns as raw u64 values on disk
2024-11-21 16:03:11 +01:00
Mikkel Denker
6deb599d78
[webgraph] optionally deduplicate edges
2024-11-20 11:38:12 +01:00
Mikkel Denker
0681588b2a
ugc rel tag and filter during centrality calculation
2024-11-20 10:19:07 +01:00
Mikkel Denker
d2fa5f4061
use nom in zimba parser ( #236 )
2024-11-20 10:12:52 +01:00
Mikkel Denker
12e9502e80
Improve API documentation ( #235 )
...
* add docusaurus scalar api documentation structure
* bump openapi 3.0 to 3.1 so we can mark internal endpoints
* improve search api docs
* webgraph api docs
* point docs to prod
2024-11-19 13:43:42 +01:00
Mikkel Denker
6048d9f133
support filters in similar hosts
2024-11-15 14:05:27 +01:00
Mikkel Denker
0d0405caa6
[webgraph] rename *params -> *query
2024-11-15 10:23:26 +01:00
Mikkel Denker
5639813d89
[webgraph] speedup similar_hosts by estimating the best candidate nodes based on how many of the backlink nodes they have a link from
2024-11-15 10:08:44 +01:00
Mikkel Denker
f56012e770
[webgraph] refactor url tokenizer into smaller helper functions
2024-11-13 11:36:51 +01:00
Mikkel Denker
3e88c6e8a9
[webgraph] match non-url-like queries against any term in parsed url
2024-11-13 10:25:19 +01:00
Mikkel Denker
b4a68a7385
[webgraph] a not-filter inside an or-filter mistakenly filtered everything
2024-11-13 10:15:00 +01:00
Mikkel Denker
e94cd2de11
[webgraph] rename HostLinkQuery to LinksQuery
2024-11-12 14:31:57 +01:00
Mikkel Denker
369ab36fe0
[webgraph] apply filters in between query
2024-11-12 13:22:55 +01:00
Mikkel Denker
fb7c191c73
fix typo
2024-11-12 11:13:24 +01:00
Mikkel Denker
444ef9fdce
[webgraph] apply column filters in HostGroupSketchQuery
2024-11-12 11:04:51 +01:00