Commit graph

413 commits

Author SHA1 Message Date
Mikkel Denker
14e6e11058 [search_server[ implement GetSiteUrls as generic query 2024-11-07 12:10:36 +01:00
Mikkel Denker
846a853969 [search_server] implement get homepage as GenericQuery 2024-11-06 15:18:18 +01:00
Mikkel Denker
2f14ed4898 [search_server] implement get webpage as GenericQuery 2024-11-06 14:48:01 +01:00
Mikkel Denker
11e30b1391 [search_server] implement size query as GenericQuery 2024-11-06 12:13:18 +01:00
Mikkel Denker
54ac58f7e9 [tantivy] kmerge posting lists to avoid iterating over documents that are not present in the posting list during segment merges, while preserving low memory footprint 2024-11-06 11:10:44 +01:00
Mikkel Denker
94d319209b [tantivy] reduce memory usage when writing postings
each posting list is already sorted by the new document ids (even without index sorting). if new(a) < new(b) => old(a) < old(b) and vice versa. the posting lists can therefore be streamed to disk instead of reading the full lists into memory and sort them
2024-11-05 17:26:52 +01:00
Mikkel Denker
aba861ea37 [search_server] implement top key phrases as generic query 2024-11-05 14:12:55 +01:00
Mikkel Denker
ebacff432e [search_server] move some function bounnds into the traits for simplicity 2024-11-05 11:41:29 +01:00
Mikkel Denker
1a3a9cbbb5 [search_server] add a 'GenericQuery' trait so the search index can handle queries that are not typical search queries
this is inspired by the new webgraph query architecture and how it creates a nice seperation of logic
2024-11-05 11:10:01 +01:00
Mikkel Denker
b93afff319 [search_server] use spawn_blocking for search requests to improve concurrency 2024-11-04 13:47:47 +01:00
Mikkel Denker
4e8426888b just update 2024-11-01 15:28:40 +01:00
Mikkel Denker
1994460d9c [webgraph] use specialized url tokenizer 2024-11-01 15:15:50 +01:00
Mikkel Denker
cd40e2757b [webgraph] speedup queries by using warmed f64 column for scorer 2024-10-31 12:14:32 +01:00
Mikkel Denker
8567b8ebf4 webgraph spawn_blocking to handle batch requests concurrently 2024-10-31 11:27:48 +01:00
Mikkel Denker
24e0257545 [webgraph] add FullLinksBetweenQuery to server 2024-10-30 16:06:34 +01:00
Mikkel Denker
2fd7f31bae [webgraph] query to find all edges between two nodes 2024-10-30 15:56:38 +01:00
Mikkel Denker
534ecd1dfa take extra edges to ensure the remote has enough for deduplication 2024-10-30 11:10:05 +01:00
Mikkel Denker
9083d600c7 take extra edges to ensure the remote has enough for deduplication 2024-10-30 10:56:39 +01:00
Mikkel Denker
69410fc74f [webgraph] use host ids in host queries 2024-10-29 16:06:29 +01:00
Mikkel Denker
8bc3d5423b [webgraph] exact group_by query 2024-10-29 15:50:32 +01:00
Mikkel Denker
16cf814f96 [webgraph] exact group_by query 2024-10-29 15:45:45 +01:00
Mikkel Denker
fcf2b99093 host group sketch query to compute the in/out degree somewhat accurately 2024-10-29 11:32:49 +01:00
Mikkel Denker
3763c51348 convert RelFlags from a bitset to a vec of enums for public api 2024-10-28 12:26:29 +01:00
Mikkel Denker
32c4a5f065 if the first edge in the postings was an edge from a host to itself, it would not get properly filtered 2024-10-28 11:14:07 +01:00
Mikkel Denker
8fecebbff4 set rel_flags from stored edges 2024-10-28 10:53:46 +01:00
Mikkel Denker
c5bd49df51 deduplicate host edges across segment+shard results 2024-10-28 10:36:33 +01:00
Mikkel Denker
ba01fc700c deduplicate host edges across entire query, not just adjacent documents 2024-10-25 14:47:01 +02:00
Mikkel Denker
91544e314d edges get sorted by their segment id for performance when retrieving. assigning sort_scores with .zip would therefore be incorrect 2024-10-25 14:30:51 +02:00
Mikkel Denker
866be554ed guard against terminated doc ids 2024-10-25 14:01:53 +02:00
Mikkel Denker
c9a1d6f1a4 re-use u64 column fields in webgraph queries as a significant fraction of the time was spend opening the columns 2024-10-25 11:52:24 +02:00
Mikkel Denker
a80e048580 update vsce dependency in optics-lsp 2024-10-25 09:45:40 +02:00
Mikkel Denker
31bfebf2c9 just update 2024-10-25 09:37:45 +02:00
Mikkel Denker
c154093166 skip webgraph during search if there are no liked/disliked sites in query (no ranking signals would currently need the backlinks in that case) 2024-10-24 16:29:50 +02:00
Mikkel Denker
ab6ed35bb8 re-use row field reader for document across signals 2024-10-24 15:51:28 +02:00
Mikkel Denker
13a06b3820 make sure webgraph merge doesn not exceed maximum number of allowed documents in tantivy 2024-10-24 11:53:21 +02:00
Mikkel Denker
3e732e0f29 when https version of robots.txt is 404 but http is unreachable, it should overall be treated as a 404 as the website has most likely just redirected all http traffic to https 2024-10-24 10:09:59 +02:00
Mikkel Denker
8df44a56ae make sure urls discovered in crawler are deduplicated (extra precaution) 2024-10-23 14:40:26 +02:00
Mikkel Denker
dda40bd4e5 centrality store in webgraph creation is now raw centrality instead of rank 2024-10-23 14:04:42 +02:00
Mikkel Denker
ba464a2d81 dedup edges by ids on insert 2024-10-23 12:06:24 +02:00
Mikkel Denker
658ac6f682
Webgraph inverted index (#232)
* overall structure for new webgraph store

* webgraph schema structure and HostLinksQuery

* deserialize edge

* forward/backlink queries

* full edge queries and iter smalledges

* [wip] use new store in webgraph

* remove id2node db

* shortcircuit link queries

* [wip] remote webgraph trait structure

* [wip] shard awareness

* finish remote webgraph trait structure

* optimize read

* merge webgraphs

* construct webgraph store

* make sure 'just configure' works and everything looks correct
2024-10-23 11:59:52 +02:00
Mikkel Denker
0b1f6eaff1 [live_crawler] shuffle sites between each tick to make site prioritisation more fair 2024-10-15 09:36:19 +02:00
Mikkel Denker
44f67be572 increase stack_size for indexers 2024-10-15 09:25:56 +02:00
Mikkel Denker
77a0e42821 tweak freshness importance 2024-10-14 11:53:24 +02:00
Mikkel Denker
f6cd2e0926 deduplicate cluster members to remove services that have been restarted 2024-10-14 11:30:24 +02:00
Mikkel Denker
016f7dbfac [sonic] probagate errors from each shard independently
errors from one shard should not cause the entire search to fail. in other cases (when indexing pages from live index crawler) we want to retry the request if any of the shards fails. probabagating the errors independently allows the caller to decide what to do about the errors
2024-10-14 11:02:21 +02:00
Mikkel Denker
1188235612 reuse RobotClient across sitemap checkers 2024-10-14 09:28:22 +02:00
Mikkel Denker
eb41e3c62d wrap reqwest requestbuilder so .send() errors in tests 2024-10-13 17:46:52 +02:00
Mikkel Denker
c662fe2855 [live_crawler] make sure frontpage check is skipped if disallowed in robots.txt 2024-10-13 17:33:34 +02:00
Mikkel Denker
34230b5916 [live_crawler] make sure unreachable robots.txt files aren't re-fetched 2024-10-11 11:32:45 +02:00
Mikkel Denker
2519bef34c remove pages_by_host from webgraph as it isn't used anymore 2024-10-09 11:55:05 +02:00