Commit graph

413 commits

Author SHA1 Message Date
Mikkel Denker
365ed02813
Very simple WAL built on top of file-store primitives (#219)
Doesn't handle concurrent writes and flushes after each write. This will cause a lot of fsync's which will impact performance, but as this will be used for the live index where each item (a full webpage) is quite large, this will hopefully not be too detrimental.
2024-09-05 14:35:52 +02:00
Mikkel Denker
27bcb083bc fix panic in bangs if url is malformed for some reason 2024-09-05 09:49:14 +02:00
Mikkel Denker
40a6dde924 custom captcha to reduce the number of bots scraping the search results 2024-09-04 15:44:57 +02:00
Mikkel Denker
6b8d921aca prevent infinite loop during robotstxt parsing... 2024-08-30 13:09:58 +02:00
Mikkel Denker
fb5a8cc6ab valid useragents are always ascii, so we should be able to do the lowercase string cmp without alloc 2024-08-30 12:11:12 +02:00
Mikkel Denker
f5cea25a9a decrease robotstxt char limit to guard against adversarial input 2024-08-30 11:50:06 +02:00
Mikkel Denker
ac9fb2ebbd disable robotstxt retry on unreachable 2024-08-30 11:46:35 +02:00
Mikkel Denker
b20515997a if optic match location is domain but the pattern clearly tries to match against a site, change match location to site as this is probably what the user meant.
interpreting the rule literally would never match anything which is not really useful
2024-08-29 16:13:56 +02:00
Mikkel Denker
e8875fe19d wait for webgraph to come online when instantiating remote webgraph conn 2024-08-29 13:31:03 +02:00
Mikkel Denker
710b42fbc7 avoid lambdamart predict allocation in ranking pipeline 2024-08-29 09:42:53 +02:00
Mikkel Denker
7ea70c4df7 use description instead of dmoz_description in optic as dmoz_description isn't generally populated and not supported by PatternQuery 2024-08-29 09:39:40 +02:00
Mikkel Denker
fa5d2b662a 'fn compute(..)' in core signal should not return option anymore as all core signals should be computable in that part of the ranking 2024-08-28 17:00:01 +02:00
Mikkel Denker
9cba6c13fd split 'Signal' trait into 'CoreSignal' and 'Signal' to distinguish between the signals that are calculated initially and later during ranking 2024-08-28 16:33:56 +02:00
Mikkel Denker
e28de39843 remove lambdamart from search server config
only used on api server
2024-08-28 10:44:51 +02:00
Mikkel Denker
c6119e31d7 giant ranking pipeline refactor to separate ranking stages from sorting/offset logic
this should make it easier to implement additional ranking stages in the future
2024-08-27 20:16:17 +02:00
Mikkel Denker
5cdfaaac3f 'new()' wasn't used as the term distance scorers are unit structs 2024-08-23 10:47:24 +02:00
Mikkel Denker
9ee9c6338e incorporate term distance into ranking
calculate minimum slop that would be required for document to match the equivalent phrase-with-slop query on title and body
2024-08-23 10:41:36 +02:00
Mikkel Denker
d608aa217f Add term cover percentage as ranking signals
i.e. how many terms from query match title/body of the document
2024-08-22 14:17:41 +02:00
Mikkel Denker
4467abb02a remove unused 'SignalComputer::empty()' 2024-08-22 12:13:46 +02:00
Mikkel Denker
acb65099ef number of terms in query as ranking signal
has coefficient 0 as it doesn't make sense to include in linear combination, but can still be used in more advanced pipelines
2024-08-22 12:05:58 +02:00
Mikkel Denker
8ec40ab9a1 specific likely_has_ads ranking signal in addition to the num_trackers signal 2024-08-22 10:36:47 +02:00
Mikkel Denker
d007647456 lower 'b' bm25 constant on backlink text
when the backlink text is longer it usually means the page has a higher in_degree which is also a useful ranking metric. long texts should therefore not be penalized so hard
2024-08-21 18:08:30 +02:00
Mikkel Denker
1acbafce4c have some initial wander steps during crawl
this makes sure we index the pages that has a short distance from the frontpage as they are likely very relevant to the site. currently it's set to 4 steps, then scheduled urls and finally the remaining wander
2024-08-21 15:55:50 +02:00
Mikkel Denker
1010402cfc properly interpret 'Disallow: ' robotstxt files 2024-08-21 11:36:10 +02:00
Mikkel Denker
bd150bda85 Revert "update signal coefficients"
This reverts commit 24f0ab4cf1.
2024-08-19 20:55:40 +02:00
Mikkel Denker
24f0ab4cf1 update signal coefficients 2024-08-19 20:29:05 +02:00
Mikkel Denker
1f09a4247d ltr experiment
use differential evolution to optimize linear model live
2024-08-19 12:13:59 +02:00
Mikkel Denker
e68b7fba19 admin cli endpoint to get number of pages in index 2024-08-16 14:09:52 +02:00
Mikkel Denker
44cc5a8d0f <title> tag inside an <svg> tag would mistakenly be chosen as the title of the page 2024-08-16 13:19:23 +02:00
Mikkel Denker
9c2d27fe05 ltr experiment 2024-08-15 11:46:05 +02:00
Mikkel Denker
8b21c3c085 set webgraph on api searcher so like/dislike works 2024-08-14 14:38:21 +02:00
Mikkel Denker
0b56d7fa92 phrase queries shouldn't be augmented with adjacent ngrams as this results in unexpected matches on the phrase query 2024-08-14 12:00:58 +02:00
Mikkel Denker
21d7342ecd optionally construct only host graph or page graph 2024-08-13 15:25:39 +02:00
Mikkel Denker
308388262f store node ids as big endian in webgraph so sort is correct during merge 2024-08-13 13:42:35 +02:00
Mikkel Denker
44b285c41a exacturl search operator 2024-08-12 11:52:41 +02:00
Mikkel Denker
2e95e42f7b bump optics 2024-08-12 11:21:18 +02:00
Mikkel Denker
95282202a9 remove some false positive 'has ads' tags 2024-08-12 11:20:05 +02:00
Mikkel Denker
f3315c4b42 leechy ranking annotation experiment 2024-08-12 11:17:55 +02:00
Mikkel Denker
3a3cbcb158 warmup search caches so the first human searches are faster 2024-08-10 10:31:23 +02:00
Mikkel Denker
0f8c074da8 warmup search caches so the first human searches are faster 2024-08-10 10:27:14 +02:00
Mikkel Denker
ebbfffdb8e simpler keyphrase extraction
turns out this works even better than the more complex approach when the dataset is large enough
2024-08-09 17:28:46 +02:00
Mikkel Denker
e7bc911fb3 make sure that all ranking signals are returned even if their score/coefficient is zero
makes it easier to train ranking models
2024-08-07 13:37:14 +02:00
Mikkel Denker
7e5d3aafd2 better keyphrases 2024-08-07 09:44:41 +02:00
Mikkel Denker
abfc90eaeb tld id as numerical field so we can potentially use it during ranking 2024-08-06 21:40:04 +02:00
Mikkel Denker
b79233302b api admin interface 2024-08-06 18:26:14 +02:00
Mikkel Denker
51e47107b2 limit should have been 0.0 in 'score_rank' as the scores are not reversed 2024-08-05 16:54:53 +02:00
Mikkel Denker
08dc07c575 update signal coefficients 2024-08-05 14:54:20 +02:00
Mikkel Denker
99d82b0a59 limit backlinks to 512 for better indexing performance 2024-07-27 21:12:29 +02:00
Mikkel Denker
88df429dad remove 'linksfrom' operator again
indexing got significantly slower as we had to lookup all node ids to actual urls which became a bottleneck
2024-07-27 20:52:20 +02:00
Mikkel Denker
e71c9716c0 group backlinks by the host centrality rank of the linking site so we can take this into account during ranking.
links from pages with high host centrality rank are more trustworthy than those from low
2024-07-27 18:16:17 +02:00