Mikkel Denker
365ed02813
Very simple WAL built on top of file-store primitives ( #219 )
...
Doesn't handle concurrent writes and flushes after each write. This will cause a lot of fsync's which will impact performance, but as this will be used for the live index where each item (a full webpage) is quite large, this will hopefully not be too detrimental.
2024-09-05 14:35:52 +02:00
Mikkel Denker
27bcb083bc
fix panic in bangs if url is malformed for some reason
2024-09-05 09:49:14 +02:00
Mikkel Denker
40a6dde924
custom captcha to reduce the number of bots scraping the search results
2024-09-04 15:44:57 +02:00
Mikkel Denker
6b8d921aca
prevent infinite loop during robotstxt parsing...
2024-08-30 13:09:58 +02:00
Mikkel Denker
fb5a8cc6ab
valid useragents are always ascii, so we should be able to do the lowercase string cmp without alloc
2024-08-30 12:11:12 +02:00
Mikkel Denker
f5cea25a9a
decrease robotstxt char limit to guard against adversarial input
2024-08-30 11:50:06 +02:00
Mikkel Denker
ac9fb2ebbd
disable robotstxt retry on unreachable
2024-08-30 11:46:35 +02:00
Mikkel Denker
b20515997a
if optic match location is domain but the pattern clearly tries to match against a site, change match location to site as this is probably what the user meant.
...
interpreting the rule literally would never match anything which is not really useful
2024-08-29 16:13:56 +02:00
Mikkel Denker
e8875fe19d
wait for webgraph to come online when instantiating remote webgraph conn
2024-08-29 13:31:03 +02:00
Mikkel Denker
710b42fbc7
avoid lambdamart predict allocation in ranking pipeline
2024-08-29 09:42:53 +02:00
Mikkel Denker
7ea70c4df7
use description instead of dmoz_description in optic as dmoz_description isn't generally populated and not supported by PatternQuery
2024-08-29 09:39:40 +02:00
Mikkel Denker
fa5d2b662a
'fn compute(..)' in core signal should not return option anymore as all core signals should be computable in that part of the ranking
2024-08-28 17:00:01 +02:00
Mikkel Denker
9cba6c13fd
split 'Signal' trait into 'CoreSignal' and 'Signal' to distinguish between the signals that are calculated initially and later during ranking
2024-08-28 16:33:56 +02:00
Mikkel Denker
e28de39843
remove lambdamart from search server config
...
only used on api server
2024-08-28 10:44:51 +02:00
Mikkel Denker
c6119e31d7
giant ranking pipeline refactor to separate ranking stages from sorting/offset logic
...
this should make it easier to implement additional ranking stages in the future
2024-08-27 20:16:17 +02:00
Mikkel Denker
5cdfaaac3f
'new()' wasn't used as the term distance scorers are unit structs
2024-08-23 10:47:24 +02:00
Mikkel Denker
9ee9c6338e
incorporate term distance into ranking
...
calculate minimum slop that would be required for document to match the equivalent phrase-with-slop query on title and body
2024-08-23 10:41:36 +02:00
Mikkel Denker
d608aa217f
Add term cover percentage as ranking signals
...
i.e. how many terms from query match title/body of the document
2024-08-22 14:17:41 +02:00
Mikkel Denker
4467abb02a
remove unused 'SignalComputer::empty()'
2024-08-22 12:13:46 +02:00
Mikkel Denker
acb65099ef
number of terms in query as ranking signal
...
has coefficient 0 as it doesn't make sense to include in linear combination, but can still be used in more advanced pipelines
2024-08-22 12:05:58 +02:00
Mikkel Denker
8ec40ab9a1
specific likely_has_ads ranking signal in addition to the num_trackers signal
2024-08-22 10:36:47 +02:00
Mikkel Denker
d007647456
lower 'b' bm25 constant on backlink text
...
when the backlink text is longer it usually means the page has a higher in_degree which is also a useful ranking metric. long texts should therefore not be penalized so hard
2024-08-21 18:08:30 +02:00
Mikkel Denker
1acbafce4c
have some initial wander steps during crawl
...
this makes sure we index the pages that has a short distance from the frontpage as they are likely very relevant to the site. currently it's set to 4 steps, then scheduled urls and finally the remaining wander
2024-08-21 15:55:50 +02:00
Mikkel Denker
1010402cfc
properly interpret 'Disallow: ' robotstxt files
2024-08-21 11:36:10 +02:00
Mikkel Denker
bd150bda85
Revert "update signal coefficients"
...
This reverts commit 24f0ab4cf1
.
2024-08-19 20:55:40 +02:00
Mikkel Denker
24f0ab4cf1
update signal coefficients
2024-08-19 20:29:05 +02:00
Mikkel Denker
1f09a4247d
ltr experiment
...
use differential evolution to optimize linear model live
2024-08-19 12:13:59 +02:00
Mikkel Denker
e68b7fba19
admin cli endpoint to get number of pages in index
2024-08-16 14:09:52 +02:00
Mikkel Denker
44cc5a8d0f
<title> tag inside an <svg> tag would mistakenly be chosen as the title of the page
2024-08-16 13:19:23 +02:00
Mikkel Denker
9c2d27fe05
ltr experiment
2024-08-15 11:46:05 +02:00
Mikkel Denker
8b21c3c085
set webgraph on api searcher so like/dislike works
2024-08-14 14:38:21 +02:00
Mikkel Denker
0b56d7fa92
phrase queries shouldn't be augmented with adjacent ngrams as this results in unexpected matches on the phrase query
2024-08-14 12:00:58 +02:00
Mikkel Denker
21d7342ecd
optionally construct only host graph or page graph
2024-08-13 15:25:39 +02:00
Mikkel Denker
308388262f
store node ids as big endian in webgraph so sort is correct during merge
2024-08-13 13:42:35 +02:00
Mikkel Denker
44b285c41a
exacturl search operator
2024-08-12 11:52:41 +02:00
Mikkel Denker
2e95e42f7b
bump optics
2024-08-12 11:21:18 +02:00
Mikkel Denker
95282202a9
remove some false positive 'has ads' tags
2024-08-12 11:20:05 +02:00
Mikkel Denker
f3315c4b42
leechy ranking annotation experiment
2024-08-12 11:17:55 +02:00
Mikkel Denker
3a3cbcb158
warmup search caches so the first human searches are faster
2024-08-10 10:31:23 +02:00
Mikkel Denker
0f8c074da8
warmup search caches so the first human searches are faster
2024-08-10 10:27:14 +02:00
Mikkel Denker
ebbfffdb8e
simpler keyphrase extraction
...
turns out this works even better than the more complex approach when the dataset is large enough
2024-08-09 17:28:46 +02:00
Mikkel Denker
e7bc911fb3
make sure that all ranking signals are returned even if their score/coefficient is zero
...
makes it easier to train ranking models
2024-08-07 13:37:14 +02:00
Mikkel Denker
7e5d3aafd2
better keyphrases
2024-08-07 09:44:41 +02:00
Mikkel Denker
abfc90eaeb
tld id as numerical field so we can potentially use it during ranking
2024-08-06 21:40:04 +02:00
Mikkel Denker
b79233302b
api admin interface
2024-08-06 18:26:14 +02:00
Mikkel Denker
51e47107b2
limit should have been 0.0 in 'score_rank' as the scores are not reversed
2024-08-05 16:54:53 +02:00
Mikkel Denker
08dc07c575
update signal coefficients
2024-08-05 14:54:20 +02:00
Mikkel Denker
99d82b0a59
limit backlinks to 512 for better indexing performance
2024-07-27 21:12:29 +02:00
Mikkel Denker
88df429dad
remove 'linksfrom' operator again
...
indexing got significantly slower as we had to lookup all node ids to actual urls which became a bottleneck
2024-07-27 20:52:20 +02:00
Mikkel Denker
e71c9716c0
group backlinks by the host centrality rank of the linking site so we can take this into account during ranking.
...
links from pages with high host centrality rank are more trustworthy than those from low
2024-07-27 18:16:17 +02:00