Mikkel Denker
58da0b3567
add a 'linksfrom' query operator to match against backlinks
2024-07-26 14:56:55 +02:00
Mikkel Denker
119403b7e1
remove once_cell dep as it is now part of std
2024-07-26 10:08:43 +02:00
Mikkel Denker
8f97617904
make ml models optional during setup
2024-07-25 15:16:32 +02:00
Mikkel Denker
fa5282a800
move some of the 'stream.next()' functionality into traits in a lending-iter crate so we can implement and re-use adapters
2024-07-25 13:58:07 +02:00
Mikkel Denker
cef67d6aaf
automatically build autosuggest from search index keywords
2024-07-24 16:48:57 +02:00
Mikkel Denker
0f9c0bbd87
generalize 'SplitWhitespaceWithRangeIter' to 'SplitWithRangeIter' that takes accepts a predicate for where to create the splits
2024-07-23 17:26:04 +02:00
Mikkel Denker
f1b72a897d
normalize diacritics/accents
...
we want 'cafe' to also return results that contains 'café' etc
2024-07-23 16:28:32 +02:00
Mikkel Denker
cb1c66f0e8
avoid string alloc in tokenizer if possible
2024-07-23 13:27:31 +02:00
Mikkel Denker
aabd774e34
add proptests to some potentially problematic offsets into strings
2024-07-23 11:49:11 +02:00
Mikkel Denker
a5e0bebee9
fix clippy warnings
...
don't know why these don't show up locally...
2024-07-22 22:20:33 +02:00
Mikkel Denker
c4192af997
re-write tokenizer to not use logos anymore
...
this should fix a reported stack overflow (might be related to https://github.com/maciejhirsz/logos/issues/384 ) and should also make it easier to add additional scripts besides latin in the future
2024-07-22 22:10:59 +02:00
Mikkel Denker
76d7323524
row order numerical fields used for ranking
2024-07-21 14:57:44 +02:00
Mikkel Denker
e44f34e261
f64 and bool types in numerical fields
2024-07-17 16:27:48 +02:00
Mikkel Denker
ffb2a2a0a0
random access row ordered fields for ints, floats and bools.
...
most of the time, we want to fetch multiple columns for each document in the result set. by ordering the fields by rows, we can fetch all the relevant fields with a minimum number of IO operations, whereas we would need at least one IO operation for each field if they were column ordered
2024-07-16 10:58:56 +02:00
Mikkel Denker
5b8f03c890
rename fast fields to columnar fields
2024-07-06 16:52:01 +02:00
Mikkel Denker
85b7de7c89
remove delete functionality from tantivy for simplicity
2024-07-06 10:21:28 +02:00
Mikkel Denker
e2fe438912
remove unused
2024-07-04 14:48:55 +02:00
Mikkel Denker
15ae3b4087
remove unused
2024-07-04 14:37:21 +02:00
Mikkel Denker
ba14aaab68
remove optional and multivalued columns from tantivy as we only use full columnar indices
2024-07-04 14:12:14 +02:00
Mikkel Denker
8af2144898
mark the query parser in tantivy with #[cfg(test)] to ensure we don't accidentally use it instead of stracts
2024-07-02 09:08:21 +02:00
Mikkel Denker
f46abd0511
move tantivy dependencies up to workspace for consistent versioning between crates
2024-07-01 20:40:15 +02:00
Mikkel Denker
297f79d46b
store number of position bytes as u64 instead of u32 so we can have more than 4gb of positions in a segment
2024-07-01 15:11:52 +02:00
Mikkel Denker
28d2eff2c2
remove unused feature flags from tantivy
2024-07-01 14:32:23 +02:00
Mikkel Denker
3e5875839b
use workspace fst in tantivy
2024-07-01 14:22:34 +02:00
Mikkel Denker
b15261b003
remove some unused dependencies
2024-07-01 14:12:58 +02:00
Mikkel Denker
4fb6af6fed
remove aggregations for simplicity
2024-07-01 13:25:32 +02:00
Mikkel Denker
4306586763
fix clippy warnings
2024-07-01 13:09:38 +02:00
Mikkel Denker
a3292143d3
fix clippy warnings
2024-07-01 12:42:02 +02:00
Mikkel Denker
454774dfa7
fork tantivy
...
our segments are starting to grow too big, so the assumption that number of position bytes can be in a u32 is no longer the case. storing it in u64 might not be what regular users of tantivy want, as our use of the library most likely doesn't resemble the average user. forking tantivy allows us to customize it directly for our usecase
2024-07-01 11:26:41 +02:00
Mikkel Denker
e5cc6f8442
add total num hosts and domains to crawlplan stats
2024-06-28 14:05:55 +02:00
Mikkel Denker
071e41d167
ensure reasonable limit for robotstxt files
...
currently same as html body (32mb)
2024-06-28 13:47:09 +02:00
Mikkel Denker
e1ccaf9251
crawler robustness
...
• delay between robots.txt requests
• retry robots.txt requests 3 times
• if /robots.txt request returns anything except 404 or 200 (times out etc.), don’t crawl site
• respect crawldelay (up to max limit)
• respect 429 Retry-After header (up to max limit)
• increase timeout in reqwest client
• only visit sites on port 80 and 443
2024-06-28 13:35:37 +02:00
Mikkel Denker
126eedc0d0
way more robust robotstxt parser
2024-06-27 16:10:41 +02:00
Mikkel Denker
95f0703602
approximate centrality naming consistency with exact harmonic
2024-06-24 11:51:57 +02:00
Mikkel Denker
385e8375c6
deduplicate urls during indexing
2024-06-24 10:43:12 +02:00
Mikkel Denker
3844e29bd1
make bm25 constants configurable for each field
2024-06-21 15:16:07 +02:00
Mikkel Denker
e4ae26470e
skip links to/from same domain when calculating harmonic centralities
...
makes it more expensive for linkfarms
2024-06-21 14:21:31 +02:00
Mikkel Denker
de5c946b58
move last worker folder into correct output location after indexing
2024-06-21 12:16:18 +02:00
Mikkel Denker
1bcf74ec11
remove bm25+ again
...
while it scales low term frequencies, it also adds a positive score for pages even though they don't match the query at all. that seems like a really bad idea...
2024-06-20 15:58:06 +02:00
Mikkel Denker
5beae3b9a9
simplified bm25f that uses same IDF weight across all fields
...
e.g. the term 'the' might not be very common in titles but should still be scaled as a less important term than other terms in the query. instead of duplicating all text in the index we approximate the bm25f IDF weight as the highest IDF across the fields
2024-06-20 15:01:41 +02:00
Mikkel Denker
2fd6db3cfa
bm25+ to not over penalize long documents
...
https://dl.acm.org/doi/10.1145/2063576.2063584
2024-06-20 10:07:08 +02:00
Mikkel Denker
ca9b249992
[crawler] don't re-crawl url when it redirects to another url
2024-06-18 15:05:16 +02:00
Mikkel Denker
5968185136
[crawler] normalize url before fetch
2024-06-18 14:46:50 +02:00
Mikkel Denker
ad617a8151
[crawler] another check to avoid re-crawling urls
2024-06-18 14:19:15 +02:00
Mikkel Denker
2376f33e19
[crawler] some sites seem to only host robots.txt files on the 'www.' subdomain without redirects
2024-06-18 14:06:56 +02:00
Mikkel Denker
07369e1005
[crawler] continue wandering while budget > 0 and we still know about uncrawled urls
2024-06-18 13:37:29 +02:00
Mikkel Denker
919441850b
parse html mime metadata
2024-06-17 14:26:45 +02:00
Mikkel Denker
2bc0d69e00
support both 'linkto:' and 'linksto:'
2024-06-17 12:47:18 +02:00
Mikkel Denker
6aec31525a
add a 'linksto' query operator
2024-06-17 12:33:10 +02:00
Mikkel Denker
a1c9721b6c
remove some telemetry hosts from list of ad servers so pages aren't mistakenly tagged as containing ads ( #209 )
2024-06-12 17:29:21 +02:00