Commit graph

413 commits

Author SHA1 Message Date
Mikkel Denker
58da0b3567 add a 'linksfrom' query operator to match against backlinks 2024-07-26 14:56:55 +02:00
Mikkel Denker
119403b7e1 remove once_cell dep as it is now part of std 2024-07-26 10:08:43 +02:00
Mikkel Denker
8f97617904 make ml models optional during setup 2024-07-25 15:16:32 +02:00
Mikkel Denker
fa5282a800 move some of the 'stream.next()' functionality into traits in a lending-iter crate so we can implement and re-use adapters 2024-07-25 13:58:07 +02:00
Mikkel Denker
cef67d6aaf automatically build autosuggest from search index keywords 2024-07-24 16:48:57 +02:00
Mikkel Denker
0f9c0bbd87 generalize 'SplitWhitespaceWithRangeIter' to 'SplitWithRangeIter' that takes accepts a predicate for where to create the splits 2024-07-23 17:26:04 +02:00
Mikkel Denker
f1b72a897d normalize diacritics/accents
we want 'cafe' to also return results that contains 'café' etc
2024-07-23 16:28:32 +02:00
Mikkel Denker
cb1c66f0e8 avoid string alloc in tokenizer if possible 2024-07-23 13:27:31 +02:00
Mikkel Denker
aabd774e34 add proptests to some potentially problematic offsets into strings 2024-07-23 11:49:11 +02:00
Mikkel Denker
a5e0bebee9 fix clippy warnings
don't know why these don't show up locally...
2024-07-22 22:20:33 +02:00
Mikkel Denker
c4192af997 re-write tokenizer to not use logos anymore
this should fix a reported stack overflow (might be related to https://github.com/maciejhirsz/logos/issues/384) and should also make it easier to add additional scripts besides latin in the future
2024-07-22 22:10:59 +02:00
Mikkel Denker
76d7323524 row order numerical fields used for ranking 2024-07-21 14:57:44 +02:00
Mikkel Denker
e44f34e261 f64 and bool types in numerical fields 2024-07-17 16:27:48 +02:00
Mikkel Denker
ffb2a2a0a0 random access row ordered fields for ints, floats and bools.
most of the time, we want to fetch multiple columns for each document in the result set. by ordering the fields by rows, we can fetch all the relevant fields with a minimum number of IO operations, whereas we would need at least one IO operation for each field if they were column ordered
2024-07-16 10:58:56 +02:00
Mikkel Denker
5b8f03c890 rename fast fields to columnar fields 2024-07-06 16:52:01 +02:00
Mikkel Denker
85b7de7c89 remove delete functionality from tantivy for simplicity 2024-07-06 10:21:28 +02:00
Mikkel Denker
e2fe438912 remove unused 2024-07-04 14:48:55 +02:00
Mikkel Denker
15ae3b4087 remove unused 2024-07-04 14:37:21 +02:00
Mikkel Denker
ba14aaab68 remove optional and multivalued columns from tantivy as we only use full columnar indices 2024-07-04 14:12:14 +02:00
Mikkel Denker
8af2144898 mark the query parser in tantivy with #[cfg(test)] to ensure we don't accidentally use it instead of stracts 2024-07-02 09:08:21 +02:00
Mikkel Denker
f46abd0511 move tantivy dependencies up to workspace for consistent versioning between crates 2024-07-01 20:40:15 +02:00
Mikkel Denker
297f79d46b store number of position bytes as u64 instead of u32 so we can have more than 4gb of positions in a segment 2024-07-01 15:11:52 +02:00
Mikkel Denker
28d2eff2c2 remove unused feature flags from tantivy 2024-07-01 14:32:23 +02:00
Mikkel Denker
3e5875839b use workspace fst in tantivy 2024-07-01 14:22:34 +02:00
Mikkel Denker
b15261b003 remove some unused dependencies 2024-07-01 14:12:58 +02:00
Mikkel Denker
4fb6af6fed remove aggregations for simplicity 2024-07-01 13:25:32 +02:00
Mikkel Denker
4306586763 fix clippy warnings 2024-07-01 13:09:38 +02:00
Mikkel Denker
a3292143d3 fix clippy warnings 2024-07-01 12:42:02 +02:00
Mikkel Denker
454774dfa7 fork tantivy
our segments are starting to grow too big, so the assumption that number of position bytes can be in a u32 is no longer the case. storing it in u64 might not be what regular users of tantivy want, as our use of the library most likely doesn't resemble the average user. forking tantivy allows us to customize it directly for our usecase
2024-07-01 11:26:41 +02:00
Mikkel Denker
e5cc6f8442 add total num hosts and domains to crawlplan stats 2024-06-28 14:05:55 +02:00
Mikkel Denker
071e41d167 ensure reasonable limit for robotstxt files
currently same as html body (32mb)
2024-06-28 13:47:09 +02:00
Mikkel Denker
e1ccaf9251 crawler robustness
• delay between robots.txt requests
• retry robots.txt requests 3 times
• if /robots.txt request returns anything except 404 or 200 (times out etc.), don’t crawl site
• respect crawldelay (up to max limit)
• respect 429 Retry-After header (up to max limit)
• increase timeout in reqwest client
• only visit sites on port 80 and 443
2024-06-28 13:35:37 +02:00
Mikkel Denker
126eedc0d0 way more robust robotstxt parser 2024-06-27 16:10:41 +02:00
Mikkel Denker
95f0703602 approximate centrality naming consistency with exact harmonic 2024-06-24 11:51:57 +02:00
Mikkel Denker
385e8375c6 deduplicate urls during indexing 2024-06-24 10:43:12 +02:00
Mikkel Denker
3844e29bd1 make bm25 constants configurable for each field 2024-06-21 15:16:07 +02:00
Mikkel Denker
e4ae26470e skip links to/from same domain when calculating harmonic centralities
makes it more expensive for linkfarms
2024-06-21 14:21:31 +02:00
Mikkel Denker
de5c946b58 move last worker folder into correct output location after indexing 2024-06-21 12:16:18 +02:00
Mikkel Denker
1bcf74ec11 remove bm25+ again
while it scales low term frequencies, it also adds a positive score for pages even though they don't match the query at all. that seems like a really bad idea...
2024-06-20 15:58:06 +02:00
Mikkel Denker
5beae3b9a9 simplified bm25f that uses same IDF weight across all fields
e.g. the term 'the' might not be very common in titles but should still be scaled as a less important term than other terms in the query. instead of duplicating all text in the index we approximate the bm25f IDF weight as the highest IDF across the fields
2024-06-20 15:01:41 +02:00
Mikkel Denker
2fd6db3cfa bm25+ to not over penalize long documents
https://dl.acm.org/doi/10.1145/2063576.2063584
2024-06-20 10:07:08 +02:00
Mikkel Denker
ca9b249992 [crawler] don't re-crawl url when it redirects to another url 2024-06-18 15:05:16 +02:00
Mikkel Denker
5968185136 [crawler] normalize url before fetch 2024-06-18 14:46:50 +02:00
Mikkel Denker
ad617a8151 [crawler] another check to avoid re-crawling urls 2024-06-18 14:19:15 +02:00
Mikkel Denker
2376f33e19 [crawler] some sites seem to only host robots.txt files on the 'www.' subdomain without redirects 2024-06-18 14:06:56 +02:00
Mikkel Denker
07369e1005 [crawler] continue wandering while budget > 0 and we still know about uncrawled urls 2024-06-18 13:37:29 +02:00
Mikkel Denker
919441850b parse html mime metadata 2024-06-17 14:26:45 +02:00
Mikkel Denker
2bc0d69e00 support both 'linkto:' and 'linksto:' 2024-06-17 12:47:18 +02:00
Mikkel Denker
6aec31525a add a 'linksto' query operator 2024-06-17 12:33:10 +02:00
Mikkel Denker
a1c9721b6c
remove some telemetry hosts from list of ad servers so pages aren't mistakenly tagged as containing ads (#209) 2024-06-12 17:29:21 +02:00