Commit graph

36 commits

Author SHA1 Message Date
Mikkel Denker
1d821ef4db rustup update and fix clippy warnings 2024-11-27 17:15:24 +01:00
Mikkel Denker
46016ab03d [tantivy] cached column to reduce disk reads 2024-11-27 14:36:22 +01:00
Mikkel Denker
efbe1db918 [tantivy] support u128 in row order 2024-11-23 10:53:08 +01:00
Mikkel Denker
b81d84762c [tantivy] support u128 column values 2024-11-22 15:26:43 +01:00
Mikkel Denker
7b75a854b7 [tantivy] support u128 values in inverted index 2024-11-22 14:29:15 +01:00
Mikkel Denker
3a48c8c7b4 [tantivy] raw column values codec
there seems to be a bug somewhere in the decompression of one of the other codecs for enormous columns. the to_id and from_id columns in the webgraph seems to point to wrong ids in the webgraph (don't know exactly why). as the ids are random hashes anyway, we don't gain much from compression and can therefore simply store the columns as raw u64 values on disk
2024-11-21 16:03:11 +01:00
Mikkel Denker
54ac58f7e9 [tantivy] kmerge posting lists to avoid iterating over documents that are not present in the posting list during segment merges, while preserving low memory footprint 2024-11-06 11:10:44 +01:00
Mikkel Denker
94d319209b [tantivy] reduce memory usage when writing postings
each posting list is already sorted by the new document ids (even without index sorting). if new(a) < new(b) => old(a) < old(b) and vice versa. the posting lists can therefore be streamed to disk instead of reading the full lists into memory and sort them
2024-11-05 17:26:52 +01:00
Mikkel Denker
ab6ed35bb8 re-use row field reader for document across signals 2024-10-24 15:51:28 +02:00
Mikkel Denker
13a06b3820 make sure webgraph merge doesn not exceed maximum number of allowed documents in tantivy 2024-10-24 11:53:21 +02:00
Mikkel Denker
658ac6f682
Webgraph inverted index (#232)
* overall structure for new webgraph store

* webgraph schema structure and HostLinksQuery

* deserialize edge

* forward/backlink queries

* full edge queries and iter smalledges

* [wip] use new store in webgraph

* remove id2node db

* shortcircuit link queries

* [wip] remote webgraph trait structure

* [wip] shard awareness

* finish remote webgraph trait structure

* optimize read

* merge webgraphs

* construct webgraph store

* make sure 'just configure' works and everything looks correct
2024-10-23 11:59:52 +02:00
Mikkel Denker
3a127af0a7 improve compaction performance in live index by performing initial segment merge on a read lock and only switch to the new segments on a write lock. this ensures that search requests can still be performed while the heavy part of merge is executing 2024-10-07 11:59:36 +02:00
Mikkel Denker
950862be9c
Re-open live index after it has been downloaded from replica (#227)
* re-open index after it has been downloaded from replica

* remove writer directory lock

* update meta file with segment changes

* flatten live index directory structure a bit for better overview

* additional live index tests
2024-10-01 09:19:08 +02:00
Mikkel Denker
de3239716f explicitly mark as unreachable 2024-09-18 12:26:18 +02:00
Mikkel Denker
ee0fc39eaa rustup update and fix clippy 2024-09-18 11:18:55 +02:00
Mikkel Denker
21ba4cde65
Live index without replication (#221)
* [WIP] live index code structure with a ton of todos

* update meta file with segment changes

* add endpoint to index webpages into live index

* compact segments by date

* cleanup old segments

* fix clippy warnings

* fix clippy warnings
2024-09-10 11:03:52 +02:00
Mikkel Denker
08dc07c575 update signal coefficients 2024-08-05 14:54:20 +02:00
Mikkel Denker
119403b7e1 remove once_cell dep as it is now part of std 2024-07-26 10:08:43 +02:00
Mikkel Denker
fa5282a800 move some of the 'stream.next()' functionality into traits in a lending-iter crate so we can implement and re-use adapters 2024-07-25 13:58:07 +02:00
Mikkel Denker
76d7323524 row order numerical fields used for ranking 2024-07-21 14:57:44 +02:00
Mikkel Denker
ffb2a2a0a0 random access row ordered fields for ints, floats and bools.
most of the time, we want to fetch multiple columns for each document in the result set. by ordering the fields by rows, we can fetch all the relevant fields with a minimum number of IO operations, whereas we would need at least one IO operation for each field if they were column ordered
2024-07-16 10:58:56 +02:00
Mikkel Denker
5b8f03c890 rename fast fields to columnar fields 2024-07-06 16:52:01 +02:00
Mikkel Denker
85b7de7c89 remove delete functionality from tantivy for simplicity 2024-07-06 10:21:28 +02:00
Mikkel Denker
e2fe438912 remove unused 2024-07-04 14:48:55 +02:00
Mikkel Denker
15ae3b4087 remove unused 2024-07-04 14:37:21 +02:00
Mikkel Denker
ba14aaab68 remove optional and multivalued columns from tantivy as we only use full columnar indices 2024-07-04 14:12:14 +02:00
Mikkel Denker
8af2144898 mark the query parser in tantivy with #[cfg(test)] to ensure we don't accidentally use it instead of stracts 2024-07-02 09:08:21 +02:00
Mikkel Denker
f46abd0511 move tantivy dependencies up to workspace for consistent versioning between crates 2024-07-01 20:40:15 +02:00
Mikkel Denker
297f79d46b store number of position bytes as u64 instead of u32 so we can have more than 4gb of positions in a segment 2024-07-01 15:11:52 +02:00
Mikkel Denker
28d2eff2c2 remove unused feature flags from tantivy 2024-07-01 14:32:23 +02:00
Mikkel Denker
3e5875839b use workspace fst in tantivy 2024-07-01 14:22:34 +02:00
Mikkel Denker
b15261b003 remove some unused dependencies 2024-07-01 14:12:58 +02:00
Mikkel Denker
4fb6af6fed remove aggregations for simplicity 2024-07-01 13:25:32 +02:00
Mikkel Denker
4306586763 fix clippy warnings 2024-07-01 13:09:38 +02:00
Mikkel Denker
a3292143d3 fix clippy warnings 2024-07-01 12:42:02 +02:00
Mikkel Denker
454774dfa7 fork tantivy
our segments are starting to grow too big, so the assumption that number of position bytes can be in a u32 is no longer the case. storing it in u64 might not be what regular users of tantivy want, as our use of the library most likely doesn't resemble the average user. forking tantivy allows us to customize it directly for our usecase
2024-07-01 11:26:41 +02:00