0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	1d821ef4db	rustup update and fix clippy warnings	2024-11-27 17:15:24 +01:00
Mikkel Denker	46016ab03d	[tantivy] cached column to reduce disk reads	2024-11-27 14:36:22 +01:00
Mikkel Denker	efbe1db918	[tantivy] support u128 in row order	2024-11-23 10:53:08 +01:00
Mikkel Denker	b81d84762c	[tantivy] support u128 column values	2024-11-22 15:26:43 +01:00
Mikkel Denker	7b75a854b7	[tantivy] support u128 values in inverted index	2024-11-22 14:29:15 +01:00
Mikkel Denker	3a48c8c7b4	[tantivy] raw column values codec there seems to be a bug somewhere in the decompression of one of the other codecs for enormous columns. the to_id and from_id columns in the webgraph seems to point to wrong ids in the webgraph (don't know exactly why). as the ids are random hashes anyway, we don't gain much from compression and can therefore simply store the columns as raw u64 values on disk	2024-11-21 16:03:11 +01:00
Mikkel Denker	54ac58f7e9	[tantivy] kmerge posting lists to avoid iterating over documents that are not present in the posting list during segment merges, while preserving low memory footprint	2024-11-06 11:10:44 +01:00
Mikkel Denker	94d319209b	[tantivy] reduce memory usage when writing postings each posting list is already sorted by the new document ids (even without index sorting). if new(a) < new(b) => old(a) < old(b) and vice versa. the posting lists can therefore be streamed to disk instead of reading the full lists into memory and sort them	2024-11-05 17:26:52 +01:00
Mikkel Denker	ab6ed35bb8	re-use row field reader for document across signals	2024-10-24 15:51:28 +02:00
Mikkel Denker	13a06b3820	make sure webgraph merge doesn not exceed maximum number of allowed documents in tantivy	2024-10-24 11:53:21 +02:00
Mikkel Denker	658ac6f682	Webgraph inverted index (#232 ) * overall structure for new webgraph store * webgraph schema structure and HostLinksQuery * deserialize edge * forward/backlink queries * full edge queries and iter smalledges * [wip] use new store in webgraph * remove id2node db * shortcircuit link queries * [wip] remote webgraph trait structure * [wip] shard awareness * finish remote webgraph trait structure * optimize read * merge webgraphs * construct webgraph store * make sure 'just configure' works and everything looks correct	2024-10-23 11:59:52 +02:00
Mikkel Denker	3a127af0a7	improve compaction performance in live index by performing initial segment merge on a read lock and only switch to the new segments on a write lock. this ensures that search requests can still be performed while the heavy part of merge is executing	2024-10-07 11:59:36 +02:00
Mikkel Denker	950862be9c	Re-open live index after it has been downloaded from replica (#227 ) * re-open index after it has been downloaded from replica * remove writer directory lock * update meta file with segment changes * flatten live index directory structure a bit for better overview * additional live index tests	2024-10-01 09:19:08 +02:00
Mikkel Denker	de3239716f	explicitly mark as unreachable	2024-09-18 12:26:18 +02:00
Mikkel Denker	ee0fc39eaa	rustup update and fix clippy	2024-09-18 11:18:55 +02:00
Mikkel Denker	21ba4cde65	Live index without replication (#221 ) * [WIP] live index code structure with a ton of todos * update meta file with segment changes * add endpoint to index webpages into live index * compact segments by date * cleanup old segments * fix clippy warnings * fix clippy warnings	2024-09-10 11:03:52 +02:00
Mikkel Denker	08dc07c575	update signal coefficients	2024-08-05 14:54:20 +02:00
Mikkel Denker	119403b7e1	remove once_cell dep as it is now part of std	2024-07-26 10:08:43 +02:00
Mikkel Denker	fa5282a800	move some of the 'stream.next()' functionality into traits in a lending-iter crate so we can implement and re-use adapters	2024-07-25 13:58:07 +02:00
Mikkel Denker	76d7323524	row order numerical fields used for ranking	2024-07-21 14:57:44 +02:00
Mikkel Denker	ffb2a2a0a0	random access row ordered fields for ints, floats and bools. most of the time, we want to fetch multiple columns for each document in the result set. by ordering the fields by rows, we can fetch all the relevant fields with a minimum number of IO operations, whereas we would need at least one IO operation for each field if they were column ordered	2024-07-16 10:58:56 +02:00
Mikkel Denker	5b8f03c890	rename fast fields to columnar fields	2024-07-06 16:52:01 +02:00
Mikkel Denker	85b7de7c89	remove delete functionality from tantivy for simplicity	2024-07-06 10:21:28 +02:00
Mikkel Denker	e2fe438912	remove unused	2024-07-04 14:48:55 +02:00
Mikkel Denker	15ae3b4087	remove unused	2024-07-04 14:37:21 +02:00
Mikkel Denker	ba14aaab68	remove optional and multivalued columns from tantivy as we only use full columnar indices	2024-07-04 14:12:14 +02:00
Mikkel Denker	8af2144898	mark the query parser in tantivy with #[cfg(test)] to ensure we don't accidentally use it instead of stracts	2024-07-02 09:08:21 +02:00
Mikkel Denker	f46abd0511	move tantivy dependencies up to workspace for consistent versioning between crates	2024-07-01 20:40:15 +02:00
Mikkel Denker	297f79d46b	store number of position bytes as u64 instead of u32 so we can have more than 4gb of positions in a segment	2024-07-01 15:11:52 +02:00
Mikkel Denker	28d2eff2c2	remove unused feature flags from tantivy	2024-07-01 14:32:23 +02:00
Mikkel Denker	3e5875839b	use workspace fst in tantivy	2024-07-01 14:22:34 +02:00
Mikkel Denker	b15261b003	remove some unused dependencies	2024-07-01 14:12:58 +02:00
Mikkel Denker	4fb6af6fed	remove aggregations for simplicity	2024-07-01 13:25:32 +02:00
Mikkel Denker	4306586763	fix clippy warnings	2024-07-01 13:09:38 +02:00
Mikkel Denker	a3292143d3	fix clippy warnings	2024-07-01 12:42:02 +02:00
Mikkel Denker	454774dfa7	fork tantivy our segments are starting to grow too big, so the assumption that number of position bytes can be in a u32 is no longer the case. storing it in u64 might not be what regular users of tantivy want, as our use of the library most likely doesn't resemble the average user. forking tantivy allows us to customize it directly for our usecase	2024-07-01 11:26:41 +02:00

36 commits