0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	de7291daa1	just update	2024-12-03 15:00:08 +01:00
Mikkel Denker	040d04413e	Web spell as dedicated module (#240 ) * separate web-spell into a dedicated module * web-spell readme	2024-11-29 15:15:18 +01:00
Mikkel Denker	05d3cf9de5	urlencode user queries when forwarding to bang location to prevent open redirect vulnerabilities (#239 )	2024-11-29 12:17:01 +01:00
Mikkel Denker	1d821ef4db	rustup update and fix clippy warnings	2024-11-27 17:15:24 +01:00
Mikkel Denker	d2fa5f4061	use nom in zimba parser (#236 )	2024-11-20 10:12:52 +01:00
Mikkel Denker	12e9502e80	Improve API documentation (#235 ) * add docusaurus scalar api documentation structure * bump openapi 3.0 to 3.1 so we can mark internal endpoints * improve search api docs * webgraph api docs * point docs to prod	2024-11-19 13:43:42 +01:00
Mikkel Denker	4e8426888b	just update	2024-11-01 15:28:40 +01:00
Mikkel Denker	31bfebf2c9	just update	2024-10-25 09:37:45 +02:00
Mikkel Denker	658ac6f682	Webgraph inverted index (#232 ) * overall structure for new webgraph store * webgraph schema structure and HostLinksQuery * deserialize edge * forward/backlink queries * full edge queries and iter smalledges * [wip] use new store in webgraph * remove id2node db * shortcircuit link queries * [wip] remote webgraph trait structure * [wip] shard awareness * finish remote webgraph trait structure * optimize read * merge webgraphs * construct webgraph store * make sure 'just configure' works and everything looks correct	2024-10-23 11:59:52 +02:00
Mikkel Denker	5ebdb24a07	just update	2024-10-01 09:51:11 +02:00
Mikkel Denker	4e8c165a1c	cleanup temporary directories automatically in tests (#228 )	2024-10-01 09:42:14 +02:00
Mikkel Denker	21ba4cde65	Live index without replication (#221 ) * [WIP] live index code structure with a ton of todos * update meta file with segment changes * add endpoint to index webpages into live index * compact segments by date * cleanup old segments * fix clippy warnings * fix clippy warnings	2024-09-10 11:03:52 +02:00
Mikkel Denker	365ed02813	Very simple WAL built on top of file-store primitives (#219 ) Doesn't handle concurrent writes and flushes after each write. This will cause a lot of fsync's which will impact performance, but as this will be used for the live index where each item (a full webpage) is quite large, this will hopefully not be too detrimental.	2024-09-05 14:35:52 +02:00
Mikkel Denker	48789d75a2	cargo update	2024-09-04 16:18:11 +02:00
Mikkel Denker	308388262f	store node ids as big endian in webgraph so sort is correct during merge	2024-08-13 13:42:35 +02:00
Mikkel Denker	f3315c4b42	leechy ranking annotation experiment	2024-08-12 11:17:55 +02:00
Mikkel Denker	b79233302b	api admin interface	2024-08-06 18:26:14 +02:00
Mikkel Denker	119403b7e1	remove once_cell dep as it is now part of std	2024-07-26 10:08:43 +02:00
Mikkel Denker	fa5282a800	move some of the 'stream.next()' functionality into traits in a lending-iter crate so we can implement and re-use adapters	2024-07-25 13:58:07 +02:00
Mikkel Denker	8b2cbc98e0	cargo update	2024-07-24 17:45:17 +02:00
Mikkel Denker	f1b72a897d	normalize diacritics/accents we want 'cafe' to also return results that contains 'café' etc	2024-07-23 16:28:32 +02:00
Mikkel Denker	c4192af997	re-write tokenizer to not use logos anymore this should fix a reported stack overflow (might be related to https://github.com/maciejhirsz/logos/issues/384) and should also make it easier to add additional scripts besides latin in the future	2024-07-22 22:10:59 +02:00
Mikkel Denker	ffb2a2a0a0	random access row ordered fields for ints, floats and bools. most of the time, we want to fetch multiple columns for each document in the result set. by ordering the fields by rows, we can fetch all the relevant fields with a minimum number of IO operations, whereas we would need at least one IO operation for each field if they were column ordered	2024-07-16 10:58:56 +02:00
Mikkel Denker	f46abd0511	move tantivy dependencies up to workspace for consistent versioning between crates	2024-07-01 20:40:15 +02:00
Mikkel Denker	3e5875839b	use workspace fst in tantivy	2024-07-01 14:22:34 +02:00
Mikkel Denker	b15261b003	remove some unused dependencies	2024-07-01 14:12:58 +02:00
Mikkel Denker	454774dfa7	fork tantivy our segments are starting to grow too big, so the assumption that number of position bytes can be in a u32 is no longer the case. storing it in u64 might not be what regular users of tantivy want, as our use of the library most likely doesn't resemble the average user. forking tantivy allows us to customize it directly for our usecase	2024-07-01 11:26:41 +02:00
Mikkel Denker	126eedc0d0	way more robust robotstxt parser	2024-06-27 16:10:41 +02:00
Mikkel Denker	a9eb8acd80	store 'rel' attribute for each edge in the webgraph this allows us to skip links to tag pages etc. when calculating harmonic centrality which should greatly improve the centrality values for the page graph	2024-06-11 15:34:54 +02:00
Mikkel Denker	e74978c5bc	update image dependency	2024-05-23 12:44:01 +02:00
Mikkel Denker	823efa6716	implement distributed version of approximated harmonic centrality the page graph still seems to be too big to calculate the exact centrality even when distributed across multiple workers (need more workers)	2024-05-21 14:53:12 +02:00
Mikkel Denker	d023552391	use binary heap for less cmp when merging spell correction dictionaries	2024-05-18 13:32:50 +02:00
Mikkel Denker	17fed5a75c	Show ranking signals (#201 )	2024-05-17 16:39:33 +02:00
Mikkel Denker	b302a8d5c7	coordinate changed nodes between workers in distrivuted harmonic to make sure a node that has been update in worker A is also considered for updates on worker B	2024-05-10 15:09:44 +02:00
Mikkel Denker	e026fe5548	Sonic connection pool (#200 ) * allow connection reuse by not taking ownership in send methods * [sonic] continously handle requests from each connection in the server as long as the connection is not closed * add connection pool to sonic based on deadpool * use connection pool in remote webgraph and distributed searcher * hopefully fix flaky test * hopefully fix flaky test	2024-05-09 15:24:43 +02:00
Mikkel Denker	76cd7e8f63	fixed bug that caused queries with special characters to crash ('c++' etc) 'c++' gets tokenized as ['c', '+', '+'] which we use in a phrase query to enforce that the result must have 'c++' in sequence instead of simply having 'c' somewhere on the page and '+' another place. however, some fields don't have the necesarry position data stored which caused these queries to crash when trying to perform the phrase query on these fields	2024-05-07 12:51:32 +02:00
Mikkel Denker	3c94cb7f81	approximate number of hits by assuming that each term is independent this allows us to short-cirquit the query by default which significantly improves performance as we therefore don't have to iterate the non-scored results simply to count them	2024-05-06 15:21:17 +02:00
Mikkel Denker	73e5445018	update fend to 1.4.8 (#198 )	2024-05-05 17:05:44 +02:00
Mikkel Denker	7e9da2e37c	chore: upgrade dependencies for kuchiki	2024-05-03 12:23:10 +02:00
Mikkel Denker	9c983e5f96	Top k webgraph edges (#197 ) * implement random access index in file_store where keys are u64 and values are serialised to a constant size * cleanup: move all webgraph store writes into store_writer * add a 'ConstIterableStore' that can store items on disk without needing to interleave headers in the case that all items can be serialized to a constant number of bytes known up front * change edges file format to make edges for a given node iterable. this allows us to only load a subset of the edges for a node in the future * compress webgraph labels in blocks of 128 * ability to limit number of edges returned by webgraph * sort edges in webgraph store by the host rank of the opposite node	2024-05-03 09:33:57 +02:00
Mikkel Denker	da4f930b03	cratify file-store	2024-04-22 21:30:59 +02:00
Oliver Bøving	18d9d279fb	Cratify `bloom` and `speedy-kv` (#193 ) * Move bloom into separate crate * Move speedy_kv into a separate crate * add licenses --------- Co-authored-by: Mikkel Denker <mikkel@stract.com>	2024-04-22 21:18:44 +02:00
Mikkel Denker	1447be7bfa	revert tantivy upgrade a bit due to deprecation warning	2024-04-18 17:06:39 +02:00
Mikkel Denker	4e12528aa2	upgrade tantivy to 0.22	2024-04-18 15:59:53 +02:00
Mikkel Denker	c43283cc5d	use redb in live-index downloaded db. to truncate the database, we would have to implement deletes and possibly also some kind of auto merging strategy in speedy_kv. to keep things simple, we use redb for this db instead.	2024-04-18 12:45:02 +02:00
Mikkel Denker	57c2affa50	new speedy-kv designed for very read heavy workloads without many small writes. this basically describes most of our workloads. as an example, in the webgraph we know that we only ever get inserts when constructing the graph, after which all the reads will happen. the key-value database consists of the following components: * an fst index (key -> blob_id) * a memory mapped blob index (blob_id -> blob_ptr) * a memory mapped blob store (blob_ptr -> blob) this allows us to move everything over from rocksdb to speedy_kv, and thereby removing the rocksdb dependency.	2024-04-17 14:14:39 +02:00
Mikkel Denker	3ab4f944e0	MapReduce -> AMPC (#189 ) * [WIP] structure for mapreduce -> ampc and introduce tables in dht * temporarily disable failing lints in ampc/mod.rs * establish dht connection in ampc * support batch get/set in dht * ampc implementation (not tested yet) * dht upsert * no more todo's in ampc harmonic centrality impl * return 'UpsertAction' instead of bool from upserts this makes it easier to see what action was taken from the callers perspective. a bool is not particularly descriptive * add ability to have multiple dht tables for each ampc algorithm gives better type-safety as each table can then have their own key-value type pair * some bundled bug/correctness fixes. * await currently scheduled jobs after there are no more jobs to schedule. * execute each mapper fully at a time before scheduling next mapper. * compute centrality scores from set cardinalities. * refactor into smaller functions * happy path ampc dht test and split ampc into multiple files * correct harmonic centrality calculation in ampc * run distributed harmonic centrality worker and coordinator from cli * stream key/values from dht using range queries in batches * benchmark distributed centrality calculation * faster hash in shard selection and drop table in background thread * Move all rpc communication to bincode2. This should give a significant serilization/deserilization performance boost * dht store copy-on-write for keys and values to make table clone faster * fix flaky dht test and improve .set performance using entries * dynamic batch size based on number of shards in dht cluster	2024-04-15 10:29:33 +02:00
Mikkel Denker	be5bb09fcf	option to limit number of concurrent search requests gives better rate limit control to slow down requests instead of crashing if we exceed the limits of the servers	2024-03-25 15:20:06 +01:00
Mikkel Denker	2dadbf70d6	Schema fields as traits (#185 ) * refactor data that is re-used across fields for a particular page during indexing into an 'FnCache' * automatically generate ALL_FIELDS and ALL_SIGNALS arrays with strum macro. ensures the arrays are always fully up to date * split up schema fields into submodules * add textfield trait with enum-dispatch * add fastfield trait with enum-dispatch * move field names into trait * move some trivial functions from 'FastFieldEnum' and 'TextFieldEnum' into their respective traits * move methods from Field into TextField and FastField traits * extract html .as_tantivy into textfield trait * extract html .as_tantivy into fastfield trait * extract webpage .as_tantivy into field traits * fix indexer example cleanup	2024-03-20 21:36:44 +01:00
Mikkel Denker	e47b49a012	Nom query parser (#184 ) * model that inbody:... intitle:... etc can have either simple term or phrase query as subterm * re-write query parser using nom * all whitespace queries should return empty terms vec	2024-03-19 09:47:48 +01:00

1 2 3 4

155 commits