0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	1d821ef4db	rustup update and fix clippy warnings	2024-11-27 17:15:24 +01:00
Mikkel Denker	089f609e70	add svelte.dev to devdocs	2024-10-02 09:35:35 +02:00
Mikkel Denker	9c2d27fe05	ltr experiment	2024-08-15 11:46:05 +02:00
Mikkel Denker	2e95e42f7b	bump optics	2024-08-12 11:21:18 +02:00
Mikkel Denker	119403b7e1	remove once_cell dep as it is now part of std	2024-07-26 10:08:43 +02:00
Mikkel Denker	ea1c517da4	remove ranking signal from optics and add to api instead this will simplify optic merging and make it easier to allow more than 1 optic to be applied to a search	2024-05-14 11:46:40 +02:00
Mikkel Denker	d24dce8831	distinguish between itemtypes and regular keys in schema flattened json to ensure schema matchings in optics always start their match against an itemtype	2024-05-07 10:37:52 +02:00
Mikkel Denker	ec8b9a0786	Use remote webgraph instead of local (#196 ) * [WIP] remote webgraph client * [WIP] use remote webgraph for backlinks during indexing. still need to properly batch the requests * support batch requests in sonic * [WIP] use remote webgraph in explore and make sure ranking pipeline always sets updated score * use remote webgraph for inbound similarity * return correct type from explore api	2024-05-01 09:04:04 +02:00
Mikkel Denker	3ab4f944e0	MapReduce -> AMPC (#189 ) * [WIP] structure for mapreduce -> ampc and introduce tables in dht * temporarily disable failing lints in ampc/mod.rs * establish dht connection in ampc * support batch get/set in dht * ampc implementation (not tested yet) * dht upsert * no more todo's in ampc harmonic centrality impl * return 'UpsertAction' instead of bool from upserts this makes it easier to see what action was taken from the callers perspective. a bool is not particularly descriptive * add ability to have multiple dht tables for each ampc algorithm gives better type-safety as each table can then have their own key-value type pair * some bundled bug/correctness fixes. * await currently scheduled jobs after there are no more jobs to schedule. * execute each mapper fully at a time before scheduling next mapper. * compute centrality scores from set cardinalities. * refactor into smaller functions * happy path ampc dht test and split ampc into multiple files * correct harmonic centrality calculation in ampc * run distributed harmonic centrality worker and coordinator from cli * stream key/values from dht using range queries in batches * benchmark distributed centrality calculation * faster hash in shard selection and drop table in background thread * Move all rpc communication to bincode2. This should give a significant serilization/deserilization performance boost * dht store copy-on-write for keys and values to make table clone faster * fix flaky dht test and improve .set performance using entries * dynamic batch size based on number of shards in dht cluster	2024-04-15 10:29:33 +02:00
Mikkel Denker	c03b249047	rustup update and fix clippy	2024-03-21 09:59:04 +01:00
Mikkel Denker	37b6c7d86c	Distributed in-memory key/value store for mapreduce (#181 ) * [WIP] raft consensus using openraft on sonic networking * handle rpc's on nodes * handle get/set application requests * dht get/set stubs that handles leader changes and retries also improve sonic error handling. there is no need for handle to return a sonic::Result, it's better that the specific message has a Result<...> as their response as this can then be properly handled on the caller side * join existing raft cluster * make sure node state is consisten in case of crash -> rejoin * ResilientConnection in sonic didn't retry requests, only connections, and was therefore a bit misleading. remove it and add a send_with_timeout_retry method to normal connection with sane defaults in .send method * add Response::Empty to raft in order to avoid having to send back hacky Response::Set(Ok(())) for internal raft entries * change key/value in dht to be arbitrary bytes * dht chaos proptest * make dht tests more reliable in raft, writes are written to a majority quorom. if we have a cluster of 3 nodes, this means that we can only be sure that 2 of the nodes get's the data. the test might therefore fail if we are unlucky and check the node that didn't get the data yet. by having a cluster of 2 nodes instead, we can be sure that both nodes always receives all writes. * sharded dht client	2024-03-17 16:04:07 +01:00
Mikkel Denker	32bbbc63ab	Semantic embeddings (#179 ) * change indexer to prepare webpages in batches * some clippy lints * split 'IndexingWorker::prepare_webpages' into more readable sub functions and fix more clippy pedantic lints * use dual encoder to embed title and keywords of page during indexing * make sure we don't open harmonic centrality rocksdb in core/src during test... * add indexer example used for benchmark * add option to only compute embeddings of top ranking sites. this is not really ideal, but it turns out to be way too slow to compute the embeddings for all the sites in the index. this way, we at least get embeddings for the sites that are most likely to appear in the search results while it is still tractable to compute. * store embeddings in index as bytes * refactor ranking pipeline to statically ensure we score the different stages as expected * use similarity between title and query embeddings during ranking * use keyword embeddings during ranking * handle missing fastfields in index gracefully * remove unneeded Arc clone when constructing 'RecallRankingWebpage'	2024-03-11 14:28:17 +01:00
Mikkel Denker	7630ae4de6	chore: update optics test samples	2024-03-05 11:03:58 +01:00
Mikkel Denker	7583aa426b	Parse site block rules into 'HostRankings' instead. (#170 ) * parse site block rules into 'HostRankings' instead. they are still executed as exactly the same tantivy queries, but this allows us to correctly import the sites from exported optics. * each rule can have more than 1 site to discard * fix non-determinstic 'num_slashes_and_digits' test	2024-02-29 10:28:00 +01:00
Mikkel Denker	99dc06ec11	chore: fix clippy	2024-02-28 19:58:38 +01:00
Oliver Bøving	0d8507a2a5	Run `cargo clippy --fix` and `cargo fmt --fix` (#157 ) * Run `cargo clippy --fix` plus some minor manual refactors * Run `cargo fmt` plus minor refactor	2024-02-17 16:43:09 +01:00
SekoiaTree	ffbec7ac80	Replay ToString with Display (#134 )	2024-02-08 19:15:27 +01:00
SekoiaTree	2a62f9c28a	Make places that use rules use rule OR-ing (#133 )	2024-02-08 18:51:12 +01:00
Mikkel Denker	a4ce85a905	bump optics samples submodule	2024-02-07 13:28:57 +01:00
SekoiaTree	359117bfbe	Initial rule or-ing (#127 )	2024-02-06 19:34:58 +01:00
Mikkel Denker	f61f1f6b0f	fix bug where query suggestions couldn't be selected in safari	2024-02-05 14:57:26 +01:00
Mikkel Denker	1a9f381d15	GGML Rust bindings (#122 ) * move crates into a 'crates' folder * added cargo-about to check dependency licenses * create ggml-sys bindings and build as a static library. simple addition sanity test passes * update licenses * yeet alice * yeet qa model * yeet fact model * [wip] idiomatic rust bindings for ggml * [ggml] mul, add and sub ops implemented for tensors. i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?	2024-01-27 12:27:27 +01:00

22 commits