Commit graph

10 commits

Author SHA1 Message Date
Mikkel Denker
1d821ef4db rustup update and fix clippy warnings 2024-11-27 17:15:24 +01:00
Mikkel Denker
f3315c4b42 leechy ranking annotation experiment 2024-08-12 11:17:55 +02:00
Mikkel Denker
08dc07c575 update signal coefficients 2024-08-05 14:54:20 +02:00
Mikkel Denker
7e9da2e37c chore: upgrade dependencies for kuchiki 2024-05-03 12:23:10 +02:00
Mikkel Denker
c03b249047 rustup update and fix clippy 2024-03-21 09:59:04 +01:00
Mikkel Denker
37b6c7d86c
Distributed in-memory key/value store for mapreduce (#181)
* [WIP] raft consensus using openraft on sonic networking

* handle rpc's on nodes

* handle get/set application requests

* dht get/set stubs that handles leader changes and retries
also improve sonic error handling. there is no need for handle to return a sonic::Result, it's better that the specific message has a Result<...> as their response as this can then be properly handled on the caller side

* join existing raft cluster

* make sure node state is consisten in case of crash -> rejoin

* ResilientConnection in sonic didn't retry requests, only connections, and was therefore a bit misleading. remove it and add a send_with_timeout_retry method to normal connection with sane defaults in .send method

* add Response::Empty to raft in order to avoid having to send back hacky Response::Set(Ok(())) for internal raft entries

* change key/value in dht to be arbitrary bytes

* dht chaos proptest

* make dht tests more reliable
in raft, writes are written to a majority quorom. if we have a cluster of 3 nodes, this means that we can only be sure that 2 of the nodes get's the data. the test might therefore fail if we are unlucky and check the node that didn't get the data yet. by having a cluster of 2 nodes instead, we can be sure that both nodes always receives all writes.

* sharded dht client
2024-03-17 16:04:07 +01:00
Mikkel Denker
32bbbc63ab
Semantic embeddings (#179)
* change indexer to prepare webpages in batches

* some clippy lints

* split 'IndexingWorker::prepare_webpages' into more readable sub functions and fix more clippy pedantic lints

* use dual encoder to embed title and keywords of page during indexing

* make sure we don't open harmonic centrality rocksdb in core/src during test...

* add indexer example used for benchmark

* add option to only compute embeddings of top ranking sites.
this is not really ideal, but it turns out to be way too slow to compute
the embeddings for all the sites in the index. this way, we at least get embeddings
for the sites that are most likely to appear in the search results while it is
still tractable to compute.

* store embeddings in index as bytes

* refactor ranking pipeline to statically ensure we score the different stages as expected

* use similarity between title and query embeddings during ranking

* use keyword embeddings during ranking

* handle missing fastfields in index gracefully

* remove unneeded Arc clone when constructing 'RecallRankingWebpage'
2024-03-11 14:28:17 +01:00
Oliver Bøving
0d8507a2a5
Run cargo clippy --fix and cargo fmt --fix (#157)
* Run `cargo clippy --fix` plus some minor manual refactors

* Run `cargo fmt` plus minor refactor
2024-02-17 16:43:09 +01:00
SekoiaTree
ffbec7ac80
Replay ToString with Display (#134) 2024-02-08 19:15:27 +01:00
Mikkel Denker
1a9f381d15
GGML Rust bindings (#122)
* move crates into a 'crates' folder

* added cargo-about to check dependency licenses

* create ggml-sys bindings and build as a static library.
simple addition sanity test passes

* update licenses

* yeet alice

* yeet qa model

* yeet fact model

* [wip] idiomatic rust bindings for ggml

* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
2024-01-27 12:27:27 +01:00