* [WIP] remote webgraph client
* [WIP] use remote webgraph for backlinks during indexing. still need to properly batch the requests
* support batch requests in sonic
* [WIP] use remote webgraph in explore and make sure ranking pipeline always sets updated score
* use remote webgraph for inbound similarity
* return correct type from explore api
* [WIP] structure for mapreduce -> ampc and introduce tables in dht
* temporarily disable failing lints in ampc/mod.rs
* establish dht connection in ampc
* support batch get/set in dht
* ampc implementation (not tested yet)
* dht upsert
* no more todo's in ampc harmonic centrality impl
* return 'UpsertAction' instead of bool from upserts
this makes it easier to see what action was taken from the callers perspective. a bool is not particularly descriptive
* add ability to have multiple dht tables for each ampc algorithm
gives better type-safety as each table can then have their own key-value type pair
* some bundled bug/correctness fixes.
* await currently scheduled jobs after there are no more jobs to schedule.
* execute each mapper fully at a time before scheduling next mapper.
* compute centrality scores from set cardinalities.
* refactor into smaller functions
* happy path ampc dht test and split ampc into multiple files
* correct harmonic centrality calculation in ampc
* run distributed harmonic centrality worker and coordinator from cli
* stream key/values from dht using range queries in batches
* benchmark distributed centrality calculation
* faster hash in shard selection and drop table in background thread
* Move all rpc communication to bincode2. This should give a significant serilization/deserilization performance boost
* dht store copy-on-write for keys and values to make table clone faster
* fix flaky dht test and improve .set performance using entries
* dynamic batch size based on number of shards in dht cluster
* [WIP] raft consensus using openraft on sonic networking
* handle rpc's on nodes
* handle get/set application requests
* dht get/set stubs that handles leader changes and retries
also improve sonic error handling. there is no need for handle to return a sonic::Result, it's better that the specific message has a Result<...> as their response as this can then be properly handled on the caller side
* join existing raft cluster
* make sure node state is consisten in case of crash -> rejoin
* ResilientConnection in sonic didn't retry requests, only connections, and was therefore a bit misleading. remove it and add a send_with_timeout_retry method to normal connection with sane defaults in .send method
* add Response::Empty to raft in order to avoid having to send back hacky Response::Set(Ok(())) for internal raft entries
* change key/value in dht to be arbitrary bytes
* dht chaos proptest
* make dht tests more reliable
in raft, writes are written to a majority quorom. if we have a cluster of 3 nodes, this means that we can only be sure that 2 of the nodes get's the data. the test might therefore fail if we are unlucky and check the node that didn't get the data yet. by having a cluster of 2 nodes instead, we can be sure that both nodes always receives all writes.
* sharded dht client
* change indexer to prepare webpages in batches
* some clippy lints
* split 'IndexingWorker::prepare_webpages' into more readable sub functions and fix more clippy pedantic lints
* use dual encoder to embed title and keywords of page during indexing
* make sure we don't open harmonic centrality rocksdb in core/src during test...
* add indexer example used for benchmark
* add option to only compute embeddings of top ranking sites.
this is not really ideal, but it turns out to be way too slow to compute
the embeddings for all the sites in the index. this way, we at least get embeddings
for the sites that are most likely to appear in the search results while it is
still tractable to compute.
* store embeddings in index as bytes
* refactor ranking pipeline to statically ensure we score the different stages as expected
* use similarity between title and query embeddings during ranking
* use keyword embeddings during ranking
* handle missing fastfields in index gracefully
* remove unneeded Arc clone when constructing 'RecallRankingWebpage'
* parse site block rules into 'HostRankings' instead.
they are still executed as exactly the same tantivy queries, but this allows us to correctly import the sites from exported optics.
* each rule can have more than 1 site to discard
* fix non-determinstic 'num_slashes_and_digits' test
* move crates into a 'crates' folder
* added cargo-about to check dependency licenses
* create ggml-sys bindings and build as a static library.
simple addition sanity test passes
* update licenses
* yeet alice
* yeet qa model
* yeet fact model
* [wip] idiomatic rust bindings for ggml
* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?