Commit graph

22 commits

Author SHA1 Message Date
Mikkel Denker
1d821ef4db rustup update and fix clippy warnings 2024-11-27 17:15:24 +01:00
Mikkel Denker
089f609e70 add svelte.dev to devdocs 2024-10-02 09:35:35 +02:00
Mikkel Denker
9c2d27fe05 ltr experiment 2024-08-15 11:46:05 +02:00
Mikkel Denker
2e95e42f7b bump optics 2024-08-12 11:21:18 +02:00
Mikkel Denker
119403b7e1 remove once_cell dep as it is now part of std 2024-07-26 10:08:43 +02:00
Mikkel Denker
ea1c517da4 remove ranking signal from optics and add to api instead
this will simplify optic merging and make it easier to allow more than 1 optic to be applied to a search
2024-05-14 11:46:40 +02:00
Mikkel Denker
d24dce8831 distinguish between itemtypes and regular keys in schema flattened json to ensure schema matchings in optics always start their match against an itemtype 2024-05-07 10:37:52 +02:00
Mikkel Denker
ec8b9a0786
Use remote webgraph instead of local (#196)
* [WIP] remote webgraph client

* [WIP] use remote webgraph for backlinks during indexing. still need to properly batch the requests

* support batch requests in sonic

* [WIP] use remote webgraph in explore and make sure ranking pipeline always sets updated score

* use remote webgraph for inbound similarity

* return correct type from explore api
2024-05-01 09:04:04 +02:00
Mikkel Denker
3ab4f944e0
MapReduce -> AMPC (#189)
* [WIP] structure for mapreduce -> ampc and introduce tables in dht

* temporarily disable failing lints in ampc/mod.rs

* establish dht connection in ampc

* support batch get/set in dht

* ampc implementation (not tested yet)

* dht upsert

* no more todo's in ampc harmonic centrality impl

* return 'UpsertAction' instead of bool from upserts
this makes it easier to see what action was taken from the callers perspective. a bool is not particularly descriptive

* add ability to have multiple dht tables for each ampc algorithm
gives better type-safety as each table can then have their own key-value type pair

* some bundled bug/correctness fixes.
* await currently scheduled jobs after there are no more jobs to schedule.
* execute each mapper fully at a time before scheduling next mapper.
* compute centrality scores from set cardinalities.

* refactor into smaller functions

* happy path ampc dht test and split ampc into multiple files

* correct harmonic centrality calculation in ampc

* run distributed harmonic centrality worker and coordinator from cli

* stream key/values from dht using range queries in batches

* benchmark distributed centrality calculation

* faster hash in shard selection and drop table in background thread

* Move all rpc communication to bincode2. This should give a significant serilization/deserilization performance boost

* dht store copy-on-write for keys and values to make table clone faster

* fix flaky dht test and improve .set performance using entries

* dynamic batch size based on number of shards in dht cluster
2024-04-15 10:29:33 +02:00
Mikkel Denker
c03b249047 rustup update and fix clippy 2024-03-21 09:59:04 +01:00
Mikkel Denker
37b6c7d86c
Distributed in-memory key/value store for mapreduce (#181)
* [WIP] raft consensus using openraft on sonic networking

* handle rpc's on nodes

* handle get/set application requests

* dht get/set stubs that handles leader changes and retries
also improve sonic error handling. there is no need for handle to return a sonic::Result, it's better that the specific message has a Result<...> as their response as this can then be properly handled on the caller side

* join existing raft cluster

* make sure node state is consisten in case of crash -> rejoin

* ResilientConnection in sonic didn't retry requests, only connections, and was therefore a bit misleading. remove it and add a send_with_timeout_retry method to normal connection with sane defaults in .send method

* add Response::Empty to raft in order to avoid having to send back hacky Response::Set(Ok(())) for internal raft entries

* change key/value in dht to be arbitrary bytes

* dht chaos proptest

* make dht tests more reliable
in raft, writes are written to a majority quorom. if we have a cluster of 3 nodes, this means that we can only be sure that 2 of the nodes get's the data. the test might therefore fail if we are unlucky and check the node that didn't get the data yet. by having a cluster of 2 nodes instead, we can be sure that both nodes always receives all writes.

* sharded dht client
2024-03-17 16:04:07 +01:00
Mikkel Denker
32bbbc63ab
Semantic embeddings (#179)
* change indexer to prepare webpages in batches

* some clippy lints

* split 'IndexingWorker::prepare_webpages' into more readable sub functions and fix more clippy pedantic lints

* use dual encoder to embed title and keywords of page during indexing

* make sure we don't open harmonic centrality rocksdb in core/src during test...

* add indexer example used for benchmark

* add option to only compute embeddings of top ranking sites.
this is not really ideal, but it turns out to be way too slow to compute
the embeddings for all the sites in the index. this way, we at least get embeddings
for the sites that are most likely to appear in the search results while it is
still tractable to compute.

* store embeddings in index as bytes

* refactor ranking pipeline to statically ensure we score the different stages as expected

* use similarity between title and query embeddings during ranking

* use keyword embeddings during ranking

* handle missing fastfields in index gracefully

* remove unneeded Arc clone when constructing 'RecallRankingWebpage'
2024-03-11 14:28:17 +01:00
Mikkel Denker
7630ae4de6 chore: update optics test samples 2024-03-05 11:03:58 +01:00
Mikkel Denker
7583aa426b
Parse site block rules into 'HostRankings' instead. (#170)
* parse site block rules into 'HostRankings' instead.
they are still executed as exactly the same tantivy queries, but this allows us to correctly import the sites from exported optics.

* each rule can have more than 1 site to discard

* fix non-determinstic 'num_slashes_and_digits' test
2024-02-29 10:28:00 +01:00
Mikkel Denker
99dc06ec11 chore: fix clippy 2024-02-28 19:58:38 +01:00
Oliver Bøving
0d8507a2a5
Run cargo clippy --fix and cargo fmt --fix (#157)
* Run `cargo clippy --fix` plus some minor manual refactors

* Run `cargo fmt` plus minor refactor
2024-02-17 16:43:09 +01:00
SekoiaTree
ffbec7ac80
Replay ToString with Display (#134) 2024-02-08 19:15:27 +01:00
SekoiaTree
2a62f9c28a
Make places that use rules use rule OR-ing (#133) 2024-02-08 18:51:12 +01:00
Mikkel Denker
a4ce85a905 bump optics samples submodule 2024-02-07 13:28:57 +01:00
SekoiaTree
359117bfbe
Initial rule or-ing (#127) 2024-02-06 19:34:58 +01:00
Mikkel Denker
f61f1f6b0f fix bug where query suggestions couldn't be selected in safari 2024-02-05 14:57:26 +01:00
Mikkel Denker
1a9f381d15
GGML Rust bindings (#122)
* move crates into a 'crates' folder

* added cargo-about to check dependency licenses

* create ggml-sys bindings and build as a static library.
simple addition sanity test passes

* update licenses

* yeet alice

* yeet qa model

* yeet fact model

* [wip] idiomatic rust bindings for ggml

* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
2024-01-27 12:27:27 +01:00