* change indexer to prepare webpages in batches
* some clippy lints
* split 'IndexingWorker::prepare_webpages' into more readable sub functions and fix more clippy pedantic lints
* use dual encoder to embed title and keywords of page during indexing
* make sure we don't open harmonic centrality rocksdb in core/src during test...
* add indexer example used for benchmark
* add option to only compute embeddings of top ranking sites.
this is not really ideal, but it turns out to be way too slow to compute
the embeddings for all the sites in the index. this way, we at least get embeddings
for the sites that are most likely to appear in the search results while it is
still tractable to compute.
* store embeddings in index as bytes
* refactor ranking pipeline to statically ensure we score the different stages as expected
* use similarity between title and query embeddings during ranking
* use keyword embeddings during ranking
* handle missing fastfields in index gracefully
* remove unneeded Arc clone when constructing 'RecallRankingWebpage'
* Disable checksum verification in rocksdb.
~80% of the time seems to be spend in xxh3 hash. This seems to be primarily used to verify the checksums when the blockcache gets a cache-miss and needs to read from disk. We don't gracefully handle corruptions either way, so let's just disable the verification and see how it impacts performance.
* also disable verification in 'RocksDbStore'
* allow rocksdb to prefetch blocks during iteration
* increase block cache for 'Id2NodeDb' and 'RocksDbStore'
some websites, especially older ones, sometimes use a different encoding scheme than utf8 or latin1. before, we simply tried different encoding schemes until one successfully decoded the bytes but this approach can fail unexpectedly as some encodings can erroneously get decoded by other encodings without errors being reported.
we now use the encoding detection crate 'chardetng' which is also [used in firefox](https://github.com/hsivonen/chardetng?tab=readme-ov-file#purpose).
a normal snippet should always be returned from the api. an application can then choose to show the rich snippet instead if one is present. this gives more flexibility when building applications on top of stracts api
the raw scores from the cross encoder model can be very similar even though site A is way better than site B, so it will be difficult to scale correctly with a coefficient. the models are essentially only trained to optimize the ranks.
also fix a non-deterministic test
* parse site block rules into 'HostRankings' instead.
they are still executed as exactly the same tantivy queries, but this allows us to correctly import the sites from exported optics.
* each rule can have more than 1 site to discard
* fix non-determinstic 'num_slashes_and_digits' test
* Initial implementation of importing sites from an optic
* Removed unused import
* Updated button text
* Implemented client-side WASM to allow for parsing of imported .optic files
* Removed unneeded deps & updated `CONTRIBUTING.md` to reflect wasm-pack needs
* CI updates
* Added vite-plugin-wasm-pack to ensure wasm modules get copied over
* CI fix >:(
* More CI attempts
* agony - CSP fix & further wasm-pack fixes
* CSP updates
* Package update to prevent an unneccesary build of wasm
* reduce bloat in ci build log from wasm
* fix another non-determinsticly failing test
* only install wasm-pack as part of setup steps in CONTRIBUTING.md
./scripts/ci/check seems to fail if it tries to install wasm-pack while it is already installed (at least on my machine). as it is already added as a step in CONTRIBUTING.md we can assume it has been installed on the system
* add vite plugin to ensure changes to 'crates/client-wasm' gets reflected in the frontend.
adapted from https://github.com/StractOrg/stract/pull/109
* run 'npm run format'
* propagate errors from wasm crate
* update nodejs to v20.10 in github action
* loosen restriction on npm version
frontend should work as long as we have correct nodejs version. don't think the npm version is necessary
* change term scaling from idf-sum to correctly weight each term based on the number of documents that match that particular term
* remove scoring from patternquery.
the results are always scored in 'Signal::compute' anyway so no need for the added complexity
* small bm25 test that makes sure terms are scaled
* use simpler idf sum instead of full bm25 for simple fields (site, domain, url, backlink text etc.)
fixes the performance regression of `Signal::compute`
* Add basic CI
* Add liburing installation step to CI workflow
* Run `npm install` as part of ci/check
* Add `@types/node` package
* Add `submodules: 'recursive'` to CI
* Skip test if test data is not available
* Install `cargo-about` in CI
Previously we were on a fixed rev due to unreleased pull-requests, but since these are now released we can pin to a crates.io version!
v0.11 introduced a breaking change in how `scylla::frame::value`'s are represented, most importantly around timestamps. Previously the timestamp was constructed from a duration, but now it takes an actual timestamp. In the old version, `Duration::seconds(0)` was used as the default, but now we have to provide a timestamp. I'm a little but uncertain what is equivalent, but I believe `chrono::Utc::now` is what we intent to store.
When adding an optic, we first make sure that we can actually fetch the optic. This check was performed client-side before, which would cause some CORS errors if it was against the CORS policy of the server hosting the optic. This commit introduces a simple endpoint on our frontend server, so the request to the optic server now doesn't come directly from the client.
* Update the linked project to new location of `optics-lsp`
* Update optics-lsp license file path
It was previously a symlink to `../LICENSE.md`, but since moving the crate into `crates/` this should now be `../../LICENSE.md`.
* Bump lsp-types and serde-wasm-bindgen
* Sort crates in optics-lsp and add itertools
* Move optic-lsp token docs out to a separate file
rustfmt does not like long string literals, so these doc strings prevented most of the file from being formatted. Moving it out still mess with rustfmt, but now it is at least isolated :)
* Add token to hover docs
This blatantly steals a feature from rust-analyzer, where the hovered token is displayed as the first thing in the hover docs. It helps signalize that the documentation corresponds to the token being hovered, and also just looks nice. :)
* Refactor error formatting in optics-lsp
Instead of pushing string literals, we build up an iterator of lines and join them at the end.
Other minor refactorings are performed as well.
* change optics extension license to MIT
* release optics extension 0.0.12
* add steps for chrome & firefox
* add steps for mainstream browsers
* add images to steps
* fix typo in filename
* fix typo
* remove word for unneeded word for brevity