Commit graph

413 commits

Author SHA1 Message Date
Mikkel Denker
13b8e7bdec
Disable webgraph checksum (#173)
* Disable checksum verification in rocksdb.
~80% of the time seems to be spend in xxh3 hash. This seems to be primarily used to verify the checksums when the blockcache gets a cache-miss and needs to read from disk. We don't gracefully handle corruptions either way, so let's just disable the verification and see how it impacts performance.

* also disable verification in 'RocksDbStore'

* allow rocksdb to prefetch blocks during iteration

* increase block cache for 'Id2NodeDb' and 'RocksDbStore'
2024-03-05 13:42:03 +01:00
Mikkel Denker
7630ae4de6 chore: update optics test samples 2024-03-05 11:03:58 +01:00
Mikkel Denker
f8c58b3c03
use more sophisticated encoding detection when utf8 decoding fails. (#172)
some websites, especially older ones, sometimes use a different encoding scheme than utf8 or latin1. before, we simply tried different encoding schemes until one successfully decoded the bytes but this approach can fail unexpectedly as some encodings can erroneously get decoded by other encodings without errors being reported.
we now use the encoding detection crate 'chardetng' which is also [used in firefox](https://github.com/hsivonen/chardetng?tab=readme-ov-file#purpose).
2024-03-05 10:55:05 +01:00
Mikkel Denker
c7e596ac5b refactor crawl planner
* make sure crawl plan has frontpage of all crawled sites
* order crawl plan to first crawl sites with high harmonic centrality
2024-03-04 16:26:07 +01:00
Mikkel Denker
26eb164482 refactor snippet into a normal snippet and a rich snippet.
a normal snippet should always be returned from the api. an application can then choose to show the rich snippet instead if one is present. this gives more flexibility when building applications on top of stracts api
2024-03-01 14:28:49 +01:00
Mikkel Denker
a0e371b296
Rake keyword extraction (#171)
* extract keywords using rake algorithm

* store webpage keywords in index and use during ranking
2024-03-01 13:02:40 +01:00
Mikkel Denker
11ad03be77 change crossencoder scores to be calculated from their ranks.
the raw scores from the cross encoder model can be very similar even though site A is way better than site B, so it will be difficult to scale correctly with a coefficient. the models are essentially only trained to optimize the ranks.

also fix a non-deterministic test
2024-02-29 14:29:31 +01:00
Mikkel Denker
7d870b2702 build wasm during 'just setup' and make sure pkg has a package.json file.
see https://github.com/rustwasm/wasm-pack/issues/965
2024-02-29 14:05:02 +01:00
Mikkel Denker
7583aa426b
Parse site block rules into 'HostRankings' instead. (#170)
* parse site block rules into 'HostRankings' instead.
they are still executed as exactly the same tantivy queries, but this allows us to correctly import the sites from exported optics.

* each rule can have more than 1 site to discard

* fix non-determinstic 'num_slashes_and_digits' test
2024-02-29 10:28:00 +01:00
Mikkel Denker
5c7278a271 better error handling in calculator widget 2024-02-28 20:22:10 +01:00
Mikkel Denker
d9b7328a67 like/dislike/block button tooltips 2024-02-28 20:12:17 +01:00
Mikkel Denker
99dc06ec11 chore: fix clippy 2024-02-28 19:58:38 +01:00
Wesley Appler
25c0344578
[WIP] Implement the importing of optics (#167)
* Initial implementation of importing sites from an optic

* Removed unused import

* Updated button text

* Implemented client-side WASM to allow for parsing of imported .optic files

* Removed unneeded deps & updated `CONTRIBUTING.md` to reflect wasm-pack needs

* CI updates

* Added vite-plugin-wasm-pack to ensure wasm modules get copied over

* CI fix >:(

* More CI attempts

* agony - CSP fix & further wasm-pack fixes

* CSP updates

* Package update to prevent an unneccesary build of wasm

* reduce bloat in ci build log from wasm

* fix another non-determinsticly failing test

* only install wasm-pack as part of setup steps in CONTRIBUTING.md
./scripts/ci/check seems to fail if it tries to install wasm-pack while it is already installed (at least on my machine). as it is already added as a step in CONTRIBUTING.md we can assume it has been installed on the system

* add vite plugin to ensure changes to 'crates/client-wasm' gets reflected in the frontend.
adapted from https://github.com/StractOrg/stract/pull/109

* run 'npm run format'

* propagate errors from wasm crate
2024-02-28 17:01:32 +01:00
Mikkel Denker
f1f8394494 improve performance of approximated harmonic.
from ~2-3 days down to ~16 hours on production page graph with ~3.2 billion nodes
2024-02-28 10:42:48 +01:00
Mikkel Denker
5d4183c57a make sure 'DomainIfHomepageNoTokenizer' is (not) tokenized correctly 2024-02-28 10:41:00 +01:00
Mikkel Denker
9ea1546272 hopefully fix non-detministic test failure of 'fetch_time_ranking' and 'liked_hosts' tests 2024-02-28 10:38:18 +01:00
Mikkel Denker
dc52379789
Fix bm25 scaling of terms based on their idf (#163)
* change term scaling from idf-sum to correctly weight each term based on the number of documents that match that particular term

* remove scoring from patternquery.
the results are always scored in 'Signal::compute' anyway so no need for the added complexity

* small bm25 test that makes sure terms are scaled

* use simpler idf sum instead of full bm25 for simple fields (site, domain, url, backlink text etc.)
fixes the performance regression of `Signal::compute`
2024-02-19 11:05:12 +01:00
Oliver Bøving
2d8973bcf7
Add basic CI (#156)
* Add basic CI

* Add liburing installation step to CI workflow

* Run `npm install` as part of ci/check

* Add `@types/node` package

* Add `submodules: 'recursive'` to CI

* Skip test if test data is not available

* Install `cargo-about` in CI
2024-02-17 20:09:58 +01:00
Oliver Bøving
0d8507a2a5
Run cargo clippy --fix and cargo fmt --fix (#157)
* Run `cargo clippy --fix` plus some minor manual refactors

* Run `cargo fmt` plus minor refactor
2024-02-17 16:43:09 +01:00
Oliver Bøving
b029d6b5f2
Bump scylla to v0.12.0 (#144)
Previously we were on a fixed rev due to unreleased pull-requests, but since these are now released we can pin to a crates.io version!

v0.11 introduced a breaking change in how `scylla::frame::value`'s are represented, most importantly around timestamps. Previously the timestamp was constructed from a duration, but now it takes an actual timestamp. In the old version, `Duration::seconds(0)` was used as the default, but now we have to provide a timestamp. I'm a little but uncertain what is equivalent, but I believe `chrono::Utc::now` is what we intent to store.
2024-02-15 13:01:17 +01:00
SekoiaTree
29870796cb
Replace powf with powi (#152) 2024-02-15 12:57:46 +01:00
Mikkel Denker
0b69853fa9 chore: 'cargo update' and remove some unused trait method.
also accept gplv3 licenses in libraries as this is permitted under section 13 of gplv3.
2024-02-12 13:49:20 +01:00
Mikkel Denker
f34daae6c3 fix 'Action(Discard)' combined with 'DiscardNonMatching' bug
if the optic had a 'DiscardNonMatching' it would also count all 'Action(Discard)' rules as matching, thereby not removing the matching result.
2024-02-12 10:12:07 +01:00
Mikkel Denker
a2bf160e26 small cleanup in tokenizer tests. no need for all the '.to_string()' 2024-02-12 10:10:32 +01:00
Mikkel Denker
aa4e59cb6e Fix CORS issues when adding an optic.
When adding an optic, we first make sure that we can actually fetch the optic. This check was performed client-side before, which would cause some CORS errors if it was against the CORS policy of the server hosting the optic. This commit introduces a simple endpoint on our frontend server, so the request to the optic server now doesn't come directly from the client.
2024-02-11 13:06:53 +01:00
Oliver Bøving
f5bb0d2ef8
Optics LSP Chores (#145)
* Update the linked project to new location of `optics-lsp`

* Update optics-lsp license file path

It was previously a symlink to `../LICENSE.md`, but since moving the crate into `crates/` this should now be `../../LICENSE.md`.

* Bump lsp-types and serde-wasm-bindgen

* Sort crates in optics-lsp and add itertools

* Move optic-lsp token docs out to a separate file

rustfmt does not like long string literals, so these doc strings prevented most of the file from being formatted. Moving it out still mess with rustfmt, but now it is at least isolated :)

* Add token to hover docs

This blatantly steals a feature from rust-analyzer, where the hovered token is displayed as the first thing in the hover docs. It helps signalize that the documentation corresponds to the token being hovered, and also just looks nice. :)

* Refactor error formatting in optics-lsp

Instead of pushing string literals, we build up an iterator of lines and join them at the end.

Other minor refactorings are performed as well.

* change optics extension license to MIT

* release optics extension 0.0.12
2024-02-10 17:36:55 +01:00
Andy Piper
16e1435bf9
trivial typo fixes (#143)
A couple of minor updates that stood out when reading this via  the VS Marketplace listing.
2024-02-10 12:13:33 +01:00
SekoiaTree
ffbec7ac80
Replay ToString with Display (#134) 2024-02-08 19:15:27 +01:00
SekoiaTree
2a62f9c28a
Make places that use rules use rule OR-ing (#133) 2024-02-08 18:51:12 +01:00
Mikkel Denker
aed64be27e internet archive warc files does not seem to store the payload type. let's just assume it's html (records that can't be parsed are skipped anyway) 2024-02-08 16:14:24 +01:00
Mikkel Denker
91ffe15cfd support internet archive warc files.
the ordering seems to be a bit different than the ones from commoncrawl. these changes should hopefully make the parser overall more robust.
2024-02-07 21:15:08 +01:00
Mikkel Denker
a4ce85a905 bump optics samples submodule 2024-02-07 13:28:57 +01:00
Mikkel Denker
be7bbd02fc Forgot to add serde defaults to the new snippet config fields... 2024-02-06 19:38:45 +01:00
SekoiaTree
359117bfbe
Initial rule or-ing (#127) 2024-02-06 19:34:58 +01:00
Mikkel Denker
aa89813906 move some of the hardcoded snippet choices into the configuration file 2024-02-06 11:19:42 +01:00
Mikkel Denker
d9136d59fd force ttl directly in scylla table 2024-02-06 11:15:22 +01:00
Mikkel Denker
f61f1f6b0f fix bug where query suggestions couldn't be selected in safari 2024-02-05 14:57:26 +01:00
Mikkel Denker
e8489f4792 flatten fastfield reader from vec<vec<u64>> to a large vec<u64> 2024-02-04 15:14:49 +01:00
Mikkel Denker
625d6fc6b7 no need for enummap in fastfield reader as we know all fields statically 2024-02-04 15:05:14 +01:00
Mikkel Denker
eb7e96bf50 better cache locality for fast field reader.
instead of going from field -> doc -> value, we can go from doc -> field -> value and thereby reuse the doc -> field part for all fields in the document.
2024-02-04 14:51:50 +01:00
Mikkel Denker
b88e7fd013 search example reduce words considered for snippet generation 2024-02-04 13:30:28 +01:00
Mikkel Denker
5164789a32 move search bench into an example so we can easier profile with perf 2024-02-04 13:28:09 +01:00
Mikkel Denker
099282cefa don't need to send all ranking signals to frontend for dicussions widget. sending final score is enough 2024-02-03 16:48:43 +01:00
Mikkel Denker
7919df0863 'optimize_for_search' actually seemed to make the searches slower as too many threads would fight for io access at once 2024-02-03 14:36:19 +01:00
Mikkel Denker
10043e7db6 reduce fastfield indirections 2024-02-03 14:25:17 +01:00
Mikkel Denker
c46a85a97e load all fastfields into memory.
this is an experiment to see how it affects performance vs memory usage
2024-02-02 22:06:54 +01:00
Mikkel Denker
45d8245374 keep fastfield reader open across searches
52% of time seems to be spend on opening the fastfields
2024-02-02 15:10:41 +01:00
Mikkel Denker
9bcaf054c3 bench based on queries from autosuggest.
this will hopefully show us where the hot paths are when caches aren't hit
2024-02-02 14:50:54 +01:00
Mikkel Denker
e4e3044e47 finally ditch that pesky libtorch dependency! 2024-02-02 13:11:06 +01:00
Mikkel Denker
d7e564d91a move neural network models from torch to candle 2024-02-02 12:36:39 +01:00