Commit graph

1308 commits

Author SHA1 Message Date
Mikkel Denker
a46a24cde8 don't treat 'site.com' etc. as a sentence boundary even though it contains a '.' 2024-03-14 09:21:12 +01:00
Mikkel Denker
3ec66c134c fix CI.
remove unused imports
2024-03-13 09:55:21 +01:00
Mikkel Denker
5ce97abf46
Run frontend lint in CI (#180)
Adds `npm run lint` to CI and fixes all the previous lint errors.
2024-03-13 09:48:07 +01:00
Wesley Appler
7bca42f068
Created ResultLink to handle the opening of result links in new tabs (#178)
* Created ResultLink to handle the opening of result links in new tabs

* Removed period

* Removed ResultLink from thesaurus & removed noreferrer
2024-03-13 09:18:02 +01:00
Mikkel Denker
32bbbc63ab
Semantic embeddings (#179)
* change indexer to prepare webpages in batches

* some clippy lints

* split 'IndexingWorker::prepare_webpages' into more readable sub functions and fix more clippy pedantic lints

* use dual encoder to embed title and keywords of page during indexing

* make sure we don't open harmonic centrality rocksdb in core/src during test...

* add indexer example used for benchmark

* add option to only compute embeddings of top ranking sites.
this is not really ideal, but it turns out to be way too slow to compute
the embeddings for all the sites in the index. this way, we at least get embeddings
for the sites that are most likely to appear in the search results while it is
still tractable to compute.

* store embeddings in index as bytes

* refactor ranking pipeline to statically ensure we score the different stages as expected

* use similarity between title and query embeddings during ranking

* use keyword embeddings during ranking

* handle missing fastfields in index gracefully

* remove unneeded Arc clone when constructing 'RecallRankingWebpage'
2024-03-11 14:28:17 +01:00
Mikkel Denker
c8447c9ef0 make explore work without javascript and add browser features section to privacy statement 2024-03-10 17:24:04 +01:00
Mikkel Denker
519ecb52b7 chore: cargo update 2024-03-05 13:58:47 +01:00
Mikkel Denker
13b8e7bdec
Disable webgraph checksum (#173)
* Disable checksum verification in rocksdb.
~80% of the time seems to be spend in xxh3 hash. This seems to be primarily used to verify the checksums when the blockcache gets a cache-miss and needs to read from disk. We don't gracefully handle corruptions either way, so let's just disable the verification and see how it impacts performance.

* also disable verification in 'RocksDbStore'

* allow rocksdb to prefetch blocks during iteration

* increase block cache for 'Id2NodeDb' and 'RocksDbStore'
2024-03-05 13:42:03 +01:00
Mikkel Denker
7630ae4de6 chore: update optics test samples 2024-03-05 11:03:58 +01:00
Mikkel Denker
f8c58b3c03
use more sophisticated encoding detection when utf8 decoding fails. (#172)
some websites, especially older ones, sometimes use a different encoding scheme than utf8 or latin1. before, we simply tried different encoding schemes until one successfully decoded the bytes but this approach can fail unexpectedly as some encodings can erroneously get decoded by other encodings without errors being reported.
we now use the encoding detection crate 'chardetng' which is also [used in firefox](https://github.com/hsivonen/chardetng?tab=readme-ov-file#purpose).
2024-03-05 10:55:05 +01:00
Mikkel Denker
c7e596ac5b refactor crawl planner
* make sure crawl plan has frontpage of all crawled sites
* order crawl plan to first crawl sites with high harmonic centrality
2024-03-04 16:26:07 +01:00
Mikkel Denker
26eb164482 refactor snippet into a normal snippet and a rich snippet.
a normal snippet should always be returned from the api. an application can then choose to show the rich snippet instead if one is present. this gives more flexibility when building applications on top of stracts api
2024-03-01 14:28:49 +01:00
Mikkel Denker
c9f48ca3b3 gracefully handle summarization errors 2024-03-01 13:26:18 +01:00
Mikkel Denker
a0e371b296
Rake keyword extraction (#171)
* extract keywords using rake algorithm

* store webpage keywords in index and use during ranking
2024-03-01 13:02:40 +01:00
Mikkel Denker
11ad03be77 change crossencoder scores to be calculated from their ranks.
the raw scores from the cross encoder model can be very similar even though site A is way better than site B, so it will be difficult to scale correctly with a coefficient. the models are essentially only trained to optimize the ranks.

also fix a non-deterministic test
2024-02-29 14:29:31 +01:00
Mikkel Denker
7d870b2702 build wasm during 'just setup' and make sure pkg has a package.json file.
see https://github.com/rustwasm/wasm-pack/issues/965
2024-02-29 14:05:02 +01:00
Mikkel Denker
9e45aa95fd make sure sites aren't removed when importing an optic 2024-02-29 12:35:31 +01:00
Mikkel Denker
7583aa426b
Parse site block rules into 'HostRankings' instead. (#170)
* parse site block rules into 'HostRankings' instead.
they are still executed as exactly the same tantivy queries, but this allows us to correctly import the sites from exported optics.

* each rule can have more than 1 site to discard

* fix non-determinstic 'num_slashes_and_digits' test
2024-02-29 10:28:00 +01:00
Mikkel Denker
5c7278a271 better error handling in calculator widget 2024-02-28 20:22:10 +01:00
Mikkel Denker
d9b7328a67 like/dislike/block button tooltips 2024-02-28 20:12:17 +01:00
Mikkel Denker
99dc06ec11 chore: fix clippy 2024-02-28 19:58:38 +01:00
Wesley Appler
25c0344578
[WIP] Implement the importing of optics (#167)
* Initial implementation of importing sites from an optic

* Removed unused import

* Updated button text

* Implemented client-side WASM to allow for parsing of imported .optic files

* Removed unneeded deps & updated `CONTRIBUTING.md` to reflect wasm-pack needs

* CI updates

* Added vite-plugin-wasm-pack to ensure wasm modules get copied over

* CI fix >:(

* More CI attempts

* agony - CSP fix & further wasm-pack fixes

* CSP updates

* Package update to prevent an unneccesary build of wasm

* reduce bloat in ci build log from wasm

* fix another non-determinsticly failing test

* only install wasm-pack as part of setup steps in CONTRIBUTING.md
./scripts/ci/check seems to fail if it tries to install wasm-pack while it is already installed (at least on my machine). as it is already added as a step in CONTRIBUTING.md we can assume it has been installed on the system

* add vite plugin to ensure changes to 'crates/client-wasm' gets reflected in the frontend.
adapted from https://github.com/StractOrg/stract/pull/109

* run 'npm run format'

* propagate errors from wasm crate
2024-02-28 17:01:32 +01:00
Mikkel Denker
6b9d514a5b temporarily disable frontend type check in CI 2024-02-28 11:34:19 +01:00
Mikkel Denker
f1f8394494 improve performance of approximated harmonic.
from ~2-3 days down to ~16 hours on production page graph with ~3.2 billion nodes
2024-02-28 10:42:48 +01:00
Mikkel Denker
5d4183c57a make sure 'DomainIfHomepageNoTokenizer' is (not) tokenized correctly 2024-02-28 10:41:00 +01:00
Mikkel Denker
9ea1546272 hopefully fix non-detministic test failure of 'fetch_time_ranking' and 'liked_hosts' tests 2024-02-28 10:38:18 +01:00
Mikkel Denker
d30cb51e3d
Update nodejs to v20.10 in CI (#166)
* update nodejs to v20.10 in github action

* loosen restriction on npm version
frontend should work as long as we have correct nodejs version. don't think the npm version is necessary
2024-02-22 19:07:41 +01:00
Wesley Appler
1260ba0969
Add node versioning to package.json (#165)
* Updated package.json with node versions & added nvmrc

* Small edit to update git email

* Small edit to update git email

---------

Co-authored-by: Wes Appler <wes@lamemakes>
2024-02-22 18:40:26 +01:00
Mikkel Denker
fc4fe9eb32 Merge branch 'main' of github.com:StractOrg/stract 2024-02-21 13:31:12 +01:00
Mikkel Denker
df577020b1 torch is needed in scripts to export models during 'just configure' 2024-02-21 13:28:18 +01:00
Mikkel Denker
dc52379789
Fix bm25 scaling of terms based on their idf (#163)
* change term scaling from idf-sum to correctly weight each term based on the number of documents that match that particular term

* remove scoring from patternquery.
the results are always scored in 'Signal::compute' anyway so no need for the added complexity

* small bm25 test that makes sure terms are scaled

* use simpler idf sum instead of full bm25 for simple fields (site, domain, url, backlink text etc.)
fixes the performance regression of `Signal::compute`
2024-02-19 11:05:12 +01:00
Oliver Bøving
f5384a4537
Run prettier and fix some fontend lint errors (#161)
* Run `npm run format`

* Fix some of the eslint errors
2024-02-19 10:05:09 +01:00
Abdurrahman Rajab
20211d8e25
fix: update scrolls to auto #159 (#160) 2024-02-19 09:29:28 +01:00
Oliver Bøving
2d8973bcf7
Add basic CI (#156)
* Add basic CI

* Add liburing installation step to CI workflow

* Run `npm install` as part of ci/check

* Add `@types/node` package

* Add `submodules: 'recursive'` to CI

* Skip test if test data is not available

* Install `cargo-about` in CI
2024-02-17 20:09:58 +01:00
Mikkel Denker
b89ea6389c pip install upgrade during setup 2024-02-17 19:50:21 +01:00
Oliver Bøving
0d8507a2a5
Run cargo clippy --fix and cargo fmt --fix (#157)
* Run `cargo clippy --fix` plus some minor manual refactors

* Run `cargo fmt` plus minor refactor
2024-02-17 16:43:09 +01:00
Mikkel Denker
b678e678a6 add links to '/webmasters' information for crawler 2024-02-17 13:41:33 +01:00
Oliver Bøving
b029d6b5f2
Bump scylla to v0.12.0 (#144)
Previously we were on a fixed rev due to unreleased pull-requests, but since these are now released we can pin to a crates.io version!

v0.11 introduced a breaking change in how `scylla::frame::value`'s are represented, most importantly around timestamps. Previously the timestamp was constructed from a duration, but now it takes an actual timestamp. In the old version, `Duration::seconds(0)` was used as the default, but now we have to provide a timestamp. I'm a little but uncertain what is equivalent, but I believe `chrono::Utc::now` is what we intent to store.
2024-02-15 13:01:17 +01:00
SekoiaTree
29870796cb
Replace powf with powi (#152) 2024-02-15 12:57:46 +01:00
Mikkel Denker
4da1987fe5 update contributing guidelines 2024-02-15 12:55:32 +01:00
Mikkel Denker
8a92bc39ed add code of conduct 2024-02-15 10:12:30 +01:00
Mikkel Denker
3fa73e42e4 dynamically import highlight-js.
makes sure we don't send the somewhat big library to the frontend unless we actually need to render code.
2024-02-12 14:30:01 +01:00
Mikkel Denker
0b69853fa9 chore: 'cargo update' and remove some unused trait method.
also accept gplv3 licenses in libraries as this is permitted under section 13 of gplv3.
2024-02-12 13:49:20 +01:00
Mikkel Denker
f34daae6c3 fix 'Action(Discard)' combined with 'DiscardNonMatching' bug
if the optic had a 'DiscardNonMatching' it would also count all 'Action(Discard)' rules as matching, thereby not removing the matching result.
2024-02-12 10:12:07 +01:00
Mikkel Denker
a2bf160e26 small cleanup in tokenizer tests. no need for all the '.to_string()' 2024-02-12 10:10:32 +01:00
Mikkel Denker
aa4e59cb6e Fix CORS issues when adding an optic.
When adding an optic, we first make sure that we can actually fetch the optic. This check was performed client-side before, which would cause some CORS errors if it was against the CORS policy of the server hosting the optic. This commit introduces a simple endpoint on our frontend server, so the request to the optic server now doesn't come directly from the client.
2024-02-11 13:06:53 +01:00
Oliver Bøving
f5bb0d2ef8
Optics LSP Chores (#145)
* Update the linked project to new location of `optics-lsp`

* Update optics-lsp license file path

It was previously a symlink to `../LICENSE.md`, but since moving the crate into `crates/` this should now be `../../LICENSE.md`.

* Bump lsp-types and serde-wasm-bindgen

* Sort crates in optics-lsp and add itertools

* Move optic-lsp token docs out to a separate file

rustfmt does not like long string literals, so these doc strings prevented most of the file from being formatted. Moving it out still mess with rustfmt, but now it is at least isolated :)

* Add token to hover docs

This blatantly steals a feature from rust-analyzer, where the hovered token is displayed as the first thing in the hover docs. It helps signalize that the documentation corresponds to the token being hovered, and also just looks nice. :)

* Refactor error formatting in optics-lsp

Instead of pushing string literals, we build up an iterator of lines and join them at the end.

Other minor refactorings are performed as well.

* change optics extension license to MIT

* release optics extension 0.0.12
2024-02-10 17:36:55 +01:00
Andy Piper
16e1435bf9
trivial typo fixes (#143)
A couple of minor updates that stood out when reading this via  the VS Marketplace listing.
2024-02-10 12:13:33 +01:00
jmillerv
ccd16df514
Add Stract to Web Browser Search Documentation (#135)
* add steps for chrome & firefox

* add steps for mainstream browsers

* add images to steps

* fix typo in filename

* fix typo

* remove word for unneeded word for brevity
2024-02-10 12:12:38 +01:00
SekoiaTree
ffbec7ac80
Replay ToString with Display (#134) 2024-02-08 19:15:27 +01:00