0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	a46a24cde8	don't treat 'site.com' etc. as a sentence boundary even though it contains a '.'	2024-03-14 09:21:12 +01:00
Mikkel Denker	3ec66c134c	fix CI. remove unused imports	2024-03-13 09:55:21 +01:00
Mikkel Denker	5ce97abf46	Run frontend lint in CI (#180 ) Adds `npm run lint` to CI and fixes all the previous lint errors.	2024-03-13 09:48:07 +01:00
Wesley Appler	7bca42f068	Created ResultLink to handle the opening of result links in new tabs (#178 ) * Created ResultLink to handle the opening of result links in new tabs * Removed period * Removed ResultLink from thesaurus & removed noreferrer	2024-03-13 09:18:02 +01:00
Mikkel Denker	32bbbc63ab	Semantic embeddings (#179 ) * change indexer to prepare webpages in batches * some clippy lints * split 'IndexingWorker::prepare_webpages' into more readable sub functions and fix more clippy pedantic lints * use dual encoder to embed title and keywords of page during indexing * make sure we don't open harmonic centrality rocksdb in core/src during test... * add indexer example used for benchmark * add option to only compute embeddings of top ranking sites. this is not really ideal, but it turns out to be way too slow to compute the embeddings for all the sites in the index. this way, we at least get embeddings for the sites that are most likely to appear in the search results while it is still tractable to compute. * store embeddings in index as bytes * refactor ranking pipeline to statically ensure we score the different stages as expected * use similarity between title and query embeddings during ranking * use keyword embeddings during ranking * handle missing fastfields in index gracefully * remove unneeded Arc clone when constructing 'RecallRankingWebpage'	2024-03-11 14:28:17 +01:00
Mikkel Denker	c8447c9ef0	make explore work without javascript and add browser features section to privacy statement	2024-03-10 17:24:04 +01:00
Mikkel Denker	519ecb52b7	chore: cargo update	2024-03-05 13:58:47 +01:00
Mikkel Denker	13b8e7bdec	Disable webgraph checksum (#173 ) * Disable checksum verification in rocksdb. ~80% of the time seems to be spend in xxh3 hash. This seems to be primarily used to verify the checksums when the blockcache gets a cache-miss and needs to read from disk. We don't gracefully handle corruptions either way, so let's just disable the verification and see how it impacts performance. * also disable verification in 'RocksDbStore' * allow rocksdb to prefetch blocks during iteration * increase block cache for 'Id2NodeDb' and 'RocksDbStore'	2024-03-05 13:42:03 +01:00
Mikkel Denker	7630ae4de6	chore: update optics test samples	2024-03-05 11:03:58 +01:00
Mikkel Denker	f8c58b3c03	use more sophisticated encoding detection when utf8 decoding fails. (#172 ) some websites, especially older ones, sometimes use a different encoding scheme than utf8 or latin1. before, we simply tried different encoding schemes until one successfully decoded the bytes but this approach can fail unexpectedly as some encodings can erroneously get decoded by other encodings without errors being reported. we now use the encoding detection crate 'chardetng' which is also [used in firefox](https://github.com/hsivonen/chardetng?tab=readme-ov-file#purpose).	2024-03-05 10:55:05 +01:00
Mikkel Denker	c7e596ac5b	refactor crawl planner * make sure crawl plan has frontpage of all crawled sites * order crawl plan to first crawl sites with high harmonic centrality	2024-03-04 16:26:07 +01:00
Mikkel Denker	26eb164482	refactor snippet into a normal snippet and a rich snippet. a normal snippet should always be returned from the api. an application can then choose to show the rich snippet instead if one is present. this gives more flexibility when building applications on top of stracts api	2024-03-01 14:28:49 +01:00
Mikkel Denker	c9f48ca3b3	gracefully handle summarization errors	2024-03-01 13:26:18 +01:00
Mikkel Denker	a0e371b296	Rake keyword extraction (#171 ) * extract keywords using rake algorithm * store webpage keywords in index and use during ranking	2024-03-01 13:02:40 +01:00
Mikkel Denker	11ad03be77	change crossencoder scores to be calculated from their ranks. the raw scores from the cross encoder model can be very similar even though site A is way better than site B, so it will be difficult to scale correctly with a coefficient. the models are essentially only trained to optimize the ranks. also fix a non-deterministic test	2024-02-29 14:29:31 +01:00
Mikkel Denker	7d870b2702	build wasm during 'just setup' and make sure pkg has a package.json file. see https://github.com/rustwasm/wasm-pack/issues/965	2024-02-29 14:05:02 +01:00
Mikkel Denker	9e45aa95fd	make sure sites aren't removed when importing an optic	2024-02-29 12:35:31 +01:00
Mikkel Denker	7583aa426b	Parse site block rules into 'HostRankings' instead. (#170 ) * parse site block rules into 'HostRankings' instead. they are still executed as exactly the same tantivy queries, but this allows us to correctly import the sites from exported optics. * each rule can have more than 1 site to discard * fix non-determinstic 'num_slashes_and_digits' test	2024-02-29 10:28:00 +01:00
Mikkel Denker	5c7278a271	better error handling in calculator widget	2024-02-28 20:22:10 +01:00
Mikkel Denker	d9b7328a67	like/dislike/block button tooltips	2024-02-28 20:12:17 +01:00
Mikkel Denker	99dc06ec11	chore: fix clippy	2024-02-28 19:58:38 +01:00
Wesley Appler	25c0344578	[WIP] Implement the importing of optics (#167 ) * Initial implementation of importing sites from an optic * Removed unused import * Updated button text * Implemented client-side WASM to allow for parsing of imported .optic files * Removed unneeded deps & updated `CONTRIBUTING.md` to reflect wasm-pack needs * CI updates * Added vite-plugin-wasm-pack to ensure wasm modules get copied over * CI fix >:( * More CI attempts * agony - CSP fix & further wasm-pack fixes * CSP updates * Package update to prevent an unneccesary build of wasm * reduce bloat in ci build log from wasm * fix another non-determinsticly failing test * only install wasm-pack as part of setup steps in CONTRIBUTING.md ./scripts/ci/check seems to fail if it tries to install wasm-pack while it is already installed (at least on my machine). as it is already added as a step in CONTRIBUTING.md we can assume it has been installed on the system * add vite plugin to ensure changes to 'crates/client-wasm' gets reflected in the frontend. adapted from https://github.com/StractOrg/stract/pull/109 * run 'npm run format' * propagate errors from wasm crate	2024-02-28 17:01:32 +01:00
Mikkel Denker	6b9d514a5b	temporarily disable frontend type check in CI	2024-02-28 11:34:19 +01:00
Mikkel Denker	f1f8394494	improve performance of approximated harmonic. from ~2-3 days down to ~16 hours on production page graph with ~3.2 billion nodes	2024-02-28 10:42:48 +01:00
Mikkel Denker	5d4183c57a	make sure 'DomainIfHomepageNoTokenizer' is (not) tokenized correctly	2024-02-28 10:41:00 +01:00
Mikkel Denker	9ea1546272	hopefully fix non-detministic test failure of 'fetch_time_ranking' and 'liked_hosts' tests	2024-02-28 10:38:18 +01:00
Mikkel Denker	d30cb51e3d	Update nodejs to v20.10 in CI (#166 ) * update nodejs to v20.10 in github action * loosen restriction on npm version frontend should work as long as we have correct nodejs version. don't think the npm version is necessary	2024-02-22 19:07:41 +01:00
Wesley Appler	1260ba0969	Add node versioning to `package.json` (#165 ) * Updated package.json with node versions & added nvmrc * Small edit to update git email * Small edit to update git email --------- Co-authored-by: Wes Appler <wes@lamemakes>	2024-02-22 18:40:26 +01:00
Mikkel Denker	fc4fe9eb32	Merge branch 'main' of github.com:StractOrg/stract	2024-02-21 13:31:12 +01:00
Mikkel Denker	df577020b1	torch is needed in scripts to export models during 'just configure'	2024-02-21 13:28:18 +01:00
Mikkel Denker	dc52379789	Fix bm25 scaling of terms based on their idf (#163 ) * change term scaling from idf-sum to correctly weight each term based on the number of documents that match that particular term * remove scoring from patternquery. the results are always scored in 'Signal::compute' anyway so no need for the added complexity * small bm25 test that makes sure terms are scaled * use simpler idf sum instead of full bm25 for simple fields (site, domain, url, backlink text etc.) fixes the performance regression of `Signal::compute`	2024-02-19 11:05:12 +01:00
Oliver Bøving	f5384a4537	Run prettier and fix some fontend lint errors (#161 ) * Run `npm run format` * Fix some of the eslint errors	2024-02-19 10:05:09 +01:00
Abdurrahman Rajab	20211d8e25	fix: update scrolls to auto #159 (#160 )	2024-02-19 09:29:28 +01:00
Oliver Bøving	2d8973bcf7	Add basic CI (#156 ) * Add basic CI * Add liburing installation step to CI workflow * Run `npm install` as part of ci/check * Add `@types/node` package * Add `submodules: 'recursive'` to CI * Skip test if test data is not available * Install `cargo-about` in CI	2024-02-17 20:09:58 +01:00
Mikkel Denker	b89ea6389c	pip install upgrade during setup	2024-02-17 19:50:21 +01:00
Oliver Bøving	0d8507a2a5	Run `cargo clippy --fix` and `cargo fmt --fix` (#157 ) * Run `cargo clippy --fix` plus some minor manual refactors * Run `cargo fmt` plus minor refactor	2024-02-17 16:43:09 +01:00
Mikkel Denker	b678e678a6	add links to '/webmasters' information for crawler	2024-02-17 13:41:33 +01:00
Oliver Bøving	b029d6b5f2	Bump scylla to v0.12.0 (#144 ) Previously we were on a fixed rev due to unreleased pull-requests, but since these are now released we can pin to a crates.io version! v0.11 introduced a breaking change in how `scylla::frame::value`'s are represented, most importantly around timestamps. Previously the timestamp was constructed from a duration, but now it takes an actual timestamp. In the old version, `Duration::seconds(0)` was used as the default, but now we have to provide a timestamp. I'm a little but uncertain what is equivalent, but I believe `chrono::Utc::now` is what we intent to store.	2024-02-15 13:01:17 +01:00
SekoiaTree	29870796cb	Replace powf with powi (#152 )	2024-02-15 12:57:46 +01:00
Mikkel Denker	4da1987fe5	update contributing guidelines	2024-02-15 12:55:32 +01:00
Mikkel Denker	8a92bc39ed	add code of conduct	2024-02-15 10:12:30 +01:00
Mikkel Denker	3fa73e42e4	dynamically import highlight-js. makes sure we don't send the somewhat big library to the frontend unless we actually need to render code.	2024-02-12 14:30:01 +01:00
Mikkel Denker	0b69853fa9	chore: 'cargo update' and remove some unused trait method. also accept gplv3 licenses in libraries as this is permitted under section 13 of gplv3.	2024-02-12 13:49:20 +01:00
Mikkel Denker	f34daae6c3	fix 'Action(Discard)' combined with 'DiscardNonMatching' bug if the optic had a 'DiscardNonMatching' it would also count all 'Action(Discard)' rules as matching, thereby not removing the matching result.	2024-02-12 10:12:07 +01:00
Mikkel Denker	a2bf160e26	small cleanup in tokenizer tests. no need for all the '.to_string()'	2024-02-12 10:10:32 +01:00
Mikkel Denker	aa4e59cb6e	Fix CORS issues when adding an optic. When adding an optic, we first make sure that we can actually fetch the optic. This check was performed client-side before, which would cause some CORS errors if it was against the CORS policy of the server hosting the optic. This commit introduces a simple endpoint on our frontend server, so the request to the optic server now doesn't come directly from the client.	2024-02-11 13:06:53 +01:00
Oliver Bøving	f5bb0d2ef8	Optics LSP Chores (#145 ) * Update the linked project to new location of `optics-lsp` * Update optics-lsp license file path It was previously a symlink to `../LICENSE.md`, but since moving the crate into `crates/` this should now be `../../LICENSE.md`. * Bump lsp-types and serde-wasm-bindgen * Sort crates in optics-lsp and add itertools * Move optic-lsp token docs out to a separate file rustfmt does not like long string literals, so these doc strings prevented most of the file from being formatted. Moving it out still mess with rustfmt, but now it is at least isolated :) * Add token to hover docs This blatantly steals a feature from rust-analyzer, where the hovered token is displayed as the first thing in the hover docs. It helps signalize that the documentation corresponds to the token being hovered, and also just looks nice. :) * Refactor error formatting in optics-lsp Instead of pushing string literals, we build up an iterator of lines and join them at the end. Other minor refactorings are performed as well. * change optics extension license to MIT * release optics extension 0.0.12	2024-02-10 17:36:55 +01:00
Andy Piper	16e1435bf9	trivial typo fixes (#143 ) A couple of minor updates that stood out when reading this via the VS Marketplace listing.	2024-02-10 12:13:33 +01:00
jmillerv	ccd16df514	Add Stract to Web Browser Search Documentation (#135 ) * add steps for chrome & firefox * add steps for mainstream browsers * add images to steps * fix typo in filename * fix typo * remove word for unneeded word for brevity	2024-02-10 12:12:38 +01:00
SekoiaTree	ffbec7ac80	Replay ToString with Display (#134 )	2024-02-08 19:15:27 +01:00

... 7 8 9 10 11 ...

1308 commits