0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	12e9502e80	Improve API documentation (#235 ) * add docusaurus scalar api documentation structure * bump openapi 3.0 to 3.1 so we can mark internal endpoints * improve search api docs * webgraph api docs * point docs to prod	2024-11-19 13:43:42 +01:00
Mikkel Denker	31bfebf2c9	just update	2024-10-25 09:37:45 +02:00
Mikkel Denker	658ac6f682	Webgraph inverted index (#232 ) * overall structure for new webgraph store * webgraph schema structure and HostLinksQuery * deserialize edge * forward/backlink queries * full edge queries and iter smalledges * [wip] use new store in webgraph * remove id2node db * shortcircuit link queries * [wip] remote webgraph trait structure * [wip] shard awareness * finish remote webgraph trait structure * optimize read * merge webgraphs * construct webgraph store * make sure 'just configure' works and everything looks correct	2024-10-23 11:59:52 +02:00
Mikkel Denker	70e9e8181d	make api types public	2024-10-01 10:02:22 +02:00
Mikkel Denker	5ebdb24a07	just update	2024-10-01 09:51:11 +02:00
Mikkel Denker	8f97617904	make ml models optional during setup	2024-07-25 15:16:32 +02:00
Mikkel Denker	265b1b7871	Ranking diff tool (#207 ) * ranking diff tool structure * fix missing icon types * add admin for queries and experiments * minor cleanup * show experiment progress * upgrade node adapter for svelte * hopefully fix ci * display common queries between experiments * display serp diffs with top signals for each result * like experiments and show overview in queries * settings to toggle experiment shuffle and show/hide signals * keyboard shortcuts * visualise improvements by query category * document how to use tool	2024-06-03 15:00:16 +02:00
Mikkel Denker	a1381d667b	fixed bug that caused error model in spell correction to always be empty	2024-05-27 11:45:07 +02:00
Mikkel Denker	1a6d8ff6be	fixed bug in stupid_backoff model that caused last n_gram count to always be 0	2024-05-14 16:49:30 +02:00
Mikkel Denker	7d870b2702	build wasm during 'just setup' and make sure pkg has a package.json file. see https://github.com/rustwasm/wasm-pack/issues/965	2024-02-29 14:05:02 +01:00
Wesley Appler	25c0344578	[WIP] Implement the importing of optics (#167 ) * Initial implementation of importing sites from an optic * Removed unused import * Updated button text * Implemented client-side WASM to allow for parsing of imported .optic files * Removed unneeded deps & updated `CONTRIBUTING.md` to reflect wasm-pack needs * CI updates * Added vite-plugin-wasm-pack to ensure wasm modules get copied over * CI fix >:( * More CI attempts * agony - CSP fix & further wasm-pack fixes * CSP updates * Package update to prevent an unneccesary build of wasm * reduce bloat in ci build log from wasm * fix another non-determinsticly failing test * only install wasm-pack as part of setup steps in CONTRIBUTING.md ./scripts/ci/check seems to fail if it tries to install wasm-pack while it is already installed (at least on my machine). as it is already added as a step in CONTRIBUTING.md we can assume it has been installed on the system * add vite plugin to ensure changes to 'crates/client-wasm' gets reflected in the frontend. adapted from https://github.com/StractOrg/stract/pull/109 * run 'npm run format' * propagate errors from wasm crate	2024-02-28 17:01:32 +01:00
Mikkel Denker	b89ea6389c	pip install upgrade during setup	2024-02-17 19:50:21 +01:00
Mikkel Denker	e4e3044e47	finally ditch that pesky libtorch dependency!	2024-02-02 13:11:06 +01:00
Mikkel Denker	1a9f381d15	GGML Rust bindings (#122 ) * move crates into a 'crates' folder * added cargo-about to check dependency licenses * create ggml-sys bindings and build as a static library. simple addition sanity test passes * update licenses * yeet alice * yeet qa model * yeet fact model * [wip] idiomatic rust bindings for ggml * [ggml] mul, add and sub ops implemented for tensors. i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?	2024-01-27 12:27:27 +01:00
Mikkel Denker	788f92c8f4	split webgraph server into host and page. allows us to host each graph on separate sets of servers.	2024-01-24 11:08:45 +01:00
Mikkel Denker	cc91935d0a	Move entity index out of normal search index and have dedicated search server for it	2024-01-23 14:53:33 +01:00
Mikkel Denker	fbc01ad865	summarization using mistral and 'chain-of-density' approach. the summarization becomes much better if we allow the model to first generate a candidate summarization and then improving on it. doing the improvement step just once seems to significantly improve the summary. we also now use an llm (mistral 7b) for the summarisations, as we can then use the same model for multiple tasks and serve it using gpus, thus significantly decreasing the latency.	2024-01-19 11:08:17 +01:00
Mikkel Denker	7ea3dbcca4	[ranking] add a host_centrality_rank and page_centrality_rank signal it might be easier to score pages based on their rank of the sorted their centralities. for instance the centralities for page A and page B might be very similar numerically, but if a lot of pages are between A and B when looking at the sorted list, the highest ranking page might in reality be a better result than the lower ranking one. the rankings are calculated using an external sorting algorithm to account for the fact that we might need to sort more nodes than we can feasibly keep in memory at once.	2024-01-05 12:20:24 +01:00
Mikkel Denker	276165da49	move libtorch behind feature flag	2023-10-14 14:17:54 +02:00
Mikkel Denker	8fd2b2a292	Re-write webgraph storage backend. The webgraph storage is now essentially a '(from, to) -> label' map stored in rocksdb databases. This heavily simplifies inserts and merges, since we can now insert new edges directly into the db without having to read the existing edges. Get operations now uses a prefix iterator from rocksdb. This utilizes the fact that '[{from_bytes},0,0,0,...]' is a prefix of any '[{from_bytes},{to_bytes}]' that might have been inserted into the database. Assuming that we use a sufficiently large read-ahead size, I think there shouldn't be a noticeable increase in IO operations for get operations and thus not noticeable performance penalty. In fact, they might be a bit faster in practice due to not having to deserialize a hashmap and from the fact that rocksdb seems to be more tuned for small key-value sizes.	2023-09-14 12:54:37 +02:00
Oliver Bøving	2e2aff3da0	🥬 Svelte frontend (#91 ) * remove deno frontend * Add Svelte frontend * change frontend port to 8000 and autofocus searchbar on frontpage * Setup formatting of the new frontend with the new monorepo * Add "show more" button to explore * Add searchbar arrow key navigation * Update query based on navigation in search bar * Highlight mathcing prefix in search results * Add toggling of site rankings to search results * Fix crashing when having multiple semi-identical optics * Refactor searchbar visibility --------- Co-authored-by: Mikkel Denker <mikkel@trystract.com>	2023-09-10 16:32:03 +00:00
Mikkel Denker	d896e4ea94	control log level with environment variable	2023-09-05 20:27:02 +02:00
Oliver Bøving	369d5031df	Refactor `Justfile` and tracing with enabled debug tracing for stract (#87 ) * Refactor Justfile and tracing with enabled debug tracing for stract * Use `just dev` in `CONTRIBUTING.md`	2023-09-04 08:53:17 +00:00
Oliver Bøving	072a6323e9	🍋 Fresh frontend (#84 ) * Add fresh frontend This reimplements the existing frontend using Fresh. Primay highlights of this new frontend is: - Uses deno instead of node/npm for less dependencies. Deno for example includes a formatter and linter, and dependencies are downloaded automatically. - Everything is TypeScript. There is no more .astro or similar, which reduces complexity. - The frontend is built up of components entirely, which can either be server side rendered only, or rehidrated on the client for interactivity (islands). - Fresh server side renderes all requests, populated by using the API, which is typesafe and generated from the OpenAPI spec. - Combining the last two, it becomes much easier to add high levels of interactivity, which needed to be written in external JS files. Now these are Preact component and can use all lthe benefits that comes from this. Future work includes: - [ ] Integrating Alice in the new UI - [ ] Direct answers UI - [ ] Default Optics. Should they come from the API or the frontend? - [ ] Integrating the new fresh server with the existing backend - [ ] Rutes supplying `queryUrlPart` to `Header` * Update fresh frontend to use "type" rather than "@type" * Add placeholder Tailwind config for VSCode intellisense * Add discussions UI * Clean up some left over template `{{...}}` * './icons' might not exist before generation * some UI/UX changes for consistency with old frontend * Remove unused ENABLE_CSP flag since it is always enabled now * Store icons used for the frontend in the repository * Don't generate icons when starting the frontend * Fix chat textarea sizing in Firefox * Add Chat UI to new frontend * Only allow one of liked, disliked, blocked at a time * Add `curosr-pointer` to safe search radio buttons * Add `leading-6` to articles to get more line spacing Almost equivalent to the old frontend * Prefix explore and site ranking links with https:// Perhaps we should determine the protocol in a more robust way? * Fix explore sites regressions from adding tailwind-forms * Refactor manage optics UI * Add API endpoint for exporting optic from site rankings `/beta/api/sites/export` is a JSON equivilant of the existing `/settings/sites/export` endpoint. * Add "Export as optic" and "Clear all and export as optic" buttons These new buttons use the new `/beta/api/sites/export` endpoint to download the generated optic * Store site rankings in URL and send it during searching * Use the tailwind config to extend the twind theme * Add `/beta/api/explore/export` API endpoint * Fix optics export button on explore * Reflect the currently searched optic in the optic selector * Add `noscript:hidden` class to hide fx search result adjust buttons * Re-search when changing ranking of a webpage * Refactor searchbar interaction and suggestion highlighting We now do the highlighting on the frontend * Change site blocking to be domain blocking when converting site rankings to optics. The domain field uses the public suffix list which already handles suffixes that can be shared by multiple users (netlify.app etc.). In other words, the domain of 'site.netlify.app' is 'site.netlify.app', so users of stract can still block specific netlify sites without blocking them all. * Pass around `queryUrlPart` between pages * Do syntax highlighting server-side using HighlightJS * Remove `facebook.com` as default site in explore * Add webmasters page to new frontend * Remove old frontend * Remove dead code from old Rust frontend * Rename webstract to frontend * remove more stuff from old frontend --------- Co-authored-by: Mikkel Denker <mikkel@trystract.com>	2023-09-04 05:59:28 +00:00
Mikkel Denker	1af94339bb	increase max politeness, faster setup and some documentation	2023-07-13 17:53:46 +02:00
Mikkel Denker	af1e43206b	Normalize redirects in web graph	2023-07-06 16:39:52 +02:00
Mikkel Denker	5b645e5188	less indexes to hopefully increase insert performance in crawldb	2023-06-28 13:36:39 +02:00
Mikkel Denker	b08bf68cba	we also need to set the libtorch environment variables to be able to run the tests. We can now run 'just cargo ...' which makes sure cargo has all the necesarry env variables set	2023-06-24 11:54:11 +02:00
Mikkel Denker	1f1dd5f588	added libtorch env stuff to justfile runs	2023-06-23 19:35:33 +02:00
Mikkel Denker	d4b3e6ba29	fixed justfile syntax	2023-06-23 17:29:34 +02:00
Mikkel Denker	c450036a1a	extracted part of 'just configure' command into a 'just setup'	2023-06-23 17:28:54 +02:00
Mikkel Denker	1e95f94207	download libtorch from python script since we need to download from pytorch website if compiling for linux	2023-06-23 17:11:26 +02:00
Mikkel Denker	415aa14f46	normalize python versions	2023-06-23 14:42:23 +02:00
Mikkel Denker	18f7ef1842	Alice; show claim confidence level	2023-06-07 15:43:33 +02:00
Mikkel Denker	b16a1b9629	alice	2023-06-01 15:43:27 +02:00
Mikkel Denker	cb64b49ad9	Fixed a bug where distance calculation in online-harmonic used the wrong node from the edge	2023-05-10 16:29:47 +02:00
Mikkel Denker	fe713a8737	Move from onnx to libtorch bindings for ML inference. Fuck onnx. It was an enormous hassle to get onnx to play ball with more advanced models and execute the onnx models on GPU since onnx is only compiled to older cuda versions. This commit removes our dependency to onnx and replaces it with direct bindings to libtorch which gives us more flexibility and still allows us to easily deploy simple models with tracing. Time will tell if this is sufficiently performant or if we may want to develop some kind of JIT that can fuse matrix operations to increase performance.	2023-05-08 11:11:49 +02:00
Mikkel Denker	5ab900eea5	fixed a bunch of problems with pattern_query implementation and wrote some tests to make sure it works correctly	2023-04-29 18:26:23 +02:00
Mikkel Denker	2a1fa6109a	abstractive summarization model with beam search	2023-02-07 15:11:23 +01:00
Mikkel Denker	8a6751cf24	Split centrality building into separate processes. This is a hotfix to reduce the memory for each step	2023-01-30 10:29:47 +01:00
Mikkel Denker	72fa54a945	Quantize crossencoder	2023-01-23 12:33:01 +01:00
Mikkel Denker	f1ad006799	Fixed bug where liked sites would show up in discardall optics, even though they matched none of the rules	2023-01-16 16:16:50 +01:00
Mikkel Denker	29fe3ad652	webgraph CLI merge segments	2023-01-04 12:43:32 +01:00
Mikkel Denker	a03a4957be	Ftr/optics language (#69 ) * Store all schema_org from webpages in a field * flatten json tokenizer * rename goggles -> optics * update optics syntax * cargo workspace * very simple lsp wasm connection * optics as separate package * hover stuff * optics vscode extension published * syntax errors on-save and begin schema-field * Use separate targets for LSP and rest (#68) By moving the different targets into separate workspaces, we avoid some of the issues where rust-analyzer might just stop working. By adding the two projects to .vscode/settings.json we keep the ability to get completions, goto definitions, rename, and such operations. This requires us to specify the dependency versions in the LSP crate, as we can no longer refer to them by the workspace version. The positive of this is that the WASM/LSP dependent crates are now moved to the LSP crate. * schema.org syntax in optic * optic can now perform schema searches * simplified schema_org flattening * wrote new quickstart.optic * update like-text Co-authored-by: Oliver Bøving <oliver@bvng.dk>	2022-12-01 14:59:49 +01:00
Mikkel Denker	bbca94c37e	Parse DMOZ data (#66 ) * Parse DMOZ data * index topics as facets * calculate topic centrality * fix serious bug in webgraph where some nodes dissapeared (there is still a bug somewhere, but waaaay less nodes are missing now) * apply topic centrality during search	2022-11-03 14:19:21 +01:00
Mikkel Denker	2c127d5f39	Ftr/configure command (#65 ) * Add autosuggest scrape as a separate command * Save queries continuously * Save images as they get downloaded (way lower memory usage) * Created configure subcommand * Updated justfile and setup documentation	2022-10-26 14:58:26 +02:00
Mikkel Denker	3cc7c84a32	Ftr/distributed search (#59 ) * refactor network communication into separate module and made mapreduce async again * sonic module is simple enough as is * rename Searcher -> LocalSearcher * [WIP] distributed searcher structure outlined * split index search into initial and retrieval steps * distributed searcher searching shards * make bucket in collector generic * no more todo!s. Waiting for indexing to finish to test implementation * distributed searcher seems to work. Needs an enourmous refactor - the code is really ugly * cleanup search-server on exit in justfile	2022-09-28 15:50:45 +02:00
Mikkel Denker	e649053260	update setup steps	2022-09-22 09:49:50 +02:00
Mikkel Denker	8c9ffede30	Ftr/page centrality (#55 ) * move signal from goggles into ranking module * refactor webpage test-constructor * add page_centrality field * use page centrality during ranking * small justfile refactoring * update index in lfs	2022-09-13 11:49:50 +02:00
Oliver Bøving	f972b163c5	Optimize frontend build time (#39 ) This moves building the astro frontend from build.rs into the justfile. This streamlines the build process for the frontend astro part, and the frontend application itself by letting cargo watch rebuild the astro and then the Rust binary, instead of building astro in build.rs. Non-conclusive results says that this improves build times from about 13s to 6s, while being more consistent :)	2022-09-10 12:23:33 +02:00

1 2

70 commits