0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	8beb2ac778	use 'chdir' instead of 'cd' as 'cd' is not a binary on most linux systems credit to Quackdoc on matrix for fixing this	2024-07-25 17:26:32 +02:00
Mikkel Denker	8f97617904	make ml models optional during setup	2024-07-25 15:16:32 +02:00
Mikkel Denker	303a2cf2da	accept unicode-3.0 license	2024-06-27 17:12:28 +02:00
Mikkel Denker	385e8375c6	deduplicate urls during indexing	2024-06-24 10:43:12 +02:00
Mikkel Denker	817bda9738	optionally merge all webgraph segments into a single segment for improved read performance	2024-06-09 14:48:11 +02:00
Mikkel Denker	a1381d667b	fixed bug that caused error model in spell correction to always be empty	2024-05-27 11:45:07 +02:00
Mikkel Denker	af73d33b39	forgot to push new accepted licenses	2024-05-23 13:01:34 +02:00
Mikkel Denker	d9848b19b2	use snippet in nsfw script instead of body	2024-04-24 09:32:58 +02:00
Mikkel Denker	fe56664e4d	cargo check all features in ci	2024-04-23 20:28:02 +02:00
Oliver Bøving	18d9d279fb	Cratify `bloom` and `speedy-kv` (#193 ) * Move bloom into separate crate * Move speedy_kv into a separate crate * add licenses --------- Co-authored-by: Mikkel Denker <mikkel@stract.com>	2024-04-22 21:18:44 +02:00
Mikkel Denker	3ab4f944e0	MapReduce -> AMPC (#189 ) * [WIP] structure for mapreduce -> ampc and introduce tables in dht * temporarily disable failing lints in ampc/mod.rs * establish dht connection in ampc * support batch get/set in dht * ampc implementation (not tested yet) * dht upsert * no more todo's in ampc harmonic centrality impl * return 'UpsertAction' instead of bool from upserts this makes it easier to see what action was taken from the callers perspective. a bool is not particularly descriptive * add ability to have multiple dht tables for each ampc algorithm gives better type-safety as each table can then have their own key-value type pair * some bundled bug/correctness fixes. * await currently scheduled jobs after there are no more jobs to schedule. * execute each mapper fully at a time before scheduling next mapper. * compute centrality scores from set cardinalities. * refactor into smaller functions * happy path ampc dht test and split ampc into multiple files * correct harmonic centrality calculation in ampc * run distributed harmonic centrality worker and coordinator from cli * stream key/values from dht using range queries in batches * benchmark distributed centrality calculation * faster hash in shard selection and drop table in background thread * Move all rpc communication to bincode2. This should give a significant serilization/deserilization performance boost * dht store copy-on-write for keys and values to make table clone faster * fix flaky dht test and improve .set performance using entries * dynamic batch size based on number of shards in dht cluster	2024-04-15 10:29:33 +02:00
Mikkel Denker	2258243bc2	run clippy in CI	2024-03-21 17:10:03 +01:00
Mikkel Denker	5ce97abf46	Run frontend lint in CI (#180 ) Adds `npm run lint` to CI and fixes all the previous lint errors.	2024-03-13 09:48:07 +01:00
Wesley Appler	25c0344578	[WIP] Implement the importing of optics (#167 ) * Initial implementation of importing sites from an optic * Removed unused import * Updated button text * Implemented client-side WASM to allow for parsing of imported .optic files * Removed unneeded deps & updated `CONTRIBUTING.md` to reflect wasm-pack needs * CI updates * Added vite-plugin-wasm-pack to ensure wasm modules get copied over * CI fix >:( * More CI attempts * agony - CSP fix & further wasm-pack fixes * CSP updates * Package update to prevent an unneccesary build of wasm * reduce bloat in ci build log from wasm * fix another non-determinsticly failing test * only install wasm-pack as part of setup steps in CONTRIBUTING.md ./scripts/ci/check seems to fail if it tries to install wasm-pack while it is already installed (at least on my machine). as it is already added as a step in CONTRIBUTING.md we can assume it has been installed on the system * add vite plugin to ensure changes to 'crates/client-wasm' gets reflected in the frontend. adapted from https://github.com/StractOrg/stract/pull/109 * run 'npm run format' * propagate errors from wasm crate	2024-02-28 17:01:32 +01:00
Mikkel Denker	6b9d514a5b	temporarily disable frontend type check in CI	2024-02-28 11:34:19 +01:00
Mikkel Denker	df577020b1	torch is needed in scripts to export models during 'just configure'	2024-02-21 13:28:18 +01:00
Oliver Bøving	2d8973bcf7	Add basic CI (#156 ) * Add basic CI * Add liburing installation step to CI workflow * Run `npm install` as part of ci/check * Add `@types/node` package * Add `submodules: 'recursive'` to CI * Skip test if test data is not available * Install `cargo-about` in CI	2024-02-17 20:09:58 +01:00
Mikkel Denker	8a92bc39ed	add code of conduct	2024-02-15 10:12:30 +01:00
Mikkel Denker	0b69853fa9	chore: 'cargo update' and remove some unused trait method. also accept gplv3 licenses in libraries as this is permitted under section 13 of gplv3.	2024-02-12 13:49:20 +01:00
Mikkel Denker	aa89813906	move some of the hardcoded snippet choices into the configuration file	2024-02-06 11:19:42 +01:00
Mikkel Denker	e4e3044e47	finally ditch that pesky libtorch dependency!	2024-02-02 13:11:06 +01:00
Mikkel Denker	d7e564d91a	move neural network models from torch to candle	2024-02-02 12:36:39 +01:00
Mikkel Denker	ea3b7a4099	implement some layers in ggml linear, embedding and multihead attention	2024-01-31 17:51:02 +01:00
Mikkel Denker	f4e7d1972c	actually skip disambiguation pages for entity index. turns out that none of the usual disambiguation elements from the online wiki are present in the .zim dump. instead, disambiguation pages seem to have a "<meta property='mw:PageProp/disambiguation'>" element which we can use. the commit also includes a useful script to dump the html for a specific article from a zim file which is very usefull when debugging this stuff	2024-01-29 09:59:39 +01:00
Mikkel Denker	b3bcda2dfe	simple script to dump article html from a zim file	2024-01-29 09:19:25 +01:00
Mikkel Denker	1a9f381d15	GGML Rust bindings (#122 ) * move crates into a 'crates' folder * added cargo-about to check dependency licenses * create ggml-sys bindings and build as a static library. simple addition sanity test passes * update licenses * yeet alice * yeet qa model * yeet fact model * [wip] idiomatic rust bindings for ggml * [ggml] mul, add and sub ops implemented for tensors. i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?	2024-01-27 12:27:27 +01:00
Mikkel Denker	cc91935d0a	Move entity index out of normal search index and have dedicated search server for it	2024-01-23 14:53:33 +01:00
Mikkel Denker	fbc01ad865	summarization using mistral and 'chain-of-density' approach. the summarization becomes much better if we allow the model to first generate a candidate summarization and then improving on it. doing the improvement step just once seems to significantly improve the summary. we also now use an llm (mistral 7b) for the summarisations, as we can then use the same model for multiple tasks and serve it using gpus, thus significantly decreasing the latency.	2024-01-19 11:08:17 +01:00
Mikkel Denker	7ea3dbcca4	[ranking] add a host_centrality_rank and page_centrality_rank signal it might be easier to score pages based on their rank of the sorted their centralities. for instance the centralities for page A and page B might be very similar numerically, but if a lot of pages are between A and B when looking at the sorted list, the highest ranking page might in reality be a better result than the lower ranking one. the rankings are calculated using an external sorting algorithm to account for the fact that we might need to sort more nodes than we can feasibly keep in memory at once.	2024-01-05 12:20:24 +01:00
Mikkel Denker	54fe19ddf6	trystract.com -> stract.com	2023-12-16 14:43:00 +01:00
Oliver Bøving	369d5031df	Refactor `Justfile` and tracing with enabled debug tracing for stract (#87 ) * Refactor Justfile and tracing with enabled debug tracing for stract * Use `just dev` in `CONTRIBUTING.md`	2023-09-04 08:53:17 +00:00
Oliver Bøving	c7e941f3c4	Rename Rust `frontend` to `api` (#86 )	2023-09-04 08:24:56 +00:00
Oliver Bøving	072a6323e9	🍋 Fresh frontend (#84 ) * Add fresh frontend This reimplements the existing frontend using Fresh. Primay highlights of this new frontend is: - Uses deno instead of node/npm for less dependencies. Deno for example includes a formatter and linter, and dependencies are downloaded automatically. - Everything is TypeScript. There is no more .astro or similar, which reduces complexity. - The frontend is built up of components entirely, which can either be server side rendered only, or rehidrated on the client for interactivity (islands). - Fresh server side renderes all requests, populated by using the API, which is typesafe and generated from the OpenAPI spec. - Combining the last two, it becomes much easier to add high levels of interactivity, which needed to be written in external JS files. Now these are Preact component and can use all lthe benefits that comes from this. Future work includes: - [ ] Integrating Alice in the new UI - [ ] Direct answers UI - [ ] Default Optics. Should they come from the API or the frontend? - [ ] Integrating the new fresh server with the existing backend - [ ] Rutes supplying `queryUrlPart` to `Header` * Update fresh frontend to use "type" rather than "@type" * Add placeholder Tailwind config for VSCode intellisense * Add discussions UI * Clean up some left over template `{{...}}` * './icons' might not exist before generation * some UI/UX changes for consistency with old frontend * Remove unused ENABLE_CSP flag since it is always enabled now * Store icons used for the frontend in the repository * Don't generate icons when starting the frontend * Fix chat textarea sizing in Firefox * Add Chat UI to new frontend * Only allow one of liked, disliked, blocked at a time * Add `curosr-pointer` to safe search radio buttons * Add `leading-6` to articles to get more line spacing Almost equivalent to the old frontend * Prefix explore and site ranking links with https:// Perhaps we should determine the protocol in a more robust way? * Fix explore sites regressions from adding tailwind-forms * Refactor manage optics UI * Add API endpoint for exporting optic from site rankings `/beta/api/sites/export` is a JSON equivilant of the existing `/settings/sites/export` endpoint. * Add "Export as optic" and "Clear all and export as optic" buttons These new buttons use the new `/beta/api/sites/export` endpoint to download the generated optic * Store site rankings in URL and send it during searching * Use the tailwind config to extend the twind theme * Add `/beta/api/explore/export` API endpoint * Fix optics export button on explore * Reflect the currently searched optic in the optic selector * Add `noscript:hidden` class to hide fx search result adjust buttons * Re-search when changing ranking of a webpage * Refactor searchbar interaction and suggestion highlighting We now do the highlighting on the frontend * Change site blocking to be domain blocking when converting site rankings to optics. The domain field uses the public suffix list which already handles suffixes that can be shared by multiple users (netlify.app etc.). In other words, the domain of 'site.netlify.app' is 'site.netlify.app', so users of stract can still block specific netlify sites without blocking them all. * Pass around `queryUrlPart` between pages * Do syntax highlighting server-side using HighlightJS * Remove `facebook.com` as default site in explore * Add webmasters page to new frontend * Remove old frontend * Remove dead code from old Rust frontend * Rename webstract to frontend * remove more stuff from old frontend --------- Co-authored-by: Mikkel Denker <mikkel@trystract.com>	2023-09-04 05:59:28 +00:00
Mikkel Denker	24ffa32983	Safe search (#79 ) * This should fix the byte/char index mixups identified in issue 77 * script to generate dataset * naive bayes classification with tf-idf features * Add prediction confidence to naive bayes. we report the confidence as $log_probs[best] / sum(log_probs)$. I'm not really sure this confidence calculation can be seen as a probability that the model has predicted the correct label, but should still give a picture of the confidence of the prediction. It's therefore named confidence and not probability. Also, even though naive bayes is a pretty decent classifier some people on stackexchange report that it's a pretty bad probability estimator. Further tests will determine if this confidence score is actually useful. * naive bayes benchmark * store safe search classification in index * search preferences page where user can control safe search settings	2023-08-26 16:36:33 +00:00
Mikkel Denker	7dbd149075	Support indieweb only optics. Still need to update the quickstart and blogroll optic to include the new match location.	2023-08-18 16:13:45 +02:00
Mikkel Denker	1af94339bb	increase max politeness, faster setup and some documentation	2023-07-13 17:53:46 +02:00
Mikkel Denker	4f1a7079e4	formating and raw http for s3 endpoint	2023-07-11 17:47:42 +02:00
Mikkel Denker	af1e43206b	Normalize redirects in web graph	2023-07-06 16:39:52 +02:00
Mikkel Denker	dda4529a2e	getting alice cuda acceleration to actually work (hopefully)	2023-06-24 17:28:10 +02:00
Mikkel Denker	26ce6b69f3	create libtorch symlinks	2023-06-24 15:59:04 +02:00
Mikkel Denker	1f1dd5f588	added libtorch env stuff to justfile runs	2023-06-23 19:35:33 +02:00
Mikkel Denker	1e95f94207	download libtorch from python script since we need to download from pytorch website if compiling for linux	2023-06-23 17:11:26 +02:00
Mikkel Denker	18f7ef1842	Alice; show claim confidence level	2023-06-07 15:43:33 +02:00
Mikkel Denker	b16a1b9629	alice	2023-06-01 15:43:27 +02:00
Mikkel Denker	cb64b49ad9	Fixed a bug where distance calculation in online-harmonic used the wrong node from the edge	2023-05-10 16:29:47 +02:00
Mikkel Denker	f0129d724f	find similar sites in webgraph	2023-05-09 11:44:25 +02:00
Mikkel Denker	72d1086672	Dual-encoder as passage scorer for extractive summarization	2023-05-08 15:43:22 +02:00
Mikkel Denker	fe713a8737	Move from onnx to libtorch bindings for ML inference. Fuck onnx. It was an enormous hassle to get onnx to play ball with more advanced models and execute the onnx models on GPU since onnx is only compiled to older cuda versions. This commit removes our dependency to onnx and replaces it with direct bindings to libtorch which gives us more flexibility and still allows us to easily deploy simple models with tracing. Time will tell if this is sufficiently performant or if we may want to develop some kind of JIT that can fuse matrix operations to increase performance.	2023-05-08 11:11:49 +02:00
Mikkel Denker	be78c1dab5	blogroll optic	2023-05-02 11:30:51 +02:00
Mikkel Denker	1a8f1ec095	10k short optic and optimizations to make large optics faster	2023-05-02 09:48:53 +02:00

1 2

65 commits