Commit graph

65 commits

Author SHA1 Message Date
Mikkel Denker
8beb2ac778 use 'chdir' instead of 'cd' as 'cd' is not a binary on most linux systems
credit to Quackdoc on matrix for fixing this
2024-07-25 17:26:32 +02:00
Mikkel Denker
8f97617904 make ml models optional during setup 2024-07-25 15:16:32 +02:00
Mikkel Denker
303a2cf2da accept unicode-3.0 license 2024-06-27 17:12:28 +02:00
Mikkel Denker
385e8375c6 deduplicate urls during indexing 2024-06-24 10:43:12 +02:00
Mikkel Denker
817bda9738 optionally merge all webgraph segments into a single segment for improved read performance 2024-06-09 14:48:11 +02:00
Mikkel Denker
a1381d667b fixed bug that caused error model in spell correction to always be empty 2024-05-27 11:45:07 +02:00
Mikkel Denker
af73d33b39 forgot to push new accepted licenses 2024-05-23 13:01:34 +02:00
Mikkel Denker
d9848b19b2 use snippet in nsfw script instead of body 2024-04-24 09:32:58 +02:00
Mikkel Denker
fe56664e4d cargo check all features in ci 2024-04-23 20:28:02 +02:00
Oliver Bøving
18d9d279fb
Cratify bloom and speedy-kv (#193)
* Move bloom into separate crate

* Move speedy_kv into a separate crate

* add licenses

---------

Co-authored-by: Mikkel Denker <mikkel@stract.com>
2024-04-22 21:18:44 +02:00
Mikkel Denker
3ab4f944e0
MapReduce -> AMPC (#189)
* [WIP] structure for mapreduce -> ampc and introduce tables in dht

* temporarily disable failing lints in ampc/mod.rs

* establish dht connection in ampc

* support batch get/set in dht

* ampc implementation (not tested yet)

* dht upsert

* no more todo's in ampc harmonic centrality impl

* return 'UpsertAction' instead of bool from upserts
this makes it easier to see what action was taken from the callers perspective. a bool is not particularly descriptive

* add ability to have multiple dht tables for each ampc algorithm
gives better type-safety as each table can then have their own key-value type pair

* some bundled bug/correctness fixes.
* await currently scheduled jobs after there are no more jobs to schedule.
* execute each mapper fully at a time before scheduling next mapper.
* compute centrality scores from set cardinalities.

* refactor into smaller functions

* happy path ampc dht test and split ampc into multiple files

* correct harmonic centrality calculation in ampc

* run distributed harmonic centrality worker and coordinator from cli

* stream key/values from dht using range queries in batches

* benchmark distributed centrality calculation

* faster hash in shard selection and drop table in background thread

* Move all rpc communication to bincode2. This should give a significant serilization/deserilization performance boost

* dht store copy-on-write for keys and values to make table clone faster

* fix flaky dht test and improve .set performance using entries

* dynamic batch size based on number of shards in dht cluster
2024-04-15 10:29:33 +02:00
Mikkel Denker
2258243bc2 run clippy in CI 2024-03-21 17:10:03 +01:00
Mikkel Denker
5ce97abf46
Run frontend lint in CI (#180)
Adds `npm run lint` to CI and fixes all the previous lint errors.
2024-03-13 09:48:07 +01:00
Wesley Appler
25c0344578
[WIP] Implement the importing of optics (#167)
* Initial implementation of importing sites from an optic

* Removed unused import

* Updated button text

* Implemented client-side WASM to allow for parsing of imported .optic files

* Removed unneeded deps & updated `CONTRIBUTING.md` to reflect wasm-pack needs

* CI updates

* Added vite-plugin-wasm-pack to ensure wasm modules get copied over

* CI fix >:(

* More CI attempts

* agony - CSP fix & further wasm-pack fixes

* CSP updates

* Package update to prevent an unneccesary build of wasm

* reduce bloat in ci build log from wasm

* fix another non-determinsticly failing test

* only install wasm-pack as part of setup steps in CONTRIBUTING.md
./scripts/ci/check seems to fail if it tries to install wasm-pack while it is already installed (at least on my machine). as it is already added as a step in CONTRIBUTING.md we can assume it has been installed on the system

* add vite plugin to ensure changes to 'crates/client-wasm' gets reflected in the frontend.
adapted from https://github.com/StractOrg/stract/pull/109

* run 'npm run format'

* propagate errors from wasm crate
2024-02-28 17:01:32 +01:00
Mikkel Denker
6b9d514a5b temporarily disable frontend type check in CI 2024-02-28 11:34:19 +01:00
Mikkel Denker
df577020b1 torch is needed in scripts to export models during 'just configure' 2024-02-21 13:28:18 +01:00
Oliver Bøving
2d8973bcf7
Add basic CI (#156)
* Add basic CI

* Add liburing installation step to CI workflow

* Run `npm install` as part of ci/check

* Add `@types/node` package

* Add `submodules: 'recursive'` to CI

* Skip test if test data is not available

* Install `cargo-about` in CI
2024-02-17 20:09:58 +01:00
Mikkel Denker
8a92bc39ed add code of conduct 2024-02-15 10:12:30 +01:00
Mikkel Denker
0b69853fa9 chore: 'cargo update' and remove some unused trait method.
also accept gplv3 licenses in libraries as this is permitted under section 13 of gplv3.
2024-02-12 13:49:20 +01:00
Mikkel Denker
aa89813906 move some of the hardcoded snippet choices into the configuration file 2024-02-06 11:19:42 +01:00
Mikkel Denker
e4e3044e47 finally ditch that pesky libtorch dependency! 2024-02-02 13:11:06 +01:00
Mikkel Denker
d7e564d91a move neural network models from torch to candle 2024-02-02 12:36:39 +01:00
Mikkel Denker
ea3b7a4099 implement some layers in ggml
linear, embedding and multihead attention
2024-01-31 17:51:02 +01:00
Mikkel Denker
f4e7d1972c actually skip disambiguation pages for entity index.
turns out that none of the usual disambiguation elements from the online wiki are present in the .zim dump. instead, disambiguation pages seem to have a "<meta property='mw:PageProp/disambiguation'>" element which we can use.

the commit also includes a useful script to dump the html for a specific article from a zim file which is very usefull when debugging this stuff
2024-01-29 09:59:39 +01:00
Mikkel Denker
b3bcda2dfe simple script to dump article html from a zim file 2024-01-29 09:19:25 +01:00
Mikkel Denker
1a9f381d15
GGML Rust bindings (#122)
* move crates into a 'crates' folder

* added cargo-about to check dependency licenses

* create ggml-sys bindings and build as a static library.
simple addition sanity test passes

* update licenses

* yeet alice

* yeet qa model

* yeet fact model

* [wip] idiomatic rust bindings for ggml

* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
2024-01-27 12:27:27 +01:00
Mikkel Denker
cc91935d0a Move entity index out of normal search index and have dedicated search server for it 2024-01-23 14:53:33 +01:00
Mikkel Denker
fbc01ad865 summarization using mistral and 'chain-of-density' approach.
the summarization becomes much better if we allow the model to first generate a candidate summarization and then improving on it.
doing the improvement step just once seems to significantly improve the summary.
we also now use an llm (mistral 7b) for the summarisations, as we can then use the same model for multiple tasks and serve it using gpus, thus significantly decreasing the latency.
2024-01-19 11:08:17 +01:00
Mikkel Denker
7ea3dbcca4 [ranking] add a host_centrality_rank and page_centrality_rank signal
it might be easier to score pages based on their rank of the sorted their centralities. for instance the centralities for page A and page B might be very similar numerically, but if a lot of pages are between A and B when looking at the sorted list, the highest ranking page might in reality be a better result than the lower ranking one.

the rankings are calculated using an external sorting algorithm to account for the fact that we might need to sort more nodes than we can feasibly keep in memory at once.
2024-01-05 12:20:24 +01:00
Mikkel Denker
54fe19ddf6 trystract.com -> stract.com 2023-12-16 14:43:00 +01:00
Oliver Bøving
369d5031df
Refactor Justfile and tracing with enabled debug tracing for stract (#87)
* Refactor Justfile and tracing with enabled debug tracing for stract

* Use `just dev` in `CONTRIBUTING.md`
2023-09-04 08:53:17 +00:00
Oliver Bøving
c7e941f3c4
Rename Rust frontend to api (#86) 2023-09-04 08:24:56 +00:00
Oliver Bøving
072a6323e9
🍋 Fresh frontend (#84)
* Add fresh frontend

This reimplements the existing frontend using Fresh. Primay highlights of
this new frontend is:

- Uses deno instead of node/npm for less dependencies. Deno for example
  includes a formatter and linter, and dependencies are downloaded
  automatically.
- Everything is TypeScript. There is no more .astro or similar, which
  reduces complexity.
- The frontend is built up of components entirely, which can either be
  server side rendered only, or rehidrated on the client for
  interactivity (islands).
- Fresh server side renderes all requests, populated by using the API,
  which is typesafe and generated from the OpenAPI spec.
- Combining the last two, it becomes much easier to add high levels of
  interactivity, which needed to be written in external JS files. Now
  these are Preact component and can use all lthe benefits that comes
  from this.

Future work includes:
- [ ] Integrating Alice in the new UI
- [ ] Direct answers UI
- [ ] Default Optics. Should they come from the API or the frontend?
- [ ] Integrating the new fresh server with the existing backend
- [ ] Rutes supplying `queryUrlPart` to `Header`

* Update fresh frontend to use "type" rather than "@type"

* Add placeholder Tailwind config for VSCode intellisense

* Add discussions UI

* Clean up some left over template `{{...}}`

* './icons' might not exist before generation

* some UI/UX changes for consistency with old frontend

* Remove unused ENABLE_CSP flag since it is always enabled now

* Store icons used for the frontend in the repository

* Don't generate icons when starting the frontend

* Fix chat textarea sizing in Firefox

* Add Chat UI to new frontend

* Only allow one of liked, disliked, blocked at a time

* Add `curosr-pointer` to safe search radio buttons

* Add `leading-6` to articles to get more line spacing

Almost equivalent to the old frontend

* Prefix explore and site ranking links with https://

Perhaps we should determine the protocol in a more robust way?

* Fix explore sites regressions from adding tailwind-forms

* Refactor manage optics UI

* Add API endpoint for exporting optic from site rankings

`/beta/api/sites/export` is a JSON equivilant of the existing
`/settings/sites/export` endpoint.

* Add "Export as optic" and "Clear all and export as optic" buttons

These new buttons use the new `/beta/api/sites/export` endpoint to
download the generated optic

* Store site rankings in URL and send it during searching

* Use the tailwind config to extend the twind theme

* Add `/beta/api/explore/export` API endpoint

* Fix optics export button on explore

* Reflect the currently searched optic in the optic selector

* Add `noscript:hidden` class to hide fx search result adjust buttons

* Re-search when changing ranking of a webpage

* Refactor searchbar interaction and suggestion highlighting

We now do the highlighting on the frontend

* Change site blocking to be domain blocking when converting site rankings to optics.
The domain field uses the public suffix list which already handles suffixes that can be shared by multiple users (netlify.app etc.).
In other words, the domain of 'site.netlify.app' is 'site.netlify.app', so users of stract can still block specific netlify sites without blocking them all.

* Pass around `queryUrlPart` between pages

* Do syntax highlighting server-side using HighlightJS

* Remove `facebook.com` as default site in explore

* Add webmasters page to new frontend

* Remove old frontend

* Remove dead code from old Rust frontend

* Rename webstract to frontend

* remove more stuff from old frontend

---------

Co-authored-by: Mikkel Denker <mikkel@trystract.com>
2023-09-04 05:59:28 +00:00
Mikkel Denker
24ffa32983
Safe search (#79)
* This should fix the byte/char index mixups identified in issue 77

* script to generate dataset

* naive bayes classification with tf-idf features

* Add prediction confidence to naive bayes.
we report the confidence as $log_probs[best] / sum(log_probs)$.
I'm not really sure this confidence calculation can be seen as a probability that the model has predicted the correct label, but should still give a picture of the confidence of the prediction. It's therefore named confidence and not probability.

Also, even though naive bayes is a pretty decent classifier some people on stackexchange report that it's a pretty bad probability estimator. Further tests will determine if this confidence score is actually useful.

* naive bayes benchmark

* store safe search classification in index

* search preferences page where user can control safe search settings
2023-08-26 16:36:33 +00:00
Mikkel Denker
7dbd149075 Support indieweb only optics.
Still need to update the quickstart and blogroll optic to include the new match location.
2023-08-18 16:13:45 +02:00
Mikkel Denker
1af94339bb increase max politeness, faster setup and some documentation 2023-07-13 17:53:46 +02:00
Mikkel Denker
4f1a7079e4 formating and raw http for s3 endpoint 2023-07-11 17:47:42 +02:00
Mikkel Denker
af1e43206b Normalize redirects in web graph 2023-07-06 16:39:52 +02:00
Mikkel Denker
dda4529a2e getting alice cuda acceleration to actually work (hopefully) 2023-06-24 17:28:10 +02:00
Mikkel Denker
26ce6b69f3 create libtorch symlinks 2023-06-24 15:59:04 +02:00
Mikkel Denker
1f1dd5f588 added libtorch env stuff to justfile runs 2023-06-23 19:35:33 +02:00
Mikkel Denker
1e95f94207 download libtorch from python script since we need to download from pytorch website if compiling for linux 2023-06-23 17:11:26 +02:00
Mikkel Denker
18f7ef1842 Alice; show claim confidence level 2023-06-07 15:43:33 +02:00
Mikkel Denker
b16a1b9629 alice 2023-06-01 15:43:27 +02:00
Mikkel Denker
cb64b49ad9 Fixed a bug where distance calculation in online-harmonic used the wrong node from the edge 2023-05-10 16:29:47 +02:00
Mikkel Denker
f0129d724f find similar sites in webgraph 2023-05-09 11:44:25 +02:00
Mikkel Denker
72d1086672 Dual-encoder as passage scorer for extractive summarization 2023-05-08 15:43:22 +02:00
Mikkel Denker
fe713a8737 Move from onnx to libtorch bindings for ML inference.
Fuck onnx. It was an enormous hassle to get onnx to play ball with more advanced models and execute the onnx models on GPU since onnx is only compiled to older cuda versions. This commit removes our dependency to onnx and replaces it with direct bindings to libtorch which gives us more flexibility and still allows us to easily deploy simple models with tracing. Time will tell if this is sufficiently performant or if we may want to develop some kind of JIT that can fuse matrix operations to increase performance.
2023-05-08 11:11:49 +02:00
Mikkel Denker
be78c1dab5 blogroll optic 2023-05-02 11:30:51 +02:00
Mikkel Denker
1a8f1ec095 10k short optic and optimizations to make large optics faster 2023-05-02 09:48:53 +02:00