* [WIP] structure for mapreduce -> ampc and introduce tables in dht
* temporarily disable failing lints in ampc/mod.rs
* establish dht connection in ampc
* support batch get/set in dht
* ampc implementation (not tested yet)
* dht upsert
* no more todo's in ampc harmonic centrality impl
* return 'UpsertAction' instead of bool from upserts
this makes it easier to see what action was taken from the callers perspective. a bool is not particularly descriptive
* add ability to have multiple dht tables for each ampc algorithm
gives better type-safety as each table can then have their own key-value type pair
* some bundled bug/correctness fixes.
* await currently scheduled jobs after there are no more jobs to schedule.
* execute each mapper fully at a time before scheduling next mapper.
* compute centrality scores from set cardinalities.
* refactor into smaller functions
* happy path ampc dht test and split ampc into multiple files
* correct harmonic centrality calculation in ampc
* run distributed harmonic centrality worker and coordinator from cli
* stream key/values from dht using range queries in batches
* benchmark distributed centrality calculation
* faster hash in shard selection and drop table in background thread
* Move all rpc communication to bincode2. This should give a significant serilization/deserilization performance boost
* dht store copy-on-write for keys and values to make table clone faster
* fix flaky dht test and improve .set performance using entries
* dynamic batch size based on number of shards in dht cluster
* Initial implementation of importing sites from an optic
* Removed unused import
* Updated button text
* Implemented client-side WASM to allow for parsing of imported .optic files
* Removed unneeded deps & updated `CONTRIBUTING.md` to reflect wasm-pack needs
* CI updates
* Added vite-plugin-wasm-pack to ensure wasm modules get copied over
* CI fix >:(
* More CI attempts
* agony - CSP fix & further wasm-pack fixes
* CSP updates
* Package update to prevent an unneccesary build of wasm
* reduce bloat in ci build log from wasm
* fix another non-determinsticly failing test
* only install wasm-pack as part of setup steps in CONTRIBUTING.md
./scripts/ci/check seems to fail if it tries to install wasm-pack while it is already installed (at least on my machine). as it is already added as a step in CONTRIBUTING.md we can assume it has been installed on the system
* add vite plugin to ensure changes to 'crates/client-wasm' gets reflected in the frontend.
adapted from https://github.com/StractOrg/stract/pull/109
* run 'npm run format'
* propagate errors from wasm crate
* Add basic CI
* Add liburing installation step to CI workflow
* Run `npm install` as part of ci/check
* Add `@types/node` package
* Add `submodules: 'recursive'` to CI
* Skip test if test data is not available
* Install `cargo-about` in CI
turns out that none of the usual disambiguation elements from the online wiki are present in the .zim dump. instead, disambiguation pages seem to have a "<meta property='mw:PageProp/disambiguation'>" element which we can use.
the commit also includes a useful script to dump the html for a specific article from a zim file which is very usefull when debugging this stuff
* move crates into a 'crates' folder
* added cargo-about to check dependency licenses
* create ggml-sys bindings and build as a static library.
simple addition sanity test passes
* update licenses
* yeet alice
* yeet qa model
* yeet fact model
* [wip] idiomatic rust bindings for ggml
* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
the summarization becomes much better if we allow the model to first generate a candidate summarization and then improving on it.
doing the improvement step just once seems to significantly improve the summary.
we also now use an llm (mistral 7b) for the summarisations, as we can then use the same model for multiple tasks and serve it using gpus, thus significantly decreasing the latency.
it might be easier to score pages based on their rank of the sorted their centralities. for instance the centralities for page A and page B might be very similar numerically, but if a lot of pages are between A and B when looking at the sorted list, the highest ranking page might in reality be a better result than the lower ranking one.
the rankings are calculated using an external sorting algorithm to account for the fact that we might need to sort more nodes than we can feasibly keep in memory at once.
* Add fresh frontend
This reimplements the existing frontend using Fresh. Primay highlights of
this new frontend is:
- Uses deno instead of node/npm for less dependencies. Deno for example
includes a formatter and linter, and dependencies are downloaded
automatically.
- Everything is TypeScript. There is no more .astro or similar, which
reduces complexity.
- The frontend is built up of components entirely, which can either be
server side rendered only, or rehidrated on the client for
interactivity (islands).
- Fresh server side renderes all requests, populated by using the API,
which is typesafe and generated from the OpenAPI spec.
- Combining the last two, it becomes much easier to add high levels of
interactivity, which needed to be written in external JS files. Now
these are Preact component and can use all lthe benefits that comes
from this.
Future work includes:
- [ ] Integrating Alice in the new UI
- [ ] Direct answers UI
- [ ] Default Optics. Should they come from the API or the frontend?
- [ ] Integrating the new fresh server with the existing backend
- [ ] Rutes supplying `queryUrlPart` to `Header`
* Update fresh frontend to use "type" rather than "@type"
* Add placeholder Tailwind config for VSCode intellisense
* Add discussions UI
* Clean up some left over template `{{...}}`
* './icons' might not exist before generation
* some UI/UX changes for consistency with old frontend
* Remove unused ENABLE_CSP flag since it is always enabled now
* Store icons used for the frontend in the repository
* Don't generate icons when starting the frontend
* Fix chat textarea sizing in Firefox
* Add Chat UI to new frontend
* Only allow one of liked, disliked, blocked at a time
* Add `curosr-pointer` to safe search radio buttons
* Add `leading-6` to articles to get more line spacing
Almost equivalent to the old frontend
* Prefix explore and site ranking links with https://
Perhaps we should determine the protocol in a more robust way?
* Fix explore sites regressions from adding tailwind-forms
* Refactor manage optics UI
* Add API endpoint for exporting optic from site rankings
`/beta/api/sites/export` is a JSON equivilant of the existing
`/settings/sites/export` endpoint.
* Add "Export as optic" and "Clear all and export as optic" buttons
These new buttons use the new `/beta/api/sites/export` endpoint to
download the generated optic
* Store site rankings in URL and send it during searching
* Use the tailwind config to extend the twind theme
* Add `/beta/api/explore/export` API endpoint
* Fix optics export button on explore
* Reflect the currently searched optic in the optic selector
* Add `noscript:hidden` class to hide fx search result adjust buttons
* Re-search when changing ranking of a webpage
* Refactor searchbar interaction and suggestion highlighting
We now do the highlighting on the frontend
* Change site blocking to be domain blocking when converting site rankings to optics.
The domain field uses the public suffix list which already handles suffixes that can be shared by multiple users (netlify.app etc.).
In other words, the domain of 'site.netlify.app' is 'site.netlify.app', so users of stract can still block specific netlify sites without blocking them all.
* Pass around `queryUrlPart` between pages
* Do syntax highlighting server-side using HighlightJS
* Remove `facebook.com` as default site in explore
* Add webmasters page to new frontend
* Remove old frontend
* Remove dead code from old Rust frontend
* Rename webstract to frontend
* remove more stuff from old frontend
---------
Co-authored-by: Mikkel Denker <mikkel@trystract.com>
* This should fix the byte/char index mixups identified in issue 77
* script to generate dataset
* naive bayes classification with tf-idf features
* Add prediction confidence to naive bayes.
we report the confidence as $log_probs[best] / sum(log_probs)$.
I'm not really sure this confidence calculation can be seen as a probability that the model has predicted the correct label, but should still give a picture of the confidence of the prediction. It's therefore named confidence and not probability.
Also, even though naive bayes is a pretty decent classifier some people on stackexchange report that it's a pretty bad probability estimator. Further tests will determine if this confidence score is actually useful.
* naive bayes benchmark
* store safe search classification in index
* search preferences page where user can control safe search settings
Fuck onnx. It was an enormous hassle to get onnx to play ball with more advanced models and execute the onnx models on GPU since onnx is only compiled to older cuda versions. This commit removes our dependency to onnx and replaces it with direct bindings to libtorch which gives us more flexibility and still allows us to easily deploy simple models with tracing. Time will tell if this is sufficiently performant or if we may want to develop some kind of JIT that can fuse matrix operations to increase performance.