* add docusaurus scalar api documentation structure
* bump openapi 3.0 to 3.1 so we can mark internal endpoints
* improve search api docs
* webgraph api docs
* point docs to prod
* overall structure for new webgraph store
* webgraph schema structure and HostLinksQuery
* deserialize edge
* forward/backlink queries
* full edge queries and iter smalledges
* [wip] use new store in webgraph
* remove id2node db
* shortcircuit link queries
* [wip] remote webgraph trait structure
* [wip] shard awareness
* finish remote webgraph trait structure
* optimize read
* merge webgraphs
* construct webgraph store
* make sure 'just configure' works and everything looks correct
* ranking diff tool structure
* fix missing icon types
* add admin for queries and experiments
* minor cleanup
* show experiment progress
* upgrade node adapter for svelte
* hopefully fix ci
* display common queries between experiments
* display serp diffs with top signals for each result
* like experiments and show overview in queries
* settings to toggle experiment shuffle and show/hide signals
* keyboard shortcuts
* visualise improvements by query category
* document how to use tool
* Initial implementation of importing sites from an optic
* Removed unused import
* Updated button text
* Implemented client-side WASM to allow for parsing of imported .optic files
* Removed unneeded deps & updated `CONTRIBUTING.md` to reflect wasm-pack needs
* CI updates
* Added vite-plugin-wasm-pack to ensure wasm modules get copied over
* CI fix >:(
* More CI attempts
* agony - CSP fix & further wasm-pack fixes
* CSP updates
* Package update to prevent an unneccesary build of wasm
* reduce bloat in ci build log from wasm
* fix another non-determinsticly failing test
* only install wasm-pack as part of setup steps in CONTRIBUTING.md
./scripts/ci/check seems to fail if it tries to install wasm-pack while it is already installed (at least on my machine). as it is already added as a step in CONTRIBUTING.md we can assume it has been installed on the system
* add vite plugin to ensure changes to 'crates/client-wasm' gets reflected in the frontend.
adapted from https://github.com/StractOrg/stract/pull/109
* run 'npm run format'
* propagate errors from wasm crate
* move crates into a 'crates' folder
* added cargo-about to check dependency licenses
* create ggml-sys bindings and build as a static library.
simple addition sanity test passes
* update licenses
* yeet alice
* yeet qa model
* yeet fact model
* [wip] idiomatic rust bindings for ggml
* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
the summarization becomes much better if we allow the model to first generate a candidate summarization and then improving on it.
doing the improvement step just once seems to significantly improve the summary.
we also now use an llm (mistral 7b) for the summarisations, as we can then use the same model for multiple tasks and serve it using gpus, thus significantly decreasing the latency.
it might be easier to score pages based on their rank of the sorted their centralities. for instance the centralities for page A and page B might be very similar numerically, but if a lot of pages are between A and B when looking at the sorted list, the highest ranking page might in reality be a better result than the lower ranking one.
the rankings are calculated using an external sorting algorithm to account for the fact that we might need to sort more nodes than we can feasibly keep in memory at once.
The webgraph storage is now essentially a '(from, to) -> label' map stored in rocksdb databases.
This heavily simplifies inserts and merges, since we can now insert new edges directly into the db without having to read the existing edges.
Get operations now uses a prefix iterator from rocksdb. This utilizes the fact that '[{from_bytes},0,0,0,...]' is a prefix of any '[{from_bytes},{to_bytes}]' that might have been inserted into the database.
Assuming that we use a sufficiently large read-ahead size, I think there shouldn't be a noticeable increase in IO operations for get operations and thus not noticeable performance penalty. In fact, they might be a bit faster in practice due to not having to deserialize a hashmap and from the fact that rocksdb seems to be more tuned for small key-value sizes.
* remove deno frontend
* Add Svelte frontend
* change frontend port to 8000 and autofocus searchbar on frontpage
* Setup formatting of the new frontend with the new monorepo
* Add "show more" button to explore
* Add searchbar arrow key navigation
* Update query based on navigation in search bar
* Highlight mathcing prefix in search results
* Add toggling of site rankings to search results
* Fix crashing when having multiple semi-identical optics
* Refactor searchbar visibility
---------
Co-authored-by: Mikkel Denker <mikkel@trystract.com>
* Add fresh frontend
This reimplements the existing frontend using Fresh. Primay highlights of
this new frontend is:
- Uses deno instead of node/npm for less dependencies. Deno for example
includes a formatter and linter, and dependencies are downloaded
automatically.
- Everything is TypeScript. There is no more .astro or similar, which
reduces complexity.
- The frontend is built up of components entirely, which can either be
server side rendered only, or rehidrated on the client for
interactivity (islands).
- Fresh server side renderes all requests, populated by using the API,
which is typesafe and generated from the OpenAPI spec.
- Combining the last two, it becomes much easier to add high levels of
interactivity, which needed to be written in external JS files. Now
these are Preact component and can use all lthe benefits that comes
from this.
Future work includes:
- [ ] Integrating Alice in the new UI
- [ ] Direct answers UI
- [ ] Default Optics. Should they come from the API or the frontend?
- [ ] Integrating the new fresh server with the existing backend
- [ ] Rutes supplying `queryUrlPart` to `Header`
* Update fresh frontend to use "type" rather than "@type"
* Add placeholder Tailwind config for VSCode intellisense
* Add discussions UI
* Clean up some left over template `{{...}}`
* './icons' might not exist before generation
* some UI/UX changes for consistency with old frontend
* Remove unused ENABLE_CSP flag since it is always enabled now
* Store icons used for the frontend in the repository
* Don't generate icons when starting the frontend
* Fix chat textarea sizing in Firefox
* Add Chat UI to new frontend
* Only allow one of liked, disliked, blocked at a time
* Add `curosr-pointer` to safe search radio buttons
* Add `leading-6` to articles to get more line spacing
Almost equivalent to the old frontend
* Prefix explore and site ranking links with https://
Perhaps we should determine the protocol in a more robust way?
* Fix explore sites regressions from adding tailwind-forms
* Refactor manage optics UI
* Add API endpoint for exporting optic from site rankings
`/beta/api/sites/export` is a JSON equivilant of the existing
`/settings/sites/export` endpoint.
* Add "Export as optic" and "Clear all and export as optic" buttons
These new buttons use the new `/beta/api/sites/export` endpoint to
download the generated optic
* Store site rankings in URL and send it during searching
* Use the tailwind config to extend the twind theme
* Add `/beta/api/explore/export` API endpoint
* Fix optics export button on explore
* Reflect the currently searched optic in the optic selector
* Add `noscript:hidden` class to hide fx search result adjust buttons
* Re-search when changing ranking of a webpage
* Refactor searchbar interaction and suggestion highlighting
We now do the highlighting on the frontend
* Change site blocking to be domain blocking when converting site rankings to optics.
The domain field uses the public suffix list which already handles suffixes that can be shared by multiple users (netlify.app etc.).
In other words, the domain of 'site.netlify.app' is 'site.netlify.app', so users of stract can still block specific netlify sites without blocking them all.
* Pass around `queryUrlPart` between pages
* Do syntax highlighting server-side using HighlightJS
* Remove `facebook.com` as default site in explore
* Add webmasters page to new frontend
* Remove old frontend
* Remove dead code from old Rust frontend
* Rename webstract to frontend
* remove more stuff from old frontend
---------
Co-authored-by: Mikkel Denker <mikkel@trystract.com>
Fuck onnx. It was an enormous hassle to get onnx to play ball with more advanced models and execute the onnx models on GPU since onnx is only compiled to older cuda versions. This commit removes our dependency to onnx and replaces it with direct bindings to libtorch which gives us more flexibility and still allows us to easily deploy simple models with tracing. Time will tell if this is sufficiently performant or if we may want to develop some kind of JIT that can fuse matrix operations to increase performance.
* Store all schema_org from webpages in a field
* flatten json tokenizer
* rename goggles -> optics
* update optics syntax
* cargo workspace
* very simple lsp wasm connection
* optics as separate package
* hover stuff
* optics vscode extension published
* syntax errors on-save and begin schema-field
* Use separate targets for LSP and rest (#68)
By moving the different targets into separate workspaces, we avoid some
of the issues where rust-analyzer might just stop working.
By adding the two projects to .vscode/settings.json we keep the ability
to get completions, goto definitions, rename, and such operations.
This requires us to specify the dependency versions in the LSP crate, as
we can no longer refer to them by the workspace version. The positive of
this is that the WASM/LSP dependent crates are now moved to the LSP crate.
* schema.org syntax in optic
* optic can now perform schema searches
* simplified schema_org flattening
* wrote new quickstart.optic
* update like-text
Co-authored-by: Oliver Bøving <oliver@bvng.dk>
* Parse DMOZ data
* index topics as facets
* calculate topic centrality
* fix serious bug in webgraph where some nodes dissapeared (there is still a bug somewhere, but waaaay less nodes are missing now)
* apply topic centrality during search
* Add autosuggest scrape as a separate command
* Save queries continuously
* Save images as they get downloaded (way lower memory usage)
* Created configure subcommand
* Updated justfile and setup documentation
* refactor network communication into separate module and made mapreduce async again
* sonic module is simple enough as is
* rename Searcher -> LocalSearcher
* [WIP] distributed searcher structure outlined
* split index search into initial and retrieval steps
* distributed searcher searching shards
* make bucket in collector generic
* no more todo!s. Waiting for indexing to finish to test implementation
* distributed searcher seems to work. Needs an enourmous refactor - the code is really ugly
* cleanup search-server on exit in justfile
* move signal from goggles into ranking module
* refactor webpage test-constructor
* add page_centrality field
* use page centrality during ranking
* small justfile refactoring
* update index in lfs
This moves building the astro frontend from build.rs into the justfile.
This streamlines the build process for the frontend astro part, and the
frontend application itself by letting cargo watch rebuild the astro and
then the Rust binary, instead of building astro in build.rs.
Non-conclusive results says that this improves build times from about
13s to 6s, while being more consistent :)