Commit graph

70 commits

Author SHA1 Message Date
Mikkel Denker
12e9502e80
Improve API documentation (#235)
* add docusaurus scalar api documentation structure

* bump openapi 3.0 to 3.1 so we can mark internal endpoints

* improve search api docs

* webgraph api docs

* point docs to prod
2024-11-19 13:43:42 +01:00
Mikkel Denker
31bfebf2c9 just update 2024-10-25 09:37:45 +02:00
Mikkel Denker
658ac6f682
Webgraph inverted index (#232)
* overall structure for new webgraph store

* webgraph schema structure and HostLinksQuery

* deserialize edge

* forward/backlink queries

* full edge queries and iter smalledges

* [wip] use new store in webgraph

* remove id2node db

* shortcircuit link queries

* [wip] remote webgraph trait structure

* [wip] shard awareness

* finish remote webgraph trait structure

* optimize read

* merge webgraphs

* construct webgraph store

* make sure 'just configure' works and everything looks correct
2024-10-23 11:59:52 +02:00
Mikkel Denker
70e9e8181d make api types public 2024-10-01 10:02:22 +02:00
Mikkel Denker
5ebdb24a07 just update 2024-10-01 09:51:11 +02:00
Mikkel Denker
8f97617904 make ml models optional during setup 2024-07-25 15:16:32 +02:00
Mikkel Denker
265b1b7871
Ranking diff tool (#207)
* ranking diff tool structure

* fix missing icon types

* add admin for queries and experiments

* minor cleanup

* show experiment progress

* upgrade node adapter for svelte

* hopefully fix ci

* display common queries between experiments

* display serp diffs with top signals for each result

* like experiments and show overview in queries

* settings to toggle experiment shuffle and show/hide signals

* keyboard shortcuts

* visualise improvements by query category

* document how to use tool
2024-06-03 15:00:16 +02:00
Mikkel Denker
a1381d667b fixed bug that caused error model in spell correction to always be empty 2024-05-27 11:45:07 +02:00
Mikkel Denker
1a6d8ff6be fixed bug in stupid_backoff model that caused last n_gram count to always be 0 2024-05-14 16:49:30 +02:00
Mikkel Denker
7d870b2702 build wasm during 'just setup' and make sure pkg has a package.json file.
see https://github.com/rustwasm/wasm-pack/issues/965
2024-02-29 14:05:02 +01:00
Wesley Appler
25c0344578
[WIP] Implement the importing of optics (#167)
* Initial implementation of importing sites from an optic

* Removed unused import

* Updated button text

* Implemented client-side WASM to allow for parsing of imported .optic files

* Removed unneeded deps & updated `CONTRIBUTING.md` to reflect wasm-pack needs

* CI updates

* Added vite-plugin-wasm-pack to ensure wasm modules get copied over

* CI fix >:(

* More CI attempts

* agony - CSP fix & further wasm-pack fixes

* CSP updates

* Package update to prevent an unneccesary build of wasm

* reduce bloat in ci build log from wasm

* fix another non-determinsticly failing test

* only install wasm-pack as part of setup steps in CONTRIBUTING.md
./scripts/ci/check seems to fail if it tries to install wasm-pack while it is already installed (at least on my machine). as it is already added as a step in CONTRIBUTING.md we can assume it has been installed on the system

* add vite plugin to ensure changes to 'crates/client-wasm' gets reflected in the frontend.
adapted from https://github.com/StractOrg/stract/pull/109

* run 'npm run format'

* propagate errors from wasm crate
2024-02-28 17:01:32 +01:00
Mikkel Denker
b89ea6389c pip install upgrade during setup 2024-02-17 19:50:21 +01:00
Mikkel Denker
e4e3044e47 finally ditch that pesky libtorch dependency! 2024-02-02 13:11:06 +01:00
Mikkel Denker
1a9f381d15
GGML Rust bindings (#122)
* move crates into a 'crates' folder

* added cargo-about to check dependency licenses

* create ggml-sys bindings and build as a static library.
simple addition sanity test passes

* update licenses

* yeet alice

* yeet qa model

* yeet fact model

* [wip] idiomatic rust bindings for ggml

* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
2024-01-27 12:27:27 +01:00
Mikkel Denker
788f92c8f4 split webgraph server into host and page.
allows us to host each graph on separate sets of servers.
2024-01-24 11:08:45 +01:00
Mikkel Denker
cc91935d0a Move entity index out of normal search index and have dedicated search server for it 2024-01-23 14:53:33 +01:00
Mikkel Denker
fbc01ad865 summarization using mistral and 'chain-of-density' approach.
the summarization becomes much better if we allow the model to first generate a candidate summarization and then improving on it.
doing the improvement step just once seems to significantly improve the summary.
we also now use an llm (mistral 7b) for the summarisations, as we can then use the same model for multiple tasks and serve it using gpus, thus significantly decreasing the latency.
2024-01-19 11:08:17 +01:00
Mikkel Denker
7ea3dbcca4 [ranking] add a host_centrality_rank and page_centrality_rank signal
it might be easier to score pages based on their rank of the sorted their centralities. for instance the centralities for page A and page B might be very similar numerically, but if a lot of pages are between A and B when looking at the sorted list, the highest ranking page might in reality be a better result than the lower ranking one.

the rankings are calculated using an external sorting algorithm to account for the fact that we might need to sort more nodes than we can feasibly keep in memory at once.
2024-01-05 12:20:24 +01:00
Mikkel Denker
276165da49 move libtorch behind feature flag 2023-10-14 14:17:54 +02:00
Mikkel Denker
8fd2b2a292 Re-write webgraph storage backend.
The webgraph storage is now essentially a '(from, to) -> label' map stored in rocksdb databases.
This heavily simplifies inserts and merges, since we can now insert new edges directly into the db without having to read the existing edges.

Get operations now uses a prefix iterator from rocksdb. This utilizes the fact that '[{from_bytes},0,0,0,...]' is a prefix of any '[{from_bytes},{to_bytes}]' that might have been inserted into the database.
Assuming that we use a sufficiently large read-ahead size, I think there shouldn't be a noticeable increase in IO operations for get operations and thus not noticeable performance penalty. In fact, they might be a bit faster in practice due to not having to deserialize a hashmap and from the fact that rocksdb seems to be more tuned for small key-value sizes.
2023-09-14 12:54:37 +02:00
Oliver Bøving
2e2aff3da0
🥬 Svelte frontend (#91)
* remove deno frontend

* Add Svelte frontend

* change frontend port to 8000 and autofocus searchbar on frontpage

* Setup formatting of the new frontend with the new monorepo

* Add "show more" button to explore

* Add searchbar arrow key navigation

* Update query based on navigation in search bar

* Highlight mathcing prefix in search results

* Add toggling of site rankings to search results

* Fix crashing when having multiple semi-identical optics

* Refactor searchbar visibility

---------

Co-authored-by: Mikkel Denker <mikkel@trystract.com>
2023-09-10 16:32:03 +00:00
Mikkel Denker
d896e4ea94 control log level with environment variable 2023-09-05 20:27:02 +02:00
Oliver Bøving
369d5031df
Refactor Justfile and tracing with enabled debug tracing for stract (#87)
* Refactor Justfile and tracing with enabled debug tracing for stract

* Use `just dev` in `CONTRIBUTING.md`
2023-09-04 08:53:17 +00:00
Oliver Bøving
072a6323e9
🍋 Fresh frontend (#84)
* Add fresh frontend

This reimplements the existing frontend using Fresh. Primay highlights of
this new frontend is:

- Uses deno instead of node/npm for less dependencies. Deno for example
  includes a formatter and linter, and dependencies are downloaded
  automatically.
- Everything is TypeScript. There is no more .astro or similar, which
  reduces complexity.
- The frontend is built up of components entirely, which can either be
  server side rendered only, or rehidrated on the client for
  interactivity (islands).
- Fresh server side renderes all requests, populated by using the API,
  which is typesafe and generated from the OpenAPI spec.
- Combining the last two, it becomes much easier to add high levels of
  interactivity, which needed to be written in external JS files. Now
  these are Preact component and can use all lthe benefits that comes
  from this.

Future work includes:
- [ ] Integrating Alice in the new UI
- [ ] Direct answers UI
- [ ] Default Optics. Should they come from the API or the frontend?
- [ ] Integrating the new fresh server with the existing backend
- [ ] Rutes supplying `queryUrlPart` to `Header`

* Update fresh frontend to use "type" rather than "@type"

* Add placeholder Tailwind config for VSCode intellisense

* Add discussions UI

* Clean up some left over template `{{...}}`

* './icons' might not exist before generation

* some UI/UX changes for consistency with old frontend

* Remove unused ENABLE_CSP flag since it is always enabled now

* Store icons used for the frontend in the repository

* Don't generate icons when starting the frontend

* Fix chat textarea sizing in Firefox

* Add Chat UI to new frontend

* Only allow one of liked, disliked, blocked at a time

* Add `curosr-pointer` to safe search radio buttons

* Add `leading-6` to articles to get more line spacing

Almost equivalent to the old frontend

* Prefix explore and site ranking links with https://

Perhaps we should determine the protocol in a more robust way?

* Fix explore sites regressions from adding tailwind-forms

* Refactor manage optics UI

* Add API endpoint for exporting optic from site rankings

`/beta/api/sites/export` is a JSON equivilant of the existing
`/settings/sites/export` endpoint.

* Add "Export as optic" and "Clear all and export as optic" buttons

These new buttons use the new `/beta/api/sites/export` endpoint to
download the generated optic

* Store site rankings in URL and send it during searching

* Use the tailwind config to extend the twind theme

* Add `/beta/api/explore/export` API endpoint

* Fix optics export button on explore

* Reflect the currently searched optic in the optic selector

* Add `noscript:hidden` class to hide fx search result adjust buttons

* Re-search when changing ranking of a webpage

* Refactor searchbar interaction and suggestion highlighting

We now do the highlighting on the frontend

* Change site blocking to be domain blocking when converting site rankings to optics.
The domain field uses the public suffix list which already handles suffixes that can be shared by multiple users (netlify.app etc.).
In other words, the domain of 'site.netlify.app' is 'site.netlify.app', so users of stract can still block specific netlify sites without blocking them all.

* Pass around `queryUrlPart` between pages

* Do syntax highlighting server-side using HighlightJS

* Remove `facebook.com` as default site in explore

* Add webmasters page to new frontend

* Remove old frontend

* Remove dead code from old Rust frontend

* Rename webstract to frontend

* remove more stuff from old frontend

---------

Co-authored-by: Mikkel Denker <mikkel@trystract.com>
2023-09-04 05:59:28 +00:00
Mikkel Denker
1af94339bb increase max politeness, faster setup and some documentation 2023-07-13 17:53:46 +02:00
Mikkel Denker
af1e43206b Normalize redirects in web graph 2023-07-06 16:39:52 +02:00
Mikkel Denker
5b645e5188 less indexes to hopefully increase insert performance in crawldb 2023-06-28 13:36:39 +02:00
Mikkel Denker
b08bf68cba we also need to set the libtorch environment variables to be able to run the tests. We can now run 'just cargo ...' which makes sure cargo has all the necesarry env variables set 2023-06-24 11:54:11 +02:00
Mikkel Denker
1f1dd5f588 added libtorch env stuff to justfile runs 2023-06-23 19:35:33 +02:00
Mikkel Denker
d4b3e6ba29 fixed justfile syntax 2023-06-23 17:29:34 +02:00
Mikkel Denker
c450036a1a extracted part of 'just configure' command into a 'just setup' 2023-06-23 17:28:54 +02:00
Mikkel Denker
1e95f94207 download libtorch from python script since we need to download from pytorch website if compiling for linux 2023-06-23 17:11:26 +02:00
Mikkel Denker
415aa14f46 normalize python versions 2023-06-23 14:42:23 +02:00
Mikkel Denker
18f7ef1842 Alice; show claim confidence level 2023-06-07 15:43:33 +02:00
Mikkel Denker
b16a1b9629 alice 2023-06-01 15:43:27 +02:00
Mikkel Denker
cb64b49ad9 Fixed a bug where distance calculation in online-harmonic used the wrong node from the edge 2023-05-10 16:29:47 +02:00
Mikkel Denker
fe713a8737 Move from onnx to libtorch bindings for ML inference.
Fuck onnx. It was an enormous hassle to get onnx to play ball with more advanced models and execute the onnx models on GPU since onnx is only compiled to older cuda versions. This commit removes our dependency to onnx and replaces it with direct bindings to libtorch which gives us more flexibility and still allows us to easily deploy simple models with tracing. Time will tell if this is sufficiently performant or if we may want to develop some kind of JIT that can fuse matrix operations to increase performance.
2023-05-08 11:11:49 +02:00
Mikkel Denker
5ab900eea5 fixed a bunch of problems with pattern_query implementation and wrote some tests to make sure it works correctly 2023-04-29 18:26:23 +02:00
Mikkel Denker
2a1fa6109a abstractive summarization model with beam search 2023-02-07 15:11:23 +01:00
Mikkel Denker
8a6751cf24 Split centrality building into separate processes. This is a hotfix to reduce the memory for each step 2023-01-30 10:29:47 +01:00
Mikkel Denker
72fa54a945 Quantize crossencoder 2023-01-23 12:33:01 +01:00
Mikkel Denker
f1ad006799 Fixed bug where liked sites would show up in discardall optics, even though they matched none of the rules 2023-01-16 16:16:50 +01:00
Mikkel Denker
29fe3ad652 webgraph CLI merge segments 2023-01-04 12:43:32 +01:00
Mikkel Denker
a03a4957be
Ftr/optics language (#69)
* Store all schema_org from webpages in a field

* flatten json tokenizer

* rename goggles -> optics

* update optics syntax

* cargo workspace

* very simple lsp wasm connection

* optics as separate package

* hover stuff

* optics vscode extension published

* syntax errors on-save and begin schema-field

* Use separate targets for LSP and rest (#68)

By moving the different targets into separate workspaces, we avoid some
of the issues where rust-analyzer might just stop working.

By adding the two projects to .vscode/settings.json we keep the ability
to get completions, goto definitions, rename, and such operations.

This requires us to specify the dependency versions in the LSP crate, as
we can no longer refer to them by the workspace version. The positive of
this is that the WASM/LSP dependent crates are now moved to the LSP crate.

* schema.org syntax in optic

* optic can now perform schema searches

* simplified schema_org flattening

* wrote new quickstart.optic

* update like-text

Co-authored-by: Oliver Bøving <oliver@bvng.dk>
2022-12-01 14:59:49 +01:00
Mikkel Denker
bbca94c37e
Parse DMOZ data (#66)
* Parse DMOZ data

* index topics as facets

* calculate topic centrality

* fix serious bug in webgraph where some nodes dissapeared (there is still a bug somewhere, but waaaay less nodes are missing now)

* apply topic centrality during search
2022-11-03 14:19:21 +01:00
Mikkel Denker
2c127d5f39
Ftr/configure command (#65)
* Add autosuggest scrape as a separate command

* Save queries continuously

* Save images as they get downloaded (way lower memory usage)

* Created configure subcommand

* Updated justfile and setup documentation
2022-10-26 14:58:26 +02:00
Mikkel Denker
3cc7c84a32
Ftr/distributed search (#59)
* refactor network communication into separate module and made mapreduce async again

* sonic module is simple enough as is

* rename Searcher -> LocalSearcher

* [WIP] distributed searcher structure outlined

* split index search into initial and retrieval steps

* distributed searcher searching shards

* make bucket in collector generic

* no more todo!s. Waiting for indexing to finish to test implementation

* distributed searcher seems to work. Needs an enourmous refactor - the code is really ugly

* cleanup search-server on exit in justfile
2022-09-28 15:50:45 +02:00
Mikkel Denker
e649053260 update setup steps 2022-09-22 09:49:50 +02:00
Mikkel Denker
8c9ffede30
Ftr/page centrality (#55)
* move signal from goggles into ranking module

* refactor webpage test-constructor

* add page_centrality field

* use page centrality during ranking

* small justfile refactoring

* update index in lfs
2022-09-13 11:49:50 +02:00
Oliver Bøving
f972b163c5
Optimize frontend build time (#39)
This moves building the astro frontend from build.rs into the justfile.

This streamlines the build process for the frontend astro part, and the
frontend application itself by letting cargo watch rebuild the astro and
then the Rust binary, instead of building astro in build.rs.

Non-conclusive results says that this improves build times from about
13s to 6s, while being more consistent :)
2022-09-10 12:23:33 +02:00