Turns out they cannot be called from a global context when running
`deno run -A main.ts` since twind is not setup by the time the files are
loaded.
Fortunately, it turns out that the `injectGlobal` does not duplicate the
CSS for every component it is rendered in!
* Add fresh frontend
This reimplements the existing frontend using Fresh. Primay highlights of
this new frontend is:
- Uses deno instead of node/npm for less dependencies. Deno for example
includes a formatter and linter, and dependencies are downloaded
automatically.
- Everything is TypeScript. There is no more .astro or similar, which
reduces complexity.
- The frontend is built up of components entirely, which can either be
server side rendered only, or rehidrated on the client for
interactivity (islands).
- Fresh server side renderes all requests, populated by using the API,
which is typesafe and generated from the OpenAPI spec.
- Combining the last two, it becomes much easier to add high levels of
interactivity, which needed to be written in external JS files. Now
these are Preact component and can use all lthe benefits that comes
from this.
Future work includes:
- [ ] Integrating Alice in the new UI
- [ ] Direct answers UI
- [ ] Default Optics. Should they come from the API or the frontend?
- [ ] Integrating the new fresh server with the existing backend
- [ ] Rutes supplying `queryUrlPart` to `Header`
* Update fresh frontend to use "type" rather than "@type"
* Add placeholder Tailwind config for VSCode intellisense
* Add discussions UI
* Clean up some left over template `{{...}}`
* './icons' might not exist before generation
* some UI/UX changes for consistency with old frontend
* Remove unused ENABLE_CSP flag since it is always enabled now
* Store icons used for the frontend in the repository
* Don't generate icons when starting the frontend
* Fix chat textarea sizing in Firefox
* Add Chat UI to new frontend
* Only allow one of liked, disliked, blocked at a time
* Add `curosr-pointer` to safe search radio buttons
* Add `leading-6` to articles to get more line spacing
Almost equivalent to the old frontend
* Prefix explore and site ranking links with https://
Perhaps we should determine the protocol in a more robust way?
* Fix explore sites regressions from adding tailwind-forms
* Refactor manage optics UI
* Add API endpoint for exporting optic from site rankings
`/beta/api/sites/export` is a JSON equivilant of the existing
`/settings/sites/export` endpoint.
* Add "Export as optic" and "Clear all and export as optic" buttons
These new buttons use the new `/beta/api/sites/export` endpoint to
download the generated optic
* Store site rankings in URL and send it during searching
* Use the tailwind config to extend the twind theme
* Add `/beta/api/explore/export` API endpoint
* Fix optics export button on explore
* Reflect the currently searched optic in the optic selector
* Add `noscript:hidden` class to hide fx search result adjust buttons
* Re-search when changing ranking of a webpage
* Refactor searchbar interaction and suggestion highlighting
We now do the highlighting on the frontend
* Change site blocking to be domain blocking when converting site rankings to optics.
The domain field uses the public suffix list which already handles suffixes that can be shared by multiple users (netlify.app etc.).
In other words, the domain of 'site.netlify.app' is 'site.netlify.app', so users of stract can still block specific netlify sites without blocking them all.
* Pass around `queryUrlPart` between pages
* Do syntax highlighting server-side using HighlightJS
* Remove `facebook.com` as default site in explore
* Add webmasters page to new frontend
* Remove old frontend
* Remove dead code from old Rust frontend
* Rename webstract to frontend
* remove more stuff from old frontend
---------
Co-authored-by: Mikkel Denker <mikkel@trystract.com>
* Add more endpoints to the OpenAPI specification
This adds:
- Autosuggest
- Summarize
- Fact check
- Alice
... and all of the necessaray types to support them.
Additionally it adds permissive CORS to these routes. This might not be
appropriate, or perhaps it should be configurable.
* rename alice::Params to AliceParams for consistency
* merge main
* Enable CORS selectivly
CORS is useful for development, but should not be enabled for production
---------
Co-authored-by: Mikkel Denker <mikkel@trystract.com>
This allows us to match compound words in the bigram and trigram field while still keeping ranking signals intact. Before we had modified the bigram and trigram indexers to also output monograms at the start and end, but this turned out to introduce too much noise during ranking.
* This should fix the byte/char index mixups identified in issue 77
* script to generate dataset
* naive bayes classification with tf-idf features
* Add prediction confidence to naive bayes.
we report the confidence as $log_probs[best] / sum(log_probs)$.
I'm not really sure this confidence calculation can be seen as a probability that the model has predicted the correct label, but should still give a picture of the confidence of the prediction. It's therefore named confidence and not probability.
Also, even though naive bayes is a pretty decent classifier some people on stackexchange report that it's a pretty bad probability estimator. Further tests will determine if this confidence score is actually useful.
* naive bayes benchmark
* store safe search classification in index
* search preferences page where user can control safe search settings
the query "wishlist" now also matches search results that has the terms "wish list". This is done using the bigram- and trigram fields.
Support for "wish list" to match "wishlist" results is not included in this commit as this would require each term in the query to be aware of the succeeding terms and it is not immediatly clear how best to approach this.
This allows us to insert new urls much faster, as we then don't have to read all url states for a given domain in order to insert a single new url.
To sample domains, we prefix each url key with the domain and perform a prefix search in the database. This means we cannot use rkyv as we then get alignment errors when trying to deserialize the keys (it's probably possible but I don't know how to get it to work). We therefore now use bincode for the url stuff.
Sampling is probably a bit slower as the prefix query likely uses more iops compared to simply finding all urls for a domain. Time will tell if this is still fast enough.
Some domains had a very (!) large number of urls. A lot of time was spend reading and writing the urlstates to/from disk for these domains.
We now shard the urls and choose a random shard when sampling. If there are no valid urls in the shard, then the job will simply contain 0 urls, the worker will quickly finish the job and request a new domain to crawl.
Rkyv uses i32 by default to represent byte offsets. This means we could not serialize/deserialize structs that are larger than approx 2gb. This commit also enables the 64 bit feature so we can deal with larger structs.
* Add `serde(tag = "type", content = "value")` to OpenAPI exposed types
Makes them more ergonomic to work with in TypeScript in some scenarios.
* Add `#[serde(rename_all="camelCase")]` to all types deriving `ToSchema`
Currently two types are exempt: `Region` and `Expr`
* Update schema names to camelCase in external files
* This should fix the byte/char index mixups identified in issue 77
* save allocation when removing trailing '/'
* increase readability and prevent potential future clippy warnings
If siteA has a canonical url for siteB, then siteB might show up for queries where it shouldn't actually match but where siteA matches. A bad actor might use this to have weird canonical urls to some sites that they don't like. This could of course be fixed by only respecting canonical urls within the same domain, but even in that case I don't see how the canonical site will benefit the user. If siteA and siteB actually has the same content, then one of them will already be downranked due to duplication detection.
Therefore it doesn't make much sense to blindly follow canonical url hints (or actually follow them at all).
Will still leave the functions to extract the canonical urls, since we might want to skip indexing for sites that has a canonical url defined thats different from their own url.