Commit graph

1308 commits

Author SHA1 Message Date
Mikkel Denker
f8480ada94 control api_base with env variables for frontend 2023-09-04 13:03:28 +02:00
Oliver Bøving
9358db9933
Move all injectGlobal into components (#88)
Turns out they cannot be called from a global context when running
`deno run -A main.ts` since twind is not setup by the time the files are
loaded.

Fortunately, it turns out that the `injectGlobal` does not duplicate the
CSS for every component it is rendered in!
2023-09-04 09:00:26 +00:00
Oliver Bøving
369d5031df
Refactor Justfile and tracing with enabled debug tracing for stract (#87)
* Refactor Justfile and tracing with enabled debug tracing for stract

* Use `just dev` in `CONTRIBUTING.md`
2023-09-04 08:53:17 +00:00
Mikkel Denker
53bf823041 Merge branch 'main' of github.com:StractOrg/stract 2023-09-04 10:32:45 +02:00
Mikkel Denker
84b32e56e3 fix clippy warnings 2023-09-04 10:32:42 +02:00
Oliver Bøving
c7e941f3c4
Rename Rust frontend to api (#86) 2023-09-04 08:24:56 +00:00
Mikkel Denker
4f4f97eb8c don't show alice when disabled 2023-09-04 08:59:29 +02:00
Mikkel Denker
1abf645918 homepages should have empty url query part 2023-09-04 08:05:14 +02:00
Oliver Bøving
072a6323e9
🍋 Fresh frontend (#84)
* Add fresh frontend

This reimplements the existing frontend using Fresh. Primay highlights of
this new frontend is:

- Uses deno instead of node/npm for less dependencies. Deno for example
  includes a formatter and linter, and dependencies are downloaded
  automatically.
- Everything is TypeScript. There is no more .astro or similar, which
  reduces complexity.
- The frontend is built up of components entirely, which can either be
  server side rendered only, or rehidrated on the client for
  interactivity (islands).
- Fresh server side renderes all requests, populated by using the API,
  which is typesafe and generated from the OpenAPI spec.
- Combining the last two, it becomes much easier to add high levels of
  interactivity, which needed to be written in external JS files. Now
  these are Preact component and can use all lthe benefits that comes
  from this.

Future work includes:
- [ ] Integrating Alice in the new UI
- [ ] Direct answers UI
- [ ] Default Optics. Should they come from the API or the frontend?
- [ ] Integrating the new fresh server with the existing backend
- [ ] Rutes supplying `queryUrlPart` to `Header`

* Update fresh frontend to use "type" rather than "@type"

* Add placeholder Tailwind config for VSCode intellisense

* Add discussions UI

* Clean up some left over template `{{...}}`

* './icons' might not exist before generation

* some UI/UX changes for consistency with old frontend

* Remove unused ENABLE_CSP flag since it is always enabled now

* Store icons used for the frontend in the repository

* Don't generate icons when starting the frontend

* Fix chat textarea sizing in Firefox

* Add Chat UI to new frontend

* Only allow one of liked, disliked, blocked at a time

* Add `curosr-pointer` to safe search radio buttons

* Add `leading-6` to articles to get more line spacing

Almost equivalent to the old frontend

* Prefix explore and site ranking links with https://

Perhaps we should determine the protocol in a more robust way?

* Fix explore sites regressions from adding tailwind-forms

* Refactor manage optics UI

* Add API endpoint for exporting optic from site rankings

`/beta/api/sites/export` is a JSON equivilant of the existing
`/settings/sites/export` endpoint.

* Add "Export as optic" and "Clear all and export as optic" buttons

These new buttons use the new `/beta/api/sites/export` endpoint to
download the generated optic

* Store site rankings in URL and send it during searching

* Use the tailwind config to extend the twind theme

* Add `/beta/api/explore/export` API endpoint

* Fix optics export button on explore

* Reflect the currently searched optic in the optic selector

* Add `noscript:hidden` class to hide fx search result adjust buttons

* Re-search when changing ranking of a webpage

* Refactor searchbar interaction and suggestion highlighting

We now do the highlighting on the frontend

* Change site blocking to be domain blocking when converting site rankings to optics.
The domain field uses the public suffix list which already handles suffixes that can be shared by multiple users (netlify.app etc.).
In other words, the domain of 'site.netlify.app' is 'site.netlify.app', so users of stract can still block specific netlify sites without blocking them all.

* Pass around `queryUrlPart` between pages

* Do syntax highlighting server-side using HighlightJS

* Remove `facebook.com` as default site in explore

* Add webmasters page to new frontend

* Remove old frontend

* Remove dead code from old Rust frontend

* Rename webstract to frontend

* remove more stuff from old frontend

---------

Co-authored-by: Mikkel Denker <mikkel@trystract.com>
2023-09-04 05:59:28 +00:00
Mikkel Denker
08db882fec improve some types of navigational searches 2023-09-04 07:23:35 +02:00
Mikkel Denker
bedc51ca3d Remove alice from readme.
Since alice is not good enough yet and therefore not deployed in production, it only causes confusion to have it as a feature in the readme for now.
2023-09-01 10:26:08 +02:00
Mikkel Denker
c71727dedc added 'DomainNameNoTokenizer' field to better rank navigational searches 2023-08-31 16:53:53 +02:00
Oliver Bøving
f0f0c420bb
Add more endpoints to the OpenAPI specification (#80)
* Add more endpoints to the OpenAPI specification

This adds:
- Autosuggest
- Summarize
- Fact check
- Alice

... and all of the necessaray types to support them.

Additionally it adds permissive CORS to these routes. This might not be
appropriate, or perhaps it should be configurable.

* rename alice::Params to AliceParams for consistency

* merge main

* Enable CORS selectivly

CORS is useful for development, but should not be enabled for production

---------

Co-authored-by: Mikkel Denker <mikkel@trystract.com>
2023-08-31 09:50:10 +00:00
Mikkel Denker
23b1fe7324 Use all nodes to compute the inbound vector (not just top harmonic centrality ones).
This makes the similarity between sites more accurate which improves explore and site_rankings
2023-08-31 09:27:05 +02:00
Mikkel Denker
445043b1d2 Deduplicate webgraph edges during lookup.
Some edges may be duplicate if they are present in multiple segments.
2023-08-31 08:46:41 +02:00
Mikkel Denker
e534cd0d83 Ability to have different tokenizer for field during query and indexing.
This allows us to match compound words in the bigram and trigram field while still keeping ranking signals intact. Before we had modified the bigram and trigram indexers to also output monograms at the start and end, but this turned out to introduce too much noise during ranking.
2023-08-30 11:57:19 +02:00
Mikkel Denker
3fc3a316ad update ranking signal defaults 2023-08-29 21:47:47 +02:00
Mikkel Denker
5c6e552dcc forgot to update js files when renaming type in api 2023-08-28 13:30:09 +02:00
Mikkel Denker
030a0a50ce Fix phrase search 2023-08-28 11:02:21 +02:00
Mikkel Denker
45f213a1d2 rename '@type' to 'type' in api since some openapi generators gets confused 2023-08-28 09:22:53 +02:00
Mikkel Denker
5512c6d76c
url encode query when send to matching banghit (#83) 2023-08-27 17:46:56 +00:00
Mikkel Denker
24ffa32983
Safe search (#79)
* This should fix the byte/char index mixups identified in issue 77

* script to generate dataset

* naive bayes classification with tf-idf features

* Add prediction confidence to naive bayes.
we report the confidence as $log_probs[best] / sum(log_probs)$.
I'm not really sure this confidence calculation can be seen as a probability that the model has predicted the correct label, but should still give a picture of the confidence of the prediction. It's therefore named confidence and not probability.

Also, even though naive bayes is a pretty decent classifier some people on stackexchange report that it's a pretty bad probability estimator. Further tests will determine if this confidence score is actually useful.

* naive bayes benchmark

* store safe search classification in index

* search preferences page where user can control safe search settings
2023-08-26 16:36:33 +00:00
Mikkel Denker
68ba133045 Ability to limit number of words considered in snippet generation.
Some search results has a very large amount of text, which results in the vast majority of time being spent in snippet generation compared to search.
2023-08-25 11:24:08 +02:00
Mikkel Denker
7706dcdfa2 Partial support for compounded words.
the query "wishlist" now also matches search results that has the terms "wish list". This is done using the bigram- and trigram fields.

Support for "wish list" to match "wishlist" results is not included in this commit as this would require each term in the query to be aware of the succeeding terms and it is not immediatly clear how best to approach this.
2023-08-25 09:23:53 +02:00
Mikkel Denker
cf113de899 fix weird quote symbols from ios in query parsing 2023-08-24 13:49:00 +02:00
Mikkel Denker
53cad3fb5f Faster crawldb by storing url states directly as values.
This allows us to insert new urls much faster, as we then don't have to read all url states for a given domain in order to insert a single new url.
To sample domains, we prefix each url key with the domain and perform a prefix search in the database. This means we cannot use rkyv as we then get alignment errors when trying to deserialize the keys (it's probably possible but I don't know how to get it to work). We therefore now use bincode for the url stuff.

Sampling is probably a bit slower as the prefix query likely uses more iops compared to simply finding all urls for a domain. Time will tell if this is still fast enough.
2023-08-24 13:31:39 +02:00
Mikkel Denker
dc0390baac sharded url state db to deal with domains that has a lot of urls 2023-08-24 10:57:10 +02:00
Mikkel Denker
1f2308eea7 huggingface seems to have stealth updated the tokenizer crate 2023-08-23 20:27:13 +02:00
Mikkel Denker
87b6699e7a Shard url state database.
Some domains had a very (!) large number of urls. A lot of time was spend reading and writing the urlstates to/from disk for these domains.
We now shard the urls and choose a random shard when sampling. If there are no valid urls in the shard, then the job will simply contain 0 urls, the worker will quickly finish the job and request a new domain to crawl.

Rkyv uses i32 by default to represent byte offsets. This means we could not serialize/deserialize structs that are larger than approx 2gb. This commit also enables the 64 bit feature so we can deal with larger structs.
2023-08-23 20:09:22 +02:00
Mikkel Denker
dffc149b10 longer timeout when fetching robots.txt 2023-08-23 16:16:42 +02:00
Mikkel Denker
21f228c471 if body is empty, generate snippet from description 2023-08-23 11:54:55 +02:00
Mikkel Denker
00bc05cfba if there is no cleantext on site, then we probably can't create a good snippet anyway 2023-08-23 11:51:30 +02:00
Mikkel Denker
47c84a0cca reduce thresholds and minimum_clean_words to make it easier to trigger sidebars during development 2023-08-23 11:17:51 +02:00
Oliver Bøving
50304467c4
Add #[serde(tag = "type", content = "value")] to OpenAPI exposed types (#76)
* Add `serde(tag = "type", content = "value")` to OpenAPI exposed types

Makes them more ergonomic to work with in TypeScript in some scenarios.

* Add `#[serde(rename_all="camelCase")]` to all types deriving `ToSchema`

Currently two types are exempt: `Region` and `Expr`

* Update schema names to camelCase in external files
2023-08-21 20:02:56 +00:00
Mikkel Denker
d5624ee1b4
Fix the byte/char index mixups identified in issue 77 (#78)
* This should fix the byte/char index mixups identified in issue 77

* save allocation when removing trailing '/'

* increase readability and prevent potential future clippy warnings
2023-08-21 17:50:14 +00:00
Mikkel Denker
0ac07ab3b4 no need to create 'warc_files' folder when downloading anymore 2023-08-21 12:09:09 +02:00
Mikkel Denker
403a740aaf Remove complete trust with canonical urls.
If siteA has a canonical url for siteB, then siteB might show up for queries where it shouldn't actually match but where siteA matches. A bad actor might use this to have weird canonical urls to some sites that they don't like. This could of course be fixed by only respecting canonical urls within the same domain, but even in that case I don't see how the canonical site will benefit the user. If siteA and siteB actually has the same content, then one of them will already be downranked due to duplication detection.

Therefore it doesn't make much sense to blindly follow canonical url hints (or actually follow them at all).

Will still leave the functions to extract the canonical urls, since we might want to skip indexing for sites that has a canonical url defined thats different from their own url.
2023-08-21 11:31:34 +02:00
Mikkel Denker
68e6b66b83 optimizations to make discussion optic faster 2023-08-21 11:15:18 +02:00
Mikkel Denker
7f28937209 indieweb tag extraction bug: should not match substring classes 2023-08-21 11:15:02 +02:00
Mikkel Denker
d84c973779 optimize optic matching when the pattern consists of a single normal term and nothing else 2023-08-21 09:35:19 +02:00
Mikkel Denker
11d18a3077 Remove 'post' requirement from lemmy urls.
It was too slow for some reason. Will need to investigate why, but lets disable it for the time being
2023-08-21 09:20:16 +02:00
Mikkel Denker
9bdc55d258 Faster discussions optic 2023-08-21 09:11:41 +02:00
Mikkel Denker
5155767f0b deploy indieweb optic 2023-08-21 09:03:24 +02:00
Mikkel Denker
7dbd149075 Support indieweb only optics.
Still need to update the quickstart and blogroll optic to include the new match location.
2023-08-18 16:13:45 +02:00
Mikkel Denker
f1403fa7aa fix brokwn link highlights in settings 2023-08-18 13:51:20 +02:00
Mikkel Denker
912dcc5a8e cannot use alpine when having strict CSP headers.
User security > developer convenience
2023-08-18 13:37:19 +02:00
Mikkel Denker
d2dc28215e forgot to remove explore script to separate file 2023-08-17 18:13:55 +02:00
Mikkel Denker
13ad5d7834 Moved more inline javascripts into files 2023-08-17 18:08:39 +02:00
Mikkel Denker
8819e6e6db Moved all inline scripts into separate files.
This allows us to set CSP headers that only allows js files from self which reduces the XSS attack surface quite substantially.
2023-08-17 15:22:43 +02:00
Mikkel Denker
7eb5387c90 ability to easily export site rankings as an optic 2023-08-17 13:40:12 +02:00