0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	f8480ada94	control api_base with env variables for frontend	2023-09-04 13:03:28 +02:00
Oliver Bøving	9358db9933	Move all `injectGlobal` into components (#88 ) Turns out they cannot be called from a global context when running `deno run -A main.ts` since twind is not setup by the time the files are loaded. Fortunately, it turns out that the `injectGlobal` does not duplicate the CSS for every component it is rendered in!	2023-09-04 09:00:26 +00:00
Oliver Bøving	369d5031df	Refactor `Justfile` and tracing with enabled debug tracing for stract (#87 ) * Refactor Justfile and tracing with enabled debug tracing for stract * Use `just dev` in `CONTRIBUTING.md`	2023-09-04 08:53:17 +00:00
Mikkel Denker	53bf823041	Merge branch 'main' of github.com:StractOrg/stract	2023-09-04 10:32:45 +02:00
Mikkel Denker	84b32e56e3	fix clippy warnings	2023-09-04 10:32:42 +02:00
Oliver Bøving	c7e941f3c4	Rename Rust `frontend` to `api` (#86 )	2023-09-04 08:24:56 +00:00
Mikkel Denker	4f4f97eb8c	don't show alice when disabled	2023-09-04 08:59:29 +02:00
Mikkel Denker	1abf645918	homepages should have empty url query part	2023-09-04 08:05:14 +02:00
Oliver Bøving	072a6323e9	🍋 Fresh frontend (#84 ) * Add fresh frontend This reimplements the existing frontend using Fresh. Primay highlights of this new frontend is: - Uses deno instead of node/npm for less dependencies. Deno for example includes a formatter and linter, and dependencies are downloaded automatically. - Everything is TypeScript. There is no more .astro or similar, which reduces complexity. - The frontend is built up of components entirely, which can either be server side rendered only, or rehidrated on the client for interactivity (islands). - Fresh server side renderes all requests, populated by using the API, which is typesafe and generated from the OpenAPI spec. - Combining the last two, it becomes much easier to add high levels of interactivity, which needed to be written in external JS files. Now these are Preact component and can use all lthe benefits that comes from this. Future work includes: - [ ] Integrating Alice in the new UI - [ ] Direct answers UI - [ ] Default Optics. Should they come from the API or the frontend? - [ ] Integrating the new fresh server with the existing backend - [ ] Rutes supplying `queryUrlPart` to `Header` * Update fresh frontend to use "type" rather than "@type" * Add placeholder Tailwind config for VSCode intellisense * Add discussions UI * Clean up some left over template `{{...}}` * './icons' might not exist before generation * some UI/UX changes for consistency with old frontend * Remove unused ENABLE_CSP flag since it is always enabled now * Store icons used for the frontend in the repository * Don't generate icons when starting the frontend * Fix chat textarea sizing in Firefox * Add Chat UI to new frontend * Only allow one of liked, disliked, blocked at a time * Add `curosr-pointer` to safe search radio buttons * Add `leading-6` to articles to get more line spacing Almost equivalent to the old frontend * Prefix explore and site ranking links with https:// Perhaps we should determine the protocol in a more robust way? * Fix explore sites regressions from adding tailwind-forms * Refactor manage optics UI * Add API endpoint for exporting optic from site rankings `/beta/api/sites/export` is a JSON equivilant of the existing `/settings/sites/export` endpoint. * Add "Export as optic" and "Clear all and export as optic" buttons These new buttons use the new `/beta/api/sites/export` endpoint to download the generated optic * Store site rankings in URL and send it during searching * Use the tailwind config to extend the twind theme * Add `/beta/api/explore/export` API endpoint * Fix optics export button on explore * Reflect the currently searched optic in the optic selector * Add `noscript:hidden` class to hide fx search result adjust buttons * Re-search when changing ranking of a webpage * Refactor searchbar interaction and suggestion highlighting We now do the highlighting on the frontend * Change site blocking to be domain blocking when converting site rankings to optics. The domain field uses the public suffix list which already handles suffixes that can be shared by multiple users (netlify.app etc.). In other words, the domain of 'site.netlify.app' is 'site.netlify.app', so users of stract can still block specific netlify sites without blocking them all. * Pass around `queryUrlPart` between pages * Do syntax highlighting server-side using HighlightJS * Remove `facebook.com` as default site in explore * Add webmasters page to new frontend * Remove old frontend * Remove dead code from old Rust frontend * Rename webstract to frontend * remove more stuff from old frontend --------- Co-authored-by: Mikkel Denker <mikkel@trystract.com>	2023-09-04 05:59:28 +00:00
Mikkel Denker	08db882fec	improve some types of navigational searches	2023-09-04 07:23:35 +02:00
Mikkel Denker	bedc51ca3d	Remove alice from readme. Since alice is not good enough yet and therefore not deployed in production, it only causes confusion to have it as a feature in the readme for now.	2023-09-01 10:26:08 +02:00
Mikkel Denker	c71727dedc	added 'DomainNameNoTokenizer' field to better rank navigational searches	2023-08-31 16:53:53 +02:00
Oliver Bøving	f0f0c420bb	Add more endpoints to the OpenAPI specification (#80 ) * Add more endpoints to the OpenAPI specification This adds: - Autosuggest - Summarize - Fact check - Alice ... and all of the necessaray types to support them. Additionally it adds permissive CORS to these routes. This might not be appropriate, or perhaps it should be configurable. * rename alice::Params to AliceParams for consistency * merge main * Enable CORS selectivly CORS is useful for development, but should not be enabled for production --------- Co-authored-by: Mikkel Denker <mikkel@trystract.com>	2023-08-31 09:50:10 +00:00
Mikkel Denker	23b1fe7324	Use all nodes to compute the inbound vector (not just top harmonic centrality ones). This makes the similarity between sites more accurate which improves explore and site_rankings	2023-08-31 09:27:05 +02:00
Mikkel Denker	445043b1d2	Deduplicate webgraph edges during lookup. Some edges may be duplicate if they are present in multiple segments.	2023-08-31 08:46:41 +02:00
Mikkel Denker	e534cd0d83	Ability to have different tokenizer for field during query and indexing. This allows us to match compound words in the bigram and trigram field while still keeping ranking signals intact. Before we had modified the bigram and trigram indexers to also output monograms at the start and end, but this turned out to introduce too much noise during ranking.	2023-08-30 11:57:19 +02:00
Mikkel Denker	3fc3a316ad	update ranking signal defaults	2023-08-29 21:47:47 +02:00
Mikkel Denker	5c6e552dcc	forgot to update js files when renaming type in api	2023-08-28 13:30:09 +02:00
Mikkel Denker	030a0a50ce	Fix phrase search	2023-08-28 11:02:21 +02:00
Mikkel Denker	45f213a1d2	rename '@type' to 'type' in api since some openapi generators gets confused	2023-08-28 09:22:53 +02:00
Mikkel Denker	5512c6d76c	url encode query when send to matching banghit (#83 )	2023-08-27 17:46:56 +00:00
Mikkel Denker	24ffa32983	Safe search (#79 ) * This should fix the byte/char index mixups identified in issue 77 * script to generate dataset * naive bayes classification with tf-idf features * Add prediction confidence to naive bayes. we report the confidence as $log_probs[best] / sum(log_probs)$. I'm not really sure this confidence calculation can be seen as a probability that the model has predicted the correct label, but should still give a picture of the confidence of the prediction. It's therefore named confidence and not probability. Also, even though naive bayes is a pretty decent classifier some people on stackexchange report that it's a pretty bad probability estimator. Further tests will determine if this confidence score is actually useful. * naive bayes benchmark * store safe search classification in index * search preferences page where user can control safe search settings	2023-08-26 16:36:33 +00:00
Mikkel Denker	68ba133045	Ability to limit number of words considered in snippet generation. Some search results has a very large amount of text, which results in the vast majority of time being spent in snippet generation compared to search.	2023-08-25 11:24:08 +02:00
Mikkel Denker	7706dcdfa2	Partial support for compounded words. the query "wishlist" now also matches search results that has the terms "wish list". This is done using the bigram- and trigram fields. Support for "wish list" to match "wishlist" results is not included in this commit as this would require each term in the query to be aware of the succeeding terms and it is not immediatly clear how best to approach this.	2023-08-25 09:23:53 +02:00
Mikkel Denker	cf113de899	fix weird quote symbols from ios in query parsing	2023-08-24 13:49:00 +02:00
Mikkel Denker	53cad3fb5f	Faster crawldb by storing url states directly as values. This allows us to insert new urls much faster, as we then don't have to read all url states for a given domain in order to insert a single new url. To sample domains, we prefix each url key with the domain and perform a prefix search in the database. This means we cannot use rkyv as we then get alignment errors when trying to deserialize the keys (it's probably possible but I don't know how to get it to work). We therefore now use bincode for the url stuff. Sampling is probably a bit slower as the prefix query likely uses more iops compared to simply finding all urls for a domain. Time will tell if this is still fast enough.	2023-08-24 13:31:39 +02:00
Mikkel Denker	dc0390baac	sharded url state db to deal with domains that has a lot of urls	2023-08-24 10:57:10 +02:00
Mikkel Denker	1f2308eea7	huggingface seems to have stealth updated the tokenizer crate	2023-08-23 20:27:13 +02:00
Mikkel Denker	87b6699e7a	Shard url state database. Some domains had a very (!) large number of urls. A lot of time was spend reading and writing the urlstates to/from disk for these domains. We now shard the urls and choose a random shard when sampling. If there are no valid urls in the shard, then the job will simply contain 0 urls, the worker will quickly finish the job and request a new domain to crawl. Rkyv uses i32 by default to represent byte offsets. This means we could not serialize/deserialize structs that are larger than approx 2gb. This commit also enables the 64 bit feature so we can deal with larger structs.	2023-08-23 20:09:22 +02:00
Mikkel Denker	dffc149b10	longer timeout when fetching robots.txt	2023-08-23 16:16:42 +02:00
Mikkel Denker	21f228c471	if body is empty, generate snippet from description	2023-08-23 11:54:55 +02:00
Mikkel Denker	00bc05cfba	if there is no cleantext on site, then we probably can't create a good snippet anyway	2023-08-23 11:51:30 +02:00
Mikkel Denker	47c84a0cca	reduce thresholds and minimum_clean_words to make it easier to trigger sidebars during development	2023-08-23 11:17:51 +02:00
Oliver Bøving	50304467c4	Add `#[serde(tag = "type", content = "value")]` to OpenAPI exposed types (#76 ) * Add `serde(tag = "type", content = "value")` to OpenAPI exposed types Makes them more ergonomic to work with in TypeScript in some scenarios. * Add `#[serde(rename_all="camelCase")]` to all types deriving `ToSchema` Currently two types are exempt: `Region` and `Expr` * Update schema names to camelCase in external files	2023-08-21 20:02:56 +00:00
Mikkel Denker	d5624ee1b4	Fix the byte/char index mixups identified in issue 77 (#78 ) * This should fix the byte/char index mixups identified in issue 77 * save allocation when removing trailing '/' * increase readability and prevent potential future clippy warnings	2023-08-21 17:50:14 +00:00
Mikkel Denker	0ac07ab3b4	no need to create 'warc_files' folder when downloading anymore	2023-08-21 12:09:09 +02:00
Mikkel Denker	403a740aaf	Remove complete trust with canonical urls. If siteA has a canonical url for siteB, then siteB might show up for queries where it shouldn't actually match but where siteA matches. A bad actor might use this to have weird canonical urls to some sites that they don't like. This could of course be fixed by only respecting canonical urls within the same domain, but even in that case I don't see how the canonical site will benefit the user. If siteA and siteB actually has the same content, then one of them will already be downranked due to duplication detection. Therefore it doesn't make much sense to blindly follow canonical url hints (or actually follow them at all). Will still leave the functions to extract the canonical urls, since we might want to skip indexing for sites that has a canonical url defined thats different from their own url.	2023-08-21 11:31:34 +02:00
Mikkel Denker	68e6b66b83	optimizations to make discussion optic faster	2023-08-21 11:15:18 +02:00
Mikkel Denker	7f28937209	indieweb tag extraction bug: should not match substring classes	2023-08-21 11:15:02 +02:00
Mikkel Denker	d84c973779	optimize optic matching when the pattern consists of a single normal term and nothing else	2023-08-21 09:35:19 +02:00
Mikkel Denker	11d18a3077	Remove 'post' requirement from lemmy urls. It was too slow for some reason. Will need to investigate why, but lets disable it for the time being	2023-08-21 09:20:16 +02:00
Mikkel Denker	9bdc55d258	Faster discussions optic	2023-08-21 09:11:41 +02:00
Mikkel Denker	5155767f0b	deploy indieweb optic	2023-08-21 09:03:24 +02:00
Mikkel Denker	7dbd149075	Support indieweb only optics. Still need to update the quickstart and blogroll optic to include the new match location.	2023-08-18 16:13:45 +02:00
Mikkel Denker	f1403fa7aa	fix brokwn link highlights in settings	2023-08-18 13:51:20 +02:00
Mikkel Denker	912dcc5a8e	cannot use alpine when having strict CSP headers. User security > developer convenience	2023-08-18 13:37:19 +02:00
Mikkel Denker	d2dc28215e	forgot to remove explore script to separate file	2023-08-17 18:13:55 +02:00
Mikkel Denker	13ad5d7834	Moved more inline javascripts into files	2023-08-17 18:08:39 +02:00
Mikkel Denker	8819e6e6db	Moved all inline scripts into separate files. This allows us to set CSP headers that only allows js files from self which reduces the XSS attack surface quite substantially.	2023-08-17 15:22:43 +02:00
Mikkel Denker	7eb5387c90	ability to easily export site rankings as an optic	2023-08-17 13:40:12 +02:00

... 13 14 15 16 17 ...

1308 commits