Commit graph

1308 commits

Author SHA1 Message Date
Mikkel Denker
fe5c366949
Ftr/custom goggles (#44)
* approx 1.5x search performance boost

* [WIP] choose goggles from frontend

* font-light by default

* delete icon in goggles

* ability to delete custom goggles

* custom user goggles
2022-09-10 17:27:48 +02:00
Oliver Bøving
4b2a34feb4
Make Montserrat a variable font (#43)
What was missing, was adding it to the tailwind font list
2022-09-10 15:36:12 +02:00
Oliver Bøving
adcaa5ddfb
Refactor askama templating to use function rather then components (#41) 2022-09-10 15:35:19 +02:00
Oliver Bøving
77edf2a1d5
Add brand color to tailwind colors (#42) 2022-09-10 14:39:08 +02:00
Oliver Bøving
f972b163c5
Optimize frontend build time (#39)
This moves building the astro frontend from build.rs into the justfile.

This streamlines the build process for the frontend astro part, and the
frontend application itself by letting cargo watch rebuild the astro and
then the Rust binary, instead of building astro in build.rs.

Non-conclusive results says that this improves build times from about
13s to 6s, while being more consistent :)
2022-09-10 12:23:33 +02:00
Oliver Bøving
059d52c54c
Use fontsource for loading Montserrat (#40)
This commit removes the Montserrat font stored in public and replaces
them with the fonts installed with @fontsource/montserrat.

This also streamlines the font import process, and ensures that correct
typeface formats are loaded.

One thing to consider, is using variable fonts instead. I failed to get
loading with this setup however, but it seems to be supported.
2022-09-10 12:19:30 +02:00
Mikkel Denker
8c08a88259 update index in lfs 2022-09-09 13:30:49 +02:00
Mikkel Denker
c2b851bb44
Ftr/customized ranking (#38)
* goggles parser

* support weird urls in goggle

* quite significant speedboost during indexing (approx x1.7)

* merge index explicitly in indexer

* query benchmark

* search performance improvements primarily by not using hash during search

* turn goggle into tantivy query

* goggle benchmark

* Goggles are working!!

* fixed bug where goggle would enforce that sitename must be A and B and C etc. instead of A or B or C

* document some more goggle syntax

* less bold search highlights
2022-09-09 12:28:36 +02:00
Oliver Bøving
e1a6e3322d
Build frontend using Astro (#37)
* Build frontend using Astro

This commit replaces the prior pure Askama templates, into templates
statically generated by Astro.

The purpose of this is to enable features such as MD/MDX,
minimization, better Tailwind integration, and JSX'ish component syntax.

In the future a frontend framework, like React, could also be added, while
still compiling the frontend templates statically.

* Delete README.md

* Make the searchbar more round

* Run `npm install` in build.rs

* Fix some image paths

* Make search bar suggestion visibility CSS only

The suggestions are only shown when the input has focus and is not empty

* Add prettier astro and tailwind plugin

* Remove meta astro tag

* Move privacy.astro to privacy-and-happy-lawyers.astro

* Convert privacy and about page to MDX

* Remove old about page

Co-authored-by: Mikkel Denker <Mikkeldenker@gmail.com>
2022-09-09 12:15:53 +02:00
Mikkel Denker
02559946c6 json api 2022-09-01 17:07:37 +02:00
Mikkel Denker
2ea1594852 More search pages 2022-09-01 15:51:23 +02:00
Mikkel Denker
d064895d0b
Ftr/customized ranking (#30)
* [WIP] refactoring ranking signal coefficients into a trait

* refactoring ranking signal coefficients into a trait

* parser for custom signal aggregator

* ability to customize signal aggregation

* update readme
2022-09-01 14:04:21 +02:00
Mikkel Denker
49d71ee0db Optional download images during indexing 2022-08-31 20:02:57 +02:00
Mikkel Denker
267e573334 control indexing destination when running locally 2022-08-31 15:42:32 +02:00
Mikkel Denker
fc583c5a29 bump rocksdb version 2022-08-31 15:35:06 +02:00
Mikkel Denker
80feaf6fe8 Tantivy main branch uses some beta features 2022-08-31 15:21:10 +02:00
Mikkel Denker
c508eca9f7 update readme 2022-08-31 15:04:49 +02:00
Mikkel Denker
9445548210 A bunch of frontend updates 2022-08-31 15:00:36 +02:00
Mikkel Denker
07b505a06d update index in lfs 2022-08-30 10:08:47 +02:00
Mikkel Denker
d8bcd409c4 only show images if the query matches the title or description 2022-08-28 11:18:23 +02:00
Mikkel Denker
04c33127a4 added bangs 2022-08-27 16:45:22 +02:00
Mikkel Denker
70d3aa19e4 more reliable region detection 2022-08-26 16:29:21 +02:00
Mikkel Denker
9e79724e44 make webgraph send + sync and retrieve backlinks from webgraph during indexing 2022-08-26 14:11:25 +02:00
Mikkel Denker
efa1c229fb speedup webgraph by not parsing texts 2022-08-26 13:24:01 +02:00
Mikkel Denker
1bdafe9bd4 Download warc files in async batches 2022-08-26 11:50:56 +02:00
Mikkel Denker
62a053c2b1 if error happens in warcfile parse, dont panic 2022-08-25 16:46:02 +02:00
Mikkel Denker
2871d6bc80 Close each webgraph after creation.
This ensures that we won't have too many open files.
2022-08-25 14:54:26 +02:00
Mikkel Denker
1a99076eae count 'www' and no-subdomain as the same in webgraph. Also other small bugfixes in webgraph 2022-08-25 10:37:51 +02:00
Mikkel Denker
467ed085de clippy shenanigans 2022-08-24 20:00:58 +02:00
Mikkel Denker
af61109248 pre-calculate and cache webpage texts 2022-08-24 19:09:33 +02:00
Mikkel Denker
cfeb68ac1c Remove custom html lexer
Kuchiki (the new parser) builds on top of html5ever and should be waaaay more robust.
2022-08-24 15:31:45 +02:00
Mikkel Denker
5c35f5d10c dependency cleanup 2022-08-24 11:08:39 +02:00
Mikkel Denker
e00fc23c7c Fix autcomplete click bug.
When clicking on autocomplete elements in the searchbar, the <b> tags were included in the new search query.
2022-08-24 09:23:29 +02:00
Mikkel Denker
847cf07e84 use backlink text during ranking 2022-08-24 09:00:52 +02:00
Mikkel Denker
e4a13faf17
Ftr/region search (#26)
* show 'found X results in Y seconds

* choose region from SERP

* take region into account during ranking
2022-08-23 21:21:13 +02:00
Mikkel Denker
c1f7c0cc89
Ftr/derank trackers (#25)
* get potential trackers from html

* take number of trackers into account during ranking

* remove google fonts
2022-08-23 10:08:48 +02:00
Mikkel Denker
d902e68d72 parallel local indexing 2022-08-22 10:43:02 +02:00
Mikkel Denker
9a3292fd34 make processed term a phrase query when possible 2022-08-22 09:21:13 +02:00
Mikkel Denker
6814329e12
Ftr/advanced query syntax (#22)
* minor ranking tweaks

* Remove term proximity ranking.
Term proximity has been saved to a seperate branch for the future, but will not be used during ranking.
Experimenting with 10 warc files seemed to indicate, that the term proximity ranking actually worsened the
search results quite substantially. We might want to re-introduce something like it in the future, but will
probably have to devote more time into paramater-tuning. Maybe doing something naive like searching for bi-grams and
tri-grams with bm25 ranking is sufficient?

* support 'not' search query

* site query

* title query

* body query

* url query
2022-08-22 08:34:47 +02:00
Mikkel Denker
3196c7b8ce Merge branch 'main' of github.com:Cuely/Cuely 2022-08-20 16:35:27 +02:00
Peulicke
2f4554a2b1
Bm25 requires at least one term #7 (#21)
* Bm25 requires at least one term #7

* move empty query handling into the searcher and handle error gracefully

Co-authored-by: Mikkel Denker <mikkel@cuely.io>
2022-08-20 16:35:14 +02:00
Mikkel Denker
965f0b8ba0 move empty query handling into the searcher and handle error gracefully 2022-08-20 16:27:43 +02:00
Oscar
7f9ec4ab4b Bm25 requires at least one term #7 2022-08-19 16:14:47 +02:00
Mikkel Denker
22468e434a
Remove term proximity ranking (#20)
* minor ranking tweaks

* Remove term proximity ranking.
Term proximity has been saved to a seperate branch for the future, but will not be used during ranking.
Experimenting with 10 warc files seemed to indicate, that the term proximity ranking actually worsened the
search results quite substantially. We might want to re-introduce something like it in the future, but will
probably have to devote more time into paramater-tuning. Maybe doing something naive like searching for bi-grams and tri-grams with bm25 ranking is sufficient?
2022-08-19 12:56:23 +02:00
Mikkel Denker
990dba7f86
Ftr/lalrpop query parser (#16)
* urlencode all the strings

* don't remove punctuations during tokenization

* re-write query parser to lalrpop

* update index in lfs
2022-08-18 20:45:29 +02:00
Mikkel Denker
e313b1c6cc New strategy for git-lfs structure.
RocksDb changes it's index each time it is opened which leads to many uploads to lfs. Also for some reason, the index in lfs seems to be broken.
2022-08-18 09:28:49 +02:00
Mikkel Denker
e40b8f020a Fix small spell correction bug.
If the query contained uppercase terms, the word splitter would report false positives.
The commit also uploads a new index to lfs.
2022-08-18 09:02:01 +02:00
Mikkel Denker
c5ec4f46f1
Ftr/spellcheck (#14)
* simple spell corrections

* Only store top n terms in spell dictionary

* remove some weird characters from terms

* split compound words

* small cleanup
2022-08-17 17:37:12 +02:00
Peulicke
2b9cf6a269
Deduplicate terms before search #6 (#12) 2022-08-17 14:45:18 +02:00
Mikkel Denker
cdb140e240 update .gitignore 2022-08-17 09:53:53 +02:00