* gui
* forgot to implement bang in search_server
* added opensearch.xml file
* reduce indexer memory by commiting more often
* fix query parser bug with special characters
* change autosuggest browser url
* remove '#' from url during indexing
* pre-hashed domain field
* we will also have a subscription model without ads
* site query should not match partial domains from tokenization
* ttl cache
* send site rankings from js to backend during search
* apply site rankings during search
* refactor network communication into separate module and made mapreduce async again
* sonic module is simple enough as is
* rename Searcher -> LocalSearcher
* [WIP] distributed searcher structure outlined
* split index search into initial and retrieval steps
* distributed searcher searching shards
* make bucket in collector generic
* no more todo!s. Waiting for indexing to finish to test implementation
* distributed searcher seems to work. Needs an enourmous refactor - the code is really ugly
* cleanup search-server on exit in justfile
* sort postings by a pre-computed field
* Limit the number of webpages considered for ranking during search.
At the moment we rank top 10 million pages. This can be done since we have sorted the postinglists based on
all pre-computable signals. The chance of "the best" webpage being at > 10 mil is extremely low, and this will allow us to speedup search significantly.
The data folder was starting to get too large, which led to some difficulties with LFS timeouting (I think) when uploading the tar file. We will need a better strategy for this in the future, but for the time being it is disabled
* pre-hash site and penalize many results from same site
* re-adjust deduplication penalty
* fix performance by limiting number of buckets
* fixed page offsets
* move signal from goggles into ranking module
* refactor webpage test-constructor
* add page_centrality field
* use page centrality during ranking
* small justfile refactoring
* update index in lfs
Before, we merged all segments into a single segment. This has the benefit of reducing the disk IO during search, but has the (quite huge) problem that multi-threaded search now becomes non-trivial.
This PR implements a way for us to control how many segments the indexer should produce in the end. We would ideally like to target roughly the same number of segments as there are threads on the search server, such that each thread get's one segment each.
The issue came from the fact that we didn't grab the term-frequencies from the correct field, thus it looked like the term "torvalds" had 0 documents since all documents we were looking at had stemmed the term (probably to "torvald" or something).
* Use faker data in askama templates during dev
By adding a `$ {{lorem.limes}}` for example to askama`...` expressions,
fake data is inserted during development.
Additionally all askama specific text (such as {% for x in xs %}) is
not produced during development for cleaner pages.
* Add astro-icons/heroicons and cleanup some styling (#45)
* Add astro-icons/heroicons and cleanup some styling
* Make the header a component
By adding a `$ {{lorem.limes}}` for example to askama`...` expressions,
fake data is inserted during development.
Additionally all askama specific text (such as {% for x in xs %}) is
not produced during development for cleaner pages.