Commit graph

1308 commits

Author SHA1 Message Date
Mikkel Denker
46d385e696 empty region file should not crash region count 2022-10-06 10:20:07 +02:00
Mikkel Denker
002eb7f79c index warc files as they become available 2022-10-05 17:21:59 +02:00
Mikkel Denker
8d1fdcc6fb Use title hash, site hash and url hash in collector 2022-10-05 15:23:29 +02:00
Mikkel Denker
b8d6afc753 merge subdomain counts 2022-10-05 12:48:48 +02:00
Mikkel Denker
7a65ebbe00 fastfield cache 2022-10-05 12:02:20 +02:00
Mikkel Denker
473c2325e0 host centrality threshold during indexing 2022-10-04 15:32:54 +02:00
Mikkel Denker
a6ea2e7ba0 count number of subdomains during indexing 2022-10-04 15:03:57 +02:00
Mikkel Denker
677f3a40b1
Ftr/site ranking adjustment (#61)
* gui

* forgot to implement bang in search_server

* added opensearch.xml file

* reduce indexer memory by commiting more often

* fix query parser bug with special characters

* change autosuggest browser url

* remove '#' from url during indexing

* pre-hashed domain field

* we will also have a subscription model without ads

* site query should not match partial domains from tokenization

* ttl cache

* send site rankings from js to backend during search

* apply site rankings during search
2022-10-04 14:16:02 +02:00
Mikkel Denker
2064cc4ad5 refactor searcher to have separate paths for api and html 2022-09-30 15:31:08 +02:00
Mikkel Denker
a4ecf5c307 merge 2022-09-28 16:05:10 +02:00
Mikkel Denker
3cc7c84a32
Ftr/distributed search (#59)
* refactor network communication into separate module and made mapreduce async again

* sonic module is simple enough as is

* rename Searcher -> LocalSearcher

* [WIP] distributed searcher structure outlined

* split index search into initial and retrieval steps

* distributed searcher searching shards

* make bucket in collector generic

* no more todo!s. Waiting for indexing to finish to test implementation

* distributed searcher seems to work. Needs an enourmous refactor - the code is really ugly

* cleanup search-server on exit in justfile
2022-09-28 15:50:45 +02:00
dependabot[bot]
0c9cda6fab
Bump axum-core from 0.2.6 to 0.2.8 (#58)
Bumps [axum-core](https://github.com/tokio-rs/axum) from 0.2.6 to 0.2.8.
- [Release notes](https://github.com/tokio-rs/axum/releases)
- [Changelog](https://github.com/tokio-rs/axum/blob/main/CHANGELOG.md)
- [Commits](https://github.com/tokio-rs/axum/compare/axum-core-v0.2.6...axum-core-v0.2.8)

---
updated-dependencies:
- dependency-name: axum-core
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-09-28 15:50:18 +02:00
Mikkel Denker
e22d4a037c tweak ranking values in test case 2022-09-27 10:23:21 +02:00
Mikkel Denker
739608bed8 ranking tweaks and fixed dockerfile 2022-09-27 10:21:12 +02:00
Mikkel Denker
f70af22c35 if updated time is in the future, return None 2022-09-27 10:18:52 +02:00
Mikkel Denker
f4446dcd11 add additional fields to the schema that might be used in the future to reduce 'duplicates feeling' in search results 2022-09-27 10:04:22 +02:00
Mikkel Denker
963bd3aef7 forgot to implement is_computable_before_search function 2022-09-22 15:06:16 +02:00
Mikkel Denker
b902a3b7aa
Ftr/sort postings by score (#57)
* sort postings by a pre-computed field

* Limit the number of webpages considered for ranking during search.
At the moment we rank top 10 million pages. This can be done since we have sorted the postinglists based on
all pre-computable signals. The chance of "the best" webpage being at > 10 mil is extremely low, and this will allow us to speedup search significantly.
2022-09-22 13:11:54 +02:00
Mikkel Denker
5b4a7cc366 added 'try on {different engine}' buttons 2022-09-22 10:57:33 +02:00
Mikkel Denker
e649053260 update setup steps 2022-09-22 09:49:50 +02:00
Mikkel Denker
0cbca3363c Remove git lfs for the time being.
The data folder was starting to get too large, which led to some difficulties with LFS timeouting (I think) when uploading the tar file. We will need a better strategy for this in the future, but for the time being it is disabled
2022-09-21 10:05:01 +02:00
Mikkel Denker
9f12cca77e tweak ranking 2022-09-21 09:29:53 +02:00
Mikkel Denker
78b238d2bb Score goggles as a const (while taking boost into account).
Also added a goggles scale to prioritize goggles more
2022-09-20 19:28:22 +02:00
Mikkel Denker
02c799c51b don't spellcheck non-alphabetic words 2022-09-20 18:16:05 +02:00
Mikkel Denker
2f40f7fc9a reduce indexer memory by closing indexes 2022-09-20 15:32:39 +02:00
Mikkel Denker
164673761e added url_without_query_hash field so we can reduce similar url search results even further if we want 2022-09-20 14:27:20 +02:00
Mikkel Denker
0da016f678 added domain_name_if_hompage field for better navigation queries 2022-09-20 14:14:19 +02:00
Mikkel Denker
574321f3d8 fixed very nasty bug where collector would not update the segment_id, leading to completely wrong search results 2022-09-20 13:15:45 +02:00
Mikkel Denker
bea5e8e60e re-enable multi threaded search 2022-09-18 21:13:37 +02:00
Mikkel Denker
15fe347dc5 goggle performance speedup 2022-09-18 20:21:19 +02:00
Mikkel Denker
fd3e99ca50 embed all images directly in search page using base64 2022-09-16 13:28:38 +02:00
Mikkel Denker
b9e1b1352a Dockerfile 2022-09-15 13:34:47 +02:00
Mikkel Denker
7176744102 tighten num buckets bound 2022-09-14 17:14:51 +02:00
Mikkel Denker
fb8cfe1f30
Ftr/de similar results (#56)
* pre-hash site and penalize many results from same site

* re-adjust deduplication penalty

* fix performance by limiting number of buckets

* fixed page offsets
2022-09-14 16:16:13 +02:00
Mikkel Denker
8c9ffede30
Ftr/page centrality (#55)
* move signal from goggles into ranking module

* refactor webpage test-constructor

* add page_centrality field

* use page centrality during ranking

* small justfile refactoring

* update index in lfs
2022-09-13 11:49:50 +02:00
Mikkel Denker
65cf8f9f53
Control the number of segments after merge. (#54)
Before, we merged all segments into a single segment. This has the benefit of reducing the disk IO during search, but has the (quite huge) problem that multi-threaded search now becomes non-trivial.
This PR implements a way for us to control how many segments the indexer should produce in the end. We would ideally like to target roughly the same number of segments as there are threads on the search server, such that each thread get's one segment each.
2022-09-13 08:50:22 +02:00
Mikkel Denker
d81fe0cc58
The query 'linus torvalds' now produces the correct snippet. (#52)
The issue came from the fact that we didn't grab the term-frequencies from the correct field, thus it looked like the term "torvalds" had 0 documents since all documents we were looking at had stemmed the term (probably to "torvald" or something).
2022-09-12 18:22:32 +02:00
Mikkel Denker
06e5a73348 simplify term proximity by not using ngrams, only slop 2022-09-12 14:36:21 +02:00
Oliver Bøving
0b0d453e6f Refactor search page further 2022-09-11 17:08:41 +02:00
Mikkel Denker
8446c38c6d
Ftr/term proximity ranking (#49)
* term proximity ranking seems to work

* refactor term-proximity queries into separate function
2022-09-11 16:13:54 +02:00
Oliver Bøving
4204802b28 Make the main header icon smaller, and improve more responsiveness 2022-09-11 16:01:27 +02:00
Oliver Bøving
30af194bf0 Make search page more responsive 2022-09-11 15:00:20 +02:00
Oliver Bøving
5176802207 Move around some prettier options 2022-09-11 14:20:07 +02:00
Mikkel Denker
7bf0dc844f
Bug/retry image download (#48)
* retry image-download if fail with exponential backoff

* fixed warc-file code header
2022-09-11 13:53:13 +02:00
Mikkel Denker
51834be843 center entity again 2022-09-11 13:27:20 +02:00
Mikkel Denker
1c57e94687 merge main into branch 2022-09-11 13:27:02 +02:00
Oliver Bøving
0009b79ffd
Use faker data in askama templates during dev (#46)
* Use faker data in askama templates during dev

By adding a `$ {{lorem.limes}}` for example to askama`...` expressions,
fake data is inserted during development.

Additionally all askama specific text (such as {% for x in xs %}) is
not produced during development for cleaner pages.

* Add astro-icons/heroicons and cleanup some styling (#45)

* Add astro-icons/heroicons and cleanup some styling

* Make the header a component
2022-09-11 13:17:11 +02:00
Oliver Bøving
9012236ce6 Add mobile menu and more responsive tweaks 2022-09-11 13:07:03 +02:00
Oliver Bøving
f988616b8f Use faker data in askama templates during dev
By adding a `$ {{lorem.limes}}` for example to askama`...` expressions,
fake data is inserted during development.

Additionally all askama specific text (such as {% for x in xs %}) is
not produced during development for cleaner pages.
2022-09-11 13:07:03 +02:00
Oliver Bøving
5223c57021
Add astro-icons/heroicons and cleanup some styling (#45)
* Add astro-icons/heroicons and cleanup some styling

* Make the header a component
2022-09-11 13:00:53 +02:00