Commit graph

413 commits

Author SHA1 Message Date
Mikkel Denker
f494a11a1a
Accessibility overhaul (#231)
* high contrast theme

* improve '/explore' error messages when site is invalid

* change aria-expanded when search suggestions are displayed

* wrap search suggestions in <ul> and <li> items to ensure screen reader knows how many suggestions there are

* move focus to modal when it opens. trap focus until modal is closed again

* add aria-expanded to each result that is true iff. the modal is expanded for that result

* entire navigation bar inside <nav> element

* add skip link to navbar to jump to main content of page

* improve focus indicators for selected /explore sites

* more descriptive titles for explore page interactive elements

* group settings in fieldsets and use title+description as legend

* add language to setting input fields to ensure required fields error is read in correct language on screenreaders

* add headings to serp

* add title to hamburger menu on mobile

* fix firefox accessibility errors

* only show button outline on tab focus
2024-10-09 11:12:09 +02:00
Mikkel Denker
8f6b23734a [live index] decrease compaction interval 2024-10-07 12:03:25 +02:00
Mikkel Denker
3a127af0a7 improve compaction performance in live index by performing initial segment merge on a read lock and only switch to the new segments on a write lock. this ensures that search requests can still be performed while the heavy part of merge is executing 2024-10-07 11:59:36 +02:00
Mikkel Denker
847c14251a bump edge limit for ingoing nodes to 128 in SimilarHostsFinder
now that nofollow links are filtered, we need to fetch more edges from the webgraph to have good enough accuracy on /explore page
2024-10-05 15:12:07 +02:00
Mikkel Denker
db2c6c3eb9 increase compaction interval
merging the segments right now takes a write lock for the entirety of the operation. this causes all searches to time out whenever the live index compacts its segments. we should actually be able to split up the merge operation to create the merged segment on a read lock and only take a write lock when switching and cleaning the old segments for the new one. increasing the compaction interval is only a temporary fix
2024-10-04 13:15:20 +02:00
Mikkel Denker
87fbd3d709 index compaction shouldn’t update creation date as this would cause all segments to eventaully have same creation date 2024-10-04 11:25:12 +02:00
Mikkel Denker
2492ebdd41 make sure live index node doesn't try to replicate to itself 2024-10-03 17:07:15 +02:00
Mikkel Denker
e16e353249 increase coefficient for update_timestamp to show more recent articles from live index 2024-10-03 14:04:26 +02:00
Mikkel Denker
c7ea2fd8ff add smoothing factor to timestamp to give it a half life of 3 days 2024-10-03 13:09:31 +02:00
Mikkel Denker
bc45409887 re-open live index after compaction to reflect segment changes 2024-10-03 11:40:39 +02:00
Mikkel Denker
5a1a5a3225 remove minimum budget of 1 per day in live index 2024-10-03 10:19:44 +02:00
Mikkel Denker
3795dbb64b sonic probagate err 2024-10-02 16:45:55 +02:00
Mikkel Denker
4915160449 harmonic centrality nearest neighbor calculation that uses the harmonic centrality of the highest neighbors node as a seed node proxy for the centrality of that node (with a discount factor) 2024-10-02 15:57:30 +02:00
Mikkel Denker
c3a1a82b66 prevent divison by zero 2024-10-02 15:14:36 +02:00
Mikkel Denker
240009db93 optionally skip db init 2024-10-02 15:11:49 +02:00
Mikkel Denker
9fc02932d9 refactor GetSiteUrls into a response stream for reusability 2024-10-02 12:59:44 +02:00
Mikkel Denker
69ad95e4e8 add first h1, all h2 and all h3 tags as fields in index
might be useful for ranking later
2024-10-02 10:31:48 +02:00
Mikkel Denker
36fab02c9d implement '<base>' tags for relative urls 2024-10-02 09:56:55 +02:00
Mikkel Denker
089f609e70 add svelte.dev to devdocs 2024-10-02 09:35:35 +02:00
Mikkel Denker
f6bca073db enforce edgelimit to only be applied in webgraph mod.rs so offset is only applied once 2024-10-02 09:22:09 +02:00
Mikkel Denker
9414c4277a implement EdgeLimit::LimitAndOffset in webgraph.
to make sure the edges are returned in sorted order, we implement a new FlatSortedBy struct that takes a vec of iters and outputs the sorted order of the iter items. it assumes that each iter is already sorted so it doesn't have to traverse the entire result set
2024-10-01 15:28:27 +02:00
Mikkel Denker
d53313388e make sure no-follow links aren't used in /explore and to score host similarity 2024-10-01 13:53:27 +02:00
Mikkel Denker
b160436297 upper bounds for in degree and out degree in webgraph 2024-10-01 13:27:02 +02:00
Mikkel Denker
c4ea75ad2a only consider simhash duplicates if we don't have enough non-duplicates 2024-10-01 12:04:27 +02:00
Mikkel Denker
c5a465b7d6 group urls by domain centrality in planner and process groups by highest centrality first 2024-10-01 10:37:23 +02:00
Mikkel Denker
945c358989 increase minimum delay between requests to 10 sec 2024-10-01 10:10:19 +02:00
Mikkel Denker
70e9e8181d make api types public 2024-10-01 10:02:22 +02:00
Mikkel Denker
4e8c165a1c
cleanup temporary directories automatically in tests (#228) 2024-10-01 09:42:14 +02:00
Mikkel Denker
950862be9c
Re-open live index after it has been downloaded from replica (#227)
* re-open index after it has been downloaded from replica

* remove writer directory lock

* update meta file with segment changes

* flatten live index directory structure a bit for better overview

* additional live index tests
2024-10-01 09:19:08 +02:00
Mikkel Denker
1b1677dd15 clear wal after inserts 2024-09-27 17:45:34 +02:00
Mikkel Denker
45f6cae5e7 cannot download index from self replica
if a node restarted quickly after it shut down, it might see its old ghost in the cluster and try to download the index from itself, which would of course fail
2024-09-27 17:45:11 +02:00
Mikkel Denker
0114e8d9eb normalize urls before they are crawled for better duplicates detection 2024-09-26 15:33:07 +02:00
Mikkel Denker
d94b46892e include live index results when returning search count 2024-09-26 15:14:26 +02:00
Mikkel Denker
d61986c255 normalize urls 2024-09-26 15:14:03 +02:00
Mikkel Denker
8ebdd37653 derive default for some live index config fields 2024-09-26 15:13:40 +02:00
Mikkel Denker
7814bf7f1a skip awaiting webgraph nodes in dev environments as they may not exist 2024-09-26 09:02:56 +02:00
Mikkel Denker
f6f39b77ca explicitly wait for graph nodes to come online 2024-09-25 16:43:13 +02:00
Mikkel Denker
5cd174e415
Live index crawler (#226)
* find links to feeds from pages and add them to site statistics (statistics is probably an incorrect name now)

* robust url parse

* shard download_db by the first two characters of the md5 hash of the host

* count number of occurrences for each feed in site stats so we can filter '/comment' etc

* filter feeds that only has a count of 1 from site stats

* webgraph granularity as trait to ensure each struct gets a connection with the granularity they expect

* remove cluster_id as it wasn't used for anything in practice

* assign budgets to each site based on their category, host centrality and total daily budgets

* entrypoint for live index crawler

* implement feeds checker

* sitemap checker

* frontpage checker

* remove unused imports + fix clippy warnings

* remove redundant closure

* fix deadlock when starting crawl of a site. the drip thread already had the lock so creating the guard would halt indefinitely

* also seed crawled_db with urls from live index
2024-09-25 15:07:16 +02:00
Mikkel Denker
de3239716f explicitly mark as unreachable 2024-09-18 12:26:18 +02:00
Mikkel Denker
55b39555aa npm update 2024-09-18 12:19:14 +02:00
Mikkel Denker
ee0fc39eaa rustup update and fix clippy 2024-09-18 11:18:55 +02:00
Mikkel Denker
bcc0819871
Live index replication (#224)
* [WIP] join live-index node in setup mode before searches can be performed

the node needs to check if there are other replicas already online and download the index from one of them. inserts should also still be inserted into setup node's wal

* once a server receives an indexing request with a consistency threshold, replicate that request to the other replicas (servers with same shard as the server that received the request). only if the consistency threshold is met should the original request succeed

* write to temp wal during setup

* remote copy of files/directories to download the index from an existing replica

* download index from existing replicas during setup

* clippy fix

* cleanup large test files in remote copy tests

* [wip] live index tests

* tests
2024-09-17 14:31:20 +02:00
Mikkel Denker
21d28ea86d remove 'package-lock.json' from gitignore 2024-09-12 12:54:15 +02:00
Mikkel Denker
7f5a4e846d boost and exclude domains in crawl plan 2024-09-10 13:41:06 +02:00
Mikkel Denker
bd8f69707b disable http redirect following in reqwest and handle it directly in crawler instead.
this ensures we have the proper delay between requests
2024-09-10 11:36:32 +02:00
Mikkel Denker
21ba4cde65
Live index without replication (#221)
* [WIP] live index code structure with a ton of todos

* update meta file with segment changes

* add endpoint to index webpages into live index

* compact segments by date

* cleanup old segments

* fix clippy warnings

* fix clippy warnings
2024-09-10 11:03:52 +02:00
Mikkel Denker
38f33c80b5 use words tokenizer at query time for all *NoTokenizer fields to ensure that e.g. the query 'example query site.com' matches site.com against the field instead of the entire query 2024-09-09 10:42:56 +02:00
Mikkel Denker
b3aa2a80f6 weekends ranking experiments to improve top-300 recall.
improves recall from ~71% -> ~83%. not ready for production yet as the top ranked urls are way off
2024-09-09 09:54:01 +02:00
Mikkel Denker
ace0a81fef
Compute some site level statistics (number of pages, number of news articles and number of blog posts). This will be used to determine which sites to crawl in the live index (#220) 2024-09-07 15:10:15 +02:00
Mikkel Denker
8d0ad573a7 start politeness factor at 2, decrease iff. we don't receive any 429 responses. also increase max delay to 180 seconds 2024-09-06 13:41:22 +02:00