From 722328efa52271c2ca158dfc51f49e5da2975bbb Mon Sep 17 00:00:00 2001 From: Daoud Clarke Date: Sat, 25 Dec 2021 23:17:38 +0000 Subject: [PATCH] Write readme --- README.md | 109 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 107 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 30bcdc4..c00a84d 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,109 @@ -Tiny Search Engine +Mwmbl: No ads, no tracking, no cruft, no profit +=============================================== + +Mwmbl is a non-profit, ad-free, free-libre and free-lunch search +engine with a focus on useability and speed. At the moment it is +little more than an idea together with a [proof of concept +implementation](https://tinysearchengine-oq4r5q2hsq-ue.a.run.app/) of +the web front-end and search technology on a very small index. A +crawler is still to be implemented. + +Our vision is a community working to provide top quality search +particularly for hackers, funded purely by donations. + +Why a non-profit search engine? +=============================== + +The motives of ad-funded search engine are at odds with providing an +optimal user experience. These sites are optimised for ad revenue, +with user experience taking second place. This means that pages are +loaded with ads which are often not clearly distinguished from search +results. Also, eitland on Hacker News +[comments](https://news.ycombinator.com/item?id=29427442): + +> Thinking about it it seems logical that for a search engine that +> practically speaking has monopoly both on users and as mattgb points +> out - [to some] degree also on indexing - serving the correct answer +> first is just dumb: if they can keep me going between their search +> results and tech blogs with their ads embedded one, two or five +> times extra that means one, two or five times more ad impressions. + +But what about...? ================== -TBD +The space of alternative search engines has expanded rapidly in recent +years. Here's a very incomplete list of some that have interested me: + + - [YaCy](https://yacy.net/) - an open source distributed search engine + - [search.marginalia.nu](https://search.marginalia.nu/) - a search + engine favouring text-heavy websites + - [Gigablast](https://gigablast.com/) - a privacy-focused search + engine whose owner makes money by selling the technology to third + parties + - [Brave](https://search.brave.com/) + - [DuckDuckGo](https://duckduckgo.com/) + +Of these, YaCy is the closest in spirit to the idea of a non-profit +search engine. The index is distributed across a peer-to-peer +network. Unfortunately this design decision makes search very slow. + +Marginalia Search is fantastic, but it is more of a personal project +than an open source community. + +All other search engines that I've come across are for-profit. Please +let me know if I've missed one! + +Designing for non-profit +======================== + +To be a good search engine, we need to store many items, but the cost +of running the engine is at least proportional to the number of items +stored. Our main consideration is thus to reduce the cost per item +stored. + +The design is founded on the observation that most items rank for a +small set of terms. In the extreme version of this, where each item +ranks for a single term, the usual inverted index design is grossly +inefficient, since we have to store each term at least twice: once in +the index and once in the item data itself. + +Our design is a giant hash map. We have a single store consisting of a +fixed number N of pages. Each page is of a fixed size (currently 4096 +bytes to match a page of memory), and consists of a compressed list of +items. Given a term for which we want an item to rank, we compute a +hash of the term, a value between 0 and N - 1. The item is then stored +in the corresponding page. + +To retrieve pages, we simply compute the hash of the terms in the user +query and load the corresponding pages, filter the items to those +containing the term and rank the items. Since each page is small, this +can be done very quickly. + +Because we compress the list of items, we can rank for more than a +single term and maintain an index smaller than the inverted index +design. Well, that's the theory. This idea has yet to be tested out on +a large scale. + +Crawling +======== + +Our current index is a small sample of the excellent Common Crawl, +restricted to English content and domains which score highly on +average in Hacker News submissions. It is likely for a variety of +reasons that we will want to go beyond Common Crawl data at some +point, so building a crawler becomes inevitable. We plan to start work +on a distributed crawler, probably implemented as a browser extension +that can be installed by volunteers. + +How to contribute +================= + +There are lots of ways to help: + - Volunteer to test out the distributed crawler when it's ready + - Help out with development of the engine itself + - Donate some money towards hosting costs and/or founding an official + non-profit organisation + +If you would like to help in any of these or other ways, thank you! +Please email the main author (email address is in the git commit +history).