An open source, non-profit search engine implemented in python

Find a file

Daoud Clarke f00eacf8aa Merge pull request #119 from mwmbl/fix-nginx-config Fix nginx config		2023-10-17 13:18:45 +01:00
.github/workflows	Fix some paths, use prod settings in Dockerfile	2023-10-10 20:18:43 +01:00
.vscode	add launch.json for vscode debugging	2023-02-10 20:59:09 -08:00
analyse	Rename django app to mwmbl	2023-10-10 13:51:06 +01:00
devdata	Keep track of curated couments	2023-04-30 18:25:48 +01:00
docs/assets/images	docs: added branding to readme and required assets files	2022-02-04 20:50:43 +01:00
front-end	Serve front end	2023-10-12 17:38:19 +01:00
mwmbl	Allow mwmbl.org	2023-10-15 16:47:23 +01:00
test	Fix some paths, use prod settings in Dockerfile	2023-10-10 20:18:43 +01:00
.dockerignore	Get Dockerfile working	2021-12-23 21:30:51 +00:00
.gcloudignore	Add .gcloudignore file to fix gcloud run deploy	2021-12-30 21:17:18 +00:00
.gitignore	Update .gitignore: fix ignoroing data folder in root of repository	2021-12-29 09:21:57 +01:00
app.json	There may not be files there, so use -rf	2023-10-12 22:12:58 +01:00
CODE_OF_CONDUCT.md	Create CODE_OF_CONDUCT.md	2023-02-11 15:13:08 +00:00
CONTRIBUTING.md	Create CONTRIBUTING.md	2023-02-11 15:17:35 +00:00
Dockerfile	Add the app.json in Dockerfile	2023-10-12 22:02:27 +01:00
LICENSE	GPLv3 -> AGPLv3	2021-12-26 22:05:15 +00:00
manage.py	Serve front end	2023-10-12 17:38:19 +01:00
nginx.conf.sigil	Revert nginx.conf	2023-10-17 13:17:57 +01:00
poetry.lock	Run poetry lock	2023-10-10 20:21:37 +01:00
pyproject.toml	Store stats in redis	2023-09-28 17:48:29 +01:00
README.md	Update README.md	2023-02-11 15:10:30 +00:00

README.md

Mwmbl - No ads, no tracking, no cruft, no profit

Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and speed. At the moment it is little more than an idea together with a proof of concept implementation of the web front-end and search technology on a small index.

Our vision is a community working to provide top quality search particularly for hackers, funded purely by donations.

Crawling

Update 2022-02-05: We now have a distributed crawler that runs on our volunteers' machines! If you have Firefox you can help out by installing our extension. This will crawl the web in the background, retrieving one page a second. It does not use or access any of your personal data. Instead it crawls the web at random, using the top scoring sites on Hacker News as seed pages. After extracting a summary of each page, it batches these up and sends the data to a central server to be stored and indexed.

Why a non-profit search engine?

The motives of ad-funded search engine are at odds with providing an optimal user experience. These sites are optimised for ad revenue, with user experience taking second place. This means that pages are loaded with ads which are often not clearly distinguished from search results. Also, eitland on Hacker News comments:

Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - [to some] degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.

But what about...?

The space of alternative search engines has expanded rapidly in recent years. Here's a very incomplete list of some that have interested me:

YaCy - an open source distributed search engine
search.marginalia.nu - a search engine favouring text-heavy websites
Gigablast - a privacy-focused search engine whose owner makes money by selling the technology to third parties
Brave
DuckDuckGo

Of these, YaCy is the closest in spirit to the idea of a non-profit search engine. The index is distributed across a peer-to-peer network. Unfortunately this design decision makes search very slow.

Marginalia Search is fantastic, but it is more of a personal project than an open source community.

All other search engines that I've come across are for-profit. Please let me know if I've missed one!

Designing for non-profit

To be a good search engine, we need to store many items, but the cost of running the engine is at least proportional to the number of items stored. Our main consideration is thus to reduce the cost per item stored.

The design is founded on the observation that most items rank for a small set of terms. In the extreme version of this, where each item ranks for a single term, the usual inverted index design is grossly inefficient, since we have to store each term at least twice: once in the index and once in the item data itself.

Our design is a giant hash map. We have a single store consisting of a fixed number N of pages. Each page is of a fixed size (currently 4096 bytes to match a page of memory), and consists of a compressed list of items. Given a term for which we want an item to rank, we compute a hash of the term, a value between 0 and N - 1. The item is then stored in the corresponding page.

To retrieve pages, we simply compute the hash of the terms in the user query and load the corresponding pages, filter the items to those containing the term and rank the items. Since each page is small, this can be done very quickly.

Because we compress the list of items, we can rank for more than a single term and maintain an index smaller than the inverted index design. Well, that's the theory. This idea has yet to be tested out on a large scale.

How to contribute

There are lots of ways to help:

Help us crawl the web
Donate some money towards hosting costs and supporting our volunteers
Give feedback/suggestions
Help out with development of the engine itself

If you would like to help in any of these or other ways, thank you! Please join our Matrix chat server or email the main author (email address is in the git commit history).

Development

Local Testing

This will run against a local test database without running background tasks to update batches etc.

This is the simplest way to configure postgres, but you can set it up how you like as long as the DATABASE_URL you give is correct for your configuration.

Install postgres and create a user for your current username
Install poetry
Run poetry install to install dependencies
Run poetry shell in the root directory to enter the virtual environment
Run $ DATABASE_URL="postgres://username@" python -m mwmbl.main replacing "username" with your username.

Using Dokku

Note: this method is not recommended as it is more involved, and your index will not have any data in it unless you set up a crawler to crawl to your server. You will need to set up your own Backblaze or S3 equivalent storage, or have access to the production keys, which we probably won't give you.

Follow the deployment instructions

Frequently Asked Question

How do you pronounce "mwmbl"?

Like "mumble". I live in Mumbles, which is spelt "Mwmbwls" in Welsh. But the intended meaning is "to mumble", as in "don't search, just mwmbl!"