Update readme for recent changes
This commit is contained in:
parent
51f2dd2690
commit
c4e86ce313
1 changed files with 22 additions and 16 deletions
38
README.md
38
README.md
|
@ -14,6 +14,19 @@ crawler is still to be implemented.
|
|||
Our vision is a community working to provide top quality search
|
||||
particularly for hackers, funded purely by donations.
|
||||
|
||||
Crawling
|
||||
========
|
||||
|
||||
**Update 2022-02-05:** We now have a distributed crawler that runs on
|
||||
our volunteers' machines! If you have Firefox you can help out by
|
||||
[installing our
|
||||
extension](https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-crawler/). This
|
||||
will crawl the web in the background, retrieving one page a second. It
|
||||
does not use or access any of your personal data. Instead it crawls
|
||||
the web at random, using the top scoring sites on Hacker News as seed
|
||||
pages. After extracting a summary of each page, it batches these up
|
||||
and sends the data to a central server to be stored and indexed.
|
||||
|
||||
Why a non-profit search engine?
|
||||
===============================
|
||||
|
||||
|
@ -87,26 +100,16 @@ single term and maintain an index smaller than the inverted index
|
|||
design. Well, that's the theory. This idea has yet to be tested out on
|
||||
a large scale.
|
||||
|
||||
Crawling
|
||||
========
|
||||
|
||||
Our current index is a small sample of the excellent Common Crawl,
|
||||
restricted to English content and domains which score highly on
|
||||
average in Hacker News submissions. It is likely for a variety of
|
||||
reasons that we will want to go beyond Common Crawl data at some
|
||||
point, so building a crawler becomes inevitable. We plan to start work
|
||||
on a distributed crawler, probably implemented as a browser extension
|
||||
that can be installed by volunteers.
|
||||
|
||||
How to contribute
|
||||
=================
|
||||
|
||||
There are lots of ways to help:
|
||||
- [Help us crawl the
|
||||
web](https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-crawler/)
|
||||
- [Donate some money](https://opencollective.com/mwmbl) towards
|
||||
hosting costs and supporting our volunteers
|
||||
- Give feedback/suggestions
|
||||
- Volunteer to test out the distributed crawler when it's ready
|
||||
- Help out with development of the engine itself
|
||||
- Donate some money towards hosting costs and/or founding an official
|
||||
non-profit organisation
|
||||
|
||||
If you would like to help in any of these or other ways, thank you!
|
||||
Please join our [Matrix chat
|
||||
|
@ -123,7 +126,7 @@ Development
|
|||
4. Run `$ docker run -p 8080:8080 mwmbl`
|
||||
|
||||
### Local Testing
|
||||
1. Create and activate a python (3.9) environment using any tool you like e.g. poetry,venv, conda etc.
|
||||
1. Create and activate a python (3.10) environment using any tool you like e.g. poetry,venv, conda etc.
|
||||
2. Run `$ pip install .`
|
||||
3. Run `$ mwmbl-tinysearchengine --config config/tinysearchengine.yaml`
|
||||
|
||||
|
@ -132,4 +135,7 @@ Frequently Asked Question
|
|||
|
||||
### How do you pronounce "mwmbl"?
|
||||
|
||||
Like "mumble". I live in [Mumbles](https://en.wikipedia.org/wiki/Mumbles), which is spelt "Mwmbwls" in Welsh. But the intended meaning is "to mumble", as in "don't search, just mwmbl!"
|
||||
Like "mumble". I live in
|
||||
[Mumbles](https://en.wikipedia.org/wiki/Mumbles), which is spelt
|
||||
"Mwmbwls" in Welsh. But the intended meaning is "to mumble", as in
|
||||
"don't search, just mwmbl!"
|
||||
|
|
Loading…
Reference in a new issue