Merge branch 'master' into update-urls-queue-quickly
This commit is contained in:
commit
bc6be8b6d5
11 changed files with 1326 additions and 982 deletions
57
.github/workflows/ci.yml
vendored
Normal file
57
.github/workflows/ci.yml
vendored
Normal file
|
@ -0,0 +1,57 @@
|
|||
name: CI
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
pull_request:
|
||||
|
||||
jobs:
|
||||
test:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
#----------------------------------------------
|
||||
# check-out repo and set-up python
|
||||
#----------------------------------------------
|
||||
- name: Check out repository
|
||||
uses: actions/checkout@v3
|
||||
- name: Set up python
|
||||
id: setup-python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: '3.10'
|
||||
#----------------------------------------------
|
||||
# ----- install & configure poetry -----
|
||||
#----------------------------------------------
|
||||
- name: Install Poetry
|
||||
uses: snok/install-poetry@v1.3.3
|
||||
with:
|
||||
virtualenvs-create: true
|
||||
virtualenvs-in-project: true
|
||||
installer-parallel: true
|
||||
|
||||
#----------------------------------------------
|
||||
# load cached venv if cache exists
|
||||
#----------------------------------------------
|
||||
- name: Load cached venv
|
||||
id: cached-poetry-dependencies
|
||||
uses: actions/cache@v3
|
||||
with:
|
||||
path: .venv
|
||||
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}
|
||||
#----------------------------------------------
|
||||
# install dependencies if cache does not exist
|
||||
#----------------------------------------------
|
||||
- name: Install dependencies
|
||||
if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
|
||||
run: poetry install --no-interaction --no-root
|
||||
#----------------------------------------------
|
||||
# install your root project, if required
|
||||
#----------------------------------------------
|
||||
- name: Install project
|
||||
run: poetry install --no-interaction
|
||||
#----------------------------------------------
|
||||
# run test suite
|
||||
#----------------------------------------------
|
||||
- name: Run tests
|
||||
run: |
|
||||
poetry run pytest
|
15
.vscode/launch.json
vendored
Normal file
15
.vscode/launch.json
vendored
Normal file
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": "0.2.0",
|
||||
"configurations": [
|
||||
{
|
||||
"name": "mwmbl",
|
||||
"type": "python",
|
||||
"request": "launch",
|
||||
"module": "mwmbl.main",
|
||||
"python": "${workspaceFolder}/.venv/bin/python",
|
||||
"stopOnEntry": false,
|
||||
"console": "integratedTerminal",
|
||||
"justMyCode": true
|
||||
}
|
||||
]
|
||||
}
|
128
CODE_OF_CONDUCT.md
Normal file
128
CODE_OF_CONDUCT.md
Normal file
|
@ -0,0 +1,128 @@
|
|||
# Contributor Covenant Code of Conduct
|
||||
|
||||
## Our Pledge
|
||||
|
||||
We as members, contributors, and leaders pledge to make participation in our
|
||||
community a harassment-free experience for everyone, regardless of age, body
|
||||
size, visible or invisible disability, ethnicity, sex characteristics, gender
|
||||
identity and expression, level of experience, education, socio-economic status,
|
||||
nationality, personal appearance, race, religion, or sexual identity
|
||||
and orientation.
|
||||
|
||||
We pledge to act and interact in ways that contribute to an open, welcoming,
|
||||
diverse, inclusive, and healthy community.
|
||||
|
||||
## Our Standards
|
||||
|
||||
Examples of behavior that contributes to a positive environment for our
|
||||
community include:
|
||||
|
||||
* Demonstrating empathy and kindness toward other people
|
||||
* Being respectful of differing opinions, viewpoints, and experiences
|
||||
* Giving and gracefully accepting constructive feedback
|
||||
* Accepting responsibility and apologizing to those affected by our mistakes,
|
||||
and learning from the experience
|
||||
* Focusing on what is best not just for us as individuals, but for the
|
||||
overall community
|
||||
|
||||
Examples of unacceptable behavior include:
|
||||
|
||||
* The use of sexualized language or imagery, and sexual attention or
|
||||
advances of any kind
|
||||
* Trolling, insulting or derogatory comments, and personal or political attacks
|
||||
* Public or private harassment
|
||||
* Publishing others' private information, such as a physical or email
|
||||
address, without their explicit permission
|
||||
* Other conduct which could reasonably be considered inappropriate in a
|
||||
professional setting
|
||||
|
||||
## Enforcement Responsibilities
|
||||
|
||||
Community leaders are responsible for clarifying and enforcing our standards of
|
||||
acceptable behavior and will take appropriate and fair corrective action in
|
||||
response to any behavior that they deem inappropriate, threatening, offensive,
|
||||
or harmful.
|
||||
|
||||
Community leaders have the right and responsibility to remove, edit, or reject
|
||||
comments, commits, code, wiki edits, issues, and other contributions that are
|
||||
not aligned to this Code of Conduct, and will communicate reasons for moderation
|
||||
decisions when appropriate.
|
||||
|
||||
## Scope
|
||||
|
||||
This Code of Conduct applies within all community spaces, and also applies when
|
||||
an individual is officially representing the community in public spaces.
|
||||
Examples of representing our community include using an official e-mail address,
|
||||
posting via an official social media account, or acting as an appointed
|
||||
representative at an online or offline event.
|
||||
|
||||
## Enforcement
|
||||
|
||||
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||
reported to the community leaders responsible for enforcement at
|
||||
https://matrix.to/#/#mwmbl:matrix.org.
|
||||
All complaints will be reviewed and investigated promptly and fairly.
|
||||
|
||||
All community leaders are obligated to respect the privacy and security of the
|
||||
reporter of any incident.
|
||||
|
||||
## Enforcement Guidelines
|
||||
|
||||
Community leaders will follow these Community Impact Guidelines in determining
|
||||
the consequences for any action they deem in violation of this Code of Conduct:
|
||||
|
||||
### 1. Correction
|
||||
|
||||
**Community Impact**: Use of inappropriate language or other behavior deemed
|
||||
unprofessional or unwelcome in the community.
|
||||
|
||||
**Consequence**: A private, written warning from community leaders, providing
|
||||
clarity around the nature of the violation and an explanation of why the
|
||||
behavior was inappropriate. A public apology may be requested.
|
||||
|
||||
### 2. Warning
|
||||
|
||||
**Community Impact**: A violation through a single incident or series
|
||||
of actions.
|
||||
|
||||
**Consequence**: A warning with consequences for continued behavior. No
|
||||
interaction with the people involved, including unsolicited interaction with
|
||||
those enforcing the Code of Conduct, for a specified period of time. This
|
||||
includes avoiding interactions in community spaces as well as external channels
|
||||
like social media. Violating these terms may lead to a temporary or
|
||||
permanent ban.
|
||||
|
||||
### 3. Temporary Ban
|
||||
|
||||
**Community Impact**: A serious violation of community standards, including
|
||||
sustained inappropriate behavior.
|
||||
|
||||
**Consequence**: A temporary ban from any sort of interaction or public
|
||||
communication with the community for a specified period of time. No public or
|
||||
private interaction with the people involved, including unsolicited interaction
|
||||
with those enforcing the Code of Conduct, is allowed during this period.
|
||||
Violating these terms may lead to a permanent ban.
|
||||
|
||||
### 4. Permanent Ban
|
||||
|
||||
**Community Impact**: Demonstrating a pattern of violation of community
|
||||
standards, including sustained inappropriate behavior, harassment of an
|
||||
individual, or aggression toward or disparagement of classes of individuals.
|
||||
|
||||
**Consequence**: A permanent ban from any sort of public interaction within
|
||||
the community.
|
||||
|
||||
## Attribution
|
||||
|
||||
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
|
||||
version 2.0, available at
|
||||
https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
|
||||
|
||||
Community Impact Guidelines were inspired by [Mozilla's code of conduct
|
||||
enforcement ladder](https://github.com/mozilla/diversity).
|
||||
|
||||
[homepage]: https://www.contributor-covenant.org
|
||||
|
||||
For answers to common questions about this code of conduct, see the FAQ at
|
||||
https://www.contributor-covenant.org/faq. Translations are available at
|
||||
https://www.contributor-covenant.org/translations.
|
5
CONTRIBUTING.md
Normal file
5
CONTRIBUTING.md
Normal file
|
@ -0,0 +1,5 @@
|
|||
Contributions are very welcome!
|
||||
|
||||
Please join the discussion at https://matrix.to/#/#mwmbl:matrix.org and let us know what you're planning to do.
|
||||
|
||||
See https://book.mwmbl.org/page/developers/ for a guide to development.
|
|
@ -13,6 +13,8 @@ the web front-end and search technology on a small index.
|
|||
Our vision is a community working to provide top quality search
|
||||
particularly for hackers, funded purely by donations.
|
||||
|
||||
![mwmbl](https://user-images.githubusercontent.com/1283077/218265959-be4220b4-dcf0-47ab-acd3-f06df0883b52.gif)
|
||||
|
||||
Crawling
|
||||
========
|
||||
|
||||
|
|
|
@ -29,6 +29,7 @@ def setup_args():
|
|||
parser = argparse.ArgumentParser(description="Mwmbl API server and background task processor")
|
||||
parser.add_argument("--num-pages", type=int, help="Number of pages of memory (4096 bytes) to use for the index", default=2560)
|
||||
parser.add_argument("--data", help="Path to the data folder for storing index and cached batches", default="./devdata")
|
||||
parser.add_argument("--port", type=int, help="Port for the server to listen at", default=5000)
|
||||
parser.add_argument("--background", help="Enable running the background tasks to process batches",
|
||||
action='store_true')
|
||||
args = parser.parse_args()
|
||||
|
@ -74,7 +75,7 @@ def run():
|
|||
app.include_router(crawler_router)
|
||||
|
||||
# Initialize uvicorn server using global app instance and server config params
|
||||
uvicorn.run(app, host="0.0.0.0", port=5000)
|
||||
uvicorn.run(app, host="0.0.0.0", port=args.port)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
|
|
@ -10,13 +10,16 @@ TERMS_PATH = Path(__file__).parent.parent / 'resources' / 'mwmbl-crawl-terms.csv
|
|||
class Completer:
|
||||
def __init__(self, num_matches: int = 3):
|
||||
# Load term data
|
||||
terms = pd.read_csv(TERMS_PATH)
|
||||
terms = self.get_terms()
|
||||
|
||||
terms_dict = terms.sort_values('term').set_index('term')['count'].to_dict()
|
||||
self.terms = list(terms_dict.keys())
|
||||
self.counts = list(terms_dict.values())
|
||||
self.num_matches = num_matches
|
||||
print("Terms", self.terms[:100], self.counts[:100])
|
||||
|
||||
def get_terms(self):
|
||||
return pd.read_csv(TERMS_PATH)
|
||||
|
||||
def complete(self, term) -> list[str]:
|
||||
term_length = len(term)
|
||||
|
|
|
@ -122,7 +122,7 @@ class TinyIndex(Generic[T]):
|
|||
def __enter__(self):
|
||||
self.index_file = open(self.index_path, 'r+b')
|
||||
prot = PROT_READ if self.mode == 'r' else PROT_READ | PROT_WRITE
|
||||
self.mmap = mmap(self.index_file.fileno(), 0, offset=METADATA_SIZE, prot=prot)
|
||||
self.mmap = mmap(self.index_file.fileno(), 0, prot=prot)
|
||||
return self
|
||||
|
||||
def __exit__(self, exc_type, exc_val, exc_tb):
|
||||
|
@ -146,7 +146,7 @@ class TinyIndex(Generic[T]):
|
|||
return [self.item_factory(*item) for item in results]
|
||||
|
||||
def _get_page_tuples(self, i):
|
||||
page_data = self.mmap[i * self.page_size:(i + 1) * self.page_size]
|
||||
page_data = self.mmap[i * self.page_size + METADATA_SIZE:(i + 1) * self.page_size + METADATA_SIZE]
|
||||
try:
|
||||
decompressed_data = self.decompressor.decompress(page_data)
|
||||
except ZstdError:
|
||||
|
@ -186,7 +186,7 @@ class TinyIndex(Generic[T]):
|
|||
|
||||
page_data = _get_page_data(self.compressor, self.page_size, data)
|
||||
logger.debug(f"Got page data of length {len(page_data)}")
|
||||
self.mmap[i * self.page_size:(i+1) * self.page_size] = page_data
|
||||
self.mmap[i * self.page_size:(i+1) * self.page_size + METADATA_SIZE] = page_data
|
||||
|
||||
@staticmethod
|
||||
def create(item_factory: Callable[..., T], index_path: str, num_pages: int, page_size: int):
|
||||
|
|
2007
poetry.lock
generated
2007
poetry.lock
generated
File diff suppressed because it is too large
Load diff
|
@ -18,6 +18,8 @@ boto3 = "^1.20.37"
|
|||
requests = "^2.27.1"
|
||||
psycopg2-binary = "^2.9.3"
|
||||
spacy = "==3.2.1"
|
||||
pytest = "^7.2.1"
|
||||
pytest-mock = "^3.10.0"
|
||||
|
||||
# Optional dependencies do not get installed by default. Look under tool.poetry.extras section
|
||||
# to see which extras to use.
|
||||
|
|
78
test/test_completer.py
Normal file
78
test/test_completer.py
Normal file
|
@ -0,0 +1,78 @@
|
|||
import mwmbl.tinysearchengine.completer
|
||||
import pytest
|
||||
import pandas as pd
|
||||
|
||||
def mockCompleterData(mocker, data):
|
||||
testDataFrame = pd.DataFrame(data, columns=['','term','count'])
|
||||
mocker.patch('mwmbl.tinysearchengine.completer.Completer.get_terms',
|
||||
return_value = testDataFrame)
|
||||
|
||||
def test_correctCompletions(mocker):
|
||||
# Mock completer with custom data
|
||||
testdata = [
|
||||
[0, 'build', 4],
|
||||
[1, 'builder', 3],
|
||||
[2, 'announce', 2],
|
||||
[3, 'buildings', 1]]
|
||||
mockCompleterData(mocker, testdata)
|
||||
|
||||
completer = mwmbl.tinysearchengine.completer.Completer()
|
||||
completion = completer.complete('build')
|
||||
assert ['build', 'builder', 'buildings'] == completion
|
||||
|
||||
def test_correctSortOrder(mocker):
|
||||
# Mock completer with custom data
|
||||
testdata = [
|
||||
[0, 'build', 4],
|
||||
[1, 'builder', 1],
|
||||
[2, 'announce', 2],
|
||||
[3, 'buildings', 3]]
|
||||
mockCompleterData(mocker, testdata)
|
||||
|
||||
completer = mwmbl.tinysearchengine.completer.Completer()
|
||||
completion = completer.complete('build')
|
||||
assert ['build', 'buildings', 'builder'] == completion
|
||||
|
||||
def test_noCompletions(mocker):
|
||||
# Mock completer with custom data
|
||||
testdata = [
|
||||
[0, 'build', 4],
|
||||
[1, 'builder', 3],
|
||||
[2, 'announce', 2],
|
||||
[3, 'buildings', 1]]
|
||||
mockCompleterData(mocker, testdata)
|
||||
|
||||
completer = mwmbl.tinysearchengine.completer.Completer()
|
||||
completion = completer.complete('test')
|
||||
assert [] == completion
|
||||
|
||||
def test_singleCompletions(mocker):
|
||||
# Mock completer with custom data
|
||||
testdata = [
|
||||
[0, 'build', 4],
|
||||
[1, 'builder', 3],
|
||||
[2, 'announce', 2],
|
||||
[3, 'buildings', 1]]
|
||||
mockCompleterData(mocker, testdata)
|
||||
|
||||
completer = mwmbl.tinysearchengine.completer.Completer()
|
||||
completion = completer.complete('announce')
|
||||
assert ['announce'] == completion
|
||||
|
||||
def test_idempotencyWithSameScoreCompletions(mocker):
|
||||
# Mock completer with custom data
|
||||
testdata = [
|
||||
[0, 'build', 1],
|
||||
[1, 'builder', 1],
|
||||
[2, 'announce', 1],
|
||||
[3, 'buildings', 1]]
|
||||
mockCompleterData(mocker, testdata)
|
||||
|
||||
completer = mwmbl.tinysearchengine.completer.Completer()
|
||||
for i in range(3):
|
||||
print(f"iteration: {i}")
|
||||
completion = completer.complete('build')
|
||||
# Results expected in reverse order
|
||||
expected = ['buildings','builder','build']
|
||||
assert expected == completion
|
||||
|
Loading…
Reference in a new issue