Ver Fonte

Merge branch 'master' into update-urls-queue-quickly

Daoud Clarke há 2 anos atrás
pai
commit
bc6be8b6d5

+ 57 - 0
.github/workflows/ci.yml

@@ -0,0 +1,57 @@
+name: CI
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      #----------------------------------------------
+      #       check-out repo and set-up python
+      #----------------------------------------------
+      - name: Check out repository
+        uses: actions/checkout@v3
+      - name: Set up python
+        id: setup-python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'
+      #----------------------------------------------
+      #  -----  install & configure poetry  -----
+      #----------------------------------------------
+      - name: Install Poetry
+        uses: snok/install-poetry@v1.3.3
+        with:
+          virtualenvs-create: true
+          virtualenvs-in-project: true
+          installer-parallel: true
+
+      #----------------------------------------------
+      #       load cached venv if cache exists
+      #----------------------------------------------
+      - name: Load cached venv
+        id: cached-poetry-dependencies
+        uses: actions/cache@v3
+        with:
+          path: .venv
+          key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}
+      #----------------------------------------------
+      # install dependencies if cache does not exist
+      #----------------------------------------------
+      - name: Install dependencies
+        if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
+        run: poetry install --no-interaction --no-root
+      #----------------------------------------------
+      # install your root project, if required
+      #----------------------------------------------
+      - name: Install project
+        run: poetry install --no-interaction
+      #----------------------------------------------
+      #              run test suite
+      #----------------------------------------------
+      - name: Run tests
+        run: |
+          poetry run pytest

+ 15 - 0
.vscode/launch.json

@@ -0,0 +1,15 @@
+{
+    "version": "0.2.0",
+    "configurations": [
+        {
+            "name": "mwmbl",
+            "type": "python",
+            "request": "launch",
+            "module": "mwmbl.main",
+            "python": "${workspaceFolder}/.venv/bin/python",
+            "stopOnEntry": false,
+            "console": "integratedTerminal",
+            "justMyCode": true
+        }
+    ]
+}

+ 128 - 0
CODE_OF_CONDUCT.md

@@ -0,0 +1,128 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, religion, or sexual identity
+and orientation.
+
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+
+## Our Standards
+
+Examples of behavior that contributes to a positive environment for our
+community include:
+
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the
+  overall community
+
+Examples of unacceptable behavior include:
+
+* The use of sexualized language or imagery, and sexual attention or
+  advances of any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email
+  address, without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+
+## Enforcement Responsibilities
+
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+
+## Scope
+
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official e-mail address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+https://matrix.to/#/#mwmbl:matrix.org.
+All complaints will be reviewed and investigated promptly and fairly.
+
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+
+## Enforcement Guidelines
+
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+
+### 1. Correction
+
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+
+### 2. Warning
+
+**Community Impact**: A violation through a single incident or series
+of actions.
+
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or
+permanent ban.
+
+### 3. Temporary Ban
+
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+
+### 4. Permanent Ban
+
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior,  harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+
+**Consequence**: A permanent ban from any sort of public interaction within
+the community.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.0, available at
+https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
+
+Community Impact Guidelines were inspired by [Mozilla's code of conduct
+enforcement ladder](https://github.com/mozilla/diversity).
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see the FAQ at
+https://www.contributor-covenant.org/faq. Translations are available at
+https://www.contributor-covenant.org/translations.

+ 5 - 0
CONTRIBUTING.md

@@ -0,0 +1,5 @@
+Contributions are very welcome!
+
+Please join the discussion at https://matrix.to/#/#mwmbl:matrix.org and let us know what you're planning to do.
+
+See https://book.mwmbl.org/page/developers/ for a guide to development.

+ 2 - 0
README.md

@@ -13,6 +13,8 @@ the web front-end and search technology on a small index.
 Our vision is a community working to provide top quality search
 particularly for hackers, funded purely by donations.
 
+![mwmbl](https://user-images.githubusercontent.com/1283077/218265959-be4220b4-dcf0-47ab-acd3-f06df0883b52.gif)
+
 Crawling
 ========
 

+ 2 - 1
mwmbl/main.py

@@ -29,6 +29,7 @@ def setup_args():
     parser = argparse.ArgumentParser(description="Mwmbl API server and background task processor")
     parser.add_argument("--num-pages", type=int, help="Number of pages of memory (4096 bytes) to use for the index", default=2560)
     parser.add_argument("--data", help="Path to the data folder for storing index and cached batches", default="./devdata")
+    parser.add_argument("--port", type=int, help="Port for the server to listen at", default=5000)
     parser.add_argument("--background", help="Enable running the background tasks to process batches",
                         action='store_true')
     args = parser.parse_args()
@@ -74,7 +75,7 @@ def run():
         app.include_router(crawler_router)
 
         # Initialize uvicorn server using global app instance and server config params
-        uvicorn.run(app, host="0.0.0.0", port=5000)
+        uvicorn.run(app, host="0.0.0.0", port=args.port)
 
 
 if __name__ == "__main__":

+ 4 - 1
mwmbl/tinysearchengine/completer.py

@@ -10,13 +10,16 @@ TERMS_PATH = Path(__file__).parent.parent / 'resources' / 'mwmbl-crawl-terms.csv
 class Completer:
     def __init__(self, num_matches: int = 3):
         # Load term data
-        terms = pd.read_csv(TERMS_PATH)
+        terms = self.get_terms()
 
         terms_dict = terms.sort_values('term').set_index('term')['count'].to_dict()
         self.terms = list(terms_dict.keys())
         self.counts = list(terms_dict.values())
         self.num_matches = num_matches
         print("Terms", self.terms[:100], self.counts[:100])
+        
+    def get_terms(self):
+        return pd.read_csv(TERMS_PATH)
 
     def complete(self, term) -> list[str]:
         term_length = len(term)

+ 3 - 3
mwmbl/tinysearchengine/indexer.py

@@ -122,7 +122,7 @@ class TinyIndex(Generic[T]):
     def __enter__(self):
         self.index_file = open(self.index_path, 'r+b')
         prot = PROT_READ if self.mode == 'r' else PROT_READ | PROT_WRITE
-        self.mmap = mmap(self.index_file.fileno(), 0, offset=METADATA_SIZE, prot=prot)
+        self.mmap = mmap(self.index_file.fileno(), 0, prot=prot)
         return self
 
     def __exit__(self, exc_type, exc_val, exc_tb):
@@ -146,7 +146,7 @@ class TinyIndex(Generic[T]):
         return [self.item_factory(*item) for item in results]
 
     def _get_page_tuples(self, i):
-        page_data = self.mmap[i * self.page_size:(i + 1) * self.page_size]
+        page_data = self.mmap[i * self.page_size + METADATA_SIZE:(i + 1) * self.page_size + METADATA_SIZE]
         try:
             decompressed_data = self.decompressor.decompress(page_data)
         except ZstdError:
@@ -186,7 +186,7 @@ class TinyIndex(Generic[T]):
 
         page_data = _get_page_data(self.compressor, self.page_size, data)
         logger.debug(f"Got page data of length {len(page_data)}")
-        self.mmap[i * self.page_size:(i+1) * self.page_size] = page_data
+        self.mmap[i * self.page_size:(i+1) * self.page_size + METADATA_SIZE] = page_data
 
     @staticmethod
     def create(item_factory: Callable[..., T], index_path: str, num_pages: int, page_size: int):

Diff do ficheiro suprimidas por serem muito extensas
+ 265 - 895
poetry.lock


+ 2 - 0
pyproject.toml

@@ -18,6 +18,8 @@ boto3 = "^1.20.37"
 requests = "^2.27.1"
 psycopg2-binary = "^2.9.3"
 spacy = "==3.2.1"
+pytest = "^7.2.1"
+pytest-mock = "^3.10.0"
 
 # Optional dependencies do not get installed by default. Look under tool.poetry.extras section
 # to see which extras to use.

+ 78 - 0
test/test_completer.py

@@ -0,0 +1,78 @@
+import mwmbl.tinysearchengine.completer
+import pytest
+import pandas as pd
+
+def mockCompleterData(mocker, data):
+    testDataFrame = pd.DataFrame(data, columns=['','term','count'])
+    mocker.patch('mwmbl.tinysearchengine.completer.Completer.get_terms', 
+                 return_value = testDataFrame)
+
+def test_correctCompletions(mocker):
+    # Mock completer with custom data
+    testdata = [
+        [0, 'build', 4],
+        [1, 'builder', 3],
+        [2, 'announce', 2],
+        [3, 'buildings', 1]]
+    mockCompleterData(mocker, testdata)
+    
+    completer = mwmbl.tinysearchengine.completer.Completer()
+    completion = completer.complete('build')
+    assert ['build', 'builder', 'buildings'] == completion
+
+def test_correctSortOrder(mocker):
+    # Mock completer with custom data
+    testdata = [
+        [0, 'build', 4],
+        [1, 'builder', 1],
+        [2, 'announce', 2],
+        [3, 'buildings', 3]]
+    mockCompleterData(mocker, testdata)
+    
+    completer = mwmbl.tinysearchengine.completer.Completer()
+    completion = completer.complete('build')
+    assert ['build', 'buildings', 'builder'] == completion
+    
+def test_noCompletions(mocker):
+    # Mock completer with custom data
+    testdata = [
+        [0, 'build', 4],
+        [1, 'builder', 3],
+        [2, 'announce', 2],
+        [3, 'buildings', 1]]
+    mockCompleterData(mocker, testdata)
+    
+    completer = mwmbl.tinysearchengine.completer.Completer()
+    completion = completer.complete('test')
+    assert [] == completion
+    
+def test_singleCompletions(mocker):
+    # Mock completer with custom data
+    testdata = [
+        [0, 'build', 4],
+        [1, 'builder', 3],
+        [2, 'announce', 2],
+        [3, 'buildings', 1]]
+    mockCompleterData(mocker, testdata)
+    
+    completer = mwmbl.tinysearchengine.completer.Completer()
+    completion = completer.complete('announce')
+    assert ['announce'] == completion
+    
+def test_idempotencyWithSameScoreCompletions(mocker):
+    # Mock completer with custom data
+    testdata = [
+        [0, 'build', 1],
+        [1, 'builder', 1],
+        [2, 'announce', 1],
+        [3, 'buildings', 1]]
+    mockCompleterData(mocker, testdata)
+    
+    completer = mwmbl.tinysearchengine.completer.Completer()
+    for i in range(3):
+        print(f"iteration: {i}")
+        completion = completer.complete('build')
+        # Results expected in reverse order
+        expected = ['buildings','builder','build']
+        assert expected == completion
+    

Alguns ficheiros não foram mostrados porque muitos ficheiros mudaram neste diff