Commit graph

  • b2e01d33e8 docs: better title display on readme #41 ColinEspinas 2022-02-04 20:52:27 +0100
  • 95c9bcfe3b Merge branch 'mwmbl:master' into add-branding Colin Espinas 2022-02-04 20:51:08 +0100
  • cd57372a84 docs: added branding to readme and required assets files ColinEspinas 2022-02-04 20:44:20 +0100
  • 6e5e56f99a New index; more pages Daoud Clarke 2022-02-04 18:08:23 +0000
  • bdf0fd1797
    Merge pull request #39 from mwmbl/analyse-links Daoud Clarke 2022-02-03 19:33:52 +0000
  • 2fc999b402 Count unique domains instead of links #39 Daoud Clarke 2022-02-02 20:09:59 +0000
  • 26e90c6e57 Merge branch 'master' into analyse-links Daoud Clarke 2022-02-02 19:48:47 +0000
  • 07d4b36052
    Merge pull request #38 from mwmbl/stop-indexing-partial-words Daoud Clarke 2022-02-02 19:48:31 +0000
  • d77b72d7df Analyse links to find most popular ones Daoud Clarke 2022-02-02 19:47:38 +0000
  • fe6ace93e6 Improve handling of incomplete words: - Correctly generate regex for incomplete vs complete words - Return more than one top word from completer - Correctly handle no terms #38 Daoud Clarke 2022-01-31 21:20:59 +0000
  • 7d829bc319 Use python 3.10; complete terms Daoud Clarke 2022-01-30 23:24:00 +0000
  • 3c75dd1a74 WIP: implement term completer Daoud Clarke 2022-01-30 22:20:28 +0000
  • 01a21337a9 Don't index partial words Daoud Clarke 2022-01-30 14:30:02 +0000
  • 2ef8304919 Remove some debug print statements Daoud Clarke 2022-01-30 13:16:24 +0000
  • 66696ad76b
    Merge pull request #37 from mwmbl/index-mwmbl-crawl Daoud Clarke 2022-01-30 13:12:06 +0000
  • 5b89bbf05d Index Mwmbl crawled data #37 Daoud Clarke 2022-01-29 08:26:42 +0000
  • ef36513f64 Analyse the pages that are crawled most often Daoud Clarke 2022-01-29 07:06:53 +0000
  • 70254ae160 Analyse crawled URLs and domains Daoud Clarke 2022-01-26 18:51:58 +0000
  • 171fa645d2 Add script to export top domains Daoud Clarke 2022-01-23 22:04:30 +0000
  • 908a9cf0b6
    Merge pull request #36 from ColinEspinas/remove-old-frontend Daoud Clarke 2022-01-20 18:06:54 +0000
  • 3481ad372b Removed old front-end files and routes #36 ColinEspinas 2022-01-19 23:30:34 +0100
  • 3e930ea3f0 * added Pipeline functionality * added Pipeline config validation code * added Op: NoneOp * added Op: commoncrawl_download_cc_index * added Connection: none_conn * added Connection: s3fs_conn * added Connection: pyarrow_s3fs_conn * added main entrypoint for pipeline, "python -m mwmbl.pipeline.main --config ..." * added mwmbl-pipeline which calls main entrypoint * added pipeline dependencies to extras to use as "pip install .[pipeline]" * updated pyarrow, s3fs, boto3 and botocore dependency versions * update poetry.lock with the new dependencies * added config/pipeline/pipeline_none.yaml * added config/pipeline/pipeline_download_cc_index.yaml * added mwmbl/pipeline/README.md #34 nitred 2022-01-05 13:39:14 +0100
  • a41088ca9a Add CORS; revert back to previous index as it timed out deploying Daoud Clarke 2022-01-03 18:31:03 +0000
  • 25918e42ef Export URLs to sqlite for evaluation purposes Daoud Clarke 2022-01-02 20:06:13 +0000
  • ae7312c32a
    Merge pull request #31 from nitred/fix-python-m-run Daoud Clarke 2021-12-31 22:11:15 +0000
  • fbdb93c86a Using the app object to start uvicorn, instead of using a reference like "mwmbl.tinysearchengine.app:app" - fixes the issue when running the server using python -m mwmbl.tinysearchengine.app #31 nitred 2021-12-31 02:06:36 +0100
  • e6655101ef Add a component of the HN domain score when ranking Daoud Clarke 2021-12-30 22:20:10 +0000
  • f347fe29ac Add .gcloudignore file to fix gcloud run deploy Daoud Clarke 2021-12-30 21:17:18 +0000
  • 3f74229ae9
    Explain pronounciation Daoud Clarke 2021-12-30 20:35:11 +0000
  • 02bcef640c
    Merge pull request #25 from ColinEspinas/search-debounce Daoud Clarke 2021-12-29 20:59:29 +0000
  • 3d7e655ebc
    Merge pull request #24 from nitred/config-and-entrypoint Daoud Clarke 2021-12-29 20:54:23 +0000
  • c636be9089 Added debounce on search input (#8) #25 ColinEspinas 2021-12-29 21:03:47 +0100
  • a72a08a7d9 added config and binary/entrypoint for mwmbl.tinysearchengine - using pydantic to validate the config - added a default bootstrap config at config/tinysearchengine.yaml - refactored app.py to include parsing CLI argument using argparse - refactored app.py to use fewer global variables - added "mwmbl-tinysearchengine" binary/entrypoint in pyproject.toml - updated Dockerfile to work with these changes and added comments to it #24 nitred 2021-12-29 15:18:02 +0100
  • 3ccf1eb7d7 added config and binary/entrypoint for mwmbl.tinysearchengine - using pydantic to validate the config - added a default bootstrap config at config/tinysearchengine.yaml - refactored app.py to include parsing CLI argument using argparse - refactored app.py to use fewer global variables - added "mwmbl-tinysearchengine" binary/entrypoint in pyproject.toml - updated Dockerfile to work with these changes and added comments to it #23 nitred 2021-12-29 15:18:02 +0100
  • da8797f5ef
    Merge pull request #18 from nitred/mwmbl-package Daoud Clarke 2021-12-29 09:34:05 +0000
  • 0b7bc90a05
    Merge pull request #21 from ArcoMul/add-dev-instructions-to-readme Daoud Clarke 2021-12-29 09:04:53 +0000
  • b6c1630953 Update .gitignore: fix ignoroing data folder in root of repository #21 Arco Mul 2021-12-29 09:21:57 +0100
  • d5a612aa47 Update README: add development instructions Arco Mul 2021-12-29 09:21:26 +0100
  • be40a15b27 Merge branch 'master' into mwmbl-package #18 nitred 2021-12-29 00:25:37 +0100
  • 03ca368b2a
    Merge pull request #17 from nitred/python-gitignore Daoud Clarke 2021-12-28 21:29:00 +0000
  • 0baed3780d
    Merge pull request #13 from nitred/indexer-dependencies-as-extra Daoud Clarke 2021-12-28 21:27:56 +0000
  • 04d7cbdfe3
    Merge pull request #11 from ArcoMul/fix-mobile-layout Daoud Clarke 2021-12-28 21:26:18 +0000
  • 11eedcde84 renamed package to mwmbl - renamed package to mwmbl in pyproject.toml - tinysearchengine and indexer modules have been moved into mwmbl package folder - analyse module has been left as is in the root of the repo - import statements in tinysearchengine now use mwmbl.tinysearchengine - import statements in indexer now use mwmbl.indexer or mwmbl.tinysearchengine or relative imports like .paths - import statements in analyse now use mwmbl.indexer or mwmbl.tinysearchengine - final CMD in Dockerfile now uses updated path mwmbl.tinysearchengine.app - fixed a couple of import statement errors in tinysearchengine/indexer.py nitred 2021-12-28 12:02:48 +0100
  • 91b357b6e2 added standard .gitignore template for python from the github/gitignore repo #17 nitred 2021-12-28 11:36:01 +0100
  • c02c052281 Fixes #12, Added dependencies for indexer as extra or extra_requires - dependencies for indexer can be installed using "pip install .[indexer]" or "poetry install -E indexer" #13 nitred 2021-12-27 15:46:24 +0100
  • e773ff68e5 Decrease font-size of url so that the title stands out more #11 Arco Mul 2021-12-27 12:36:10 +0100
  • 4e41f68a46 Make page responsive for mobile devices Arco Mul 2021-12-27 12:29:09 +0100
  • acb2d19470
    Merge pull request #6 from ndren/master Daoud Clarke 2021-12-27 08:57:07 +0000
  • 389d0abcc1 Do not send Referer #6 Andrei E 2021-12-26 22:48:03 +0000
  • 61e5dba20d
    Merge pull request #5 from ndren/master Daoud Clarke 2021-12-26 22:14:06 +0000
  • a09340891a GPLv3 -> AGPLv3 #5 Andrei E 2021-12-26 22:05:15 +0000
  • 0cd2bd5346 Add Matrix button Andrei E 2021-12-26 21:58:27 +0000
  • 0ea7a0c031 Merge branch 'master' of github.com:mwmbl/mwmbl Daoud Clarke 2021-12-26 21:26:50 +0000
  • b47de434dd Add link to Matrix chat Daoud Clarke 2021-12-26 21:26:39 +0000
  • 8fcc75f037
    Create LICENSE Daoud Clarke 2021-12-26 09:08:31 +0000
  • baede32298 Move indexer code to a separate package Daoud Clarke 2021-12-26 08:55:09 +0000
  • 8cfb8b7a44 Remove debug print code Daoud Clarke 2021-12-26 08:47:33 +0000
  • 794af00bfb Update domain name Daoud Clarke 2021-12-25 23:19:01 +0000
  • 722328efa5 Write readme Daoud Clarke 2021-12-25 23:17:38 +0000
  • 6ab961d070 Look for onchange events to get it working on mobile Daoud Clarke 2021-12-25 09:50:58 +0000
  • 0721ec0f81 Specify correct host to make the app available Daoud Clarke 2021-12-24 21:24:35 +0000
  • 7e520fb32f Get Dockerfile working Daoud Clarke 2021-12-23 21:30:51 +0000
  • 9c65bf3c8f WIP: implement docker image. TODO: copy index and set the correct index path using env var Daoud Clarke 2021-12-22 23:21:23 +0000
  • f754b38f71 Prevent default for up and down keys Daoud Clarke 2021-12-20 23:21:40 +0000
  • 8f8fc43c9f Improve focus on reload, back, etc Daoud Clarke 2021-12-20 23:15:42 +0000
  • 202ef35d7a Make Enter key work when pressing Enter Daoud Clarke 2021-12-20 23:05:22 +0000
  • 30a00425ae Follow selected item on enter Daoud Clarke 2021-12-20 23:02:12 +0000
  • 5e7c5a905e Select item with arrow keys Daoud Clarke 2021-12-20 21:28:01 +0000
  • 2d7bb0efd7 Add background colour and hover highlighting Daoud Clarke 2021-12-20 20:55:46 +0000
  • c22f522c07 Improve styling Daoud Clarke 2021-12-19 22:48:53 +0000
  • 7c745ef87b Show the URL Daoud Clarke 2021-12-19 22:34:44 +0000
  • 585f4bd00c Format extract differently Daoud Clarke 2021-12-19 22:16:01 +0000
  • 734798e4de Prefer items that find the result early on Daoud Clarke 2021-12-19 21:38:17 +0000
  • 9ee6f37a60 Analysis to confirm that 'leek and potato soup' page was really missing Daoud Clarke 2021-12-19 21:09:00 +0000
  • 4cbed29c08 Show the extract Daoud Clarke 2021-12-19 20:48:28 +0000
  • 16121d2b19 Index extracts Daoud Clarke 2021-12-18 22:56:39 +0000
  • 4fa1c4a39a Filter results with low scores Daoud Clarke 2021-12-18 22:35:59 +0000
  • 6b72a056b2 Improve results ordering Daoud Clarke 2021-12-18 12:42:04 +0000
  • cc290bfc07 Bold search terms in results Daoud Clarke 2021-12-17 21:31:26 +0000
  • e4d2a45d6c Add css Daoud Clarke 2021-12-16 22:26:50 +0000
  • 1d8b37add1 Set cursor at the end of the input Daoud Clarke 2021-12-16 21:55:04 +0000
  • af29b4c039 Update results as you type Daoud Clarke 2021-12-16 21:36:01 +0000
  • 23eb341832 Add search page Daoud Clarke 2021-12-14 22:01:59 +0000
  • 869127c6ec Add an error state Daoud Clarke 2021-12-14 19:59:31 +0000
  • 2844c1df75 Index common crawl data Daoud Clarke 2021-12-13 11:23:01 +0000
  • 65b366d30d Add spacy Daoud Clarke 2021-12-12 20:58:44 +0000
  • 16a8356a23 Run multiple processes in parallel Daoud Clarke 2021-12-12 09:09:44 +0000
  • 34dc50a6ed Output processed items to an output queue Daoud Clarke 2021-12-11 17:18:00 +0000
  • c46257c6d1 Use our own filesystem-based queue Daoud Clarke 2021-12-11 16:57:17 +0000
  • a76fd2d8f9 Use multiprocessing Daoud Clarke 2021-12-07 22:56:46 +0000
  • 2d554b14e7 Save results to gzip file Daoud Clarke 2021-12-07 22:10:16 +0000
  • 2562a5257a Extract locally Daoud Clarke 2021-12-05 22:25:37 +0000
  • c151fe3777 Extract archive info Daoud Clarke 2021-12-05 21:42:23 +0000
  • a173db319b Add EMR deploy scripts Daoud Clarke 2021-12-05 21:02:17 +0000
  • 14817d7657 Optimise imports Daoud Clarke 2021-12-05 20:38:05 +0000
  • 312f32bf61 Add common crawl extract script and dependency management with poetry Daoud Clarke 2021-12-05 20:31:49 +0000
  • 896f782379 Improve typing of indexer Daoud Clarke 2021-06-13 21:41:19 +0100
  • 0578f41a73 Limit number of chars used in query Daoud Clarke 2021-06-11 21:43:12 +0100
  • c81fc83900 Abstract index to allow storing anything Daoud Clarke 2021-06-05 22:22:31 +0100
  • fb5b6ffd45 Count terms Daoud Clarke 2021-05-30 21:30:34 +0100