Commit graph

216 commits

Author SHA1 Message Date
Daoud Clarke
eda7870788 Restrict to https and strip the prefix and / on the end 2022-08-11 22:23:14 +01:00
Daoud Clarke
23e47e963b Simplify completions 2022-08-11 17:34:52 +01:00
Daoud Clarke
3bcb7f42c1 Use heuristic ranker 2022-08-09 22:56:12 +01:00
Daoud Clarke
c1b9e70743 Add new LTR model 2022-08-09 22:47:59 +01:00
Daoud Clarke
57476ed2c8 Tweak features 2022-08-09 22:23:36 +01:00
Daoud Clarke
c99e813398 Get best-performing configuration 2022-08-09 20:56:15 +01:00
Daoud Clarke
8b50643303 Add in match score feature (although it hurts the results) 2022-08-09 00:08:55 +01:00
Daoud Clarke
c60b73a403 Create a get_features function and make it work like the heuristic approach 2022-08-08 23:42:34 +01:00
Daoud Clarke
c1d361c0a0 New LTR model trained on more data 2022-08-08 22:52:37 +01:00
Daoud Clarke
b99d9d1c6a Search for the term itself as well as its completion 2022-08-08 22:51:09 +01:00
Daoud Clarke
f40d82c449 Allow running with no background script 2022-08-01 23:33:02 +01:00
Daoud Clarke
ae658906dd Store the best items, not the worst ones 2022-07-31 22:55:15 +01:00
Daoud Clarke
fc1742e24f Reinstate correct num_pages 2022-07-31 00:45:00 +01:00
Daoud Clarke
bb5186196f Use an in-memory queue 2022-07-31 00:43:58 +01:00
Daoud Clarke
62ba9ddc7e Use a randomised timeout for getting a new batch 2022-07-30 23:10:37 +01:00
Daoud Clarke
2942d83673 Get URL scores in batches 2022-07-30 14:35:21 +01:00
Daoud Clarke
3709cb236f Use correct index path; retrieve historical batches 2022-07-30 11:08:15 +01:00
Daoud Clarke
063ebb4504 args.index no longer exists 2022-07-30 10:57:15 +01:00
Daoud Clarke
ea32c0ba00 Double index size 2022-07-30 10:37:07 +01:00
Daoud Clarke
2d5235f6f6 More threads for retrieving batches 2022-07-30 10:10:11 +01:00
Daoud Clarke
218d873654 Delete unused SQL 2022-07-30 10:10:03 +01:00
Daoud Clarke
6209382d76 Index batches in memory 2022-07-24 15:44:01 +01:00
Daoud Clarke
1bceeae3df Implement new indexing approach 2022-07-23 23:19:36 +01:00
Daoud Clarke
a8a6c67239 Use URL path to store locally so that we can easily get a local path from a URL 2022-07-20 22:21:35 +01:00
Daoud Clarke
0d1e7d841c Implement a batch cache to store files locally before preprocessing 2022-07-19 21:18:43 +01:00
Daoud Clarke
5ce333cc9a Log at info level 2022-07-18 23:46:01 +01:00
Daoud Clarke
a097ec9fbe Allow more tries so that popular terms can be indexed 2022-07-18 23:42:09 +01:00
Daoud Clarke
cfca015efe Enough preprocessing 2022-07-18 22:36:37 +01:00
Daoud Clarke
003cd217f4 Run preprocessing 2022-07-18 22:21:20 +01:00
Daoud Clarke
bcd31326b8 Just index a single page for now 2022-07-18 22:17:15 +01:00
Daoud Clarke
a471bc2437 Use a more specific exception in case we're discarding ones we shouldn't 2022-07-18 22:05:24 +01:00
Daoud Clarke
ce9f52267a Run update 2022-07-18 21:55:27 +01:00
Daoud Clarke
09a9390c92 Catch corrupt data 2022-07-18 21:40:38 +01:00
Daoud Clarke
93307ad1ec Add util script to send batch; add logging 2022-07-18 21:37:19 +01:00
Daoud Clarke
3c97fdb3a0
Merge pull request #66 from mwmbl/fix-unicode-encode-error
Fix unicode encode error; bigger index
2022-07-16 10:59:14 +01:00
Daoud Clarke
680fe1ca0c Fix unicode encoding error 2022-07-16 10:54:25 +01:00
Daoud Clarke
fee5cbb400 10x index size 2022-07-10 17:15:10 +01:00
milovanderlinden
dfd3f3962e Fix issue #60 2022-07-10 11:10:03 +02:00
Daoud Clarke
dba50b372f Don't include web.archive.org as a curated domain 2022-07-04 15:44:28 +01:00
Daoud Clarke
43815c7322 Add a URL length penalty 2022-07-03 22:10:02 +01:00
Daoud Clarke
a3ff2f537f Score domain and path, weight components 2022-07-03 21:55:20 +01:00
Daoud Clarke
9482ae5028 Delete documents that have been preprocessed from the database to save space 2022-07-03 09:44:51 +01:00
Daoud Clarke
f9fefa0b62 Record new batches as being local 2022-07-02 13:25:31 +01:00
Daoud Clarke
e578d55789 Allow crawling links from unknown domains 2022-07-01 21:35:34 +01:00
Daoud Clarke
db1aa1a928 Don't require a slash for the search URL 2022-07-01 20:43:38 +01:00
Daoud Clarke
24f82a3c2f Actually used the passed in timestamp 2022-06-30 20:57:01 +01:00
Daoud Clarke
d47457b834 CONFIRMED no longer exists 2022-06-30 20:45:26 +01:00
Daoud Clarke
b6f29548db Fix log message 2022-06-30 20:42:37 +01:00
Daoud Clarke
e9835edc45 Wrap background tasks in try/except 2022-06-30 20:00:38 +01:00
Daoud Clarke
6ea3a95684 Allow batches to fail silently 2022-06-30 19:52:58 +01:00
Daoud Clarke
ddc8664c11 Queue the right type of batch 2022-06-29 22:52:12 +01:00
Daoud Clarke
2b52b50569 Queue new batches for indexing 2022-06-29 22:49:24 +01:00
Daoud Clarke
b8c495bda8 Correctly insert new URLs 2022-06-29 22:39:21 +01:00
Daoud Clarke
955d650cf4 Prevent deadlock when inserting URLs 2022-06-28 22:34:46 +01:00
Daoud Clarke
1457cba2c2 Cache batches; start a background process 2022-06-27 23:44:25 +01:00
Daoud Clarke
ff2312a5ca Use different scores for same domain links 2022-06-27 22:46:06 +01:00
Daoud Clarke
36b168a8f6 Fix logic in found URL logic SQL and allow crawling URLs crawled by one user for now 2022-06-26 21:23:57 +01:00
Daoud Clarke
5e1ec9ccd5 Temporarily disable startup background processes; add root domains; check for empty batches. 2022-06-26 21:15:52 +01:00
Daoud Clarke
e27d749e18 Investigate duplication of URLs in batches 2022-06-26 21:11:51 +01:00
Daoud Clarke
1d9b5cb3ca Make more robust 2022-06-21 08:44:46 +01:00
Daoud Clarke
30e1e19072 Update queued pages in the index 2022-06-20 23:35:44 +01:00
Daoud Clarke
4330551e0f Tokenize documents and store pages to be added to the index 2022-06-20 22:54:35 +01:00
Daoud Clarke
9594915de1 WIP: index continuously. Retrieve batches and store in Postgres 2022-06-19 23:23:57 +01:00
Daoud Clarke
b8b605daed Factor out connection code 2022-06-19 16:52:25 +01:00
Daoud Clarke
c31cea710f CORS is handled by nginx 2022-06-19 13:13:36 +01:00
Daoud Clarke
96da534ca5 Don't add CORS on the python side 2022-06-19 11:34:54 +01:00
Daoud Clarke
866c17f2dc Use the dokku app storage 2022-06-19 09:53:19 +01:00
Daoud Clarke
16c2692099 Start processing historical data on startup 2022-06-19 08:56:55 +01:00
Daoud Clarke
d400950689 Add script to process historical data 2022-06-18 15:31:35 +01:00
Daoud Clarke
d7c6dcb5c2 Use the correct port for dokku 2022-06-17 23:54:22 +01:00
Daoud Clarke
77088a8a1b Use a database URL env var 2022-06-17 23:39:24 +01:00
Daoud Clarke
476481c5f8 Put the resources in the package 2022-06-17 23:32:43 +01:00
Daoud Clarke
505e7521d4 Copy the resources 2022-06-17 23:29:04 +01:00
Daoud Clarke
5ea9efcfa2 Fix relative path 2022-06-17 23:19:30 +01:00
Daoud Clarke
1c7420e5fb Don't depend on existing data 2022-06-17 23:12:22 +01:00
Daoud Clarke
e2eb405083 Combine crawler and search servers 2022-06-16 22:49:41 +01:00
Daoud Clarke
770b4b945b Refactor feature extraction 2022-05-07 22:52:36 +01:00
Daoud Clarke
87d8b40cad Make order_results public 2022-05-06 23:15:50 +01:00
Daoud Clarke
229819e57e Refactor to allow LTR ranker 2022-03-27 22:32:44 +01:00
Daoud Clarke
94287cec01 Get features for each string separately 2022-03-21 21:49:10 +00:00
Daoud Clarke
4740d89c6a Add domain score feature 2022-03-21 21:13:20 +00:00
Daoud Clarke
af6a28fac3 Implement learning to rank feature extraction and thresholding 2022-03-20 22:01:45 +00:00
Daoud Clarke
2d334074af Make get_results() public for learning to rank 2022-03-20 17:25:54 +00:00
Daoud Clarke
ee5ca6bcf6 Experiment with score variations (best is simple weighted domain score) 2022-02-27 21:24:16 +00:00
Daoud Clarke
6fb310c363 Use addition instead of multiplication 2022-02-25 22:19:26 +00:00
Daoud Clarke
4e6516ccf1 Scale by 0.99 2022-02-25 22:14:49 +00:00
Daoud Clarke
f5afbed2e5 Handle empty list 2022-02-25 22:11:09 +00:00
Daoud Clarke
efafec5214 Rank using item score as well as match score 2022-02-25 22:08:37 +00:00
Daoud Clarke
e1e9e404a3 Dedupe before indexing 2022-02-24 22:01:42 +00:00
Daoud Clarke
f5b20d0128 Index link counts 2022-02-24 20:47:36 +00:00
Daoud Clarke
b5b2005323 Store computed link counts 2022-02-23 22:13:38 +00:00
Daoud Clarke
00d18c3474 Remove unused code 2022-02-23 21:59:24 +00:00
Daoud Clarke
04a33a134b Fixes to mwmbl API for changes to the index 2022-02-22 22:27:02 +00:00
Daoud Clarke
ae3b334a7f Fixes for API changes 2022-02-22 22:12:39 +00:00
Daoud Clarke
326f7e3d7f Use JSON instead of struct to store metadata 2022-02-18 22:22:47 +00:00
Daoud Clarke
e6273c7f76 WIP: include metadata in index - using struct approach 2022-02-18 22:12:22 +00:00
Daoud Clarke
e03e379ccf Refactor to enable easier evaluation 2022-02-09 22:43:47 +00:00
Daoud Clarke
6e5e56f99a New index; more pages 2022-02-04 18:08:23 +00:00
Daoud Clarke
fe6ace93e6 Improve handling of incomplete words:
- Correctly generate regex for incomplete vs complete words
 - Return more than one top word from completer
 - Correctly handle no terms
2022-01-31 21:20:59 +00:00
Daoud Clarke
7d829bc319 Use python 3.10; complete terms 2022-01-30 23:24:00 +00:00
Daoud Clarke
3c75dd1a74 WIP: implement term completer 2022-01-30 22:20:28 +00:00
Daoud Clarke
01a21337a9 Don't index partial words 2022-01-30 14:30:02 +00:00
Daoud Clarke
2ef8304919 Remove some debug print statements 2022-01-30 13:16:24 +00:00
Daoud Clarke
5b89bbf05d Index Mwmbl crawled data 2022-01-29 08:26:42 +00:00
Daoud Clarke
70254ae160 Analyse crawled URLs and domains 2022-01-26 18:51:58 +00:00
Daoud Clarke
171fa645d2 Add script to export top domains 2022-01-23 22:04:30 +00:00
ColinEspinas
3481ad372b Removed old front-end files and routes 2022-01-19 23:33:37 +01:00
Daoud Clarke
a41088ca9a Add CORS; revert back to previous index as it timed out deploying 2022-01-03 18:31:03 +00:00
Daoud Clarke
25918e42ef Export URLs to sqlite for evaluation purposes 2022-01-02 20:06:13 +00:00
nitred
fbdb93c86a Using the app object to start uvicorn, instead of using a reference like "mwmbl.tinysearchengine.app:app"
- fixes the issue when running the server using python -m mwmbl.tinysearchengine.app

When running the server using python -m, uvicorn seems to spawn a new process or interpreter session.
At least it appears that way since already initialized & imported modules and variables appear to be uninitialized.
2021-12-31 02:15:16 +01:00
Daoud Clarke
e6655101ef Add a component of the HN domain score when ranking 2021-12-30 22:20:10 +00:00
Daoud Clarke
02bcef640c
Merge pull request #25 from ColinEspinas/search-debounce
Added debounce on search input
2021-12-29 20:59:29 +00:00
ColinEspinas
c636be9089 Added debounce on search input (#8) 2021-12-29 21:03:47 +01:00
nitred
a72a08a7d9 added config and binary/entrypoint for mwmbl.tinysearchengine
- using pydantic to validate the config
- added a default bootstrap config at config/tinysearchengine.yaml
- refactored app.py to include parsing CLI argument using argparse
- refactored app.py to use fewer global variables
- added "mwmbl-tinysearchengine" binary/entrypoint in pyproject.toml
- updated Dockerfile to work with these changes and added comments to it
2021-12-29 15:26:33 +01:00
nitred
be40a15b27 Merge branch 'master' into mwmbl-package 2021-12-29 00:25:37 +01:00
nitred
11eedcde84 renamed package to mwmbl
- renamed package to mwmbl in pyproject.toml
- tinysearchengine and indexer modules have been moved into mwmbl package folder
- analyse module has been left as is in the root of the repo
- import statements in tinysearchengine now use mwmbl.tinysearchengine
- import statements in indexer now use mwmbl.indexer or mwmbl.tinysearchengine or relative imports like .paths
- import statements in analyse now use mwmbl.indexer or mwmbl.tinysearchengine
- final CMD in Dockerfile now uses updated path mwmbl.tinysearchengine.app
- fixed a couple of import statement errors in tinysearchengine/indexer.py
2021-12-28 12:35:46 +01:00