Daoud Clarke
eda7870788
Restrict to https and strip the prefix and / on the end
2022-08-11 22:23:14 +01:00
Daoud Clarke
23e47e963b
Simplify completions
2022-08-11 17:34:52 +01:00
Daoud Clarke
3bcb7f42c1
Use heuristic ranker
2022-08-09 22:56:12 +01:00
Daoud Clarke
c1b9e70743
Add new LTR model
2022-08-09 22:47:59 +01:00
Daoud Clarke
57476ed2c8
Tweak features
2022-08-09 22:23:36 +01:00
Daoud Clarke
c99e813398
Get best-performing configuration
2022-08-09 20:56:15 +01:00
Daoud Clarke
8b50643303
Add in match score feature (although it hurts the results)
2022-08-09 00:08:55 +01:00
Daoud Clarke
c60b73a403
Create a get_features function and make it work like the heuristic approach
2022-08-08 23:42:34 +01:00
Daoud Clarke
c1d361c0a0
New LTR model trained on more data
2022-08-08 22:52:37 +01:00
Daoud Clarke
b99d9d1c6a
Search for the term itself as well as its completion
2022-08-08 22:51:09 +01:00
Daoud Clarke
f40d82c449
Allow running with no background script
2022-08-01 23:33:02 +01:00
Daoud Clarke
ae658906dd
Store the best items, not the worst ones
2022-07-31 22:55:15 +01:00
Daoud Clarke
fc1742e24f
Reinstate correct num_pages
2022-07-31 00:45:00 +01:00
Daoud Clarke
bb5186196f
Use an in-memory queue
2022-07-31 00:43:58 +01:00
Daoud Clarke
62ba9ddc7e
Use a randomised timeout for getting a new batch
2022-07-30 23:10:37 +01:00
Daoud Clarke
2942d83673
Get URL scores in batches
2022-07-30 14:35:21 +01:00
Daoud Clarke
3709cb236f
Use correct index path; retrieve historical batches
2022-07-30 11:08:15 +01:00
Daoud Clarke
063ebb4504
args.index no longer exists
2022-07-30 10:57:15 +01:00
Daoud Clarke
ea32c0ba00
Double index size
2022-07-30 10:37:07 +01:00
Daoud Clarke
2d5235f6f6
More threads for retrieving batches
2022-07-30 10:10:11 +01:00
Daoud Clarke
218d873654
Delete unused SQL
2022-07-30 10:10:03 +01:00
Daoud Clarke
6209382d76
Index batches in memory
2022-07-24 15:44:01 +01:00
Daoud Clarke
1bceeae3df
Implement new indexing approach
2022-07-23 23:19:36 +01:00
Daoud Clarke
a8a6c67239
Use URL path to store locally so that we can easily get a local path from a URL
2022-07-20 22:21:35 +01:00
Daoud Clarke
0d1e7d841c
Implement a batch cache to store files locally before preprocessing
2022-07-19 21:18:43 +01:00
Daoud Clarke
5ce333cc9a
Log at info level
2022-07-18 23:46:01 +01:00
Daoud Clarke
a097ec9fbe
Allow more tries so that popular terms can be indexed
2022-07-18 23:42:09 +01:00
Daoud Clarke
cfca015efe
Enough preprocessing
2022-07-18 22:36:37 +01:00
Daoud Clarke
003cd217f4
Run preprocessing
2022-07-18 22:21:20 +01:00
Daoud Clarke
bcd31326b8
Just index a single page for now
2022-07-18 22:17:15 +01:00
Daoud Clarke
a471bc2437
Use a more specific exception in case we're discarding ones we shouldn't
2022-07-18 22:05:24 +01:00
Daoud Clarke
ce9f52267a
Run update
2022-07-18 21:55:27 +01:00
Daoud Clarke
09a9390c92
Catch corrupt data
2022-07-18 21:40:38 +01:00
Daoud Clarke
93307ad1ec
Add util script to send batch; add logging
2022-07-18 21:37:19 +01:00
Daoud Clarke
3c97fdb3a0
Merge pull request #66 from mwmbl/fix-unicode-encode-error
...
Fix unicode encode error; bigger index
2022-07-16 10:59:14 +01:00
Daoud Clarke
680fe1ca0c
Fix unicode encoding error
2022-07-16 10:54:25 +01:00
Daoud Clarke
fee5cbb400
10x index size
2022-07-10 17:15:10 +01:00
milovanderlinden
dfd3f3962e
Fix issue #60
2022-07-10 11:10:03 +02:00
Daoud Clarke
dba50b372f
Don't include web.archive.org as a curated domain
2022-07-04 15:44:28 +01:00
Daoud Clarke
43815c7322
Add a URL length penalty
2022-07-03 22:10:02 +01:00
Daoud Clarke
a3ff2f537f
Score domain and path, weight components
2022-07-03 21:55:20 +01:00
Daoud Clarke
9482ae5028
Delete documents that have been preprocessed from the database to save space
2022-07-03 09:44:51 +01:00
Daoud Clarke
f9fefa0b62
Record new batches as being local
2022-07-02 13:25:31 +01:00
Daoud Clarke
e578d55789
Allow crawling links from unknown domains
2022-07-01 21:35:34 +01:00
Daoud Clarke
db1aa1a928
Don't require a slash for the search URL
2022-07-01 20:43:38 +01:00
Daoud Clarke
24f82a3c2f
Actually used the passed in timestamp
2022-06-30 20:57:01 +01:00
Daoud Clarke
d47457b834
CONFIRMED no longer exists
2022-06-30 20:45:26 +01:00
Daoud Clarke
b6f29548db
Fix log message
2022-06-30 20:42:37 +01:00
Daoud Clarke
e9835edc45
Wrap background tasks in try/except
2022-06-30 20:00:38 +01:00
Daoud Clarke
6ea3a95684
Allow batches to fail silently
2022-06-30 19:52:58 +01:00
Daoud Clarke
ddc8664c11
Queue the right type of batch
2022-06-29 22:52:12 +01:00
Daoud Clarke
2b52b50569
Queue new batches for indexing
2022-06-29 22:49:24 +01:00
Daoud Clarke
b8c495bda8
Correctly insert new URLs
2022-06-29 22:39:21 +01:00
Daoud Clarke
955d650cf4
Prevent deadlock when inserting URLs
2022-06-28 22:34:46 +01:00
Daoud Clarke
1457cba2c2
Cache batches; start a background process
2022-06-27 23:44:25 +01:00
Daoud Clarke
ff2312a5ca
Use different scores for same domain links
2022-06-27 22:46:06 +01:00
Daoud Clarke
36b168a8f6
Fix logic in found URL logic SQL and allow crawling URLs crawled by one user for now
2022-06-26 21:23:57 +01:00
Daoud Clarke
5e1ec9ccd5
Temporarily disable startup background processes; add root domains; check for empty batches.
2022-06-26 21:15:52 +01:00
Daoud Clarke
e27d749e18
Investigate duplication of URLs in batches
2022-06-26 21:11:51 +01:00
Daoud Clarke
1d9b5cb3ca
Make more robust
2022-06-21 08:44:46 +01:00
Daoud Clarke
30e1e19072
Update queued pages in the index
2022-06-20 23:35:44 +01:00
Daoud Clarke
4330551e0f
Tokenize documents and store pages to be added to the index
2022-06-20 22:54:35 +01:00
Daoud Clarke
9594915de1
WIP: index continuously. Retrieve batches and store in Postgres
2022-06-19 23:23:57 +01:00
Daoud Clarke
b8b605daed
Factor out connection code
2022-06-19 16:52:25 +01:00
Daoud Clarke
c31cea710f
CORS is handled by nginx
2022-06-19 13:13:36 +01:00
Daoud Clarke
96da534ca5
Don't add CORS on the python side
2022-06-19 11:34:54 +01:00
Daoud Clarke
866c17f2dc
Use the dokku app storage
2022-06-19 09:53:19 +01:00
Daoud Clarke
16c2692099
Start processing historical data on startup
2022-06-19 08:56:55 +01:00
Daoud Clarke
d400950689
Add script to process historical data
2022-06-18 15:31:35 +01:00
Daoud Clarke
d7c6dcb5c2
Use the correct port for dokku
2022-06-17 23:54:22 +01:00
Daoud Clarke
77088a8a1b
Use a database URL env var
2022-06-17 23:39:24 +01:00
Daoud Clarke
476481c5f8
Put the resources in the package
2022-06-17 23:32:43 +01:00
Daoud Clarke
505e7521d4
Copy the resources
2022-06-17 23:29:04 +01:00
Daoud Clarke
5ea9efcfa2
Fix relative path
2022-06-17 23:19:30 +01:00
Daoud Clarke
1c7420e5fb
Don't depend on existing data
2022-06-17 23:12:22 +01:00
Daoud Clarke
e2eb405083
Combine crawler and search servers
2022-06-16 22:49:41 +01:00
Daoud Clarke
770b4b945b
Refactor feature extraction
2022-05-07 22:52:36 +01:00
Daoud Clarke
87d8b40cad
Make order_results public
2022-05-06 23:15:50 +01:00
Daoud Clarke
229819e57e
Refactor to allow LTR ranker
2022-03-27 22:32:44 +01:00
Daoud Clarke
94287cec01
Get features for each string separately
2022-03-21 21:49:10 +00:00
Daoud Clarke
4740d89c6a
Add domain score feature
2022-03-21 21:13:20 +00:00
Daoud Clarke
af6a28fac3
Implement learning to rank feature extraction and thresholding
2022-03-20 22:01:45 +00:00
Daoud Clarke
2d334074af
Make get_results() public for learning to rank
2022-03-20 17:25:54 +00:00
Daoud Clarke
ee5ca6bcf6
Experiment with score variations (best is simple weighted domain score)
2022-02-27 21:24:16 +00:00
Daoud Clarke
6fb310c363
Use addition instead of multiplication
2022-02-25 22:19:26 +00:00
Daoud Clarke
4e6516ccf1
Scale by 0.99
2022-02-25 22:14:49 +00:00
Daoud Clarke
f5afbed2e5
Handle empty list
2022-02-25 22:11:09 +00:00
Daoud Clarke
efafec5214
Rank using item score as well as match score
2022-02-25 22:08:37 +00:00
Daoud Clarke
e1e9e404a3
Dedupe before indexing
2022-02-24 22:01:42 +00:00
Daoud Clarke
f5b20d0128
Index link counts
2022-02-24 20:47:36 +00:00
Daoud Clarke
b5b2005323
Store computed link counts
2022-02-23 22:13:38 +00:00
Daoud Clarke
00d18c3474
Remove unused code
2022-02-23 21:59:24 +00:00
Daoud Clarke
04a33a134b
Fixes to mwmbl API for changes to the index
2022-02-22 22:27:02 +00:00
Daoud Clarke
ae3b334a7f
Fixes for API changes
2022-02-22 22:12:39 +00:00
Daoud Clarke
326f7e3d7f
Use JSON instead of struct to store metadata
2022-02-18 22:22:47 +00:00
Daoud Clarke
e6273c7f76
WIP: include metadata in index - using struct approach
2022-02-18 22:12:22 +00:00
Daoud Clarke
e03e379ccf
Refactor to enable easier evaluation
2022-02-09 22:43:47 +00:00
Daoud Clarke
6e5e56f99a
New index; more pages
2022-02-04 18:08:23 +00:00
Daoud Clarke
fe6ace93e6
Improve handling of incomplete words:
...
- Correctly generate regex for incomplete vs complete words
- Return more than one top word from completer
- Correctly handle no terms
2022-01-31 21:20:59 +00:00
Daoud Clarke
7d829bc319
Use python 3.10; complete terms
2022-01-30 23:24:00 +00:00
Daoud Clarke
3c75dd1a74
WIP: implement term completer
2022-01-30 22:20:28 +00:00
Daoud Clarke
01a21337a9
Don't index partial words
2022-01-30 14:30:02 +00:00
Daoud Clarke
2ef8304919
Remove some debug print statements
2022-01-30 13:16:24 +00:00
Daoud Clarke
5b89bbf05d
Index Mwmbl crawled data
2022-01-29 08:26:42 +00:00
Daoud Clarke
70254ae160
Analyse crawled URLs and domains
2022-01-26 18:51:58 +00:00
Daoud Clarke
171fa645d2
Add script to export top domains
2022-01-23 22:04:30 +00:00
ColinEspinas
3481ad372b
Removed old front-end files and routes
2022-01-19 23:33:37 +01:00
Daoud Clarke
a41088ca9a
Add CORS; revert back to previous index as it timed out deploying
2022-01-03 18:31:03 +00:00
Daoud Clarke
25918e42ef
Export URLs to sqlite for evaluation purposes
2022-01-02 20:06:13 +00:00
nitred
fbdb93c86a
Using the app object to start uvicorn, instead of using a reference like "mwmbl.tinysearchengine.app:app"
...
- fixes the issue when running the server using python -m mwmbl.tinysearchengine.app
When running the server using python -m, uvicorn seems to spawn a new process or interpreter session.
At least it appears that way since already initialized & imported modules and variables appear to be uninitialized.
2021-12-31 02:15:16 +01:00
Daoud Clarke
e6655101ef
Add a component of the HN domain score when ranking
2021-12-30 22:20:10 +00:00
Daoud Clarke
02bcef640c
Merge pull request #25 from ColinEspinas/search-debounce
...
Added debounce on search input
2021-12-29 20:59:29 +00:00
ColinEspinas
c636be9089
Added debounce on search input ( #8 )
2021-12-29 21:03:47 +01:00
nitred
a72a08a7d9
added config and binary/entrypoint for mwmbl.tinysearchengine
...
- using pydantic to validate the config
- added a default bootstrap config at config/tinysearchengine.yaml
- refactored app.py to include parsing CLI argument using argparse
- refactored app.py to use fewer global variables
- added "mwmbl-tinysearchengine" binary/entrypoint in pyproject.toml
- updated Dockerfile to work with these changes and added comments to it
2021-12-29 15:26:33 +01:00
nitred
be40a15b27
Merge branch 'master' into mwmbl-package
2021-12-29 00:25:37 +01:00
nitred
11eedcde84
renamed package to mwmbl
...
- renamed package to mwmbl in pyproject.toml
- tinysearchengine and indexer modules have been moved into mwmbl package folder
- analyse module has been left as is in the root of the repo
- import statements in tinysearchengine now use mwmbl.tinysearchengine
- import statements in indexer now use mwmbl.indexer or mwmbl.tinysearchengine or relative imports like .paths
- import statements in analyse now use mwmbl.indexer or mwmbl.tinysearchengine
- final CMD in Dockerfile now uses updated path mwmbl.tinysearchengine.app
- fixed a couple of import statement errors in tinysearchengine/indexer.py
2021-12-28 12:35:46 +01:00