Daoud Clarke
77e39b4a89
Optimise URL update
2023-01-22 20:28:18 +00:00
Daoud Clarke
66700f8a3e
Speed up domain parsing
2023-01-20 20:53:50 +00:00
Daoud Clarke
4779371cf3
Use a custom tokenizer
2022-08-23 21:57:38 +01:00
Daoud Clarke
b1eea2457f
Script to index local batch for evaluation
2022-08-22 22:47:42 +01:00
Daoud Clarke
00705703f3
Require matching at least half the terms
2022-08-11 23:27:30 +01:00
Daoud Clarke
74107667b4
Improve printing of search results in script
2022-08-10 21:43:13 +01:00
Daoud Clarke
c1d361c0a0
New LTR model trained on more data
2022-08-08 22:52:37 +01:00
Daoud Clarke
ae658906dd
Store the best items, not the worst ones
2022-07-31 22:55:15 +01:00
Daoud Clarke
93307ad1ec
Add util script to send batch; add logging
2022-07-18 21:37:19 +01:00
Daoud Clarke
ff2312a5ca
Use different scores for same domain links
2022-06-27 22:46:06 +01:00
Daoud Clarke
e27d749e18
Investigate duplication of URLs in batches
2022-06-26 21:11:51 +01:00
Daoud Clarke
eb571fc5fe
Add a script to count urls in the index
2022-06-21 21:55:38 +01:00
Daoud Clarke
e2eb405083
Combine crawler and search servers
2022-06-16 22:49:41 +01:00
Daoud Clarke
14107acc75
Use new server
2022-06-09 22:24:54 +01:00
Daoud Clarke
aaca8b2b6e
Record historical batches via the API
2022-06-05 09:15:04 +01:00
Daoud Clarke
f5b20d0128
Index link counts
2022-02-24 20:47:36 +00:00
Daoud Clarke
b5b2005323
Store computed link counts
2022-02-23 22:13:38 +00:00
Daoud Clarke
00d18c3474
Remove unused code
2022-02-23 21:59:24 +00:00
Daoud Clarke
e03e379ccf
Refactor to enable easier evaluation
2022-02-09 22:43:47 +00:00
Daoud Clarke
2fc999b402
Count unique domains instead of links
2022-02-02 20:09:59 +00:00
Daoud Clarke
d77b72d7df
Analyse links to find most popular ones
2022-02-02 19:47:38 +00:00
Daoud Clarke
ef36513f64
Analyse the pages that are crawled most often
2022-01-29 07:06:53 +00:00
Daoud Clarke
70254ae160
Analyse crawled URLs and domains
2022-01-26 18:51:58 +00:00
Daoud Clarke
171fa645d2
Add script to export top domains
2022-01-23 22:04:30 +00:00
Daoud Clarke
25918e42ef
Export URLs to sqlite for evaluation purposes
2022-01-02 20:06:13 +00:00
nitred
11eedcde84
renamed package to mwmbl
...
- renamed package to mwmbl in pyproject.toml
- tinysearchengine and indexer modules have been moved into mwmbl package folder
- analyse module has been left as is in the root of the repo
- import statements in tinysearchengine now use mwmbl.tinysearchengine
- import statements in indexer now use mwmbl.indexer or mwmbl.tinysearchengine or relative imports like .paths
- import statements in analyse now use mwmbl.indexer or mwmbl.tinysearchengine
- final CMD in Dockerfile now uses updated path mwmbl.tinysearchengine.app
- fixed a couple of import statement errors in tinysearchengine/indexer.py
2021-12-28 12:35:46 +01:00
Daoud Clarke
baede32298
Move indexer code to a separate package
2021-12-26 08:55:09 +00:00
Daoud Clarke
9c65bf3c8f
WIP: implement docker image. TODO: copy index and set the correct index path using env var
2021-12-22 23:21:23 +00:00
Daoud Clarke
9ee6f37a60
Analysis to confirm that 'leek and potato soup' page was really missing
2021-12-19 21:09:00 +00:00
Daoud Clarke
4cbed29c08
Show the extract
2021-12-19 20:48:28 +00:00