Daoud Clarke
bb8a36a612
Number of pages is an int
2022-12-27 10:40:53 +00:00
Daoud Clarke
c01129cdb9
Merge branch 'master' of github.com:mwmbl/mwmbl
2022-12-27 10:25:41 +00:00
Daoud Clarke
26351a1072
Use the correct storage location in prod
2022-12-27 10:24:48 +00:00
Daoud Clarke
f3f3831a97
Merge pull request #83 from omasanori/spacy-deps-rework
...
Rework installation of spaCy models for clarity
2022-12-27 10:20:52 +00:00
Masanori Ogino
71187a3938
Rework installation of spaCy models for clarity
...
- Install the wheel package for compatibility with future pip
- Use `spacy download` for installing model(s)
- Use `spacy validate` for checking model compatibility explicitly
Signed-off-by: Masanori Ogino <167209+omasanori@users.noreply.github.com>
2022-12-27 11:33:52 +09:00
Daoud Clarke
d85067ec09
Remove apt command
2022-12-24 20:20:53 +00:00
Daoud Clarke
1ef60e8d5d
Put install in correct place
2022-12-24 20:18:02 +00:00
Daoud Clarke
8e613dd368
Install psql client
2022-12-24 20:13:53 +00:00
Daoud Clarke
80282cfc7a
Exclude a domain
2022-12-24 19:59:56 +00:00
Daoud Clarke
57295846cb
Update README.md
2022-12-21 21:49:56 +00:00
Daoud Clarke
efc8e8e383
Merge pull request #78 from mwmbl/make-dev-easier
...
Make it easier to run mwmbl locally
2022-12-19 21:50:54 +00:00
Daoud Clarke
f8ab6092b0
Suggest using dokku instead of docker directly
2022-12-08 22:33:58 +00:00
Daoud Clarke
a50bc28436
Make it easier to rum mwmbl locally
2022-12-07 20:01:31 +00:00
Daoud Clarke
c0f89ba6c3
Update matrix badge
2022-12-05 18:47:26 +00:00
Daoud Clarke
dd4dd8a752
Exclude an annoying web site
2022-12-02 21:29:06 +00:00
Daoud Clarke
40f9eade9a
Update index name
2022-08-27 09:38:39 +01:00
Daoud Clarke
b6183e00ea
Merge pull request #74 from mwmbl/evaluate-indexing
...
Evaluate indexing
2022-08-27 09:37:22 +01:00
Daoud Clarke
cf253ae524
Split out URL updating from indexing
2022-08-26 22:20:35 +01:00
Daoud Clarke
f4fb9f831a
Use terms and bigrams from the beginning of the string only
2022-08-26 17:20:11 +01:00
Daoud Clarke
619b6c3a93
Don't remove stopwords
2022-08-24 21:08:33 +01:00
Daoud Clarke
578b705609
Don't replace full stops and commas
2022-08-23 22:06:43 +01:00
Daoud Clarke
4779371cf3
Use a custom tokenizer
2022-08-23 21:57:38 +01:00
Daoud Clarke
b1eea2457f
Script to index local batch for evaluation
2022-08-22 22:47:42 +01:00
Daoud Clarke
480be85cfd
Fix bug in completions with duplicated terms
2022-08-14 22:03:50 +01:00
Daoud Clarke
f7660bcd27
Merge pull request #73 from mwmbl/completion
...
Completion
2022-08-13 23:55:22 +01:00
Daoud Clarke
627f82d19f
Suggest searching Google if there are no search results
2022-08-13 23:54:57 +01:00
Daoud Clarke
f1c77d1389
Search google if there are no results
2022-08-13 23:47:48 +01:00
Daoud Clarke
fe5eff7b64
Exclude web.archive.org as we're only crawling that right now
2022-08-13 10:52:31 +01:00
Daoud Clarke
00705703f3
Require matching at least half the terms
2022-08-11 23:27:30 +01:00
Daoud Clarke
eda7870788
Restrict to https and strip the prefix and / on the end
2022-08-11 22:23:14 +01:00
Daoud Clarke
23e47e963b
Simplify completions
2022-08-11 17:34:52 +01:00
Daoud Clarke
c6773b46c4
Merge pull request #72 from mwmbl/improve-ranking-with-multi-term-search
...
Improve ranking with multi term search
2022-08-10 21:43:51 +01:00
Daoud Clarke
74107667b4
Improve printing of search results in script
2022-08-10 21:43:13 +01:00
Daoud Clarke
3bcb7f42c1
Use heuristic ranker
2022-08-09 22:56:12 +01:00
Daoud Clarke
c1b9e70743
Add new LTR model
2022-08-09 22:47:59 +01:00
Daoud Clarke
57476ed2c8
Tweak features
2022-08-09 22:23:36 +01:00
Daoud Clarke
c99e813398
Get best-performing configuration
2022-08-09 20:56:15 +01:00
Daoud Clarke
8b50643303
Add in match score feature (although it hurts the results)
2022-08-09 00:08:55 +01:00
Daoud Clarke
c60b73a403
Create a get_features function and make it work like the heuristic approach
2022-08-08 23:42:34 +01:00
Daoud Clarke
c1d361c0a0
New LTR model trained on more data
2022-08-08 22:52:37 +01:00
Daoud Clarke
b99d9d1c6a
Search for the term itself as well as its completion
2022-08-08 22:51:09 +01:00
Daoud Clarke
f40d82c449
Allow running with no background script
2022-08-01 23:33:02 +01:00
Daoud Clarke
046f86f7e3
Merge pull request #71 from mwmbl/fix-missing-scores
...
Store the best items, not the worst ones
2022-08-01 23:32:24 +01:00
Daoud Clarke
ae658906dd
Store the best items, not the worst ones
2022-07-31 22:55:15 +01:00
Daoud Clarke
aa5878fd2f
Merge pull request #70 from mwmbl/reduce-new-batch-contention
...
Reduce new batch contention
2022-07-31 21:02:05 +01:00
Daoud Clarke
fc1742e24f
Reinstate correct num_pages
2022-07-31 00:45:00 +01:00
Daoud Clarke
bb5186196f
Use an in-memory queue
2022-07-31 00:43:58 +01:00
Daoud Clarke
62ba9ddc7e
Use a randomised timeout for getting a new batch
2022-07-30 23:10:37 +01:00
Daoud Clarke
a54e093cf1
Merge pull request #69 from mwmbl/reduce-contention-for-client-queries
...
Reduce contention for client queries
2022-07-30 17:11:34 +01:00
Daoud Clarke
2942d83673
Get URL scores in batches
2022-07-30 14:35:21 +01:00