Daoud Clarke
6ff62fb119
Ensure URLs in queue are unique
2023-02-25 10:34:09 +00:00
Daoud Clarke
c36e1dffcb
Remove picolisp as a top domain since there are duplicate URLs
2023-02-25 09:56:26 +00:00
Daoud Clarke
362f9bfa9e
Write page to the correct location (metadata size offset bug fix)
2023-02-24 21:46:18 +00:00
Daoud Clarke
5616626fc1
Merge pull request #89 from mwmbl/update-urls-queue-quickly
...
Update urls queue quickly
2023-02-24 21:39:40 +00:00
Daoud Clarke
bc6be8b6d5
Merge branch 'master' into update-urls-queue-quickly
2023-02-24 21:37:54 +00:00
Daoud Clarke
a03b76e5cc
Fix broken test
2023-02-24 21:37:32 +00:00
Daoud Clarke
c97d946fcf
Go back to processing 10,000 batches at a time
2023-02-24 21:29:42 +00:00
Rishabh Singh Ahluwalia
38a5dbbf3c
Merge pull request #94 from mwmbl/rishabh-port-configuration
...
Allow configuration of port
2023-02-23 07:31:07 -08:00
Rishabh Singh Ahluwalia
2aa61a5121
Merge pull request #95 from mwmbl/rishabh-unit-testing-with-ci
...
Add PyUnit dependency + Unit Tests for completer.py + Github Actions CI for running unit tests
2023-02-23 07:30:48 -08:00
Rishabh Singh Ahluwalia
30aff3b920
Add pytest, unit tests for completer,gh actions ci
2023-02-22 21:37:10 -08:00
Rishabh Singh Ahluwalia
842aec19e2
Add port to args
2023-02-22 19:59:42 -08:00
Daoud Clarke
50a059410b
Merge pull request #93 from mwmbl/add-code-of-conduct-1
...
Create CODE_OF_CONDUCT.md
2023-02-15 20:36:31 +00:00
Rishabh Singh Ahluwalia
084a870f65
Merge pull request #92 from mwmbl/rishabh-add-launch-json
...
Add launch.json for vscode run/debugging
2023-02-12 07:17:47 -08:00
Daoud Clarke
68ecdee145
Create CONTRIBUTING.md
2023-02-11 15:17:35 +00:00
Daoud Clarke
3a07fb54b5
Create CODE_OF_CONDUCT.md
2023-02-11 15:13:08 +00:00
Daoud Clarke
d8dbe54f9c
Update README.md
2023-02-11 15:10:30 +00:00
Daoud Clarke
2daf902ca3
Merge pull request #90 from mwmbl/m1-mmap-issue-fix-2
...
Offset by metadata size manually to increase compatibility
2023-02-11 08:30:46 +00:00
Rishabh Singh Ahluwalia
7fdc8480bd
add launch.json for vscode debugging
2023-02-10 20:59:09 -08:00
Daoud Clarke
e890e56661
Offset by metadata size manually to increase compatibility
2023-02-05 15:49:09 +00:00
Daoud Clarke
5783cee6b7
Fix bugs
2023-01-24 22:52:58 +00:00
Daoud Clarke
77e39b4a89
Optimise URL update
2023-01-22 20:28:18 +00:00
Daoud Clarke
66700f8a3e
Speed up domain parsing
2023-01-20 20:53:50 +00:00
Daoud Clarke
2b36f2ccc1
Try and balance URLs before adding to queue
2023-01-19 21:56:40 +00:00
Daoud Clarke
603fcd4eb2
Create a custom URL queue
2023-01-14 21:59:31 +00:00
Daoud Clarke
01f08fd88d
Return updated URLs
2023-01-14 19:17:16 +00:00
Daoud Clarke
bd0cc3863e
Don't try and update an empty list of URLs
2023-01-09 21:02:40 +00:00
Daoud Clarke
d347a17d63
Update URL queue separately from the other background process to speed it up
2023-01-09 20:50:28 +00:00
Daoud Clarke
7bd12c1ead
Fix some bugs in URL fetching query
2023-01-02 20:51:23 +00:00
Daoud Clarke
a50f1d8ae3
Fix postgres install
2023-01-02 12:19:10 +00:00
Daoud Clarke
1ab16b1fb4
Install postgres client
2023-01-02 12:18:03 +00:00
Daoud Clarke
dda5a25ad0
Add core domains
2023-01-02 12:05:22 +00:00
Daoud Clarke
ab37bbe0a5
Exclude google plus
2023-01-01 22:18:47 +00:00
Daoud Clarke
2336ed7f7d
Allow posting extra links with lower score weighting
2023-01-01 20:37:41 +00:00
Daoud Clarke
6edf48693b
Check the domain is correct, potential bug in psql
2023-01-01 01:30:44 +00:00
Daoud Clarke
b7984684c9
Tidy, improve logging
2023-01-01 01:14:05 +00:00
Daoud Clarke
7c14cd99f8
Update the URL queue earlier
2022-12-31 23:37:59 +00:00
Daoud Clarke
0d33b4f68f
Merge pull request #86 from mwmbl/improve-crawling
...
Improve crawling
2022-12-31 22:56:21 +00:00
Daoud Clarke
a86e172bf3
Reinstate background tasks
2022-12-31 22:52:17 +00:00
Daoud Clarke
d9cd3c585b
Get results from other domains
2022-12-31 22:51:00 +00:00
Daoud Clarke
77f08d8f0a
Update URL status
2022-12-31 22:25:05 +00:00
Daoud Clarke
36af579f7c
Sample domains
2022-12-31 17:04:38 +00:00
Daoud Clarke
ea16e7b5cd
WIP: improve method of getting URLs for crawling
2022-12-31 13:37:40 +00:00
Daoud Clarke
7dae39b780
WIP: improve method of getting URLs for crawling
2022-12-31 13:32:15 +00:00
Daoud Clarke
c69108cfcc
Don't delete an index if the sizes don't match
2022-12-27 10:52:46 +00:00
Daoud Clarke
bb8a36a612
Number of pages is an int
2022-12-27 10:40:53 +00:00
Daoud Clarke
c01129cdb9
Merge branch 'master' of github.com:mwmbl/mwmbl
2022-12-27 10:25:41 +00:00
Daoud Clarke
26351a1072
Use the correct storage location in prod
2022-12-27 10:24:48 +00:00
Daoud Clarke
f3f3831a97
Merge pull request #83 from omasanori/spacy-deps-rework
...
Rework installation of spaCy models for clarity
2022-12-27 10:20:52 +00:00
Masanori Ogino
71187a3938
Rework installation of spaCy models for clarity
...
- Install the wheel package for compatibility with future pip
- Use `spacy download` for installing model(s)
- Use `spacy validate` for checking model compatibility explicitly
Signed-off-by: Masanori Ogino <167209+omasanori@users.noreply.github.com>
2022-12-27 11:33:52 +09:00
Daoud Clarke
d85067ec09
Remove apt command
2022-12-24 20:20:53 +00:00