* show basic (and ugly) entity information during search
* refactor image downloads into an image downloader
* download entity images
* show entity image and link to wiki article
* show related entities
* show links in entity text and infobox
* don't match stopwords in entity titles during search
* remove stopwords from entity query
* fixed bug where some wikipedia images werent downloaded
* prettify the primary image a bit
Extra information such as favicon and other images are now separated from the inverted index and everything is stored in the index. This also allows us to easier keep a list of all pending image download jobs and execute these asynchronously (which we know do).
Since we didn't check if the positions[i] vector had elements, some searches caused the weights and positions vector to have a different number of elements, thereby crashing on .zip_eq
Preparations for expanded spans ranking which allows us to rank documents based on term proximity. Overall this change gives more control over the search+ranking procedure.
GraphStore is now a struct backed by something that implements a new Kv trait.
This makes it far easier to implement another backing key-value store, as all the caching
etc. from the graph database is abstracted away.
This change is introduced since we want to try RocksDB instead of Sled as Sled caused huge memory consumption
when indexing 10_000 warc files. It brought a 256GB ram server to its knees.
Generate better snippets by guessing which parts of the website viable content.
This also improves search accuracy as the BM25 score becomes way better (it is given cleaner data).
Performance could probably be immensly improved, but it seems to work.
There is no need to parse the entire html file into a tree since a stream of tags will be sufficient for extracting links+text.
This should be faster than parsing. It wil also make it easier to implement the 'just_text' algorithm.
There seems to be a problem with large fast-fields, both in terms of crashes and performance.
We therefore now check whether a query-document match is navigational by indexing the host string
and boosting the field during search. In the future, we may want to add a boolean fast-field to
indicate if the given webpage is the homepage and only do navigational searches on those.
The commit also increases indexing ressources in the hopes that this reduces the number of merges
during indexing.