Browse Source

Reverting to a4433be

Zepher Ashe 3 years ago
parent
commit
bec211ef5c
1 changed files with 146 additions and 196 deletions
  1. 146 196
      README.md

+ 146 - 196
README.md

@@ -1,313 +1,263 @@
-[MonoBlockchain](https://github.com/safesploit/MonoBlockchain)
-
-# MonoBlockchain
-
-MonoBlockchain is a decentralised proof of work cryptocurrency written in Python3. 
-
-Individually referred to as an _MB Coin_ (Mono Block Coin).
-
+# Doogle
+Doogle is a search engine and web crawler which can search indexed websites and images, and then use keywords to be searched later. 
 
+Written primarily in OOP style PHP with the intent of better understanding OOP and how web crawlers work.
 
 <p align="center">
-    <img alt="MonoBlockchain-Preview" src="#">
+  <img width="527" alt="DoogleHomepage-Preview" src="https://user-images.githubusercontent.com/10171446/165316199-b0fe279c-cb11-4a36-84b8-53a514ac488a.png">
 </p>
 
 # Features
 
-- Immutable ledger
-- Distributed P2P network
-- Proof of Work
-  - SHA256
-- Consensus protocol
-- API
-- Easy to use transaction system
+- Search sites
+   *    Displays title, URL and description
+- Search images
+    *   Hover over images to preview description (alt tag)
+    *   Masonry layout for searched images
+    *   Image preview using Fancybox
+    *   Image search page responds dynamically
+- Clean homepage
+- Filters broken image results
+- Organises search results by clicks/visits
+- Pagination system at the bottom of the search page
+- Shows 'results found' for search term
+
+# Table of Contents 
 
-# Table of Contents
 - [Setup and Usage](#setup-and-usage)
-  - [Python Version](#python-version)
-  - [Dependencies](#dependencies)
-  - [Initialising Servers](#initialising-servers)
-- [Blockchain Concepts](#blockchain-concepts)
-  - [Proof of Work](#proof-of-work)
-  - [Hashing Algorithm](#hashing-algorithm)
-  - [Immutable Ledger](#immutable-ledger)
-  - [Distributed P2P](#distributed-p2p)
-  - [Mining](#mining)
-  - [Consensus Protocol](#consensus-protocol)
+  - [Server Setup](#server-setup)
+  - [Connecting PHP to MySQL Server](#connecting-php-to-mysql-server)
+  - [Crawling Websites to Populate Images and Sites tables](#crawling-websites-to-populate-images-and-sites-tables)
 - [Programming Logic](#programming-logic)
-  - [How Mining Works (Technical)](#how-mining-works-technical)
-  - [User Interface](#user-interface)
+  - [Pagination](#pagination)
+  - [Image Search](#image-search)
+  - [Site Search - Trimming Results](#site-search---trimming-results)
+  - [Telemetry](#telemetry)
+  - [User-Agent](#user-agent)
 - [Preview Images](#preview-images)
+  - [Doogle Homepage](#doogle-homepage)
+  - [Doogle Search - Sites](#doogle-search---sites)
+  - [Doogle Search - Images](#doogle-search---images)
+  - [Pagination System](#pagination-system)
+  - [doogleBot Crawl Form](#dooglebot-crawl-form)
 - [Preview Video](#preview-video)
 
-[ToC Markdown Generator](https://toc.git.safesploit.com)
-
 # Setup and Usage
 
-## Python Version
+## Server Setup
 
-    $ python3.10
-    Python 3.10.5
+Please refer to [XAMPP](https://www.apachefriends.org/index.html) for the web server, PHP server and MySQL server configuration.
+XAMPP is the simplest method as several servers are required to use Doogle.
 
+[MySQL Setup on XAMPP](https://www.rose-hulman.edu/class/se/csse290-WebProgramming/201520/SupportCode/SQL-setup.html) will use PHPMyAdmin as a GUI method of setting up the database.
 
-## Dependencies
+Once logged into the database via PHPMyAdmin under the **PHPMyAdmin > SQL** tab, the content of 'doogle-tables-no-data.sql' can be pasted into the field
 
-    pip3.10 install Crypto \
-                    Flask \
+<img width="960" alt="Image1-PHPMyAdmin" src="https://user-images.githubusercontent.com/10171446/165310962-7ec771d2-50a0-4117-87f8-60373f694e55.png">
 
+### SQL User Creation
 
+Amend the password _PASSWORD_HERE_ using a strong [random password](https://passwordsgenerator.net/).
 
-## Initialising Servers
+    mysql> CREATE USER IF NOT EXISTS 'doogle'@'localhost' IDENTIFIED BY 'PASSWORD_HERE';
 
-Initialisation can be done in two-ways. One is via PyCharm the other is running multiple Python instances via the command-line.
+### SQL User Permissions
 
-### PyCharm
+The SQL user 'doogle' must have SELECT, INSERT and UPDATE privileges:
 
-#### Node1 (Server)
+    mysql> GRANT SELECT, INSERT, UPDATE ON `doogle`.* TO 'doogle'@'localhost';
+    
+  - INSERT is used for crawling
+  - SELECT is required for the search engine to return queries
+  - UPDATE is required to amend the clicks and broken results (see ./ajax/) 
 
-Within PyCharm under _Run/Debug Configurations > Add New Configuration > Python_ the following Configuration should be chosen:
+## Connecting PHP to MySQL Server
 
-  - Name: Node1
-  - Script path: ~/PycharmProjects/MonoBlockchain/blockchain/blockchain.py
-  - Parameters: _Leave this blank
-  - Python interpreter: Python 3.10
+In the file config.php the following must be entered correctly for your database configuration:
 
-  <p align="center">
-  <img width="1348" alt="PyCharm Configuration Node1" src="https://user-images.githubusercontent.com/10171446/174770301-9f283c31-851c-4e78-bf2a-b0a04b3527a9.png">
-  </br>
-  <b>PyCharm Configuration Node1</b>
-</p>
+    $dbname = "doogle";
+    $dbhost = "localhost";
+    $dbuser = "doogle";
+    $dbpass = "";
 
-#### Alice (Client)
+In the file 'doogle-tables-no-data.sql' the database will be created as 'doogle'.
 
-  - Name: Alice
-  - Script path: ~/PycharmProjects/MonoBlockchain/blockchain_client/blockchain_client.py
-  - Parameters: -p 8081
-  - Python interpreter: Python 3.10
+## Crawling Websites to Populate Images and Sites tables
 
-<p align="center">
-  <img width="1196" alt="PyCharm Configuration Alice" src="https://user-images.githubusercontent.com/10171446/174770596-5742d253-496b-4357-a4a0-698de77659f8.png">
-  </br>
-  <b>PyCharm Configuration Alice</b>
-</p>
+### Form-based crawl
 
-#### Bob (Client)
+In your browser go to where the file is hosted http://localhost/crawl-formSubmit.php
 
-  - Name: Bob
-  - Script path: ~/PycharmProjects/MonoBlockchain/blockchain_client/blockchain_client.py
-  - Parameters: -p 8082
-  - Python interpreter: Python 3.10
+Paste the URL into the input field and press the Crawl button.
 
-<p align="center">
-  <img width="1196" alt="PyCharm Configuration Bob" src="https://user-images.githubusercontent.com/10171446/174770973-8e59cb41-9e6b-4e30-a6f0-40350dd86935.png">
-  </br>
-  <b>PyCharm Configuration Bob</b>
-</p>
+### Manual crawl
 
-#### Running the Configuration
-
-<p align="center">
-  <img width="1326" alt="PyCharm Configurations" src="https://user-images.githubusercontent.com/10171446/174771586-566e19c6-fb7c-4ddb-8dd8-31a17f7ac28e.png">
-
-  </br>
-  <b></b>
-</p>
-
-PyCharm Configurations are outlined in greater detail [here](https://www.jetbrains.com/help/pycharm/configuring-python-interpreter.html#packages).
+At the bottom of crawl-manual.php the variable $startUrl is where to paste the URL of the website to be crawled:
 
+    $startUrl = "https://thehackernews.com/";
+  
+Then in your browser go to where the file is hosted http://localhost/crawl-manual.php
 
-### Command-Line
+### Explanation
 
-CMD.exe / Terminal / Shell
+The crawling process will take some time, it will completely depend on the size of the website being crawled. 
+The page will continue to load (without output) until the `crawl.php` script finishes.
 
-#### Node1 (server)
+Check the tables `images` and `sites` in the database to ensure they are being populated.
 
-    $ python3 ~/MonoBlockchain/blockchain/blockchain.py
+<img width="960" alt="Image2-PHPMyAdmin" src="https://user-images.githubusercontent.com/10171446/165312292-c2830b80-365d-4a39-b176-8226bd0d7f65.png">
 
 
-#### Alice (client)
+Once the tables are populated visit the Doogle homepage and search!
+See preview images.
 
-    $ python3 ~/MonoBlockchain/blockchain-client/blockchain-client.py -p 8081
+# Programming Logic
 
-#### Bob (client)
+## Pagination
 
-    $ python3 ~/MonoBlockchain/blockchain-client/blockchain-client.py -p 8082
+### Logic of pagination system
+Inside search.php, pagination is implemented  
 
+<img width="261" alt="image demonstrating pagnigation" src="https://user-images.githubusercontent.com/10171446/165146284-cf5362c0-bfe1-4489-b68e-5f7363d243dd.png">
 
-# Blockchain Concepts
+In the example above, currentPage=11. 
+The number of pages to show is always 10.
 
-## Proof of Work
+### Results Per Page
 
-MonoBlockchain is based on Proof of Work (PoW), _explanation_.
+Site search will return 20 results per page and image search will return 30 results per page.
 
-## Hashing Algorithm
+The results per page can be changed inside search.php on lines {83, 88} respectively. As indicated by the $pageSize variables:
 
-PoW relies on SHA256 due to requiring:
-  - One-way function
-    - [The avalanche effect](https://www.cryptovision.com/en/glossary/avalanche-effect/#:~:text=The%20Avalanche%20Effect%20refers%20to,show%20a%20strong%20avalanche%20effect.)
-    - [Deterministic](https://www.sqlite.org/deterministic.html#:~:text=A%20deterministic%20function%20always%20gives,input%20X%20is%20the%20same.)
-  - Fast computation
-  - Must withstand collisions (SHA-256 has 2<sup>256</sup> combinations)
+<img width="455" alt="Search-resultsPerPage" src="https://user-images.githubusercontent.com/10171446/165478400-f11c1be4-2c83-4559-8ccb-cba4550a64bd.png">
 
-## Immutable Ledger
 
-The idea of an immutable ledger is to ensure the previous hash is linked cryptographically to the last block. Which then can be traversed back to the genesis (initial) block.
+### Handling an edge case
 
-<p align="center">
-  <img width="640" alt="Immutable Ledger Example" src="https://user-images.githubusercontent.com/10171446/172579310-c11ca268-f185-4560-8b89-5388aa17dabb.png">
-  </br>
-  <b>Immputable Ledger</b>
-</p>
+An edge case can occur when no more pages are available.
 
-If _block 2_ were to be maliciously altered, the previous hash on _block 3_ would reflect this alteration. As the hash of _block 2_ would not be equal to the previous hash of _block 3_.
+So, for 331 results, **17 pages** will be available. However, without an edge case scenario consider, the UI for the pagination system will allow scrolling through pages which don't exist; which would return an empty result.
 
-## Distributed P2P
+To handle an edge case the following logic is implemented in the while-loop:
 
-### Explanation of Distributed P2P
+    if($currentPage + $pagesLeft > $numPages + 1)
+        $currentPage = $numPages + 1 - $pagesLeft;
 
-Distributed peer-to-peer (P2P) network ensures the network hosting the blockchain ledger is not centralised located. 
+    while($pagesLeft != 0 && $currentPage <= $numPages) 
+    { ... }
+    
+    
+## Image Search
 
-<p align="center">
-  <img width="575" alt="image" src="https://user-images.githubusercontent.com/10171446/172582471-6d101052-4e95-4482-b3f8-6c6bd120bf1e.png">
-  </br>
-  <b>Distributed P2P Network: Showing Computers (Servers) as Nodes</b>
-</p>
+### Image Captions
 
-Having a decentralised network provides several benefits:
-  - More nodes in the network
-  - Potentially faster as not relying on a single node
-  - More secure as the ledger has no single point of failure
+To make image searches more informative, the 'alt' tag is part of the search term. As shown in ./classes/ImageResultsProvider.php line 34
 
+<img width="419" alt="ImageResultsProvider-query" src="https://user-images.githubusercontent.com/10171446/165472615-fd149596-3a39-4e48-8308-bd4f1ed16968.png">
 
-<p align="center">
-  <img width="681" alt="Distributed P2P Network Showing Blocks" src="https://user-images.githubusercontent.com/10171446/172584273-9f9cdf41-b5d2-4727-b232-eebe8802473c.png">
-  </br>
-  <b>Distributed P2P Network Showing Blocks</b>
-</p>
 
-### Attacking Distributed P2P Network
+### Loading Images with JavaScript
+In the 'images' table, there is a row 'broken' which tracks images which return an error.
 
-Because of the immutable ledger, the attack must modify earlier blocks to reflect the hash change. Which requires a great deal of processing power as the SHA-256 algorithm is computationally demanding.
+Because images are already loaded with a pure server-side solution, AJAX must be leveraged, loading images dynamically. Which is shown in ./assets/js/script.js
 
-The attack vector of computing forged blocks is demonstrated below:
 
-<p align="center">
-  <img width="682" alt="Distributed P2P Network Being Attacked" src="https://user-images.githubusercontent.com/10171446/172584901-121923b0-2890-41ad-8f0f-d78d8d447461.png">
-  </br>
-  <b>Distributed P2P Network Being Attacked</b>
-</p>
+<img width="319" alt="script js-loadImage-broken" src="https://user-images.githubusercontent.com/10171446/165471191-6119b5cf-dc77-49a4-b84d-12276232813a.png">
 
-However, for the example above, seven nodes maintain an independent version of the ledger. Moreover, while the attacker has successfully modify blocks and forged the ledger cryptographically to reflect an action which did not take place. However, the attack did this for a single node. Hence, the attacker only makes up 14% of the distributed P2P network. For the attacker to successfully perform their attack, they must control 51% or more of the network's nodes. 
 
-See, [51% attack](https://www.investopedia.com/terms/1/51-attack.asp#:~:text=A%2051%25%20attack%20is%20an,other%20miners%20from%20completing%20blocks.).
 
-## Mining
 
-Mining is the competitive process that verifies and adds new transactions to the blockchain for a cryptocurrency that uses the proof of work (PoW) method. The miner that wins the competition is rewarded with some amount of the currency and/or transaction fees - [source](https://www.pcmag.com/encyclopedia/term/crypto-mining#:~:text=(CRYPTOcurrency%20mining)%20The%20competitive%20process,currency%20and%2For%20transaction%20fees.).
+### Masonry
+Image searches are using [Masonry - Cascading grid layout library](https://masonry.desandro.com/).
 
-Consider further reading [PoW - Wikipedia](https://en.wikipedia.org/wiki/Proof_of_work).
+Masonry allows images a grid layout which is responsive due to jQuery.
+The image below shows an example layout:
 
-The main point with mining is _hard to solve, easy to verify_.
+<img width="428" alt="Masonry-item-layout" src="https://user-images.githubusercontent.com/10171446/165469864-97c2bec4-2af7-4987-917f-02885d407ba9.png">
 
 
-### Explanation of How Mining Works (Abstract)
-[Bitcoin Mining in 4 Minutes - Computerphile](https://www.youtube.com/watch?v=wTC31ZI6QM4) will give a ver clear outline of Bitcoin mining. MonoBlockchain is based on the same concept and the PoW hashing algorithm is also SHA-256. So, there is little difference from a mining perspective between Bitcoin and MonoBlockchain.
 
-[Nonce](https://en.wikipedia.org/wiki/Cryptographic_nonce) (number once) is an arbitrary number that can be used just once in a cryptographic communication. The nonce is used to ensure old communications cannot be reused.
+## Site Search - Trimming Results
 
-<p align="center">
-  <img width="464" alt="Abstract Overview of a Block" src="https://user-images.githubusercontent.com/10171446/172814804-7b06b2ad-6641-44d7-9034-cdc395bd8867.png">
-  </br>
-  <b>Abstract Overview of a Block</b>
-</p>
+As shown in the preview images, Doogle when performing a site search will return (title, URL and description) for each result.
 
+However, to make some results easier to read, a trimming process is performed. Inside ./classes/SiteResultsProvider.php the function trimField() is called:
 
+<img width="380" alt="SiteResultsProvider-trim1" src="https://user-images.githubusercontent.com/10171446/165468731-9176be82-c3ed-4bf4-bcbb-bf5dd838398b.png">
 
-### Why is Mining Necessary?
+<img width="374" alt="SiteResultsProvider-trim2" src="https://user-images.githubusercontent.com/10171446/165468845-5e382320-71ce-4b6a-988b-8d4ddf3f341a.png">
 
-The short answer is to prevent abuse of the network. Requiring computational power to _prove work_ is a logical method for mitigating DoS attacks while still keeping the usage of the network feasible.
+Title's are trimmed at 55 characters and description's are trimmed at 230 characters.
 
 
-<p align="center">
-  
-  </br>
-  <b></b>
-</p>
+## Telemetry
 
+Both the 'images' and 'sites' tables in the database have a row containing 'clicks' for each column.
 
-<p align="center">
-  
-  </br>
-  <b>Mining</b>
-</p>
+The 'clicks' field is increased each time a site is visited or image is previewed.
 
-## Consensus Protocol
+When performing a search, results returned are organised in descending order of clicks.
+This behaviour is shown by the $query inside ./classes/SiteResultsProvider.php function getResultsHtml(). See line 43.
 
-The purpose of a consensus protocol is to achieve consensus between participants as to what a blockchain should contain at a given time (including new blocks).
+<img width="443" alt="SiteResultsProvider-getResultsHtml" src="https://user-images.githubusercontent.com/10171446/165467418-37de4f8c-1901-4911-a7c9-33b42806f0bb.png">
 
-### First Challenege
 
-Referring back to _Attacking Distributed P2P Network_, we saw the attacker achieved a tampered version of the blockchain, but only contributed 14% of the entire distributed network. The other 86% naturally outnumbered the malicious blockchain node. 
+## User-Agent
 
+Inside ./classes/DomDocumentParser.php the user-agent data used during crawling is located.
+As indicated on line 9:
 
-### Second Challenege
+<img width="481" alt="DomDocumentParser-bot" src="https://user-images.githubusercontent.com/10171446/165465964-2bba0582-2846-44f1-abd1-b51ac316b186.png">
 
-Because each node in the distributed P2P network will mine the next block independently, a problem arises; overlapping nodes during synchronisation.
 
-<p align="center">
-  <img width="661" alt="Second Challenege to Overcome" src="https://user-images.githubusercontent.com/10171446/173308045-ce98925e-ed15-4c2d-a586-6266bbd8b0fb.png">
-  </br>
-  <b>Consensus Protocol - Second Challenege: Overlapping Nodes</b>
-</p>
+# Preview Images
+## Doogle Homepage
 
-The consensus to avoid overlapping nodes is to wait for a new block to be mined before synchronising with other nodes in the network. Once a node has mined a new block, the other nodes will be asked to add it to their blockchain.
+<img width="701" alt="Image3-DoogleHomepage-Edge" src="https://user-images.githubusercontent.com/10171446/165313393-fcfdb9fc-1b19-4c8f-ac08-b96ff393ab63.png">
 
-Essentially, _the longest chain wins_ is an consensus between nodes.
+## Doogle Search - Sites
 
-#### Orphan Block
+<img width="701" alt="Image4-DoogleSearch-PoC" src="https://user-images.githubusercontent.com/10171446/165313470-02c30d0a-e7e6-4fcf-8c09-6be9e633fc0f.png">
 
-Another issue to be considered as a user of a blockchain PoW network are orphan blocks.
+## Doogle Search - Images
 
-An [orphan block](https://www.investopedia.com/terms/o/orphan-block-cryptocurrency.asp#:~:text=An%20orphan%20block%20is%20a,the%20shorter%20chain%20are%20orphaned.) is a block that has been solved within the blockchain network but was not accepted by the network.
+<img width="882" alt="Image5-DoogleSearch-PoC-images" src="https://user-images.githubusercontent.com/10171446/165313548-686a79e3-5b1d-4e9e-a3d7-ab7775a9b171.png">
 
-An orphan block can occur because, _there can be two miners who solve valid blocks simultaneously. The network uses both blocks until one chain has more verified blocks than the other. Then, the blocks in the shorter chain are orphaned._
+### Image Preview
 
-Ideally, to avoid falling victim to an orphan block, users would be adviced to wait 4-6 blocks after their transaction was verified, before considering the transaction as 'full' verified.
+Image preview is done using Fancybox.
 
+The title, image URL and site URL are available on the bottom left corner.
 
-<p align="center">
-  
-  </br>
-  <b></b>
-</p>
+<img width="883" alt="Image9-DoogleSearch-imagePreview" src="https://user-images.githubusercontent.com/10171446/165315386-8bc4a25e-0a9f-4622-82b8-d733bc343a3b.png">
 
-# Programming Logic
 
-## How Mining Works (Technical)
 
-## User Interface
+## Pagination System
 
-To reduce development time with the user interface (UI), I have opted for using HTML with frameworks.
+Naturally, certain search terms may return many results like 'bbc'.
 
-### HTML
+To which Doogle only displays **20 sites** per page.
+At the bottom of the page, we can view the next 10 pages.
 
+### Results Shown
 
-### JavaScript
+<img width="883" alt="Image6-DoogleSearch-pagination-ResultsShown" src="https://user-images.githubusercontent.com/10171446/165314211-5daf2903-5ecc-44ad-942a-2270a361dec5.png">
 
-### Frameworks
+### Bottom of Page
 
-The following frameworks are used:
+<img width="883" alt="Image7-DoogleSearch-pagination-Bottom" src="https://user-images.githubusercontent.com/10171446/165314516-d00bf38a-6fef-467c-9182-88d0d6ce07d2.png">
 
-  - Bootstrap v4.0.0 
-  - DataTables 1.10.16
-  - Font Awesome 4.7.0 
-    - Font Awesome webfont
-  - jQuery JavaScript Library v3.3.1
+### Bottom of Page 13
 
-### CDNs and SRI Hashes
+<img width="883" alt="Image8-DoogleSearch-pagination-scrollingThrough" src="https://user-images.githubusercontent.com/10171446/165314716-08834b0c-4ba0-4e90-b466-58a57e91bf69.png">
 
+## doogleBot Crawl Form
 
+An HTML form to submit a URL for crawling
 
-# Preview Images
+<img width="581" alt="Image10-doogleBot-Crawler-formpng" src="https://user-images.githubusercontent.com/10171446/165463270-d36f7b78-379c-46da-b859-f5dde9304668.png">
 
 # Preview Video
+
+[Doogle Search demo - YouTube](https://youtu.be/clDt4Sg7ako)