Compare commits

..

No commits in common. "main" and "v1.0.0-beta.1" have entirely different histories.

8 changed files with 86 additions and 342 deletions

2
.gitignore vendored
View file

@ -1,2 +0,0 @@
config.php

View file

@ -1,5 +1,5 @@
# Doogle
Doogle is a search engine and web crawler which can search indexed websites and images, and then use keywords to be searched later.
Doogle is a search engine and web crawler which can search indexed websites and image, and then using keywords be searched later.
Written primarily in OOP style PHP with the intent of better understanding OOP and how web crawlers work.
@ -13,7 +13,7 @@ Written primarily in OOP style PHP with the intent of better understanding OOP a
* Displays title, URL and description
- Search images
* Hover over images to preview description (alt tag)
* Masonry layout for searched images
* Masonary layout for searched images
* Image preview using Fancybox
* Image search page responds dynamically
- Clean homepage
@ -21,14 +21,11 @@ Written primarily in OOP style PHP with the intent of better understanding OOP a
- Organises search results by clicks/visits
- Pagination system at the bottom of the search page
- Shows 'results found' for search term
- Supports non-latin characters (UTF-8)
# Table of Contents
- [Setup and Usage](#setup-and-usage)
- [Docker](#docker)
- [Server Setup](#server-setup)
- [PHP Dependencies](#php-dependencies)
- [Connecting PHP to MySQL Server](#connecting-php-to-mysql-server)
- [Crawling Websites to Populate Images and Sites tables](#crawling-websites-to-populate-images-and-sites-tables)
- [Programming Logic](#programming-logic)
@ -47,34 +44,8 @@ Written primarily in OOP style PHP with the intent of better understanding OOP a
# Setup and Usage
Two methods of setup are discussed.
- Docker (Easiest)
- Server Setup
## Docker
Docker configuration files are available at [doogle-docker](https://github.com/safesploit/doogle-docker).
Presuming you already have [Docker](https://www.docker.com/) v3.9 (or greater) installed and configured.
git clone https://github.com/safesploit/doogle-docker.git
cd doogle-docker
sh build.sh
<p align="center">
<img width="857" alt="Screenshot 2023-02-22 at 21 11 33" src="https://user-images.githubusercontent.com/10171446/220760089-71baee5a-19ce-43e6-9cd5-35ce9e143400.png">
<img width="857" alt="image" src="https://user-images.githubusercontent.com/10171446/220760298-65e0b64e-3724-4e8e-b9ec-a86ba20d58c8.png">
Doogle is now accessible via [localhost:8000](http://localhost:8000).
For debugging phpMyAdmin has also been included on [localhost:8001](http://localhost:8001).
</p>
## Server Setup
v1.0.0-beta.1 is supported and tested in PHP 7.4, 8.0 and 8.1.
Please refer to [XAMPP](https://www.apachefriends.org/index.html) for the web server, PHP server and MySQL server configuration.
XAMPP is the simplest method as several servers are required to use Doogle.
@ -84,44 +55,23 @@ Once logged into the database via PHPMyAdmin under the **PHPMyAdmin > SQL** tab,
<img width="960" alt="Image1-PHPMyAdmin" src="https://user-images.githubusercontent.com/10171446/165310962-7ec771d2-50a0-4117-87f8-60373f694e55.png">
## PHP Dependencies
mysql
pdo_mysql
### SQL User Creation
Amend the password _PASSWORD_HERE_ using a strong [random password](https://passwordsgenerator.net/).
mysql> CREATE USER IF NOT EXISTS 'doogle'@'localhost' IDENTIFIED BY 'PASSWORD_HERE';
### SQL User Permissions
The SQL user 'doogle' must have SELECT, INSERT and UPDATE privileges:
mysql> GRANT SELECT, INSERT, UPDATE ON `doogle`.* TO 'doogle'@'localhost';
- INSERT is used for crawling
- SELECT is required for the search engine to return queries
- UPDATE is required to amend the clicks and broken results (see ./ajax/)
## Connecting PHP to MySQL Server
In the file config.php the following must be entered correctly for your database configuration:
$dbname = "doogle";
$dbhost = "localhost";
$dbuser = "doogle";
$dbhost = "127.0.0.1";
$dbuser = "root";
$dbpass = "";
In the file 'doogle-tables-no-data.sql' the database will be created as 'doogle'.
In the file 'doogle-tables-no-data.sql' the database will be created as 'doogle', but the remaining parameters must still be filled.
## Crawling Websites to Populate Images and Sites tables
### Form-based crawl
In your browser go to where the file is hosted http://localhost/crawl.php
In your browser go to where the file is hosted http://127.0.0.1/crawl-formSubmit.php
Paste the URL into the input field and press the Crawl button.
@ -131,14 +81,14 @@ At the bottom of crawl-manual.php the variable $startUrl is where to paste the U
$startUrl = "https://thehackernews.com/";
Then in your browser go to where the file is hosted http://localhost/crawl-manual.php
Then in your browser go to where the file is hosted http://127.0.0.1/crawl-manual.php
### Explanation
### Explination
The crawling process will take some time, it will completely depend on the size of the website being crawled.
The page will continue to load (without output) until the `crawl.php` script finishes.
The page will continue to load (without output) until the crawl.php script finishes.
Check the tables `images` and `sites` in the database to ensure they are being populated.
Check the tables 'images' and 'sites' in the database to ensure they are being populated.
<img width="960" alt="Image2-PHPMyAdmin" src="https://user-images.githubusercontent.com/10171446/165312292-c2830b80-365d-4a39-b176-8226bd0d7f65.png">
@ -192,7 +142,7 @@ To make image searches more informative, the 'alt' tag is part of the search ter
### Loading Images with JavaScript
In the 'images' table, there is a row 'broken' which tracks images which return an error.
In the 'images' table there is a row 'broken' which tracks images which return an error.
Because images are already loaded with a pure server-side solution, AJAX must be leveraged, loading images dynamically. Which is shown in ./assets/js/script.js
@ -231,7 +181,7 @@ Both the 'images' and 'sites' tables in the database have a row containing 'clic
The 'clicks' field is increased each time a site is visited or image is previewed.
When performing a search, results returned are organised in descending order of clicks.
When performing a search, results returned are organised in decending order of clicks.
This behaviour is shown by the $query inside ./classes/SiteResultsProvider.php function getResultsHtml(). See line 43.
<img width="443" alt="SiteResultsProvider-getResultsHtml" src="https://user-images.githubusercontent.com/10171446/165467418-37de4f8c-1901-4911-a7c9-33b42806f0bb.png">
@ -270,7 +220,7 @@ The title, image URL and site URL are available on the bottom left corner.
## Pagination System
Naturally, certain search terms may return many results like 'bbc'.
Naturally certain search terms may return many results like 'bbc'.
To which Doogle only displays **20 sites** per page.
At the bottom of the page, we can view the next 10 pages.

49
SHA256SUMS Normal file
View file

@ -0,0 +1,49 @@
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
59492c770b524a1c583598969f410864b42cbe50dba1567401d42925a21fdc3b ./ajax/setBroken.php
42b0a956fbb7e7c4d258b76f77a40bcc3f424e92b8df4e48029b6ac36814ce13 ./ajax/updateImageCount.php
93c09e73205ae4cb9298787e456e7ba40a3a59913d4c7f7fa8231f3c0d57fb7e ./ajax/updateLinkCount.php
f6f3d53dd2240261f157695adf386a5c08014298c19f62ccf63cd162996892d0 ./assets/css/fancybox/3.3.5/jquery.fancybox.min.css
d5acffdfb41dc1e41a2132bff2e67b0e1632f10cba48321abc5c33e09fa0d076 ./assets/css/style.css
f72f0630f8cbe6b5a24094960e099eaf1f7be18af93c2f74f516f13b1e6212a0 ./assets/images/doogleLogo.png
11e84e77b4b2ca65fdc45407601edd77acef60994d8dfb670fe6db6a6672dbb9 ./assets/images/favicon/android-chrome-192x192.png
81506b473fe8709ea34e9974ede53b7bed210bbbd5b8f3541e9cc9fac2cd1fd8 ./assets/images/favicon/android-chrome-512x512.png
30b13061a191ddd4eb62107fa20bc714db91c5b39d39e64e843ce5a118a13bf1 ./assets/images/favicon/apple-touch-icon.png
a734688685e1d3eeb7c5b15267f31e7961aff394f3f68fc389a256b45e42970e ./assets/images/favicon/favicon-16x16.png
018136439b52bc1db7c84311f20435cfce95b5e191f5d8f36b27ce7eb5bc6064 ./assets/images/favicon/favicon-32x32.png
09925a497bbd72a7850434f205f31d9ab8cfa0e4f727718731595314ac89d482 ./assets/images/favicon/favicon.ico
bcb764f2e87fefd1f9c39cf0d3517ad4cbea2008cd925380aee23d9832e1fc2c ./assets/images/icons/search.png
88761e31eae97360d4dbdeedff92c4d151ec33492e9f1cdb34eb802762a9c125 ./assets/images/page.png
4ea49f1436476f370da62494ae780cfc99d4cbdd5cfab48082be4ce2274ecc07 ./assets/images/pageEnd.png
e771b2c0a69e5695ad7ff1a8bd7071fc6d46e6f9e3024acad6f743634d9c2e6d ./assets/images/pageSelected.png
61a1b5647ecafd1ae2fdd513262a394e16222b590bb82cc2a442f128cc6d4e52 ./assets/images/pageStart.png
4dbe2075e08dfc008a9a1290dc149f6ee360215610cc1944bdb625c0aee3b83c ./assets/js/fancybox/3.3.5/jquery.fancybox.min.js
160a426ff2894252cd7cebbdd6d6b7da8fcd319c65b70468f10b6690c45d02ef ./assets/js/jquery-3.3.1.min.js
367d6afdfc741fb48d2d9310e47c3924b693459a74c882c0fc545ec5ed7d55d2 ./assets/js/masonry/4.2.2/masonry.pkgd.min.js
19ada944019a8ef415a633317cc3d0924d5a0f2d91fe47ba1546e8844d7b308f ./assets/js/script.js
c798d95082d993b0de54f32e728515255f91e5c130476ff0b77089138aea1b5f ./classes/DomDocumentParser.php
1bd5e96382d6a3eddeec946080c96629e9aa56c2774a017fad24606f0c9f4244 ./classes/ImageResultsProvider.php
748e777d13df22396e186ddadb825cf472c92d9bda7aa04aa19c2b1cf96de3ec ./classes/SiteResultsProvider.php
b0ca5b7eb0af35124f5caab1c334356fd3cfe5c2cea625f313336622e475df76 ./config.php
371c7775b3cddddd12ff95ece2b8782e984ba869d4bce45cdca8e6c14ffa07a5 ./crawl-formSubmit.php
62ac95f4e51efd41db713a049218ad9255b35350a2e070a6e1facc62d842550b ./crawl-manual.php
bec40c943cf20745f47210a73226f69c19c63a6c90a2cc6c23d223aa777f9ba9 ./doogle-tables-no-data.sql
54346c28a4b984e342192c34c95cd849d641437f22a7b03303e729e286a2afa4 ./index.php
2ec5c79665b679dbc290b3d753a3710fc732ac64e3e7e525324988e6507103f6 ./search.php
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEEHj7kwug4oP7u+vqSNt6aWc2GnuIFAmJo/qQACgkQNt6aWc2G
nuJKbA/+LDRw7bwjj9mz/60E7BMMrMSJKOhnVTxNIfuiK+HVwFh8NAUoVvR3S3qZ
VJfS7igvmj/Ne6YDTpd45lVXhAPr1POx/RzwS0VGr270lacl1cyMq60dHrZu0wLc
/rKyEbejFCqH16l6f+qfy7e4rhJFi/IM+tS+gJp7T0EQMSuNzrH9KgVc7H0LlfWZ
tQy9Sll64y9TI/W80kzq2169ULgRoH3AQpWLBznaIPZo/EmKH/r57DV+WYHm10Z8
Bu9tdGPDn3eh6IvvHfeFm+dSgskbnh8FTsa2VUaY0GZ1hnLAvENjAAV9CDuUBj5z
KWZPhWIz+iM8RsMax13eA3TgNT+p7JLhHaeLtteyhXwWobgvTsgh/UichwmFqcPV
nswJgTngJhRMkf8O0A3fQO3zZrKAU2rR5bJrMSMylhIvtpg38+/yp0a5xIkJhKwn
scf9GV8brqT+q9y9wrwKfrgcWMTbUH07Iv9R7KwcNz3sz0lA2TgIAximg0ZbmeEp
FjCL/sl6mhtK/LlR2blclMxQnEXOg/Y17LLDqxRuh/SzGDgoEBce7I36j8ZyQSfT
z7wwufDMnsqoC5LoDONyfjhrnydmYRRJ9mVnSipz48ON/pPAo6jGyk78mmO3A0/P
oVepSwM8n1HUHwnQmsXtoHyz1lL7n/H+X/pJcwO7lMVYEuCaxDQ=
=M5Y1
-----END PGP SIGNATURE-----

View file

@ -1,208 +0,0 @@
<?php
class Crawler
{
private $con;
public function __construct($con)
{
$this->con = $con;
}
function linkExists($url)
{
global $con;
$query = $con->prepare("SELECT * FROM sites WHERE url = :url");
$query->bindParam(":url", $url);
$query->execute();
return $query->rowCount() != 0;
}
function imageExists($src)
{
global $con;
$query = $con->prepare("SELECT * FROM images WHERE imageUrl = :src");
$query->bindParam(":src", $src);
$query->execute();
return $query->rowCount() != 0;
}
function insertLink($url, $title, $description, $keywords)
{
global $con;
$query = $con->prepare("INSERT INTO sites(url, title, description, keywords)
VALUES(:url, :title, :description, :keywords)");
$query->bindParam(":url", $url);
$query->bindParam(":title", $title);
$query->bindParam(":description", $description);
$query->bindParam(":keywords", $keywords);
return $query->execute();
}
function insertImage($url, $src, $alt, $title)
{
global $con;
$query = $con->prepare("INSERT INTO images(siteUrl, imageUrl, alt, title)
VALUES(:siteUrl, :imageUrl, :alt, :title)");
$query->bindParam(":siteUrl", $url);
$query->bindParam(":imageUrl", $src);
$query->bindParam(":alt", $alt);
$query->bindParam(":title", $title);
return $query->execute();
}
/* Converts relative link to absolute link */
function createLink($src, $url)
{
$scheme = parse_url($url)["scheme"]; // http
$host = parse_url($url)["host"]; // www.safesploit.com
if(substr($src, 0, 2) == "//")
$src = $scheme . ":" . $src;
else if(substr($src, 0, 1) == "/")
$src = $scheme . "://" . $host . $src;
else if(substr($src, 0, 2) == "./")
$src = $scheme . "://" . $host . dirname(parse_url($url)["path"]) . substr($src, 1);
else if(substr($src, 0, 3) == "../")
$src = $scheme . "://" . $host . "/" . $src;
else if(substr($src, 0, 5) != "https" && substr($src, 0, 4) != "http")
$src = $scheme . "://" . $host . "/" . $src;
return $src;
}
function getDetails($url)
{
global $alreadyFoundImages;
$parser = new DomDocumentParser($url);
$titleArray = $parser->getTitleTags();
if(sizeof($titleArray) == 0 || $titleArray->item(0) == NULL)
return;
//Replace linebreak
$title = $titleArray->item(0)->nodeValue;
$title = str_replace("\n", "", $title);
//Return if no <title>
if($title == "")
return;
$description = "";
$keywords = "";
$metasArray = $parser->getMetatags();
foreach($metasArray as $meta)
{
if($meta->getAttribute("name") == "description")
$description = $meta->getAttribute("content");
if($meta->getAttribute("name") == "keywords")
$keywords = $meta->getAttribute("content");
}
$description = str_replace("\n", "", $description);
$keywords = str_replace("\n", "", $keywords);
//Non-ASCII char encoding
// $title = json_encode($title);
// $description = json_encode($description);
// $keywords = json_encode($keywords);
if(linkExists($url))
echo "$url already exists<br>";
else if(insertLink($url, $title, $description, $keywords))
echo "SUCCESS: $url<br>";
else
echo "ERROR: Failed to insert $url<br>";
$imageArray = $parser->getImages();
foreach($imageArray as $image)
{
$src = $image->getAttribute("src");
$alt = $image->getAttribute("alt");
$title = $image->getAttribute("title");
if(!$title && !$alt)
continue;
$src = createLink($src, $url);
if(!in_array($src, $alreadyFoundImages))
{
$alreadyFoundImages[] = $src;
if(imageExists($src))
echo "$src already exists<br>";
else if(insertImage($url, $src, $alt, $title))
echo "SUCCESS: $src<br>";
else
echo "ERROR: Failed to insert $src<br>";
}
}
echo "<b>URL:</b> $url, <b>Title:</b> $title, <b>Description:</b> $description, <b>keywords:</b> $keywords<br>"; //DEBUGGING sites
echo "<b>src:</b> <a href=$src>$src</a>, <b>alt:</b> $alt, <b>title:</b> $title, <b>url:</b> $url<br>"; //DEBUGGING images
}
function followLinks($url)
{
global $alreadyCrawled;
global $crawling;
$parser = new DomDocumentParser($url);
$linkList = $parser->getLinks();
foreach($linkList as $link)
{
$href = $link->getAttribute("href");
// Filter hrefs
if(strpos($href, "#") !== false)
continue;
else if(substr($href, 0, 11) == "javascript:")
continue;
$href = createLink($href, $url);
if(!in_array($href, $alreadyCrawled))
{
$alreadyCrawled[] = $href;
$crawling[] = $href;
getDetails($href);
}
//else return; //DEBUGGING
echo ($href . "<br>"); //DEBUGGING
}
array_shift($crawling);
foreach($crawling as $site)
followLinks($site);
}
}
?>

View file

@ -5,16 +5,13 @@ class DomDocumentParser
public function __construct($url)
{
$html = '<?xml encoding="UTF-8">';
$options = array(
'http'=>array('method'=>"GET", 'header'=>"User-Agent: doogleBot/0.1\n")
);
$context = stream_context_create($options);
$getConstants = file_get_contents($url, false, $context);
$this->doc = new DomDocument('1.0', 'utf-8');
@$this->doc->loadHTML($html . $getConstants);
$this->doc = new DomDocument();
@$this->doc->loadHTML(file_get_contents($url, false, $context));
//@ Error supression is unnecessary, PHP>7.0 supports HTML5
}

View file

@ -2,9 +2,9 @@
ob_start();
$dbname = "doogle";
$dbhost = "mysql_db";
$dbuser = "doogle";
$dbpass = "PASSWORD_HERE";
$dbhost = "192.168.5.240";
$dbuser = "root";
$dbpass = "";
try
{
@ -15,4 +15,4 @@ catch(PDOExeption $e)
{
echo "Connection failed: " . $e->getMessage();
}
?>
?>

View file

@ -1,14 +1,7 @@
<?php
include("config.php");
include("classes/Crawler.php");
include("classes/DomDocumentParser.php");
if(isset($_SESSION['loggedin']))
{
exit("You must be logged in!");
header("location: login.php");
}
$alreadyCrawled = array();
$crawling = array();
$alreadyFoundImages = array();
@ -213,7 +206,6 @@ function followLinks($url)
<link rel="apple-touch-icon" href="assets/images/favicon/apple-touch-icon.png">
<link rel="android-chrome-icon" type="image/png" href="assets/images/favicon/android-chrome-512x512.png">
<meta charset="utf-8">
<meta name="description" content="Search the web for sites and images.">
<meta name="keywords" content="Search engine, doogle, websites">
<meta name="author" content="Zepher Ashe">
@ -222,18 +214,11 @@ function followLinks($url)
<link rel="stylesheet" type="text/css" href="assets/css/style.css">
</head>
<body>
<div class="headerContent">
<div class="logoContainer">
<a href="index.php">
Homepage
</a>
</div>
<div id="crawl-wrapper">
<form action="crawl.php" method="post" accept-charset="utf-8">
URL: <input type="text" name="url" required="required" id="crawl-input" value="">
<button type="submit">Crawl</button>
</form>
</div>
<div id="crawl-wrapper">
<form action="crawl-formSubmit.php" method="post" >
URL: <input type="text" name="url" required="required" id="crawl-input" value="">
<button type="submit">Crawl</button>
</form>
</div>
</body>
</html>
@ -241,9 +226,7 @@ function followLinks($url)
<?php
if (isset($_POST['url']))
{
$crawlerObj = new Crawler($con);
$startUrl = $_POST['url'];
// $crawlerObj->followLinks($startUrl);
followLinks($startUrl);
}
?>
?>

View file

@ -1,4 +1,11 @@
-- phpMyAdmin SQL Dump - No Data
-- version 5.1.1
-- https://www.phpmyadmin.net/
--
-- Host: 192.168.5.240
-- Generation Time: Apr 24, 2022 at 09:25 AM
-- Server version: 8.0.28-0ubuntu0.20.04.3
-- PHP Version: 7.4.24
SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
SET AUTOCOMMIT = 0;
@ -11,16 +18,10 @@ SET time_zone = "+00:00";
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;
--
-- User Creation: `doogle`
--
CREATE USER IF NOT EXISTS 'doogle'@'%' IDENTIFIED BY 'PASSWORD_HERE';
GRANT SELECT, INSERT, UPDATE ON `doogle`.* TO 'doogle'@'%';
--
-- Database: `doogle`
--
CREATE DATABASE IF NOT EXISTS `doogle` DEFAULT CHARACTER SET utf8mb4;
CREATE DATABASE IF NOT EXISTS `doogle` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
USE `doogle`;
-- --------------------------------------------------------
@ -29,7 +30,7 @@ USE `doogle`;
-- Table structure for table `images`
--
CREATE TABLE IF NOT EXISTS `images` (
CREATE TABLE `images` (
`id` int(11) NOT NULL,
`siteUrl` varchar(512) NOT NULL,
`imageUrl` varchar(512) NOT NULL,
@ -37,7 +38,7 @@ CREATE TABLE IF NOT EXISTS `images` (
`title` varchar(512) NOT NULL,
`clicks` int(11) NOT NULL DEFAULT '0',
`broken` tinyint(4) NOT NULL DEFAULT '0'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
-- --------------------------------------------------------
@ -45,28 +46,14 @@ CREATE TABLE IF NOT EXISTS `images` (
-- Table structure for table `sites`
--
CREATE TABLE IF NOT EXISTS `sites` (
CREATE TABLE `sites` (
`id` int(11) NOT NULL,
`url` varchar(512) NOT NULL,
`title` varchar(512) NOT NULL,
`description` varchar(512) NOT NULL,
`keywords` varchar(512) NOT NULL,
`clicks` int(11) NOT NULL DEFAULT '0'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
-- --------------------------------------------------------
--
-- Table structure for table `users`
--
CREATE TABLE IF NOT EXISTS `users` (
`id` int(11) NOT NULL,
`username` varchar(100) NOT NULL,
`email` varchar(150) NOT NULL,
`password` varchar(255) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
--
-- Indexes for dumped tables
@ -84,12 +71,6 @@ ALTER TABLE `images`
ALTER TABLE `sites`
ADD PRIMARY KEY (`id`);
--
-- Indexes for table `users`
--
ALTER TABLE `users`
ADD PRIMARY KEY (`id`);
--
-- AUTO_INCREMENT for dumped tables
--
@ -105,12 +86,6 @@ ALTER TABLE `images`
--
ALTER TABLE `sites`
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=5297;
--
-- AUTO_INCREMENT for table `users`
--
ALTER TABLE `users`
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1382;
COMMIT;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;