Compare commits
39 commits
v1.0.0-bet
...
main
Author | SHA1 | Date | |
---|---|---|---|
![]() |
50304b956c | ||
![]() |
6720029da2 | ||
![]() |
46a05506a7 | ||
![]() |
2a879d9781 | ||
![]() |
9608bc5731 | ||
![]() |
38bf3037ff | ||
![]() |
4373731716 | ||
![]() |
72fc0a79eb | ||
![]() |
97a995c6d0 | ||
![]() |
bb93dc745f | ||
![]() |
7b2fb538ac | ||
![]() |
a30063d001 | ||
![]() |
4874704be8 | ||
![]() |
c6b1e6b339 | ||
![]() |
5f1a2d04e0 | ||
![]() |
2ae8269fb3 | ||
![]() |
dd201e4ba4 | ||
![]() |
3bd5530c36 | ||
![]() |
c9a23da943 | ||
![]() |
ec281ecf55 | ||
![]() |
9603f87428 | ||
![]() |
56f624f2f8 | ||
![]() |
7eb39af2b8 | ||
![]() |
9de6a83693 | ||
![]() |
9f1c926400 | ||
![]() |
e03c08cd74 | ||
![]() |
dc0e4a607b | ||
![]() |
e7463c39d4 | ||
![]() |
f05d8855c5 | ||
![]() |
59c837e418 | ||
![]() |
bec211ef5c | ||
![]() |
a7047e25f2 | ||
![]() |
a4433bee21 | ||
![]() |
8a75f1fc1d | ||
![]() |
b46a09af5f | ||
![]() |
93e88e266e | ||
![]() |
a936b78234 | ||
![]() |
1ff8e1b4be | ||
![]() |
b5ab7bf17e |
8 changed files with 342 additions and 86 deletions
2
.gitignore
vendored
Normal file
2
.gitignore
vendored
Normal file
|
@ -0,0 +1,2 @@
|
|||
|
||||
config.php
|
76
README.md
76
README.md
|
@ -1,5 +1,5 @@
|
|||
# Doogle
|
||||
Doogle is a search engine and web crawler which can search indexed websites and image, and then using keywords be searched later.
|
||||
Doogle is a search engine and web crawler which can search indexed websites and images, and then use keywords to be searched later.
|
||||
|
||||
Written primarily in OOP style PHP with the intent of better understanding OOP and how web crawlers work.
|
||||
|
||||
|
@ -13,7 +13,7 @@ Written primarily in OOP style PHP with the intent of better understanding OOP a
|
|||
* Displays title, URL and description
|
||||
- Search images
|
||||
* Hover over images to preview description (alt tag)
|
||||
* Masonary layout for searched images
|
||||
* Masonry layout for searched images
|
||||
* Image preview using Fancybox
|
||||
* Image search page responds dynamically
|
||||
- Clean homepage
|
||||
|
@ -21,11 +21,14 @@ Written primarily in OOP style PHP with the intent of better understanding OOP a
|
|||
- Organises search results by clicks/visits
|
||||
- Pagination system at the bottom of the search page
|
||||
- Shows 'results found' for search term
|
||||
- Supports non-latin characters (UTF-8)
|
||||
|
||||
# Table of Contents
|
||||
|
||||
- [Setup and Usage](#setup-and-usage)
|
||||
- [Docker](#docker)
|
||||
- [Server Setup](#server-setup)
|
||||
- [PHP Dependencies](#php-dependencies)
|
||||
- [Connecting PHP to MySQL Server](#connecting-php-to-mysql-server)
|
||||
- [Crawling Websites to Populate Images and Sites tables](#crawling-websites-to-populate-images-and-sites-tables)
|
||||
- [Programming Logic](#programming-logic)
|
||||
|
@ -44,8 +47,34 @@ Written primarily in OOP style PHP with the intent of better understanding OOP a
|
|||
|
||||
# Setup and Usage
|
||||
|
||||
Two methods of setup are discussed.
|
||||
- Docker (Easiest)
|
||||
- Server Setup
|
||||
|
||||
## Docker
|
||||
|
||||
Docker configuration files are available at [doogle-docker](https://github.com/safesploit/doogle-docker).
|
||||
|
||||
Presuming you already have [Docker](https://www.docker.com/) v3.9 (or greater) installed and configured.
|
||||
|
||||
git clone https://github.com/safesploit/doogle-docker.git
|
||||
cd doogle-docker
|
||||
sh build.sh
|
||||
|
||||
<p align="center">
|
||||
<img width="857" alt="Screenshot 2023-02-22 at 21 11 33" src="https://user-images.githubusercontent.com/10171446/220760089-71baee5a-19ce-43e6-9cd5-35ce9e143400.png">
|
||||
<img width="857" alt="image" src="https://user-images.githubusercontent.com/10171446/220760298-65e0b64e-3724-4e8e-b9ec-a86ba20d58c8.png">
|
||||
|
||||
Doogle is now accessible via [localhost:8000](http://localhost:8000).
|
||||
|
||||
For debugging phpMyAdmin has also been included on [localhost:8001](http://localhost:8001).
|
||||
|
||||
</p>
|
||||
|
||||
## Server Setup
|
||||
|
||||
v1.0.0-beta.1 is supported and tested in PHP 7.4, 8.0 and 8.1.
|
||||
|
||||
Please refer to [XAMPP](https://www.apachefriends.org/index.html) for the web server, PHP server and MySQL server configuration.
|
||||
XAMPP is the simplest method as several servers are required to use Doogle.
|
||||
|
||||
|
@ -55,23 +84,44 @@ Once logged into the database via PHPMyAdmin under the **PHPMyAdmin > SQL** tab,
|
|||
|
||||
<img width="960" alt="Image1-PHPMyAdmin" src="https://user-images.githubusercontent.com/10171446/165310962-7ec771d2-50a0-4117-87f8-60373f694e55.png">
|
||||
|
||||
## PHP Dependencies
|
||||
|
||||
mysql
|
||||
pdo_mysql
|
||||
|
||||
|
||||
### SQL User Creation
|
||||
|
||||
Amend the password _PASSWORD_HERE_ using a strong [random password](https://passwordsgenerator.net/).
|
||||
|
||||
mysql> CREATE USER IF NOT EXISTS 'doogle'@'localhost' IDENTIFIED BY 'PASSWORD_HERE';
|
||||
|
||||
### SQL User Permissions
|
||||
|
||||
The SQL user 'doogle' must have SELECT, INSERT and UPDATE privileges:
|
||||
|
||||
mysql> GRANT SELECT, INSERT, UPDATE ON `doogle`.* TO 'doogle'@'localhost';
|
||||
|
||||
- INSERT is used for crawling
|
||||
- SELECT is required for the search engine to return queries
|
||||
- UPDATE is required to amend the clicks and broken results (see ./ajax/)
|
||||
|
||||
## Connecting PHP to MySQL Server
|
||||
|
||||
In the file config.php the following must be entered correctly for your database configuration:
|
||||
|
||||
$dbname = "doogle";
|
||||
$dbhost = "127.0.0.1";
|
||||
$dbuser = "root";
|
||||
$dbhost = "localhost";
|
||||
$dbuser = "doogle";
|
||||
$dbpass = "";
|
||||
|
||||
In the file 'doogle-tables-no-data.sql' the database will be created as 'doogle', but the remaining parameters must still be filled.
|
||||
In the file 'doogle-tables-no-data.sql' the database will be created as 'doogle'.
|
||||
|
||||
## Crawling Websites to Populate Images and Sites tables
|
||||
|
||||
### Form-based crawl
|
||||
|
||||
In your browser go to where the file is hosted http://127.0.0.1/crawl-formSubmit.php
|
||||
In your browser go to where the file is hosted http://localhost/crawl.php
|
||||
|
||||
Paste the URL into the input field and press the Crawl button.
|
||||
|
||||
|
@ -81,14 +131,14 @@ At the bottom of crawl-manual.php the variable $startUrl is where to paste the U
|
|||
|
||||
$startUrl = "https://thehackernews.com/";
|
||||
|
||||
Then in your browser go to where the file is hosted http://127.0.0.1/crawl-manual.php
|
||||
Then in your browser go to where the file is hosted http://localhost/crawl-manual.php
|
||||
|
||||
### Explination
|
||||
### Explanation
|
||||
|
||||
The crawling process will take some time, it will completely depend on the size of the website being crawled.
|
||||
The page will continue to load (without output) until the crawl.php script finishes.
|
||||
The page will continue to load (without output) until the `crawl.php` script finishes.
|
||||
|
||||
Check the tables 'images' and 'sites' in the database to ensure they are being populated.
|
||||
Check the tables `images` and `sites` in the database to ensure they are being populated.
|
||||
|
||||
<img width="960" alt="Image2-PHPMyAdmin" src="https://user-images.githubusercontent.com/10171446/165312292-c2830b80-365d-4a39-b176-8226bd0d7f65.png">
|
||||
|
||||
|
@ -142,7 +192,7 @@ To make image searches more informative, the 'alt' tag is part of the search ter
|
|||
|
||||
|
||||
### Loading Images with JavaScript
|
||||
In the 'images' table there is a row 'broken' which tracks images which return an error.
|
||||
In the 'images' table, there is a row 'broken' which tracks images which return an error.
|
||||
|
||||
Because images are already loaded with a pure server-side solution, AJAX must be leveraged, loading images dynamically. Which is shown in ./assets/js/script.js
|
||||
|
||||
|
@ -181,7 +231,7 @@ Both the 'images' and 'sites' tables in the database have a row containing 'clic
|
|||
|
||||
The 'clicks' field is increased each time a site is visited or image is previewed.
|
||||
|
||||
When performing a search, results returned are organised in decending order of clicks.
|
||||
When performing a search, results returned are organised in descending order of clicks.
|
||||
This behaviour is shown by the $query inside ./classes/SiteResultsProvider.php function getResultsHtml(). See line 43.
|
||||
|
||||
<img width="443" alt="SiteResultsProvider-getResultsHtml" src="https://user-images.githubusercontent.com/10171446/165467418-37de4f8c-1901-4911-a7c9-33b42806f0bb.png">
|
||||
|
@ -220,7 +270,7 @@ The title, image URL and site URL are available on the bottom left corner.
|
|||
|
||||
## Pagination System
|
||||
|
||||
Naturally certain search terms may return many results like 'bbc'.
|
||||
Naturally, certain search terms may return many results like 'bbc'.
|
||||
|
||||
To which Doogle only displays **20 sites** per page.
|
||||
At the bottom of the page, we can view the next 10 pages.
|
||||
|
|
49
SHA256SUMS
49
SHA256SUMS
|
@ -1,49 +0,0 @@
|
|||
-----BEGIN PGP SIGNED MESSAGE-----
|
||||
Hash: SHA512
|
||||
|
||||
59492c770b524a1c583598969f410864b42cbe50dba1567401d42925a21fdc3b ./ajax/setBroken.php
|
||||
42b0a956fbb7e7c4d258b76f77a40bcc3f424e92b8df4e48029b6ac36814ce13 ./ajax/updateImageCount.php
|
||||
93c09e73205ae4cb9298787e456e7ba40a3a59913d4c7f7fa8231f3c0d57fb7e ./ajax/updateLinkCount.php
|
||||
f6f3d53dd2240261f157695adf386a5c08014298c19f62ccf63cd162996892d0 ./assets/css/fancybox/3.3.5/jquery.fancybox.min.css
|
||||
d5acffdfb41dc1e41a2132bff2e67b0e1632f10cba48321abc5c33e09fa0d076 ./assets/css/style.css
|
||||
f72f0630f8cbe6b5a24094960e099eaf1f7be18af93c2f74f516f13b1e6212a0 ./assets/images/doogleLogo.png
|
||||
11e84e77b4b2ca65fdc45407601edd77acef60994d8dfb670fe6db6a6672dbb9 ./assets/images/favicon/android-chrome-192x192.png
|
||||
81506b473fe8709ea34e9974ede53b7bed210bbbd5b8f3541e9cc9fac2cd1fd8 ./assets/images/favicon/android-chrome-512x512.png
|
||||
30b13061a191ddd4eb62107fa20bc714db91c5b39d39e64e843ce5a118a13bf1 ./assets/images/favicon/apple-touch-icon.png
|
||||
a734688685e1d3eeb7c5b15267f31e7961aff394f3f68fc389a256b45e42970e ./assets/images/favicon/favicon-16x16.png
|
||||
018136439b52bc1db7c84311f20435cfce95b5e191f5d8f36b27ce7eb5bc6064 ./assets/images/favicon/favicon-32x32.png
|
||||
09925a497bbd72a7850434f205f31d9ab8cfa0e4f727718731595314ac89d482 ./assets/images/favicon/favicon.ico
|
||||
bcb764f2e87fefd1f9c39cf0d3517ad4cbea2008cd925380aee23d9832e1fc2c ./assets/images/icons/search.png
|
||||
88761e31eae97360d4dbdeedff92c4d151ec33492e9f1cdb34eb802762a9c125 ./assets/images/page.png
|
||||
4ea49f1436476f370da62494ae780cfc99d4cbdd5cfab48082be4ce2274ecc07 ./assets/images/pageEnd.png
|
||||
e771b2c0a69e5695ad7ff1a8bd7071fc6d46e6f9e3024acad6f743634d9c2e6d ./assets/images/pageSelected.png
|
||||
61a1b5647ecafd1ae2fdd513262a394e16222b590bb82cc2a442f128cc6d4e52 ./assets/images/pageStart.png
|
||||
4dbe2075e08dfc008a9a1290dc149f6ee360215610cc1944bdb625c0aee3b83c ./assets/js/fancybox/3.3.5/jquery.fancybox.min.js
|
||||
160a426ff2894252cd7cebbdd6d6b7da8fcd319c65b70468f10b6690c45d02ef ./assets/js/jquery-3.3.1.min.js
|
||||
367d6afdfc741fb48d2d9310e47c3924b693459a74c882c0fc545ec5ed7d55d2 ./assets/js/masonry/4.2.2/masonry.pkgd.min.js
|
||||
19ada944019a8ef415a633317cc3d0924d5a0f2d91fe47ba1546e8844d7b308f ./assets/js/script.js
|
||||
c798d95082d993b0de54f32e728515255f91e5c130476ff0b77089138aea1b5f ./classes/DomDocumentParser.php
|
||||
1bd5e96382d6a3eddeec946080c96629e9aa56c2774a017fad24606f0c9f4244 ./classes/ImageResultsProvider.php
|
||||
748e777d13df22396e186ddadb825cf472c92d9bda7aa04aa19c2b1cf96de3ec ./classes/SiteResultsProvider.php
|
||||
b0ca5b7eb0af35124f5caab1c334356fd3cfe5c2cea625f313336622e475df76 ./config.php
|
||||
371c7775b3cddddd12ff95ece2b8782e984ba869d4bce45cdca8e6c14ffa07a5 ./crawl-formSubmit.php
|
||||
62ac95f4e51efd41db713a049218ad9255b35350a2e070a6e1facc62d842550b ./crawl-manual.php
|
||||
bec40c943cf20745f47210a73226f69c19c63a6c90a2cc6c23d223aa777f9ba9 ./doogle-tables-no-data.sql
|
||||
54346c28a4b984e342192c34c95cd849d641437f22a7b03303e729e286a2afa4 ./index.php
|
||||
2ec5c79665b679dbc290b3d753a3710fc732ac64e3e7e525324988e6507103f6 ./search.php
|
||||
-----BEGIN PGP SIGNATURE-----
|
||||
|
||||
iQIzBAEBCgAdFiEEHj7kwug4oP7u+vqSNt6aWc2GnuIFAmJo/qQACgkQNt6aWc2G
|
||||
nuJKbA/+LDRw7bwjj9mz/60E7BMMrMSJKOhnVTxNIfuiK+HVwFh8NAUoVvR3S3qZ
|
||||
VJfS7igvmj/Ne6YDTpd45lVXhAPr1POx/RzwS0VGr270lacl1cyMq60dHrZu0wLc
|
||||
/rKyEbejFCqH16l6f+qfy7e4rhJFi/IM+tS+gJp7T0EQMSuNzrH9KgVc7H0LlfWZ
|
||||
tQy9Sll64y9TI/W80kzq2169ULgRoH3AQpWLBznaIPZo/EmKH/r57DV+WYHm10Z8
|
||||
Bu9tdGPDn3eh6IvvHfeFm+dSgskbnh8FTsa2VUaY0GZ1hnLAvENjAAV9CDuUBj5z
|
||||
KWZPhWIz+iM8RsMax13eA3TgNT+p7JLhHaeLtteyhXwWobgvTsgh/UichwmFqcPV
|
||||
nswJgTngJhRMkf8O0A3fQO3zZrKAU2rR5bJrMSMylhIvtpg38+/yp0a5xIkJhKwn
|
||||
scf9GV8brqT+q9y9wrwKfrgcWMTbUH07Iv9R7KwcNz3sz0lA2TgIAximg0ZbmeEp
|
||||
FjCL/sl6mhtK/LlR2blclMxQnEXOg/Y17LLDqxRuh/SzGDgoEBce7I36j8ZyQSfT
|
||||
z7wwufDMnsqoC5LoDONyfjhrnydmYRRJ9mVnSipz48ON/pPAo6jGyk78mmO3A0/P
|
||||
oVepSwM8n1HUHwnQmsXtoHyz1lL7n/H+X/pJcwO7lMVYEuCaxDQ=
|
||||
=M5Y1
|
||||
-----END PGP SIGNATURE-----
|
208
classes/Crawler.php
Normal file
208
classes/Crawler.php
Normal file
|
@ -0,0 +1,208 @@
|
|||
<?php
|
||||
class Crawler
|
||||
{
|
||||
private $con;
|
||||
|
||||
public function __construct($con)
|
||||
{
|
||||
$this->con = $con;
|
||||
}
|
||||
|
||||
|
||||
|
||||
function linkExists($url)
|
||||
{
|
||||
global $con;
|
||||
|
||||
$query = $con->prepare("SELECT * FROM sites WHERE url = :url");
|
||||
|
||||
$query->bindParam(":url", $url);
|
||||
$query->execute();
|
||||
|
||||
return $query->rowCount() != 0;
|
||||
}
|
||||
|
||||
function imageExists($src)
|
||||
{
|
||||
global $con;
|
||||
|
||||
$query = $con->prepare("SELECT * FROM images WHERE imageUrl = :src");
|
||||
|
||||
$query->bindParam(":src", $src);
|
||||
$query->execute();
|
||||
|
||||
return $query->rowCount() != 0;
|
||||
}
|
||||
|
||||
|
||||
function insertLink($url, $title, $description, $keywords)
|
||||
{
|
||||
global $con;
|
||||
|
||||
$query = $con->prepare("INSERT INTO sites(url, title, description, keywords)
|
||||
VALUES(:url, :title, :description, :keywords)");
|
||||
|
||||
$query->bindParam(":url", $url);
|
||||
$query->bindParam(":title", $title);
|
||||
$query->bindParam(":description", $description);
|
||||
$query->bindParam(":keywords", $keywords);
|
||||
|
||||
return $query->execute();
|
||||
}
|
||||
|
||||
function insertImage($url, $src, $alt, $title)
|
||||
{
|
||||
global $con;
|
||||
|
||||
$query = $con->prepare("INSERT INTO images(siteUrl, imageUrl, alt, title)
|
||||
VALUES(:siteUrl, :imageUrl, :alt, :title)");
|
||||
|
||||
$query->bindParam(":siteUrl", $url);
|
||||
$query->bindParam(":imageUrl", $src);
|
||||
$query->bindParam(":alt", $alt);
|
||||
$query->bindParam(":title", $title);
|
||||
|
||||
return $query->execute();
|
||||
}
|
||||
|
||||
/* Converts relative link to absolute link */
|
||||
function createLink($src, $url)
|
||||
{
|
||||
$scheme = parse_url($url)["scheme"]; // http
|
||||
$host = parse_url($url)["host"]; // www.safesploit.com
|
||||
|
||||
if(substr($src, 0, 2) == "//")
|
||||
$src = $scheme . ":" . $src;
|
||||
else if(substr($src, 0, 1) == "/")
|
||||
$src = $scheme . "://" . $host . $src;
|
||||
else if(substr($src, 0, 2) == "./")
|
||||
$src = $scheme . "://" . $host . dirname(parse_url($url)["path"]) . substr($src, 1);
|
||||
else if(substr($src, 0, 3) == "../")
|
||||
$src = $scheme . "://" . $host . "/" . $src;
|
||||
else if(substr($src, 0, 5) != "https" && substr($src, 0, 4) != "http")
|
||||
$src = $scheme . "://" . $host . "/" . $src;
|
||||
|
||||
return $src;
|
||||
}
|
||||
|
||||
function getDetails($url)
|
||||
{
|
||||
global $alreadyFoundImages;
|
||||
|
||||
$parser = new DomDocumentParser($url);
|
||||
|
||||
$titleArray = $parser->getTitleTags();
|
||||
|
||||
if(sizeof($titleArray) == 0 || $titleArray->item(0) == NULL)
|
||||
return;
|
||||
|
||||
//Replace linebreak
|
||||
$title = $titleArray->item(0)->nodeValue;
|
||||
$title = str_replace("\n", "", $title);
|
||||
|
||||
//Return if no <title>
|
||||
if($title == "")
|
||||
return;
|
||||
|
||||
$description = "";
|
||||
$keywords = "";
|
||||
|
||||
$metasArray = $parser->getMetatags();
|
||||
|
||||
foreach($metasArray as $meta)
|
||||
{
|
||||
if($meta->getAttribute("name") == "description")
|
||||
$description = $meta->getAttribute("content");
|
||||
|
||||
if($meta->getAttribute("name") == "keywords")
|
||||
$keywords = $meta->getAttribute("content");
|
||||
}
|
||||
|
||||
$description = str_replace("\n", "", $description);
|
||||
$keywords = str_replace("\n", "", $keywords);
|
||||
|
||||
//Non-ASCII char encoding
|
||||
// $title = json_encode($title);
|
||||
// $description = json_encode($description);
|
||||
// $keywords = json_encode($keywords);
|
||||
|
||||
if(linkExists($url))
|
||||
echo "$url already exists<br>";
|
||||
else if(insertLink($url, $title, $description, $keywords))
|
||||
echo "SUCCESS: $url<br>";
|
||||
else
|
||||
echo "ERROR: Failed to insert $url<br>";
|
||||
|
||||
$imageArray = $parser->getImages();
|
||||
foreach($imageArray as $image)
|
||||
{
|
||||
$src = $image->getAttribute("src");
|
||||
$alt = $image->getAttribute("alt");
|
||||
$title = $image->getAttribute("title");
|
||||
|
||||
if(!$title && !$alt)
|
||||
continue;
|
||||
|
||||
$src = createLink($src, $url);
|
||||
|
||||
if(!in_array($src, $alreadyFoundImages))
|
||||
{
|
||||
$alreadyFoundImages[] = $src;
|
||||
|
||||
if(imageExists($src))
|
||||
echo "$src already exists<br>";
|
||||
else if(insertImage($url, $src, $alt, $title))
|
||||
echo "SUCCESS: $src<br>";
|
||||
else
|
||||
echo "ERROR: Failed to insert $src<br>";
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
echo "<b>URL:</b> $url, <b>Title:</b> $title, <b>Description:</b> $description, <b>keywords:</b> $keywords<br>"; //DEBUGGING sites
|
||||
echo "<b>src:</b> <a href=$src>$src</a>, <b>alt:</b> $alt, <b>title:</b> $title, <b>url:</b> $url<br>"; //DEBUGGING images
|
||||
}
|
||||
|
||||
function followLinks($url)
|
||||
{
|
||||
global $alreadyCrawled;
|
||||
global $crawling;
|
||||
|
||||
$parser = new DomDocumentParser($url);
|
||||
|
||||
$linkList = $parser->getLinks();
|
||||
|
||||
|
||||
foreach($linkList as $link)
|
||||
{
|
||||
$href = $link->getAttribute("href");
|
||||
|
||||
// Filter hrefs
|
||||
if(strpos($href, "#") !== false)
|
||||
continue;
|
||||
else if(substr($href, 0, 11) == "javascript:")
|
||||
continue;
|
||||
|
||||
$href = createLink($href, $url);
|
||||
|
||||
if(!in_array($href, $alreadyCrawled))
|
||||
{
|
||||
$alreadyCrawled[] = $href;
|
||||
$crawling[] = $href;
|
||||
|
||||
getDetails($href);
|
||||
}
|
||||
//else return; //DEBUGGING
|
||||
|
||||
echo ($href . "<br>"); //DEBUGGING
|
||||
}
|
||||
|
||||
array_shift($crawling);
|
||||
|
||||
foreach($crawling as $site)
|
||||
followLinks($site);
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
?>
|
|
@ -5,13 +5,16 @@ class DomDocumentParser
|
|||
|
||||
public function __construct($url)
|
||||
{
|
||||
$html = '<?xml encoding="UTF-8">';
|
||||
|
||||
$options = array(
|
||||
'http'=>array('method'=>"GET", 'header'=>"User-Agent: doogleBot/0.1\n")
|
||||
);
|
||||
$context = stream_context_create($options);
|
||||
$getConstants = file_get_contents($url, false, $context);
|
||||
|
||||
$this->doc = new DomDocument();
|
||||
@$this->doc->loadHTML(file_get_contents($url, false, $context));
|
||||
$this->doc = new DomDocument('1.0', 'utf-8');
|
||||
@$this->doc->loadHTML($html . $getConstants);
|
||||
//@ Error supression is unnecessary, PHP>7.0 supports HTML5
|
||||
}
|
||||
|
||||
|
|
|
@ -2,9 +2,9 @@
|
|||
ob_start();
|
||||
|
||||
$dbname = "doogle";
|
||||
$dbhost = "192.168.5.240";
|
||||
$dbuser = "root";
|
||||
$dbpass = "";
|
||||
$dbhost = "mysql_db";
|
||||
$dbuser = "doogle";
|
||||
$dbpass = "PASSWORD_HERE";
|
||||
|
||||
try
|
||||
{
|
||||
|
@ -15,4 +15,4 @@ catch(PDOExeption $e)
|
|||
{
|
||||
echo "Connection failed: " . $e->getMessage();
|
||||
}
|
||||
?>
|
||||
?>
|
||||
|
|
|
@ -1,7 +1,14 @@
|
|||
<?php
|
||||
include("config.php");
|
||||
include("classes/Crawler.php");
|
||||
include("classes/DomDocumentParser.php");
|
||||
|
||||
if(isset($_SESSION['loggedin']))
|
||||
{
|
||||
exit("You must be logged in!");
|
||||
header("location: login.php");
|
||||
}
|
||||
|
||||
$alreadyCrawled = array();
|
||||
$crawling = array();
|
||||
$alreadyFoundImages = array();
|
||||
|
@ -206,6 +213,7 @@ function followLinks($url)
|
|||
<link rel="apple-touch-icon" href="assets/images/favicon/apple-touch-icon.png">
|
||||
<link rel="android-chrome-icon" type="image/png" href="assets/images/favicon/android-chrome-512x512.png">
|
||||
|
||||
<meta charset="utf-8">
|
||||
<meta name="description" content="Search the web for sites and images.">
|
||||
<meta name="keywords" content="Search engine, doogle, websites">
|
||||
<meta name="author" content="Zepher Ashe">
|
||||
|
@ -214,11 +222,18 @@ function followLinks($url)
|
|||
<link rel="stylesheet" type="text/css" href="assets/css/style.css">
|
||||
</head>
|
||||
<body>
|
||||
<div id="crawl-wrapper">
|
||||
<form action="crawl-formSubmit.php" method="post" >
|
||||
URL: <input type="text" name="url" required="required" id="crawl-input" value="">
|
||||
<button type="submit">Crawl</button>
|
||||
</form>
|
||||
<div class="headerContent">
|
||||
<div class="logoContainer">
|
||||
<a href="index.php">
|
||||
Homepage
|
||||
</a>
|
||||
</div>
|
||||
<div id="crawl-wrapper">
|
||||
<form action="crawl.php" method="post" accept-charset="utf-8">
|
||||
URL: <input type="text" name="url" required="required" id="crawl-input" value="">
|
||||
<button type="submit">Crawl</button>
|
||||
</form>
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
|
@ -226,7 +241,9 @@ function followLinks($url)
|
|||
<?php
|
||||
if (isset($_POST['url']))
|
||||
{
|
||||
$crawlerObj = new Crawler($con);
|
||||
$startUrl = $_POST['url'];
|
||||
// $crawlerObj->followLinks($startUrl);
|
||||
followLinks($startUrl);
|
||||
}
|
||||
?>
|
||||
?>
|
|
@ -1,11 +1,4 @@
|
|||
-- phpMyAdmin SQL Dump - No Data
|
||||
-- version 5.1.1
|
||||
-- https://www.phpmyadmin.net/
|
||||
--
|
||||
-- Host: 192.168.5.240
|
||||
-- Generation Time: Apr 24, 2022 at 09:25 AM
|
||||
-- Server version: 8.0.28-0ubuntu0.20.04.3
|
||||
-- PHP Version: 7.4.24
|
||||
|
||||
SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
|
||||
SET AUTOCOMMIT = 0;
|
||||
|
@ -18,10 +11,16 @@ SET time_zone = "+00:00";
|
|||
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
|
||||
/*!40101 SET NAMES utf8mb4 */;
|
||||
|
||||
--
|
||||
-- User Creation: `doogle`
|
||||
--
|
||||
CREATE USER IF NOT EXISTS 'doogle'@'%' IDENTIFIED BY 'PASSWORD_HERE';
|
||||
GRANT SELECT, INSERT, UPDATE ON `doogle`.* TO 'doogle'@'%';
|
||||
|
||||
--
|
||||
-- Database: `doogle`
|
||||
--
|
||||
CREATE DATABASE IF NOT EXISTS `doogle` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
|
||||
CREATE DATABASE IF NOT EXISTS `doogle` DEFAULT CHARACTER SET utf8mb4;
|
||||
USE `doogle`;
|
||||
|
||||
-- --------------------------------------------------------
|
||||
|
@ -30,7 +29,7 @@ USE `doogle`;
|
|||
-- Table structure for table `images`
|
||||
--
|
||||
|
||||
CREATE TABLE `images` (
|
||||
CREATE TABLE IF NOT EXISTS `images` (
|
||||
`id` int(11) NOT NULL,
|
||||
`siteUrl` varchar(512) NOT NULL,
|
||||
`imageUrl` varchar(512) NOT NULL,
|
||||
|
@ -38,7 +37,7 @@ CREATE TABLE `images` (
|
|||
`title` varchar(512) NOT NULL,
|
||||
`clicks` int(11) NOT NULL DEFAULT '0',
|
||||
`broken` tinyint(4) NOT NULL DEFAULT '0'
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
|
||||
|
||||
-- --------------------------------------------------------
|
||||
|
||||
|
@ -46,14 +45,28 @@ CREATE TABLE `images` (
|
|||
-- Table structure for table `sites`
|
||||
--
|
||||
|
||||
CREATE TABLE `sites` (
|
||||
CREATE TABLE IF NOT EXISTS `sites` (
|
||||
`id` int(11) NOT NULL,
|
||||
`url` varchar(512) NOT NULL,
|
||||
`title` varchar(512) NOT NULL,
|
||||
`description` varchar(512) NOT NULL,
|
||||
`keywords` varchar(512) NOT NULL,
|
||||
`clicks` int(11) NOT NULL DEFAULT '0'
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
|
||||
|
||||
-- --------------------------------------------------------
|
||||
|
||||
--
|
||||
-- Table structure for table `users`
|
||||
--
|
||||
|
||||
CREATE TABLE IF NOT EXISTS `users` (
|
||||
`id` int(11) NOT NULL,
|
||||
`username` varchar(100) NOT NULL,
|
||||
`email` varchar(150) NOT NULL,
|
||||
`password` varchar(255) NOT NULL
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
|
||||
|
||||
|
||||
--
|
||||
-- Indexes for dumped tables
|
||||
|
@ -71,6 +84,12 @@ ALTER TABLE `images`
|
|||
ALTER TABLE `sites`
|
||||
ADD PRIMARY KEY (`id`);
|
||||
|
||||
--
|
||||
-- Indexes for table `users`
|
||||
--
|
||||
ALTER TABLE `users`
|
||||
ADD PRIMARY KEY (`id`);
|
||||
|
||||
--
|
||||
-- AUTO_INCREMENT for dumped tables
|
||||
--
|
||||
|
@ -86,6 +105,12 @@ ALTER TABLE `images`
|
|||
--
|
||||
ALTER TABLE `sites`
|
||||
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=5297;
|
||||
|
||||
--
|
||||
-- AUTO_INCREMENT for table `users`
|
||||
--
|
||||
ALTER TABLE `users`
|
||||
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1382;
|
||||
COMMIT;
|
||||
|
||||
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
|
||||
|
|
Loading…
Add table
Reference in a new issue