Regex based simple search engine in PHP
Some notes about this code:
- The directory scanning makes some assumtions about file-structure and naming that are specific to my setup (all content is in one single directory, and each 'item' has an ending of '.txt'). For a general searchengine, the directory scanning process would have to be refined.
- The code was mainly written to have some code availeable to demonstrate very basic keywordsearch techniques in my essay 'Three approaches to searching'
- It
is more than obvious that this searchengine can be made better in many
ways, but that is and was never the intention behind this code. I will
be writing other engines, demonstrating different approaches.
- The code below does more than regex based searching, it is also a simple example of using the Amazon XML API
<?php
if (!empty($q)) {
if (is_array($q)) {
$q = join(" ",$q);
}
// prepare for urls in query
$q = preg_replace("/\/|(http(s)*:|www|\.de|\.net|\.com|\.org)/",' ',$q);
textise($q);
$query = preg_split("/\s+/", $q);
$q = join(" ", $query); // a clean version
} else {
$q = '';
}
// prepare the searchform
ob_start();
?>
<form>
<input type="text" name="q" value="<?php echo $q; ?>"><br>
<input type="submit" value="brute force search">
</form>
<!-- http://www.feedster.com/search.php?hl=en&ie=UTF-8&q=test&btnG=Search&sort=date
-->
<form method="get" action="http://www.feedster.com/search.php" target="_blank">
<input type="hidden" name ="sort" value="relevance">
<input type="text" name="q" value="<?php echo $q; ?>"><br>
<input type="submit" value="feedster">
</form>
<form method="get" action="http://www.google.com/search" target="_blank">
<small>
<input type="text" name="q" size="16" maxlength="255" value="<?php echo $q; ?>"><br />
<input type="submit" name="sa" value="Google">
</small>
</form>
<form>
<input type="hidden" name="searchtype" value="amazon.de">
<input type="text" name="q" value="<?php echo $q; ?>"><br>
<label><input type="radio" name="where" value="books-de" <?
if (@$where=="books-de") echo 'checked';
?>>Deutsch</label>
<label><input type="radio" name="where" value="books-de-intl-us" <?
if (@$where=="books-de-intl-us" || @$where=='') echo 'checked';
?>>Englisch</label><br>
<input type="submit" value="Amazon.de">
</form>
<?php
$sform = ob_get_contents();
ob_end_clean();
if (preg_match("/\w+/", $q)) {
if ($searchtype=="amazon.de") {
$bfout = $sform;
if(!where) $where="books-de-intl-us";
$amznurl = "http://xml-eu.amazon.com/onca/xml3";
$amznurl .= "?t=traumwind-21&dev-t=D32B4SO2Q54BGU&locale=de&PowerSearch=";
$amznurl .= "keywords:+";
$amznurl .= urlencode($q);
$amznurl .= "&mode=";
$amznurl .= $where;
$amznurl .= "&type=lite&page=1&sort=+daterank&f=http://traumwind.de/books/traumwind.xsl"; //
$f = fopen($amznurl, "r");
if (!$f) die("<hr>can't open <a href='$amznurl'>$amznurl</a>");
while (!feof($f)) {
$amzn .= fread($f, 4096);
}
$bfout .= "
<style>
amzn_list detail{
margin: 1em;
padding: 2em 1em 1em 1em;
height: 240px;
border-top: 1px solid black;
border-left: 1px solid black;
border-right: 3px solid black;
border-bottom: 3px solid black;
}
</style>";
if (strlen($amzn)<=100) {
$bfout .= "<div id='amzn_list'><div id='detail'><h2>Amazon returned <b>no results</b></h2>
<p>if you where using more than one keyword, try putting 'or' between them</p></div></div>\n";
} else {
$bfout .= $amzn;
}
$bfout .= "<br /><a href='$amznurl'>debug url</a><br />";
$sbeltoutput = $bfout;
} else {
// Brute Force PHP searching
// take a list of keywords and simply scan each and every document for matches
// (try to scan younger docs first)
// martin@traumwind.de May 2003
// where do we find the docs?
$postingspath = "./archive";
// how do we identify a doc from other files in dir?
$postingsregex = ".txt$";
$percperhit = 1/count($query);
if($q && $q!='') {
$bfout = $sform;
$posts = findposts($postingspath, $postingsregex);
$hits = Array();
foreach ($posts as $pname=>$ppath) {
$s = join("", file($ppath));
preg_match("/title ([^\n]+)/", $s, $m);
$title = $m[1];
textise($s);
foreach($query as $k) {
// $k=quotemeta($k);
// die ("<pre>$k</pre>");
if (!preg_match('/\w+/', $k)) continue;
if ($res = preg_match_all("!\s$k\s!i", $s, $m)) {
$total[$pname] += $percperhit;
$words[$pname][$k] = $res;
$titles[$pname] = strip_tags($title);
}
}
}
if ($total) {
arsort($total);
$bfout .= "<ol>\n";
foreach($total as $pname=>$rel) {
$bfout .= "<li>(".floor($rel*100).") <a href='index.php?detail=$pname&'>";
$bfout .= $titles[$pname];
$bfout .= "</a><br>";
arsort($words[$pname]);
foreach($words[$pname] as $word=>$count) {
$bfout .= "<b>$word</b>: $count \n";
}
$bfout .= "</li>\n";
}
$bfout .= "</ol>\n";
} else {
$bfout .= "<p>no results for <b>'$q'</b></p>\n";
}
// let's view this in the content area
$sbeltoutput = $bfout;
}
}
}
// functions
function textise(&$s) {
$search = array (
"'title ([^\n]+)*'",
"'author \w+'",
"'<script[^>]*?>.*?</script>'si", // Strip out javascript
"'<[\/\!]*?[^<>]*?>'si", // Strip out html tags
"'[^\w]'", // Strip out punctuation
);
$replace = array (
"$1 $1",
"",
"",
"",
" ",
);
$s = preg_replace ($search, $replace, $s);
}
function findposts($path, $regex) {
$handle = opendir($path) or die ("$path was not found");
while($entry=readdir($handle)) {
if (eregi($regex, $entry)){
$entry_name=eregi_replace($regex,"",$entry);
$dir_arr[$entry_name]="$path/".$entry;
}
}
closedir($handle);
// sort in alphabetical order
arsort($dir_arr);
return ($dir_arr);
}
?>
alles Bild, Text und Tonmaterial ist © Martin Spernau, Verwendung und Reproduktion erfordert die Zustimmung des Authors