January 30, 2004

Refactoring Erowid 3 HTML Lib

Things move slowly in mail archive land.

Against my better judgement initially, I was convinced to refactor a bunch of my codebase to make it easier for my HTML/CSS person to edit the output pages. After dragging my feet, it is now very very obvious to me that this is a worthwhile endeavor.

I don't know if there is a term for the type of design that I am moving to, where heavily CGI-driven pages have the display rendering (view) code explicitly in the target .. urm.. source code page?

Perhaps the style where the page rendering info is to be found in the file location where the URL says it is is called "sane" style ?

What I have had before is stuff no one else has to ever really edit directly... so a page in PHP might look like (in my old style, now renamed
"erowid 2.crazy" style):

http://www.erowid.org/references/refs.php

(click on that to see what a generic erowid table listing view looks like)

The code in the source file at that location might look a lot like:

<? .several include calls ..
require_once("library1.php"); require_once("...");
$HTMLLib = new HTMLLib_c();
$HTMLLib->Init();

$ViewObject = new ThisDataTypeViewClass_c();
PrintHeader();
if (! $HTMLLib->IsRunLive())
{
$HTMLLib->ShowCache();
}
else
{ $ViewObject->RenderPage();
}
PrintFooter();
Quit();
?>

obviously that's not really what an HTML person wants to see when they open up something to try to change something on a page's display.

The new "sane" style looks more like (I am still working on this refactoring):

<? require_once("refs_view_controller.php"); $View = new $DataTypeView(); ?>
<html>
<? virtual("refs_header_file.php"); ?>

<div class="thisorthat">whatever html stuff here...</div>

<? virtual("refs_search_interface.php"); ?>

Table Name
<table>
<tr><th>Field1 Header</th><th>Field2 Header</th></tr>
<?
$View->PrintRows();
function PrintRow($Data) {
?>
<tr class="whatever"><td class="blah2"><? print $Data['Field1'];?></td><td><? print $Data['Field2'];?></td></tr>
<? } /* end PrintRow function definition */ ?>
</table>


<? virtual("refs_list_paging_interface.php"); ?>

<? virtual("refs_footer_file.php"); ?>

</html>

Posted by Earth at 12:13 PM | Comments (1)

Turing Sniff

As I was browsing the assortment of spam that got through the filter tonight,
I read something that was clearly generated by a robot.

My mind immediately started up a little "Turing Test" thought routine and it
was clear that a piece of email or other "artifact" cannot be respond to any
real Turing Test...

The phrase "Turing Sniff" came to mind.

Definition: A loosely defined spin-off of the Turing Test which can be used to
decribe that an artifact or text which purports to be made by a human seems to have been generated by a machine. Something which fails the turing sniff, 'smells like a robot'.

Usage: The machine-generated spam failed a Turing Sniff.

Anyone like that?

Posted by Earth at 09:23 AM | Comments (0)

January 19, 2004

MailArchive Features ToDo

To Add: Threading
- pre-process the archive and look for connections between messages based on
headers: References: , In-Reply-To:, and a combo of Date and Subject line.
CDent suggested looking at Zawinski's threading algorithm: http://www.jwz.org/doc/threading.html

When viewing a message, provide links to: show messages by: This Author (by email address), show messages in thread, show messages near this date (+/- 24 hours?),

Add advanced search stuff to the main listing view

Add a display choice for "remove email addresses" (earth@xxx)

Add more code to the blog.

add some kind of Attachment type thing to the main table

fix the ugly display look

add spam filter hook

add "highlight + button" search on text.

Posted by Earth at 07:38 AM | Comments (0)

January 05, 2004

Pear MIME Decode Bug

I think I discovered and killed a bug in PHP's Pear MIME Decode code.

Basically the main problem is in the way it grabs the boundary strings out of the header. It needs a couple of trim()s in order to work properly if there are spaces around the boundary entries.

In _parseHeaderValue, it needs to look like this:

for ($i = 0; $i < count($parameters); $i++) {
$param_name = substr($parameters[$i], 0, $pos = strpos($parameters[$i], '=\
'));
// erowid
$param_name = trim($param_name);
$param_value = substr($parameters[$i], $pos + 1);
// erowid
$param_value = trim($param_value);

if ($param_value[0] == '"') {
$param_value = substr($param_value, 1, -1);
}

Posted by Earth at 07:55 AM | Comments (0)

MySQL 4.0.13 FullText Sucks

So, after two attempts at using the MySQL built in "search" feature called FullText indexing, I have decided to abandon it as useless for any purpose I have.

MySQL can't really make a normal index on LongText fields, so doing searches on them is fairly slow if you have a lot of rows or really large texts. However, The concept with FullText indexes is that they are more like search engine indexes.

The context: 700MB / 77K row email archive, translated into MySQL as two tables. One has all the header info broken out into individual fields (From, To, Subject, etc) and one is just the texts with four columns: MailID, MailText, MailTextMD5, and MessageBodyMD5.

The Bad:

1) Creating the FullText index is extremely slow. It is incredibly impractical to use because it takes so goddamned long to create. This is clearly not a production tool because its a joke. It takes longer for this thing to index data already in its own format, in its own database, than other tools I've used would take to read that much data off a webserver (htdig is what I'm thinking of here).

For the data in question, it took 7.5 hours to run "alter table MailArchiveTexts ADD FullText(MailText)" on a machine doing nothing else: 1.5gigahertz athlon xp, 500MB of ram. It looked, from the outside, like there was just some serious problem with the code. After an initial spurt of activity, MySQL dropped down to only 1% of CPU usage and barely touched the disk. Every minute there was a hit to the disk.

Pathetic. So, I let it run all night just to see what would happen and it did eventually finish.

2) Disk usage absurd. The FullText index ended up being larger than the original dataset (120%).

3) Different syntax. It uses specialized syntax, so I had to write some extra code to use it. MATCH(FieldName) AGAINST (SearchTerm). Not a big deal, but still, it seemed weak that it didnt also act to speed up normal LIKE and =, etc.

4) Not that fast. Even after having an extra 800MB of disk space used and 7.5 hours of processing time just to make the FullText thing, the times for doing the matches were fairly abominable. The one bright spot was that the mysqld seemed to save recent searches so that if I did a match-against the same term more than once in a session, it remembered and they were nearly instantaneous... sort of. I can't explain it, but for some reason it only cached results for /some/ search terms. On the negative side, the ones it didnt cache seemed to be the ones that took forever to search for.

Running a normal select-like on the table looks like this:
SELECT MailArchiveID FROM MailArchiveTexts WHERE MailText LIKE "%sapphire%"

Those take between 20 and 40 seconds per query, depending on what else is going on on the machine and how large the result set is.

Running the match-against on the table looks like this:

SELECT MailArchiveID FROM MailArchiveTexts WHERE MATCH(MailText) AGAINST ("sapphire")

That took 30-40 seconds the first time and then .05 seconds each time after that during the same session.

However, when I tried some other searches, I got horrifying results.

SELECT MailArchiveID FROM MailArchiveTexts WHERE MATCH(MailText) AGAINST ("mail")

took around 2 minutes per query with the machine otherwise idle. And it didnt change, so the second and third and fourth time I ran it, it took the same amount of time. I suspect its because mysqld was caching the smaller result sets and not caching the result for large numbers of rows(??).

5) Bizarre and annoying behaviour. I realize that I clearly am just not wanting this alpha-quality module, but it also exhibited some annoying weirdnesses. Like not being able to search for 2 or 3 letter combinations. I believe I have to recompile the mysqld to get it to accept 3 letter things in the FullText query, however I can't even imagine how large the index file would be. It also excludes (by default) words that occur in Too Many of the entries. Although I couldn't really figure out why this seemed not to be true all of the time. By the time I was playing with this, I was so irritated and tired of it, I didn't really give it a fair shot.

Overall, I never want to try this thing again. If I want real search functionality, I will probably work on implementing something with Lucene.

I like MySQL, although all my programming friends make fun of me for using it. So I am not saying that I dislike MySQL, just that I have now tried twice to make FullText feature work for me and failed and I don't want to waste my time on it again, so I wrote this to remind myself.

Posted by Earth at 02:11 AM | Comments (0)