So, after two attempts at using the MySQL built in "search" feature called FullText indexing, I have decided to abandon it as useless for any purpose I have.
MySQL can't really make a normal index on LongText fields, so doing searches on them is fairly slow if you have a lot of rows or really large texts. However, The concept with FullText indexes is that they are more like search engine indexes.
The context: 700MB / 77K row email archive, translated into MySQL as two tables. One has all the header info broken out into individual fields (From, To, Subject, etc) and one is just the texts with four columns: MailID, MailText, MailTextMD5, and MessageBodyMD5.
1) Creating the FullText index is extremely slow. It is incredibly impractical to use because it takes so goddamned long to create. This is clearly not a production tool because its a joke. It takes longer for this thing to index data already in its own format, in its own database, than other tools I've used would take to read that much data off a webserver (htdig is what I'm thinking of here).
For the data in question, it took 7.5 hours to run "alter table MailArchiveTexts ADD FullText(MailText)" on a machine doing nothing else: 1.5gigahertz athlon xp, 500MB of ram. It looked, from the outside, like there was just some serious problem with the code. After an initial spurt of activity, MySQL dropped down to only 1% of CPU usage and barely touched the disk. Every minute there was a hit to the disk.
Pathetic. So, I let it run all night just to see what would happen and it did eventually finish.
2) Disk usage absurd. The FullText index ended up being larger than the original dataset (120%).
3) Different syntax. It uses specialized syntax, so I had to write some extra code to use it. MATCH(FieldName) AGAINST (SearchTerm). Not a big deal, but still, it seemed weak that it didnt also act to speed up normal LIKE and =, etc.
4) Not that fast. Even after having an extra 800MB of disk space used and 7.5 hours of processing time just to make the FullText thing, the times for doing the matches were fairly abominable. The one bright spot was that the mysqld seemed to save recent searches so that if I did a match-against the same term more than once in a session, it remembered and they were nearly instantaneous... sort of. I can't explain it, but for some reason it only cached results for /some/ search terms. On the negative side, the ones it didnt cache seemed to be the ones that took forever to search for.
Running a normal select-like on the table looks like this:
SELECT MailArchiveID FROM MailArchiveTexts WHERE MailText LIKE "%sapphire%"
Those take between 20 and 40 seconds per query, depending on what else is going on on the machine and how large the result set is.
Running the match-against on the table looks like this:
SELECT MailArchiveID FROM MailArchiveTexts WHERE MATCH(MailText) AGAINST ("sapphire")
That took 30-40 seconds the first time and then .05 seconds each time after that during the same session.
However, when I tried some other searches, I got horrifying results.
SELECT MailArchiveID FROM MailArchiveTexts WHERE MATCH(MailText) AGAINST ("mail")
took around 2 minutes per query with the machine otherwise idle. And it didnt change, so the second and third and fourth time I ran it, it took the same amount of time. I suspect its because mysqld was caching the smaller result sets and not caching the result for large numbers of rows(??).
5) Bizarre and annoying behaviour. I realize that I clearly am just not wanting this alpha-quality module, but it also exhibited some annoying weirdnesses. Like not being able to search for 2 or 3 letter combinations. I believe I have to recompile the mysqld to get it to accept 3 letter things in the FullText query, however I can't even imagine how large the index file would be. It also excludes (by default) words that occur in Too Many of the entries. Although I couldn't really figure out why this seemed not to be true all of the time. By the time I was playing with this, I was so irritated and tired of it, I didn't really give it a fair shot.
Overall, I never want to try this thing again. If I want real search functionality, I will probably work on implementing something with Lucene.
I like MySQL, although all my programming friends make fun of me for using it. So I am not saying that I dislike MySQL, just that I have now tried twice to make FullText feature work for me and failed and I don't want to waste my time on it again, so I wrote this to remind myself.Posted by Earth at January 5, 2004 02:11 AM