header

Torsten Curdt’s weblog

Full-text search with Cocoa

When in java land the answer to searching is usually Lucene. When building a Mac OSX or iPhone application unfortunately the answer is not that simple.

Recently I had the need to build a search index of some data for an iPhone project and was a little surprised about the lack of options. Again my first thought was Lucene – more specifically the C port of it. But unfortunately it was abandoned somewhere along the way. A new try has not even reached the alpha phase. So what to do? Port the Lucene java code to Objective-C? That sounded like a bit out of scope for the iPhone project. I found two options.

LuceneKit

Good someone else already ported Lucene 2.x to Objective-C – for GNUstep though. But with only little work I got it working for both Mac OSX and the iPhone. I’ve forked the official svn repository via git-svn, applied my changes and added some examples for Mac OSX and iPhone. It’s available on github. Here is how you use it.


LCFSDirectory *rd = [[LCFSDirectory alloc] initWithPath: @"/path/to/index" create: YES];

LCSimpleAnalyzer *analyzer = [[LCSimpleAnalyzer alloc] init];
LCIndexWriter *writer = [[LCIndexWriter alloc] initWithDirectory: rd
                                                       analyzer: analyzer
                                                         create: YES];

while(..) {
    NSString *content = "..."

    LCDocument *doc = [[LCDocument alloc] init];

    LCField *f1 = [[LCField alloc] initWithName: FIELD_CONTENT
                                         string: content
                                          store: LCStore_NO
                                          index: LCIndex_Tokenized];                                         

    LCField *f2 = [[LCField alloc] initWithName: FIELD_ID
                                         string: @"the id"
                                          store: LCStore_YES
                                          index: LCIndex_NO];
    [doc addField: f1];
    [f1 release];

    [doc addField: f2];
    [f2 release];

    [writer addDocument: doc];
    [doc release];
}

[writer close];
[writer release];
[analyzer release];

The above source code is an example on how to create the index. Of course that’s something you should not be doing at runtime (if possible). Then all you need to do in your application is to open the index in read-only mode.


LCFSDirectory *rd = [[LCFSDirectory alloc] initWithPath: @"/path/to/index" create: NO];

And you are ready to do some searching.


LCTerm *t = [[LCTerm alloc] initWithField: FIELD_CONTENT text: searchText];
LCTermQuery *tq = [[LCTermQuery alloc] initWithTerm: t];
LCHits *hits = [searcher search: tq];

LCHitIterator *iterator = [hits iterator];
while([iterator hasNext]) {
    LCHit *hit = [iterator next];

    NSString *id = [hit stringForField: FIELD_ID];
    NSLog(@"%@ -> %@", hit, id);
}
int results = [hits count];

Unfortunately the Objective-C port is still quite alpha. I ended up having some problems when indexing bigger chunks of data. It doesn’t look like it’s a really big thing to fix but I didn’t have the time to look into it.

sqlite

So what about using sqlite? While it does provide full-text searching the version on the iPhone does not have the feature compiled in. Bummer! But no problem – you can just use your own version of sqlite and you are good to go. I found the easiest way to do this is to download the amalgamation source and add it directly to the Xcode project. It’s really just one large .c file. To enable full-text search all you need to do is to add a define to the head of the file.


#define SQLITE_ENABLE_FTS3

While you are now ready to go on the iPhone you still need to build the db itself. For that you also need a sqlite on Mac OSX that supports the virtual table syntax. Again just use the amalgamation source and build it with

CFLAGS="-DSQLITE_ENABLE_FTS3=1" ./configure
make install

In order to run the new sqlite you need to set the DYLD_LIBRARY_PATH to point to the folder that has the shared libraries (the libsqlite3.dylib file)

export DYLD_LIBRARY_PATH=/path/to/dylib:$DYLD_LIBRARY_PATH

Now you create your SQL that creates and fills the database for you. Sqlite has a special table syntax for full-search indexes.

CREATE VIRTUAL TABLE content_search using FTS3(id,content);
INSERT INTO content_search VALUES ('someid', 'content without stopwords');

For better performance and efficiency you should remove stopwords first.

sqlite3 content.db < content.sql

So once you have the database files make sure it's included in your project's bundle. That's where you will open the database from


NSString *filePath = [[NSBundle mainBundle] pathForResource:@"content" ofType:@"db"];

Then on application launch you prepare the statements


sqlite3 *database;
sqlite3_stmt *statement;

if (sqlite3_open([filePath UTF8String], &database) == SQLITE_OK) {

    const char *sql = "select id, snippet(content, '[', ']', '...' ) as extract from content_search where content match ?";

    if (sqlite3_prepare_v2(database, sql, -1, &statement, NULL) != SQLITE_OK) {
        NSLog(@"failed to prepare statement");
    }
}

that you can then use to search inside the content and step through the result set.


NSString *searchText = "...";

sqlite3_bind_text(statement, 1, [searchText UTF8String], -1, SQLITE_TRANSIENT);

int success = sqlite3_step(statement);

if (success == SQLITE_ROW) {
    char *str = (char *)sqlite3_column_text(statement, 0);
    NSLog(@"found id '%s'", str);

    // step for more results
} else {
    NSLog(@"not found");
}

sqlite3_reset(statement);

So as a final word: I was really impressed by sqlite. But the full-text search engine is quite limited. If you need some more flexibility (like a different stemmer or search ranking) LuceneKit might be the way to go. I bet the fixes are not that hard. And it would be great to see the code maybe find it's way "back" to the Lucene project. At least it is already released under Apache License 2.0.

  • Just came across another open source option yesterday.
    Apparently Aron's mini-persistence layer does now also support full text search.

    http://github.com/hillegass/BN...

  • To solve this very problem I have been working with Locayta to port their full text search engine to iOS. We now have a beta version available and are looking for more beta testers to try it out and provide feedback.

    http://www.locayta.com/pages/n...

    Here is a quick blurb about the search engine library:

    Locayta Seach for iOS is a port of Locayta's full text search engine library for the iOS platform. The core library is pure C (with a bit of C++) and we have wrapped a higher level Objective-C API around it and produced a static library in a Framework bundle so that iPhone & iPad apps can provide fast local full text search. The search engine provides "enterprise level" full text search using a probabilistic model of document terms, along with clever features to improve search success such as automatic spell correction (based on trigram analysis of terms) and word stemming.

    Hope you don't mind me posting this info here, but we are happy to be able to offer a commercial option to this problem.
  • fedmest
    There seems to be an issue with custom SQLite builds and iOS 3.2 - the problem only presents itself when building against iOS 4 and deploying to iOS 3.2 on the iPad. As my application has separate targets for iPhone and iPad, I just changed the Base SDK of the iPad target to iOS 3.2 and re-built. It works a charm…

    Federico
    http://blog.federicomestrone.c.../
  • U'suf
    hi, I am trying to use the fts3 but have stuck up with this problem.

    My tables are all created as needed fro fts3.

    I want to querry "SELECT * FROM tableName WHERE tableName MATCH ?"

    this works for me on the command line, but in my iphone application the return value from SQLite_Prepare_V2 is not SQLite_OK

    Can any one help me out with this
  • I use a plain sqlite for text search.

    My Statemenet looks like:

    SELECT * from table WHERE hackstack like '%needle%';

    like seems not to be case sensitive so it fits my needs. I know that there is no stemming and so on, but it was a very easy approach.
  • I presume you want to know the array index for every hit (not the query). You store the array index as field in the index and you can then access it as described in the example.
  • Ashar
    Fair enough ;)

    I have an array of items which contains the text i want to search so what i do is parse through this collection and build up my indexer , and when i perform the search i am getting an appropriate number of hits as well. The problem is how do i get hold of the Item object which contains my query?


    --
    Thanks
    Ashar
  • @ Ashar: No, I haven't had the time to look into that any further. But the source is there :)
  • Ashar
    Hey

    Were you able to fix the bug you encountered when searching with the data amounting to a couple of MB?

    --
    Thanks
    Ashar
  • @James: Sorry for the late response. Was away.

    The limitation used to be in the AppDelegate. If you are talking about the limitation that Stefan was hitting. I removed that based on his comment. That was just in the example.

    For the actual problems that's a little more complicated. You will have to index a couple of MBs of data to see LuceneKit fall over.

    And thanks for the pointer to CLucene. I did not yet come across this one yet. Not sure how much fun it is to use on the iPhone though. Please let me know if you get it working. Certainly I would prefer LuceneKit to just get a little more stable.
  • Where exactly in the code do you set the limitation? I would like to look into this as well, but I can't seem to find it.

    Also, have you looked at CLucne (http://sourceforge.net/project.../)? It is a C++ port of Lucene and seems to be active. I have made some initial attempts at compiling it for the iPhone, but I have not been able to make it work yet.

    Thanks,

    James
  • Stefan Shurman
    I'll keep an eye on it and try to get it work with larger docs, when I have more time. If I'm able to fix it, I'll send you the code.

    Regards,

    Stefan
  • Nice text for the testing ;) Stefan, yes check the sources. It does indeed only read in the first n lines. I've encountered some problems with bigger chunks of data (see the article). But I should probably get rid of that in the example code no matter what. If you are looking into indexing a couple of MB of data you will probably have to fix the bug I encountered.
  • Stefan Shurman
    Thanks for our quick reply.

    I've deactivated auto correction. Now it works on some words.
    I've used Faust (ASCII) for testing: http://www.gutenberg.org/dirs/...

    I get hits on the words (for example): gutenberg, johann. But not on the words FAUST, Part, ueberhaupt.

    Is there a limitation of the characters, which are read in.

    Overall I have to say: nice work! Cool.

    Regards,

    Stefan
  • Hey Stefan, be aware that by default the search is case-sensitive! So auto-correction sometimes comes in the way. But if I put in "test" exactly I do get the hit from the data.txt file. Let me know if that works for you.
  • Stefan Shurman
    Hello Torsten,

    thanks your for the article.
    I've tested your GitHub package on the iPhone.
    It compiles correctly and logs: "searching:"
    but I always get 0 hits. I've put some text
    in data.txt - but I hasn't lead to anything :-(

    Regards,

    Stefan
blog comments powered by Disqus