Overview
Thought I would quickly write down some of the work that is in progress on the Drupal 6 upgrade of the nutch module. Over the past month I have been redeveloping the Drupal 6 version of the Nutch module. In past implementations the Nutch module has used the open search client module to display its crawl results while this was a perfectly reasonable solution I felt that to continue down this line was not right for a number of reasons.
* Open Search Clients lack of Drupal 6 version
* Nutch crawler now supports pushing its results into Apache Solr
* The amount of active development on the Drupal Apache Solr module
* The exciting integration options with Apache Solr and views 3 was too hard to pass up
Nutch Module Development This part was quite quick and I have lots of ideas in this area including allowing you to set the crawl seed from Drupal manage the crawl and see reporting about its success. Unfortunately I have put this on hold as I hit a blocker early on as I wanted to be able to have both a Native Drupal Solr index (nodes etc) and a bunch of crawled pages but Nutch's Solr implementation has a set data structure and a few hacks (in my option) which causes a clash with the Drupal Nutch index. So as a result I have extended Nutch to allow you to specify the field sturcture and have submitted a patch to the and am waiting it to be included. I am also working on a Nutch plugin that allow you to pull specific information for a web page either by regex or via xslt that will help the the cause. Thats about it for now, if anyone has other ideas that that want need please post a comment or send me a mail or tweet.
Post new comment