I've recently finished building an industry specific search engine. The primary use case is to drive international and domestic business traffic to the Canadian websites doing business within the oceans technology and innovation sectors.
From a technology architecture perspective we built a search engine for the Canadian oceans super cluster initiative where all components run, and are based, upon Canadian assets hosted in Canada. We seeded the search engine using the URLs for all the organizations identified as participants within this economic sector. The indexing process analysed each URL and followed all links up to two hops deep. All the identified URLs were scored using a web graph and the top pages were indexed.
The architecture decisions
|The NELK stack became our back-end infrastructure.|
A number of important architecture decisions were made early on as the project was detailed. Mostly decisions were made to support the technologies that the small team was already familiar. And if the team wasn't familiar, we chose technologies that had the most industry support and local resources in our personal networks so we could help out if we needed. We ended up having Nutch feeding the ELK stack using Wordpress for the UX. In the project it became known as the NELK stack.
- Nutch - for web crawling and first round of web page extraction and cleanse.
- ElasticSearch (ES) - as the search engine / data manager
- Logstash - as the data transform and load.
- Kibana - as the administration / developer console
We ended up using Nutch to crawl the internet for ocean sector specific web pages. We also needed to integrate with ElasticPress so the broader ecosystem search included the contents of our websites Wordpress database. To do all this we ended up using Nutch 1.15 for it integrated best across our technology stack. We used the Nutch recommended approach seeding, ingesting, fetching, and duplicate removing as we prepared the data for export to ElasticSearch. Due to versioning issues we exported the Nutch database to CSV before importing the data. For the first load of data our use of Nutch created the following page loading metrics;
- seeded with 2612 domain names
- removed 709 duplicate or in error domain names
- identified 86872 candidate webpages
- fetched the 29323 most relevant web pages (based upon web graph algorithms)
- indexed 29270 pages into ElasticSearch
We used Logstash to bring the Nutch exported CSV data into ElasticSearch. Coding up the logstash job was fairly easy, the most important aspect was choosing the correct logstash filter. It was better to use the dissect filter rather than the csv filter. More on this in a later post. In the end, I was amazed with how quickly Logstash loaded, and ElasticSearch indexed, all the data.
Once all the data was loaded into ElasticSearch I used Kibana to confirm data was correctly loaded into the data repository. Kibana has a very intuitive interface and creating filters and running queries to confirm the successful loading of data was straight forward. I look forward to using Kibana to manage the repository and create meaningful dashboards.
|Integrating with Wordpress|
- The ElasticSearch (ES) PhP library which provides a mature (and easy to use) set of features to build your own interface into ES using PhP.
- ElasticPress which allows automated ElasticSearch integration with a wordpress database.
In conclusion, using Nutch with the ELK stack provides for a very strong search engine that integrates easily with Wordpress on the front-end. The learning curve for this approach was not overwhelming and whenever challenges presented themselves the online groups help us resolve issues within days.
Special thanks to the team put together by Triware Technologies. Without all the other technical people, analysts, business people, data entry, project managers, ACOA,OSC, ElasticSearch support, Azure support, and those clearing the way... none of this would have been possible. Thank-you!