digitalpebble
Home   Services   Clients   Contact
 
  Customisation of StormCrawler. Text classification with Apache Spark.
     
  Design and implementation of a WARC exporter for Nutch. Crawl manager for the Feb 2016 dataset.
     
  Strategy review and design for low latency scalable fetching of web pages using Storm-Crawler and Nutch.
     
  Low latency web scraper for job adverts based on Storm-Crawler.
     
  Design and implementation of crawlers for shopping sites based on Storm-Crawler.
     
  Customisation and recommendations on best practices for Apache Nutch.
     
  Customization of Storm-Crawler for low latency scalable fetching of web pages.
     
  Customization of Apache Nutch and Storm-Crawler for crawling images using Amazon Web Services
     
  Customization of Apache Nutch for a flexible and scalable vertical crawl.
     
  Development of custom GATE plugins and resources for Named Entity Extraction on contracts.
     
  Customisation and hosting of Nutch for a vertical crawl.
     
  Auditing, redesign and optimisation of a SOLR setup for a real estate search system using geo-location.
     
  Whole web crawling using Nutch on Amazon EC2. Development of custom Nutch plugins and resources.
     
  Development of custom Nutch plugins and resources. Monitoring of crawls. Deployment and tuning of SOLR instances.
     
 
  Consulting on GATE for Named Entity Recognition; improvement of the accuracy of the ANNIE application.
     
  Port of the RASP application to Apache UIMA.
     
  Design of an avanced architecture for a search solution based on Nutch / Lucene. Work on a performance benchmark and optimisation of the results.
Work on Term Extraction and Clustering, Text Classification, Ontology Learning and custom Information Extraction.
     
  Strategy review and integration design for mobile content based search engines.
     
 

Implementation of a search functionality based on Lucene and compliant with the OpenSearch standard.

Design and development of Text Classification web service. It is used to identify junk posts from a collection of forum pages indexed with Lucene. This improves the relevance of the search engine results, as these documents tend to rank high due to the repetition of keywords (e.g. product names). The format of the messages used by the service is based on Solr.