Unique challenges need bespoke solutions

Our unique expertise covers all aspects of a document’s life cycle, from web-wide crawling and collection, content analysis, filtering and categorization to indexing.

DigitalPebble can help your organisation by advising on best practice and identifying suitable resources, designing scalable solutions as well as implementing them. We can help you deploy and monitor your project on your premises or on the cloud.

  • Open source leader
    Open source leader
  • Range of expertise
    Range of expertise
  • Proven track record
    Proven track record
  • Web Crawling

    Web Crawling

    We are the authors and maintainers of StormCrawler, one of the leading open-source solutions for web crawling. Used by numerous companies all over the world, it is both scalable and highly configurable.

    We can help you customise StormCrawler and run it on your premises or in the cloud, or, alternatively, DigitalPebble can run it on your behalf.

  • Big Data

    Big Data

    Processing data on a large scale either in streaming or batch can be done with platforms such as Apache Flink or Apache Storm.

    In fact, we have built some of our open source solutions on these platforms and have a large experience of using them for our clients.

    Combined with our know-how and expertise in cloud computing, we are confident we can help you deliver your project, no matter how much data you have.

Julien Nioche - Director

Julien Nioche - Director

Having studied Russian language and culture in Paris and taught French in a school in Kyiv, Ukraine, Julien went on to graduate in Text Engineering and Natural Language Processing. He moved to the UK to work as a researcher at the University of Sheffield in 2005 and founded DigitalPebble in 2008.

Julien has been involved in several open source projects, mainly at the Apache Software Foundation, and was the PMC chair for Apache Nutch. He is an Emeritus member of the Apache Software Foundation.

Julien runs workshops on web crawling, speaks at conferences and reviews technical books. He has over 20 years experience in the Java programming language.

References

  • Gage

    Advice on best practices for web crawling and customisation of StormCrawler.

  • CameraForensics

    Customization of StormCrawler for crawling images with Elasticsearch.

  • PoleCat

    Customisation of StormCrawler. Text classification with ApacheSpark.

  • Career Builder

    Low latency web scraper for job adverts based on StormCrawler.

  • Common Crawl

    Design and implementation of a WARC exporter for Nutch, StormCrawler pipeline for crawling the news dataset

  • Navia

    Auditing, redesign and optimisation of a SOLR setup for a real estate search system using geo-location.