Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


To add entries you need write permission to the wiki, which you can get by subscribing to the mailing list and asking for permissions on the wiki account username you've registered yourself as. If you are using Apache Hadoop in production you ought to consider getting involved in the development process anyway, by filing bugs, testing beta releases, reviewing the code and turning your notes into shared documentation. Your participation in this process will ensure your needs get met.


Table of Contents



  • - Amazon*
    • We build Amazon's product search indices using the streaming API and pre-existing C++, Perl, and Python tools.
    • We process millions of sessions daily for analytics, using both the Java and streaming APIs.
    • Our clusters vary from 1 to 100 nodes
  • Accela Communications
    • We use an Apache Hadoop cluster to rollup registration and view data each night.
    • Our cluster has 10 1U servers, with 4 cores, 4GB ram and 3 drives
    • Each night, we run 112 Hadoop jobs
    • It is roughly 4X faster to export the transaction tables from each of our reporting databases, transfer the data to the cluster, perform the rollups, then import back into the databases than to perform the same rollups in the database.
  • Adobe
    • We use Apache Hadoop and Apache HBase in several areas from social services to structured data storage and processing for internal use.
    • We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development. We plan a deployment on an 80 nodes cluster.
    • We constantly write data to Apache HBase and run MapReduce jobs to process then store it back to Apache HBase or external systems.
    • Our production cluster has been running since Oct 2008.
  • adyard
    • We use Apache Flume, Apache Hadoop and PApache ig for log storage and report generation as well as ad-Targeting.
    • We currently have 12 nodes running HDFS and Pig and plan to add more from time to time.
    • 50% of our recommender system is pure Pig because of it's ease of use.
    • Some of our more deeply-integrated tasks are using the streaming API and ruby as well as the excellent Wukong-Library.
  • Able Grape - Vertical search engine for trustworthy wine information
    • We have one of the world's smaller Hadoop clusters (2 nodes @ 8 CPUs/node)
    • Hadoop and Apache Nutch used to analyze and index textual information
  • Adknowledge - Ad network
    • Hadoop used to build the recommender system for behavioral targeting, plus other clickstream analytics
    • We handle 500MM clickstream events per day
    • Our clusters vary from 50 to 200 nodes, mostly on EC2.
    • Investigating use of R clusters atop Hadoop for statistical analysis and modeling at scale.
  • Aguja- E-Commerce Data analysis
    • We use hadoop, pig and hbase to analyze search log, product view data, and analyze all of our logs
    • 3 node cluster with 48 cores in total, 4GB RAM and 1 TB storage each.
  • Alibaba
    • A 15-node cluster dedicated to processing sorts of business data dumped out of database and joining them together. These data will then be fed into iSearch, our vertical search engine.
    • Each node has 8 cores, 16G RAM and 1.4T storage.
  • AOL
    • We use Apache Hadoop for variety of things ranging from ETL style processing and statistics generation to running advanced algorithms for doing behavioral analysis and targeting.
    • The cluster that we use for mainly behavioral analysis and targeting has 150 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk.
  • ARA.COM.TR - Ara Com Tr - Turkey's first and only search engine
    • We build search engine using the Python tools.
    • We use Apache Hadoop for analytics.
    • We handle about 400TB per month
    • Our clusters vary from 10 to 100 nodes
    • HDFS, Apache Accumulo, Scala
    • Currently 3 nodes (16Gb RAM, 6Tb storage)
  • Atbrox
    • We use Hadoop for information extraction & search, and data analysis consulting
    • Cluster: we primarily use Amazon's Elastic MapReduce
  • ATXcursions
    • Two applications that are side products/projects of a local tour company: 1. Sentiment analysis of review websites and social media data. Targeting the tourism industry. 2. Marketing tool that analyzes the most valuable/useful reviewers from sites like Tripadvisor and Yelp as well as social media. Lets marketers and business owners find community members most relevant to their businesses.
    • Using Apache Hadoop, HDFS, Hive, and HBase.
    • 3 node cluster, 4 cores, 4GB RAM.