Skip to content

BigData Distributed Programming NoSQL New SQL Analytics

sujit edited this page Jan 27, 2015 · 1 revision

Distributed Programming

  • AddThis Hydra: distributed data processing and storage system originally developed at AddThis
  • Akela: Mozilla’s utility library for Hadoop, HBase, Pig, etc.
  • Amazon Lambda: a compute service that runs your code in response to events and automatically manages the compute resources for you
  • AMPLab SIMR: run Spark on Hadoop MapReduce v1
  • AMPLab Succinct: Enabling Queries on Compressed Data
  • Apache Crunch: a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce
  • Apache DataFu: collection of user-defined functions for Hadoop and Pig developed by LinkedIn
  • Apache Flink: high-performance runtime, and automatic program optimization
  • Apache Gora: framework for in-memory data model and persistence
  • Apache Hama: BSP (Bulk Synchronous Parallel) computing framework
  • Apache MapReduce: programming model for processing large data sets with a parallel, distributed algorithm on a cluster
  • Apache Pig: high level language to express data analysis programs for Hadoop
  • Apache S4: framework for stream processing, implementation of S4
  • Apache Spark: framework for in-memory cluster computing
  • Apache Spark Streaming: framework for stream processing, part of Spark
  • Apache Storm: framework for stream processing by Twitter also on YARN
  • Apache Tez: application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN
  • Apache Twill: abstraction over YARN that reduces the complexity of developing distributed applications
  • Cascalog: data processing and querying library
  • Cheetah: High Performance, Custom Data Warehouse on Top of MapReduce
  • Concurrent Cascading: framework for data management/analytics on Hadoop
  • Damballa Parkour: MapReduce library for Clojure
  • Datasalt Pangool: alternative MapReduce paradigm
  • DataTorrent StrAM: real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance
  • DistributedR: scalable high-performance platform for the R language
  • Drools: a Business Rules Management System (BRMS) solution
  • eBay Oink: REST based interface for PIG execution
  • Facebook Corona: Hadoop enhancement which removes single point of failure
  • Facebook Peregrine: Map Reduce framework
  • Facebook Scuba: distributed in-memory datastore
  • Geotrellis: geographic data processing engine for high performance applications
  • GIS Tools for Hadoop: Big Data Spatial Analytics for the Hadoop Framework
  • Google Dataflow: create data pipelines to help themæingest, transform and analyze data
  • Google MapReduce: map reduce framework
  • Google MillWheel: fault tolerant stream processing framework
  • Hazelcast: In-Memory Data Grid
  • HParser: data parsing transformation environment optimized for Hadoop
  • IBM Streams: advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources
  • JAQL: declarative programming language for working with structured, semi-structured and unstructured data
  • Kite: is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem
  • Kryo: Java serialization and cloning: fast, efficient, automatic
  • LinkedIn Cubert: a fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop
  • Lipstick: Pig workflow visualization tool
  • Metamarkers Druid: framework for real-time analysis of large datasets
  • Netflix Aegisthus: Bulk Data Pipeline out of Cassandra. implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family
  • Netflix Lipstick: Pig Visualization framework
  • Netflix Mantis: Event Stream Processing System
  • Netflix PigPen: map-reduce for Clojure whiche compiles to Apache Pig
  • Netflix STAASH: language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems
  • Netflix Zeno: Netflix’s In-Memory Data Propagation Framework
  • Nextflow: Dataflow oriented toolkit for parallel and distributed computational pipelines
  • Nokia Disco: MapReduce framework developed by Nokia
  • PigPen: PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it
  • Pinterest Pinlater: asynchronous job execution system
  • Pubnub: Data stream network
  • Pydoop: Python MapReduce and HDFS API for Hadoop
  • ScaleOut hServer: fast, scalable in-memory data grid for Hadoop
  • SeqPig: Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
  • SigmoidAnalytics Spork: Pig on Apache Spark
  • SpatialHadoop: SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
  • Spring for Apache Hadoop: unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive
  • SQLStream Blaze: stream processing platform
  • Stratio Streaming: the union of a real-time messaging bus with a complex event processing engine using Spark Streaming
  • Stratosphere: general purpose cluster computing framework
  • Streamdrill: usefull for counting activities of event streams over different time windows and finding the most active one
  • Sumo Logic: cloud based analyzer for machine-generated data.
  • Teradata QueryGrid: data-access layer that can orchestrate multiple modes of analysis across multiple databases plus Hadoop
  • TIBCO ActiveSpaces: in-memory data grid
  • Tigon: a distributed framework built on Apache HadoopTM and Apache HBaseTM for real-time, high-throughput, low-latency data processing and analytics applications
  • Torch: Scientific computing for LuaJIT
  • Twitter Scalding: Scala library for Map Reduce jobs, built on Cascading
  • Twitter Summingbird: Streaming MapReduce with Scalding and Storm, by Twitter
  • Twitter TSAR: TimeSeries AggregatoR by Twitter

Distributed Filesystem

Key-Map Data Model

  • Actian Vector: column-oriented analytic database
  • Apache Accumulo: distribuited key/value store, built on Hadoop
  • Apache Cassandra: column-oriented distribuited datastore, inspired by BigTable
  • Apache HBase: column-oriented distribuited datastore, inspired by BigTable
  • Facebook HydraBase: evolution of HBase made by Facebook
  • Google BigTable: column-oriented distributed datastore
  • Google Cloud Datastore: is a fully managed, schemaless database for storing non-relational data over BigTable
  • Hypertable: column-oriented distribuited datastore, inspired by BigTable
  • InfiniDB: is accessed through a MySQL interface and use massive parallel processing to parallelize queries
  • MapR-DB: fast, scalable, and enterprise-ready in-Hadoop database architected to manage big data
  • Netflix Priam: Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra
  • OhmData C5: improved version of HBase
  • Sqrrl: NoSQL databases on top of Apache Accumulo
  • Tephra: Transactions for HBase
  • Twitter Manhattan: real-time, multi-tenant distributed database for Twitter scale

Document Data Model

  • Actian Versant: commercial object-oriented database management systems
  • Amazon SimpleDB: a highly available and flexible non-relational data store that offloads the work of database administration
  • Clusterpoint: a database software for high-speed storage and large-scale processing of XML and JSON data on clusters of commodity hardware
  • Crate Data: is an open source massively scalable data store. It requires zero administration
  • Facebook Apollo: Facebook’s Paxos-like NoSQL database
  • jumboDB: document oriented datastore over Hadoop
  • LinkedIn Espresso: horizontally scalable document-oriented NoSQL data store
  • MarkLogic: Schema-agnostic Enterprise NoSQL database technology
  • Microsoft DocumentDB: fully-managed, highly-scalable, NoSQL document database service
  • MongoDB: Document-oriented database system
  • RavenDB: A transactional, open-source Document Database
  • RethinkDB: document database that supports queries like table joins and group by
  • Terrastore: a modern document store which provides advanced scalability and elasticity features without sacrificing consistency
  • TokuMX: High-Performance MongoDB Distribution

Key-value Data Model

  • Aerospike: NoSQL flash-optimized, in-memory. Open source and “Server code in ‘C’ (not Java or Erlang) precisely tuned to avoid context switching and memory copies.
  • Amazon DynamoDB: distributed key/value store, implementation of Dynamo paper
  • Couchbase ForestDB: Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie
  • Edis: is a protocol-compatible Server replacement for Redis
  • ElephantDB: Distributed database specialized in exporting data from Hadoop
  • EventStore: distributed time series database
  • HyperDex: next generation key-value store
  • KAI: a distributed key-value datastore
  • LinkedIn Krati: is a simple persistent data store with very low latency and high throughput
  • Linkedin Voldemort: distributed key/value storage system
  • MemcacheDB: a distributed key-value storage system designed for persistent
  • Netflix Dynomite: thin Dynamo-based replication for cached data
  • Oracle NoSQL Database: distributed key-value database by Oracle Corporation
  • RAMCloud: storage system that provides large-scale low-latency storage by keeping all data in DRAM all the time and aggregating the main memories of thousands of servers
  • Redis: in memory key value datastore
  • Redis Cluster: distributed implementation of Redis
  • Redis Sentinel: system designed to help managing Redis instances
  • Riak: a decentralized datastore
  • Scalaris: a distributed transactional key-value store
  • Storehaus: library to work with asynchronous key value stores, by Twitter
  • Tarantool: an efficient NoSQL database and a Lua application server
  • TreodeDB: key-value store that’s replicated and sharded and provides atomic multirow writes
  • Yahoo Sherpa: hosted, distributed and geographically replicated key-valueÊcloud storage platform

Graph Data Model

  • Apache Giraph: implementation of Pregel, based on Hadoop
  • Apache Spark Bagel: implementation of Pregel, part of Spark
  • ArangoDB: multi model distribuited database
  • Facebook TAO: TAO is the distributed data store that is widely used at facebook to store and serve the social graph
  • Faunus: Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster
  • Google Cayley: open-source graph database
  • Google Pregel: graph processing framework
  • GraphLab PowerGraph: a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API
  • GraphX: resilient Distributed Graph System on Spark
  • Gremlin: graph traversal Language
  • HyperGraphDB: general purpose, open-source data storage mechanism based on a powerful knowledge management formalism known as directed hypergraphs
  • InfiniteGraph: distributed graph database
  • Infovore: RDF-centric Map/Reduce framework
  • Intel GraphBuilder: tools to construct large-scale graphs on top of Hadoop
  • MapGraph: Massively Parallel Graph processing on GPUs
  • Neo4j: graph database writting entirely in Java
  • OrientDB: document and graph database
  • Phoebus: framework for large scale graph processing
  • Sparksee: scalable high-performance graph database
  • Stardog: graph database: search, query, reasoning, and constraints in a lightweight, pure Java system
  • Titan: distributed graph database, built over Cassandra
  • Twitter FlockDB: distribuited graph database

NewSQL Databases

  • Actian Ingres: commercially supported, open-source SQL relational database management system
  • BayesDB: statistic oriented SQL database
  • Cockroach: Scalable, Geo-Replicated, Transactional Datastore
  • Datomic: distributed database designed to enable scalable, flexible and intelligent applications
  • FoundationDB: distributed database, inspired by F1
  • Google F1: distributed SQL database built on Spanner
  • Google Spanner: globally distributed semi-relational database
  • H-Store: is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications
  • HandlerSocket: NoSQL plugin for MySQL/MariaDB
  • IBM DB2: object-relational database management system
  • InfiniSQL: infinity scalable RDBMS
  • MemSQL: in memory SQL database witho optimized columnar storage on flash
  • NuoDB: SQL/ACID compliant distributed database
  • Oracle Database: object-relational database management system
  • Oracle TimesTen in-Memory Database: in-memory, relational database management system with persistence and recoverability
  • Pivotal GemFire XD: Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS
  • SAP HANA: is an in-memory, column-oriented, relational database management system
  • SenseiDB: distributed, realtime, semi-structured database
  • Sky: database used for flexible, high performance analysis of behavioral data
  • SymmetricDS: open source software for both file and database synchronization
  • Teradata Database: complete relational database management system
  • VoltDB: in-memory NewSQL database

Columnar Databases

  • Amazon RedShift: data warehouse service, based on PostgreSQL
  • C-Store: column oriented DBMS
  • Google BigQuery: framework for interactive analysis, implementation of Dremel
  • Google Dremel: framework for interactive analysis, implementation of Dremel
  • MonetDB: column store database
  • Parquet: columnar storage format for Hadoop
  • Pivotal Greenplum: purpose-built, dedicated analytic data warehouse
  • Vertica: is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses

Time-Series Databases

  • Cube: uses MongoDB to store time series data
  • Etsy StatsD: simple daemon for easy stats aggregation
  • InfluxDB: distributed time series database
  • Kairosdb: similar to OpenTSDB but allows for Cassandra
  • OpenTSDB: distributed time series database on top of HBase
  • Square Cube: system for collecting timestamped events and deriving metrics
  • TempoIQ: Cloud-based sensor analytics

SQL-like processing

  • Actian SQL for Hadoop: high performance interactive SQL access to all Hadoop data
  • AMPLAB Shark: data warehouse system for Spark
  • Apache Drill: framework for interactive analysis, inspired by Dremel
  • Apache HCatalog: table and storage management layer for Hadoop
  • Apache Hive: SQL-like data warehouse system for Hadoop
  • Apache Optiq: framework that allows efficient translation of queries involving heterogeneous and federated data
  • Apache Phoenix: SQL skin over HBase
  • BlinkDB: massively parallel, approximate query engine
  • Cloudera Impala: framework for interactive analysis, Inspired by Dremel
  • Concurrent Lingual: SQL-like query language for Cascading
  • Datasalt Splout SQL: full SQL query engine for big datasets
  • eBay Kylin: Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
  • Facebook PrestoDB: distributed SQL query engine
  • Hadapt: a native implementation of SQL for the Apache Hadoop open-source project
  • JethroData: index-based SQL engine for Hadoop
  • Metanautix Quest: data compute engine
  • Pivotal HAWQ: SQL-like data warehouse system for Hadoop
  • RainstorDB: database for storing petabyte-scale volumes of structured and semi-structured data
  • Spark Catalyst: is a Query Optimization Framework for Spark and Shark
  • SparkSQL: Manipulating Structured Data Using Spark
  • Splice Machine: a full-featured SQL-on-Hadoop RDBMS with ACID transactions
  • Stinger: interactive query for Hive
  • Tajo: distributed data warehouse system on Hadoop
  • Trafodion: enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads

Integrated Development Environments

Data Ingestion

  • Amazon Kinesis: real-time processing of streaming data at massive scale
  • Apache BookKeeper: a distributed logging service called BookKeeper and a distributed publish/subscribe system built on top of BookKeeper called Hedwig
  • Apache Chukwa: data collection system
  • Apache Flume: service to manage large amount of log data
  • Apache Samza: stream processing framework, based on Kafla and YARN
  • Apache Sqoop: tool to transfer data between Hadoop and a structured datastore
  • Apache UIMA: Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user
  • Cloudera Morphlines: framework that help ETL to Solr, HBase and HDFS
  • Facebook Scribe: streamed log data aggregator
  • Fluentd: tool to collect events and logs
  • Google Photon: geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency
  • Heka: open source stream processing software system
  • HIHO: framework for connecting disparate data sources with Hadoop
  • LinkedIn Camus: Kafka to HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka
  • LinkedIn Databus: stream of change capture events for a database
  • LinkedIn Gobblin: a framework for Solving Big Data Ingestion Problem
  • LinkedIn Kamikaze: utility package for compressing sorted integer arrays
  • Linkedin Lumos: bridge from OLTP to OLAP for use it on Hadoop
  • LinkedIn White Elephant: log aggregator and dashboard
  • Logstash: a tool for managing events and logs
  • Netflix Suro: data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data based on Chukwa
  • Pinterest Secor: is a service implementing Kafka log persistance
  • Record Breaker: Automatic structure for your text-formatted data
  • TIBCO Enterprise Message Service: standards-based messaging middleware
  • Twitter Zipkin: distributed tracing system that helps us gather timing data for all the disparate services at Twitter
  • Vibe Data Stream: streaming data collection for real-time Big Data analytics

Message-oriented middleware

  • ActiveMQ: open source messaging and Integration Patterns server
  • Amazon Simple Queue Service: fast, reliable, scalable, fully managed queue service
  • Apache Kafka: distributed publish-subscribe messaging system
  • Apache Qpid: messaging tools that speak AMQP and support many languages and platforms
  • Apollo: ActiveMQ’s next generation of messaging
  • Beanstalkd: simple, fast work queue
  • Bit.ly NSQ: realtime distributed message processing at scale
  • Celery: Distributed Task Queue
  • Crossroads I/O: library for building scalable and high performance distributed applications
  • Darner: simple, lightweight message queue
  • Facebook Iris: a totally ordered queue of messaging updates with separate pointers into the queue indicating the last update sent to your Messenger app and the traditional storage tier
  • Gearman: Job Server
  • HornetQ: open source project to build a multi-protocol, embeddable, very high performance, clustered, asynchronous messaging system
  • IronMQ: easy-to-use highly available message queuing service
  • Kestrel: distributed message queue system
  • Marconi: queuing and notification service made by and for OpenStack, but not only for it
  • RabbitMQ: Robust messaging for applications
  • RestMQ: message queue which uses HTTP as transport, JSON to format a minimalist protocol and is organized as REST resources
  • RQ: simple Python library for queueing jobs and processing them in the background with workers
  • Sidekiq: Simple, efficient background processing for Ruby
  • ZeroMQ: The Intelligent Transport Layer

Service Programming

  • Akka Toolkit: runtime for distributed, and fault tolerant event-driven applications on the JVM
  • Apache Avro: data serialization system
  • Apache Curator: Java libaries for Apache ZooKeeper
  • Apache Karaf: OSGi runtime that runs on top of any OSGi framework
  • Apache Thrift: framework to build binary protocols
  • Apache Zookeeper: centralized service for process management
  • Google Chubby: a lock service for loosely-coupled distributed systems
  • Linkedin Norbert: cluster manager
  • MPICH: high performance and widely portable implementation of the Message Passing Interface (MPI) standard
  • OpenMPI: message passing framework
  • Serf: decentralized solution for service discovery and orchestration
  • Spotify Luigi: a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more
  • Spring XD: distributed and extensible system for data ingestion, real time analytics, batch processing, and data export
  • Twitter Elephant Bird: libraries for working with LZOP-compressed data
  • Twitter Finagle: asynchronous network stack for the JVM

Scheduling

Machine Learning

  • Apache Mahout: machine learning library for Hadoop
  • Ayasdi Core: tool for topological data analysis
  • brain: Neural networks in JavaScript
  • Cloudera Oryx: real-time large-scale machine learning
  • Concurrent Pattern: machine learning library for Cascading
  • convnetjs: Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser
  • cuDNN: GPU-accelerated library of primitives for deep neural networks
  • Decider: Flexible and Extensible Machine Learning in Ruby
  • etcML: text classification with machine learning
  • Etsy Conjecture: scalable Machine Learning in Scalding
  • Google Sibyl: System for Large Scale Machine Learning at Google
  • H2O: statistical, machine learning and math runtime for Hadoop
  • IBM Watson: cognitive computing system
  • LinkedIn ml-ease: ADMM based large scale logistic regression
  • MLbase: distributed machine learning libraries for the BDAS stack
  • MLPNeuralNet: Fast multilayer perceptron neural network library for iOS and Mac OS X
  • nupic: Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms
  • PredictionIO: machine learning server buit on Hadoop, Mahout and Cascading
  • scikit-learn: scikit-learn: machine learning in Python
  • Spark MLlib: a Spark implementation of some common machine learning (ML) functionality
  • Sparkling Water: combine H2OÕs Machine Learning capabilities with the power of the Spark platform
  • Theano: Python package for deep learning that can utilize NVIDIA’s CUDA toolkit to run on the GPU
  • Thunder: Large-scale analysis of neural data
  • Vahara: Machine learning and natural language processing with Apache Pig
  • Viv: global platform that enables developers to plug into and create an intelligent, conversational interface to anything
  • Vowpal Wabbit: learning system sponsored by Microsoft and Yahoo!
  • WEKA: suite of machine learning software
  • Wit: Natural Language for the Internet of Things
  • Wolfram Alpha: computational knowledge engine
  • YHat ScienceOps: platform for deploying, managing, and scaling predictive models in production applications

Benchmarking

Security

System Deployment

  • Ankush: A big data cluster management tool that creates and manages clusters of different technologies.
  • Apache Ambari: operational framework for Hadoop mangement
  • Apache Bigtop: system deployment framework for the Hadoop ecosystem
  • Apache Helix: cluster management framework
  • Apache Mesos: cluster manager
  • Apache Slider: is a YARN application to deploy existing distributed applications on YARN
  • Apache Whirr: set of libraries for running cloud services
  • Apache YARN: Cluster manager
  • Brooklyn: library that simplifies application deployment and management
  • Buildoop: Similar to Apache BigTop based on Groovy language
  • Cloudera Director: a comprehensive data management platform with the flexibility and power to evolve with your business
  • Cloudera HUE: web application for interacting with Hadoop
  • Deimos: Mesos containerizer hooks for Docker
  • Develoop: tool for provisioning, managing and monitoring Apache Hadoop
  • Facebook Autoscale: the load balancer will concentrate workload to a server until it has at least a medium-level workload
  • Facebook Prism: multi datacenters replication system
  • Ganglia Monitoring System: scalable distributed monitoring system for high-performance computing systems such as clusters and Grids
  • Genie: Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
  • Google Borg: job scheduling and monitoring system
  • Google Omega: job scheduling and monitoring system
  • Hannibal: Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
  • Hortonworks HOYA: application that can deploy HBase cluster on YARN
  • Jumbune: Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.
  • Marathon: Mesos framework for long-running services
  • Myriad: a mesos framework designed for scaling YARN clusters on Mesos. Myriad can expand or shrink one or more YARN clusters in response to events as per configured rules and policies.
  • Neflix SimianArmy: a suite of tools for keeping your cloud operating in top form

Container Manager

  • Amazon EC2 Container Service: a highly scalable, high performance container management service that supports Docker containers
  • Docker: an open platform for developers and sysadmins to build, ship, and run distributed applications
  • Fig: fast, isolated development environments using Docker
  • Google Container Engine: Run Docker containers on Google Cloud Platform, powered by Kubernetes
  • Kubernetes: open source implementation of container cluster management
  • Rocket: an alternative to the Docker runtime, designed for server environments with the most rigorous security and production requirements

Applications

  • Adobe Spindle: Next-generation web analytics processing with Scala, Spark, and Parquet
  • Apache Kiji: framework to collect and analyze data in real-time, based on HBase
  • Apache Nutch: open source web crawler
  • Apache OODT: capturing, processing and sharing of data for NASA’s scientific archives
  • Apache Tika: content analysis toolkit
  • Domino: Run, scale, share, and deploy models Ñ without any infrastructure.
  • Eclipse BIRT: Eclipse-based reporting system
  • Eventhub: open source event analytics platform
  • HIPI Library: API for performing image processing tasks on Hadoop’s MapReduce
  • Hunk: Splunk analytics for Hadoop
  • MADlib: data-processing library of an RDBMS to analyze data
  • PivotalR: R on Pivotal HD / HAWQ and PostgreSQL
  • Qubole: auto-scaling Hadoop cluster, built-in data connectors
  • Sense: Cloud Platform for Data Science and Big Data Analytics
  • Snowplow: enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres
  • SparkR: R frontend for Spark
  • Splunk: analyzer for machine-generated date
  • Talend: unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig

Search engine and framework

  • Apache Blur: a search engine capable of querying massive amounts of structured data at incredible speeds
  • Apache Lucene: Search engine library
  • Apache Solr: Search platform for Apache Lucene
  • ElasticSearch: Search and analytics engine based on Apache Lucene
  • Elasticsearch Hadoop: Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
  • Enigma.io: Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web
  • Facebook Unicorn: social graph search platform
  • Google Caffeine: continuous indexing system
  • Google Percolator: continuous indexing system
  • TeraGoogle: large search index
  • Haeinsa: linearly scalable multi-row, multi-table transaction library for HBase based on Percolator
  • HBase Coprocessor: implementation of Percolator, part of HBase
  • hIndex: Secondary Index for HBase
  • SF1R Search Engine: distributed search engine written in c++
  • Lily HBase Indexer: quickly and easily search for any content stored in HBase
  • LinkedIn Bobo: is a Faceted Search implementation written purely in Java, an extension to Apache Lucene
  • LinkedIn Cleo: is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search
  • LinkedIn Galene: search architecture at LinkedIn
  • LinkedIn Zoie: is a realtime search/indexing system written in Java
  • Sphnix Search Server: fulltext search engine

MySQL forks and evolutions

  • Amazon Aurora: a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases
  • Amazon RDS: MySQL databases in Amazon’s cloud
  • Drizzle: evolution of MySQL 6.0
  • Google Cloud SQL: MySQL databases in Google’s cloud
  • MariaDB: enhanced, drop-in replacement for MySQL
  • MariaDB Galera: a synchronous multi-master cluster for MariaDB
  • MySQL Cluster: MySQL implementation using NDB Cluster storage engine providing shared-nothing clustering and auto-sharding
  • Percona Server: enhanced, drop-in replacement for MySQL
  • ProxySQL: High Performance Proxy for MySQL
  • TokuDB: TokuDB is a storage engine for MySQL and MariaDB
  • WebScaleSQL: is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale
  • Youtube Vitess: provides servers and tools which facilitate scaling of MySQL databases for large scale web services

PostgreSQL forks and evolutions

  • HadoopDB: hybrid of MapReduce and DBMS
  • IBM Netezza: high-performance data warehouse appliances
  • Postgres-XL: Scalable Open Source PostgreSQL-based Database Cluster
  • RecDB: Open Source Recommendation Engine Built Entirely Inside PostgreSQL
  • Stado: open source MPP database system solely targeted at data warehousing and data mart applications
  • Yahoo Everest: multi-peta-byte database / MPP derived by PostgreSQL

Memcached forks and evolutions

Embedded Databases

  • Actian PSQL: ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications
  • BerkeleyDB: a software library that provides a high-performance embedded database for key/value data
  • FairCom c-treeACE: a cross-platform database engine
  • Google Firebase: a powerful API to store and sync data in realtime
  • HamsterDB: transactional key-value database
  • HanoiDB: Erlang LSM BTree Storage
  • LevelDB: a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values
  • LMDB: ultra-fast, ultra-compact key-value embedded data store developed by Symas
  • RocksDB: embeddable persistent key-value store for fast storage based on LevelDB
  • TokioCabinet: a library of routines for managing a database

Business Intelligence

  • ActivePivot: Java In-Memory OLAP cube stored in columns, with clearly decoupled pre/post processing
  • Adatao: business intelligence and data science platform
  • Apama analytics: platform for streaming analytics and intelligent automated action
  • Atigeo xPatterns: data analytics platform
  • BIME Analytics: business intelligence platform in the cloud
  • Chartio: lean business intelligence platform to visualize and explore your data
  • Datapine: self-service business intelligence tool in the cloud
  • Jaspersoft: powerful business intelligence suite
  • Jedox Palo: customisable Business Intelligence platform
  • Lavastorm Analytics: used for audit analytics, revenue assurance, fraud management, and customer experience management
  • LinkedIn GoSpeed: provides RUM data processing, visualization, monitoring, and analyses data daily, hourly, or on a near real-time basis
  • Map-D: GPU in-memory database, big data analysis and visualization platform
  • Microsoft: business intelligence software and platform
  • Microstrategy: software platforms for business intelligence, mobile intelligence, and network applications
  • Pentaho: business intelligence platform
  • Qlik: business intelligence and analytics platform
  • SpagoBI: open source business intelligence platform
  • Spotfire: business intelligence platform
  • Tableau: business intelligence platform
  • Teradata Aster: Big Data Analytics
  • Tessera: Environment for Deep Analysis of Large Complex Data
  • Zeppelin: open source data analysis environment on top of Hadoop.
  • Zoomdata: Big Data Analytics

Data Analysis

  • LinkedIn Pinot: a distributed system that supports columnar indexes with the ability to add new types of indexes
  • Myria: scalable Analytics-as-a-Service platform based on relational algebra
  • Pinalytics: Pinterestâ��s data analytics engine
  • Zillabyte: an API for distributed data computation. Scale with your data.

Data Warehouse

Data Visualization

  • Arbor: graph visualization library using web workers and jQuery
  • C3: D3-based reusable chart library
  • CartoDB: open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API
  • Chart.js: open source HTML5 Charts visualizations
  • Chartist.js: another open source HTML5 Charts visualization
  • Crossfilter: avaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js
  • Cubism: JavaScript library for time series visualization
  • Cytoscape: JavaScript library for visualizing complex networks
  • D3: javaScript library for manipulating documents
  • DC.js: Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3
  • Envisionjs: dynamic HTML5 visualization
  • FnordMetric ChartSQL: allows you to write SQL queries that return charts instead of tables. The charts are rendered as SVG vector graphics.
  • Freeboard: open source real-time dashboard builder for IOT and other web mashups
  • Gephi: An award-winning open-source platform for visualizing and manipulating large graphs and network connections
  • Google Charts: simple charting API
  • Grafana: open source, feature rich metrics dashboard and graph editor for Graphite, InfluxDB & OpenTSDB
  • Graphite: scalable Realtime Graphing
  • Highcharts: simple and flexible charting API
  • IPython: provides a rich architecture for interactive computing
  • Keylines: toolkit for visualizing the networks in your data
  • Kibana: visualize logs and time-stamped data
  • Matplotlib: plotting with Python
  • NVD3: chart components for d3.js
  • Peity: Progressive SVG bar, line and pie charts
  • Plot.ly: Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly’s online spreadsheet. Fork others’ plots.
  • Recline: simple but powerful library for building data applications in pure Javascript and HTML
  • Redash: open-source platform to query and visualize data
  • Sigma.js: JavaScript library dedicated to graph drawing
  • Square Cubism.js: aÊD3Êplugin for visualizing time series. Use Cubism to construct better realtime dashboards, pulling data fromÊGraphite,ÊCubeÊand other sources
  • Vega: a visualization grammar

Internet of Things

  • 2lemetry: Platform for Internet of things
  • Evrything: Making products smart
  • ThingWorx: Rapid development and connection of intelligent systems
Clone this wiki locally