Big Data Storage, Indexing, Streaming, Analytics & Graph Databases
Apache CassandraTM is a proven high performance open source NoSQL database. It is fault-tolerant, data is automatically replicated to multiple nodes, with no single point of failure. CassandraTM is highly durable and elastic with read and write throughput increasing linearly as new hardware nodes are added. Due to its high write speeds it is specifically suited for sensor data storage requirements. DarkMatterTM is fully integrated with Cassandra as a standard storage environment for the sensor agnostic data streaming service.
Apache SolrTM is a highly scalable and fault tolerant open source enterprise search server built on Apache LuceneTM. It is optimised for high volume traffic and has standards based open interfaces for JSON, XML, PHP,Python and REST API's amongst other. One of its key features is its geospatial search capability that includes multiple points per document and polygons. It pairs very well with CassandraTM and has the same linear scalability. It is a search engine of choice for major Hadoop distributions as well as commercial and open source content management systems. We have found that SOLR has excellent capabilities in use as a standalone NoSQL data store in analytical projects as well.
Apache SparkTM supports a host of languages such as Java, Python Scala and R. It requires a cluster management tool (Mesos or Yarn) and interfaces seamlessly with Cassandra for distributed data storage.
Spark SQL - exposes Spark datasets over JDBC API to allow SQL like queries
Spark Streaming - for performing real-time streaming analytics
Spark Machine Learning Library (MLlib) - distributed machine learning framework which includes summary statistics, hypothesis testing, regression, cluster analysis, principle component analysis (PCA) and other
Spark GraphX - an API for graphs and graph-parallel processing
Apache SparkTM is a Big Data processing framework focussing on sophisticated analytics, speed, and ease of use. Spark differs from Hadoop in that it is a data-processing tool that functions on top of distributed data collections. It has four main libraries:
Titan is a distributed graph database that interfaces very well with CassandraTM, SolrTM and SparkTM, making it the ideal platform for graph based analytics in our "Thing Stack" framework. It builds on top of and is linearly scalable like Cassandra (distribution & replication) and allows for thousands of concurrent users performing complex graph queries. It differs from Spark GraphX in the way it stores nodes an edges (vertices & edges) of network visualisations (graphs). It is distributed under the Apache 2 licence. Titan forms a key element in our Activity Based Intelligence solutions.