|Data operation||Data at rest||Data in motion|
|Parallel computation||Task parallel||Data parallel|
|Deploying the application||Using Scala, Java, Python language||Using Java API|
For streaming of data flow, three components are used
- Bolt :- Bolts represent the processing logic unit in Storm. One can utilize bolts to do any kind of processing such as filtering, aggregating, joining, interacting with data stores, talking to external systems etc. Bolts can also emit tuples (data messages) for the subsequent bolts to process. Additionally, bolts are responsible to acknowledge the processing of tuples after they are done processing.
- Spout :- Spouts represent the source of data in Storm. You can write spouts to read data from data sources such as database, distributed file systems, messaging frameworks etc. Spouts can broadly be classified into following –
Reliable – These spouts have the capability to replay the tuples (a unit of data in data stream). This helps applications achieve ‘at least once message processing’ semantic as in case of failures, tuples can be replayed and processed again. Spouts for fetching the data from messaging frameworks are generally reliable as these frameworks provide the mechanism to replay the messages.
Unreliable – These spouts don’t have the capability to replay the tuples. Once a tuple is emitted, it cannot be replayed irrespective of whether it was processed successfully or not. This type of spouts follow ‘at most once message processing’ semantic.
- Tuple :- The tuple is the main data structure in Storm. A tuple is a named list of values, where each value can be any type. Tuples are dynamically typed — the types of the fields do not need to be declared. Tuples have helper methods like getInteger and getString to get field values without having to cast the result. Storm needs to know how to serialize all the values in a tuple. By default, Storm knows how to serialize the primitive types, strings, and byte arrays. If you want to use another type, you’ll need to implement and register a serializer for that type
3. What are the key benefits of using Storm for Real Time Processing?
- Easy to operate : Operating storm is quiet easy.
- Real fast : It can process 100 messages per second per node.
- Fault Tolerant : It detects the fault automatically and re-starts the functional attributes.
- Reliable : It guarantees that each unit of data will be executed at least once or exactly once.
- Scalable : It runs across a cluster of machine
Yes, It acts as proxy also by using the mod_proxy module. This module implements a proxy, gateway or cache for Apache. It implements proxying capability for AJP13 (Apache JServ Protocol version 1.3), FTP, CONNECT (for SSL),HTTP/0.9, HTTP/1.0, and (since Apache 1.3.23) HTTP/1.1. The module can be configured to connect to other proxy modules for these and other protocols.
5. What is the use of Zookeeper in Storm?
- It’s critical that you run Zookeeper under supervision, since Zookeeper is fail-fast and will exit the process if it encounters any error case. See here for more details.
- It’s critical that you set up a cron to compact Zookeeper’s data and transaction logs. The Zookeeper daemon does not do this on its own, and if you don’t set up a cron, Zookeeper will quickly run out of disk space. See here for more details.