-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Programming Experience
• 在jstorm中, spout中nextTuple和ack/fail运行在不同的线程中, 从而鼓励用户在nextTuple里面执行block的操作, 原生的storm,nextTuple和ack/fail在同一个线程,不允许nextTuple/ack/fail执行任何block的操作,否则就会出现数据超时,但带来的问题是,当没有数据时, 整个spout就不停的在空跑,极大的浪费了cpu, 因此,jstorm更改了storm的spout设计,鼓励用户block操作(比如从队列中take消息),从而节省cpu。 • In jstorm, the function nextTuple and ack/fail of spout are running in different thread, which promotes users to execute block in nexTuple. In original unmodified storm, nextTuple and ack/fail are running in the same thread, where block is not allowed, or data timeout is sure to happen. Furthermore, without the input data, the whole spout is running in vain, which wastes CPU resources greatly. Hence, jstorm modified the design in storm, which promotes users to execute block(e.g., take messages from queue) to save the resource of CPU. • 在架构上,推荐 “消息中间件 + jstorm + 外部存储” 3架马车式架构 o JStorm从消息中间件中取出数据,计算出结果,存储到外部存储上 o 通常消息中间件推荐使用RocketMQ,Kafka o 外部存储推荐使用HBase,Mysql o 该架构,非常方便JStorm程序进行重启(如因为增加业务升级程序) o 职责清晰化,减少和外部系统的交互,JStorm将计算结果存储到外部存储后,用户的查询就无需访问JStorm中服务进程,查询外部存储即可。 *在实际计算中,常常发现需要做数据订正,因此在设计整个项目时,需要考虑重跑功能 *在meta中,数据最好带时间戳 *如果计算结果入hadoop或数据库,最好结果也含有时间戳 In terms of the design of architecture, we recommend the architecture of message-oriented middleware +jstrom+External Storage. In jstorm, we take data from the message-oriented middleware and then compute the result, store the result to the external storage finally. In general, we recommend use the RocketMQ and Kafka as the message-oriented middleware. HBase and Mysql as the external storage. The architecture is very convenient for jstorm program to restart (e.g., the need to upgrade the system when add new function) The reduction of interaction with the external system. After JStorm store the result to external storage, there is no need for users’ query to access the service process in JStorm, we just need to access the external storage. In real computation, we usually find that the data to be revised, so in the phase of design, we need take the re-run function into consideration. In meta, it is better for data to attach the timestamp. Besides, if the result is going to store the Hadoop or database, it is also better to attach the timestamp. • 如果使用异步kafka/meta客户端(listener方式)时,当增加重启meta时,均需要重启topology If we use asynchronous Kafka/meta client(listener mode), it necessary to restart the topology when add the restarted meta. • 如果使用trasaction时,增加kafka/meta时, brokerId要按顺序,即新增机器brokerId要比之前的都要大,这样reassign spout消费brokerId时就不会发生错位。 If we use the transactions, we need to sort the brokerId when add the Kafka/meta, that is, the newly added machine’s brokerId must be greater than those added before it. In this way, when we reassign the consumption brokerId, dislocation can be avoided. • 非事务环境中,尽量使用IBasicBolt In non-transaction mode, we should use IBasicBolt as much as possible. • 计算并发度时, o spout 按单task每秒500的QPS计算并发 o 全内存操作的task,按单task 每秒2000个QPS计算并发 o 有向外部输出结果的task,按外部系统承受能力进行计算并发。 When compute the concurrency level, Spout compute concurrency level by the 500 QPS per second in a task. For those task whose operations are all completed in memory, we compute their concurrency level by 2000 QPS per second. For those task that have output result, we compute the concurrency level according to the affordability of external system. • 对于MetaQ 和 Kafka, o 拉取的频率不要太小,低于100ms时,容易造成MetaQ/Kafka 空转次数偏多 o 一次获取数据Block大小推荐是2M或1M,太大内存GC压力比较大,太小效率比较低。 For MeataQ and Kafka, The pull of data frequency should not be too high, when the interval smaller than 100ms, it is easy to cause comparatively more empty data transfer. A suited Block size is 2M or 1M, since too big block will cause memory GC pressure comparative high, and too small will lower the efficiency. • 推荐一个worker运行2个task run 2 task in a worker is good. • 条件允许时,尽量让程序可以报警,比如某种特殊情况出现时,比如截止到凌晨2点,数据还没有同步到hadoop,发送报警出来 If conditions allow, we should let the program alert automatically as much as possible, such as when some special condition happen, as of 2:00 am, the data still do not sync to hadoop. • 从jstorm 0.9.5.1 开始, 底层netty同时支持同步模式和异步模式, o 异步模式, 性能更好, 但容易造成spout 出现fail, 适合在无acker模式下,storm.messaging.netty.sync.mode: false o 同步模式, 底层是接收端收一条消息,才能发送一条消息, 适合在有acker模式下,storm.messaging.netty.sync.mode: true From the version 0.9.5.1 of jstorm, the bottom layer netty can support synchronous mode and asynchronous mode at the same time. Synchronous mode, the performance is good, but easy for spout to fail, it suited to run in the non acker mode, storm.messaging.netty.sync.mode: false. Asynchronous mode, the bottom layer can send a message only after it receive a message, it suited to run in the acker mode, storm.messaging.netty.sync.mode: true. 常见经验 • 使用zookeeper时, 建议使用curator,但不要使用过高的curator版本 • 数据热点问题 Our experience, When using zookeeper, we recommend to use curator, but not the too high version curator. Hot data problem. 哪位朋友有好的经验愿意分享, 可以发邮件给我们 [email protected] Anyone who willing to share your experience, welcome to send email to [email protected].