Skip to content
This repository was archived by the owner on Sep 29, 2025. It is now read-only.

Hadoop File Format

Jörn Franke edited this page Jun 27, 2016 · 40 revisions

Basically this Hadoop file format is suitable for reading transactions and blocks from files in HDFS containing crypto ledger data. They can be used by any MapReduce/Tez/Spark application to process them. Currently the following crypto ledgers are supported:

  • Bitcoin. This module will provide three formats:
    • BitcoinBlockInputformat: Deserializes blocks containing transactions into Java-Object(s). Each record is an object of the class BitcoinBlock containing Transactions (class BitcoinTransaction). Best suitable if you want to have flexible analytics. The key (ie unique identifier) of the block is currently a byte array containing hashMerkleRoot and prevHashBlock (64 Byte).
    • BitcoinRawBlockInputformat: Each record is a byte array containing the raw bitcoin block data. The key (ie unique identifier) of the block is currently a byte array containing hashMerkleRoot and prevHashBlock (64 Byte). This is most suitable if you are only interested in a small part of the data and do not want to waste time on deserialization.
  • BitcoinTransactionInputFormat: Deserializes Bitcoin transactions into Java-Object(s). Each record is an object of class BitcoinTransaction. Transactions are identifiable by their double hash value (32 byte) as specified in the Bitcoin specification. This makes it easy to link inputs of other transaction to the originating transaction. They do not contain block header data. This make sense if you anyway want to analyse each transaction independently (e.g. if you want to do some analytics on the scripts within a transaction and combine the results later on).

Build

Execute:

git clone https://github.com/ZuInnoTe/hadoopcryptoledger.git hadoopcryptoledger

You can build the application by changing to the directory hadoopcryptoledger/inputformat and using the following command:

gradle clean build publishToMavenLocal

Use

Configure

The following configuration options exist:

  • "io.file.buffer.size": Size of io Buffer. Defaults to 64K
  • "hadoopcryptoledger.bitcoinblockinputformat.maxblocksize": Maximum size of a Bitcoin block. Defaults (currently) to: 1M. If you see exceptions related to this in the log (e.g. due to changes in the Bitcoin blockchain) then increase this.
  • "hadoopcryptoledger.bitcoinblockinputformat.filter.magic": A comma-separated list of valid magics to identify Bitcoin blocks in the blockchain data. Defaults to "F9BEB4D9" (Bitcoin main network). Other Possibilities are are (https://en.bitcoin.it/wiki/Protocol_documentation) F9BEB4D9 (Bitcoin main network), FABFB5DA (testnet) ,0B110907 (testnet3), F9BEB4FE (namecoin)
  • "hadoopcryptoledeger.bitcoinblockinputformat.usedirectbuffer": If true then DirectByteBuffer instead of HeapByteBuffer will be used. This option is experimental and defaults to "false".

More Information

Understanding the structure of Bitcoin data:

Blocks: https://en.bitcoin.it/wiki/Block

Transactions: https://en.bitcoin.it/wiki/Transactions

Clone this wiki locally