This repository was archived by the owner on Sep 29, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 48
Hadoop File Format
Jörn Franke edited this page Jun 27, 2016
·
40 revisions
Basically this Hadoop file format is suitable for reading transactions and blocks from files in HDFS containing crypto ledger data. They can be used by any MapReduce/Tez/Spark application to process them. Currently the following crypto ledgers are supported:
- Bitcoin. This module will provide three formats:
- BitcoinBlockInputformat: Deserializes blocks containing transactions into Java-Object(s). Each record is an object of the class BitcoinBlock containing Transactions (class BitcoinTransaction). Best suitable if you want to have flexible analytics. The key (ie unique identifier) of the block is currently a byte array containing hashMerkleRoot and prevHashBlock (64 Byte).
- BitcoinRawBlockInputformat: Each record is a byte array containing the raw bitcoin block data. The key (ie unique identifier) of the block is currently a byte array containing hashMerkleRoot and prevHashBlock (64 Byte). This is most suitable if you are only interested in a small part of the data and do not want to waste time on deserialization.
- BitcoinTransactionInputFormat: Deserializes Bitcoin transactions into Java-Object(s). Each record is an object of class BitcoinTransaction. Transactions are identifiable by their double hash value (32 byte) as specified in the Bitcoin specification. This makes it easy to link inputs of other transaction to the originating transaction. They do not contain block header data. This make sense if you anyway want to analyse each transaction independently (e.g. if you want to do some analytics on the scripts within a transaction and combine the results later on).
Execute:
git clone https://github.com/ZuInnoTe/hadoopcryptoledger.git hadoopcryptoledger
You can build the application by changing to the directory hadoopcryptoledger/inputformat and using the following command:
gradle clean build publishToMavenLocal
- Count the number of transactions from files containing Bitcoin Blockchain data
- Count the total number of inputs of all transactions from files containing Bitcoin Blockchain data
- Use Spark to count the number of transactions from files containing Bitcoin Blockchain data
The following configuration options exist:
- "io.file.buffer.size": Size of io Buffer. Defaults to 64K
- "hadoopcryptoledger.bitcoinblockinputformat.maxblocksize": Maximum size of a Bitcoin block. Defaults (currently) to: 1M. If you see exceptions related to this in the log (e.g. due to changes in the Bitcoin blockchain) then increase this.
- "hadoopcryptoledger.bitcoinblockinputformat.filter.magic": A comma-separated list of valid magics to identify Bitcoin blocks in the blockchain data. Defaults to "F9BEB4D9" (Bitcoin main network). Other Possibilities are are (https://en.bitcoin.it/wiki/Protocol_documentation) F9BEB4D9 (Bitcoin main network), FABFB5DA (testnet) ,0B110907 (testnet3), F9BEB4FE (namecoin)
- "hadoopcryptoledeger.bitcoinblockinputformat.usedirectbuffer": If true then DirectByteBuffer instead of HeapByteBuffer will be used. This option is experimental and defaults to "false".
Understanding the structure of Bitcoin data:
Blocks: https://en.bitcoin.it/wiki/Block
Transactions: https://en.bitcoin.it/wiki/Transactions