-
Notifications
You must be signed in to change notification settings - Fork 48
Hadoop File Format
Basically this Hadoop file format is suitable for reading transactions and blocks from files in HDFS containing crypto ledger data. They can be used by any MapReduce/Tez/Spark application to process them. Currently the following crypto ledgers are supported:
- Bitcoin. This module will provide three formats:
- BitcoinBlockInputformat: Deserializes blocks containing transactions into Java-Object(s). Each record is an object of the class BitcoinBlock containing Transactions (class BitcoinTransaction). Best suitable if you want to have flexible analytics. The key (ie unique identifier) of the block is currently a byte array containing hashMerkleRoot and prevHashBlock (64 Byte).
- BitcoinRawBlockInputformat: Each record is a byte array containing the raw bitcoin block data. The key (ie unique identifier) of the block is currently a byte array containing hashMerkleRoot and prevHashBlock (64 Byte). This is most suitable if you are only interested in a small part of the data and do not want to waste time on deserialization.
- BitcoinTransactionInputFormat: Deserializes Bitcoin transactions into Java-Object(s). Each record is an object of class BitcoinTransaction. Transactions are identifiable by their double hash value (32 byte) as specified in the Bitcoin specification. This makes it easy to link inputs of other transaction to the originating transaction. They do not contain block header data. This make sense if you anyway want to analyse each transaction independently (e.g. if you want to do some analytics on the scripts within a transaction and combine the results later on).
Note the Hadoop File Format is available on Maven Central and you do not need to build and publish it to a local Maven anymore to use it.
Execute:
git clone https://github.com/ZuInnoTe/hadoopcryptoledger.git hadoopcryptoledger
You can build the application by changing to the directory hadoopcryptoledger/inputformat and using the following command:
../gradlew clean build publishToMavenLocal
- Count the number of transactions from files containing Bitcoin Blockchain data
- Count the total number of inputs of all transactions from files containing Bitcoin Blockchain data
- Use Spark to count the number of transactions from files containing Bitcoin Blockchain data
The following configuration options exist:
- "io.file.buffer.size": Size of io Buffer. Defaults to 64K
- "hadoopcryptoledger.bitcoinblockinputformat.maxblocksize": Maximum size of a Bitcoin block. Defaults (since version 1.0.1) to: 2M. If you see exceptions related to this in the log (e.g. due to changes in the Bitcoin blockchain) then increase this.
- "hadoopcryptoledger.bitcoinblockinputformat.filter.magic": A comma-separated list of valid magics to identify Bitcoin blocks in the blockchain data. Defaults to "F9BEB4D9" (Bitcoin main network). Other Possibilities are are (https://en.bitcoin.it/wiki/Protocol_documentation) F9BEB4D9 (Bitcoin main network), FABFB5DA (testnet) ,0B110907 (testnet3), F9BEB4FE (namecoin), FBC0B6DB (Litecoin), FCC1B7DC (Litecoin Testnet)
- "hadoopcryptoledeger.bitcoinblockinputformat.usedirectbuffer": If true then DirectByteBuffer instead of HeapByteBuffer will be used. This option is experimental and defaults to "false".
- "hadoopcryptoledeger.bitcoinblockinputformat.issplitable" (since version 1.0.1): if true then we use the default Hadoop FileInputFormat mechanism to split files (if possible). This implies using a heuristic to find the start of a BitcoinBlock using the magic number. While this should work normally in all of the cases, it cannot be excluded that it uniquely marks the start of a Bitcoin block (e.g. in case it is part of a hash). Defaults to "false". In case of "false" it is recommended to create multiple files of at least the size of one or multiple HDFS blocks containing Bitcoin Blockchain data.
Understanding the structure of Bitcoin data:
Blocks: https://en.bitcoin.it/wiki/Block
Transactions: https://en.bitcoin.it/wiki/Transactions