-
Notifications
You must be signed in to change notification settings - Fork 48
Hadoop File Format
Basically this Hadoop file format is suitable for reading transactions and blocks from files in HDFS containing crypto ledger data. They can be used by any MapReduce/Tez/Spark application to process them. Currently the following crypto ledgers are supported:
- Bitcoin. This module provides three formats:
- BitcoinBlockInputformat: Deserializes blocks containing transactions into Java-Object(s). Each record is an object of the class BitcoinBlock containing Transactions (class BitcoinTransaction) including Segwit information. Best suitable if you want to have flexible analytics. The key (ie unique identifier) of the block is currently a byte array containing hashMerkleRoot and prevHashBlock (64 Byte).
- BitcoinRawBlockInputformat: Each record is a byte array containing the raw bitcoin block data. The key (ie unique identifier) of the block is currently a byte array containing hashMerkleRoot and prevHashBlock (64 Byte). This is most suitable if you are only interested in a small part of the data and do not want to waste time on deserialization.
- BitcoinTransactionInputFormat: Deserializes Bitcoin transactions into Java-Object(s). Each record is an object of class BitcoinTransaction including Segwit information. Transactions are identifiable by their double hash value (32 byte) as specified in the Bitcoin specification. This makes it easy to link inputs of other transaction to the originating transaction. They do not contain block header data. This make sense if you anyway want to analyse each transaction independently (e.g. if you want to do some analytics on the scripts within a transaction and combine the results later on).
- Ethereum. This module provides two formats:
- EthereumBlockInputformat: Deserializes blocks containing transactions into Java-Object(s). Each record is an object of the class EthereumBlock containing Transactions (class EthereumTransaction) and uncleHeaders. Best suitable if you want to have flexible analytics. The key (ie unique identifier) of the block is currently a byte array containing parentHash.
Note the Hadoop File Format is available on Maven Central and you do not need to build and publish it to a local Maven anymore to use it.
Execute:
git clone https://github.com/ZuInnoTe/hadoopcryptoledger.git hadoopcryptoledger
You can build the application by changing to the directory hadoopcryptoledger/inputformat and using the following command:
../gradlew clean build publishToMavenLocal
See main wiki page
The following configuration options exist:
- "io.file.buffer.size": Size of io Buffer. Defaults to 64K.
- Bitcoin and Altcoins
- "hadoopcryptoledger.bitcoinblockinputformat.maxblocksize": Maximum size of a Bitcoin block. Defaults (since version 1.0.7) to: 8M. If you see exceptions related to this in the log (e.g. due to changes in the Bitcoin blockchain) then increase this.
- "hadoopcryptoledger.bitcoinblockinputformat.filter.magic": A comma-separated list of valid magics to identify Bitcoin blocks in the blockchain data. Defaults to "F9BEB4D9" (Bitcoin main network, also used by other Altcoins, such as Multichain). Other Possibilities are are (https://en.bitcoin.it/wiki/Protocol_documentation) F9BEB4D9 (Bitcoin main network), ,0B110907 (Bitcoin testnet3), FABFB5DA (Bitcoin testnet) , F9BEB4FE (Namecoin), FBC0B6DB (Litecoin), FCC1B7DC (Litecoin Testnet), 24E92764 (Zcash main network), FA1AF9BF (Zcash testnet), E6E8E9E5 (Emercoin main network), CBF2C0EF (Emercoin test network), E6E8E9E5 (Peercoin), CBF2C0EF (peercoin Testnet), 6E8B92A5 (Slimcoin), 4D2AE1AB (Slimcoin Testnet)
- "hadoopcryptoledeger.bitcoinblockinputformat.usedirectbuffer": If true then DirectByteBuffer instead of HeapByteBuffer will be used. This option is experimental and defaults to "false".
- "hadoopcryptoledeger.bitcoinblockinputformat.issplitable" (since version 1.0.1): if true then we use the default Hadoop FileInputFormat mechanism to split files (if possible). This implies using a heuristic to find the start of a BitcoinBlock using the magic number. While this should work normally in all of the cases, it cannot be excluded that it uniquely marks the start of a Bitcoin block (e.g. in case it is part of a hash). Defaults to "false". In case of "false" it is recommended to create multiple files of at least the size of one or multiple HDFS blocks containing Bitcoin Blockchain data. This is, for example, done automatically when using Bitcoin Core.
- "hadoopcryptoledeger.bitcoinblockinputformat.readauxpow" (since version 1.0.8): if true then Altcoins using Merged Mining/AuxPOW, such as Namecoin, can be parsed properly and additional information about merged mining are available. Default: false
- Ethereum and Altcoin (since 1.1.0)
- "hadoopcryptoledger.ethereumblockinputformat.maxblocksize": Maximum size of a Bitcoin block. Defaults to: 1M. If you see exceptions related to this in the log (e.g. due to changes in the Bitcoin blockchain) then increase this.
- "hadoopcryptoledeger.ethereumblockinputformat.usedirectbuffer": If true then DirectByteBuffer instead of HeapByteBuffer will be used. This option is experimental and defaults to "false".
Understanding the structure of Bitcoin data:
Blocks: https://en.bitcoin.it/wiki/Block
Transactions: https://en.bitcoin.it/wiki/Transactions
Understanding Segwit information: https://github.com/bitcoin/bips/blob/master/bip-0141.mediawiki
Understanding Merged Mining/AuxPOW: https://en.bitcoin.it/wiki/Merged_mining_specification
Understanding the structure of Ethereum data: http://yellowpaper.io/