diff --git a/content/en/docs/Contribution Guidelines/contributing.md b/content/en/docs/Contribution Guidelines/contributing.md index 58b3bf55..b9655567 100644 --- a/content/en/docs/Contribution Guidelines/contributing.md +++ b/content/en/docs/Contribution Guidelines/contributing.md @@ -23,7 +23,7 @@ If you’d like to report a bug but don’t have time to fix it, you can still p Committers ---------- -Merging a pull request requires being a comitter on the project. +Merging a pull request requires being a committer on the project. How to merge a Pull request (have an apache and github-apache remote setup): diff --git a/content/en/docs/Contribution Guidelines/releasing.md b/content/en/docs/Contribution Guidelines/releasing.md index 7ea25cec..2a8c3055 100644 --- a/content/en/docs/Contribution Guidelines/releasing.md +++ b/content/en/docs/Contribution Guidelines/releasing.md @@ -19,7 +19,7 @@ If you have problems, read the [publishing Maven artifacts documentation](https: ### Release process -Parquet uses the maven-release-plugin to tag a release and push binary artifacts to staging in Nexus. Once maven completes the release, the offical source tarball is built from the tag. +Parquet uses the maven-release-plugin to tag a release and push binary artifacts to staging in Nexus. Once maven completes the release, the official source tarball is built from the tag. Before you start the release process: @@ -153,7 +153,7 @@ Then add and commit the release artifacts: #### 4\. Update parquet.apache.org -Update the downloads page on parquet.apache.org. Instructions for updating the site are on the [contribution page](http://parquet.apache.org/docs/contribution-guidelines/contributing/). +Update the downloads page on parquet.apache.org. Instructions for updating the site are on the [contribution page](https://parquet.apache.org/docs/contribution-guidelines/contributing/). #### 5\. Send an ANNOUNCE e-mail to [announce@apache.org](mailto:announce@apache.org) and the dev list diff --git a/content/en/docs/File Format/Data Pages/compression.md b/content/en/docs/File Format/Data Pages/compression.md index 32176121..73922925 100644 --- a/content/en/docs/File Format/Data Pages/compression.md +++ b/content/en/docs/File Format/Data Pages/compression.md @@ -47,7 +47,7 @@ that writers refrain from creating such pages by default for better interoperabi ### LZO A codec based on or interoperable with the -[LZO compression library](http://www.oberhumer.com/opensource/lzo/). +[LZO compression library](https://www.oberhumer.com/opensource/lzo/). ### BROTLI @@ -73,11 +73,11 @@ switch to the newer, interoperable `LZ4_RAW` codec. A codec based on the Zstandard format defined by [RFC 8478](https://tools.ietf.org/html/rfc8478). If any ambiguity arises when implementing this format, the implementation provided by the -[ZStandard compression library](https://facebook.github.io/zstd/) +[Zstandard compression library](https://facebook.github.io/zstd/) is authoritative. ### LZ4_RAW A codec based on the [LZ4 block format](https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md). If any ambiguity arises when implementing this format, the implementation -provided by the [LZ4 compression library](http://www.lz4.org/) is authoritative. +provided by the [LZ4 compression library](https://www.lz4.org/) is authoritative. diff --git a/content/en/docs/File Format/Data Pages/encodings.md b/content/en/docs/File Format/Data Pages/encodings.md index 3ff8d059..ea27d46a 100644 --- a/content/en/docs/File Format/Data Pages/encodings.md +++ b/content/en/docs/File Format/Data Pages/encodings.md @@ -158,7 +158,7 @@ repetition and definition levels. Supported Types: INT32, INT64 This encoding is adapted from the Binary packing described in -["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) +["Decoding billions of integers per second through vectorization"](https://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. Boytsov. In delta encoding we make use of variable length integers for storing various @@ -189,7 +189,7 @@ Each block contains positive integers for bit packing) * the bitwidth of each block is stored as a byte * each miniblock is a list of bit packed ints according to the bit width - stored at the begining of the block + stored at the beginning of the block To encode a block, we will: diff --git a/content/en/docs/File Format/Data Pages/encryption.md b/content/en/docs/File Format/Data Pages/encryption.md index 1f736c50..62f803ef 100644 --- a/content/en/docs/File Format/Data Pages/encryption.md +++ b/content/en/docs/File Format/Data Pages/encryption.md @@ -189,7 +189,7 @@ data set (table). This string is optionally passed by a writer upon file creatio the AAD prefix is stored in an `aad_prefix` field in the file, and is made available to the readers. This field is not encrypted. If a user is concerned about keeping the file identity inside the file, the writer code can explicitly request Parquet not to store the AAD prefix. Then the aad_prefix field -will be empty; AAD prefixes must be fully managed by the caller code and supplied explictly to Parquet +will be empty; AAD prefixes must be fully managed by the caller code and supplied explicitly to Parquet readers for each file. The protection against swapping full files is optional. It is not enabled by default because diff --git a/content/en/docs/File Format/bloomfilter.md b/content/en/docs/File Format/bloomfilter.md index e4203b46..6fe0aaf0 100644 --- a/content/en/docs/File Format/bloomfilter.md +++ b/content/en/docs/File Format/bloomfilter.md @@ -154,7 +154,7 @@ unsigned int32 i = (h_top_bits * z_as_64_bit) >> 32; ``` The first line extracts the most significant 32 bits from `h` and -assignes them to a 64-bit unsigned integer. The second line is +assigns them to a 64-bit unsigned integer. The second line is simpler: it just sets an unsigned 64-bit value to the same value as the 32-bit unsigned value `z`. The purpose of having both `h_top_bits` and `z_as_64_bit` be 64-bit values is so that their product is a @@ -205,7 +205,7 @@ boolean filter_check(SBBF filter, unsigned int64 x) { The use of blocks is from Putze et al.'s [Cache-, Hash- and Space-Efficient Bloom -filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf) +filters](https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf) To use an SBBF for values of arbitrary Parquet types, we apply a hash function to that value - at the time of writing, @@ -217,14 +217,14 @@ with a seed of 0 and [following the specification version The `check` operation in SBBFs can return `true` for an argument that was never inserted into the SBBF. These are called "false -positives". The "false positive probabilty" is the probability that +positives". The "false positive probability" is the probability that any given hash value that was never `insert`ed into the SBBF will cause `check` to return `true` (a false positive). There is not a simple closed-form calculation of this probability, but here is an example: A filter that uses 1024 blocks and has had 26,214 hash values -`insert`ed will have a false positive probabilty of around 1.26%. Each +`insert`ed will have a false positive probability of around 1.26%. Each of those 1024 blocks occupies 256 bits of space, so the total space usage is 262,144. That means that the ratio of bits of space to hash values is 10-to-1. Adding more hash values increases the denominator