-
Notifications
You must be signed in to change notification settings - Fork 2.4k
ASF Proposal
Hudi is a big-data storage library, that provides atomic upserts and incremental data consumption.
Hudi manages data stored in Apache Hadoop and other API compatible distributed file systems/cloud stores.
Hudi provides the ability to atomically upsert datasets with new values in near-real time, making data available quickly to existing query engines like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a sequence of changes to a dataset from a given point-in-time to enable incremental data pipelines that yield greater efficiency & latency than their typical batch counterparts. By carefully managing number of files & sizes, Hudi greatly aids both query engines (e.g: always providing well-sized files) and underlying storage (e.g: HDFS NameNode memory consumption).
Hudi is largely implemented as a Apache Spark library that reads/writes data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are supported via specialized Apache Hadoop input formats, that understand Hudi’s storage layout. Currently, Hudi manages datasets using a combination of Apache Parquet & Apache Avro file/serialization formats.
Apache Hadoop distributed filesystem (HDFS) & other compatible cloud storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as longer term analytical storage for thousands of organizations. Typical analytical datasets are built by reading data from a source (e.g: upstream databases, messaging buses, or other datasets), transforming the data, writing results back to storage, & making it available for analytical queries--all of this typically accomplished in batch jobs which operate in a bulk fashion on partitions of datasets. Such a style of processing typically incurs large delays in making data available to queries as well as lot of complexity in carefully partitioning datasets to guarantee latency SLAs.
The need for fresher/faster analytics has increased enormously in the past few years, as evidenced by the popularity of Stream processing systems like Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By using updateable state store to incrementally compute & instantly reflect new results to queries and using a “tailable” messaging bus to publish these results to other downstream jobs, such systems employ a different approach to building analytical dataset. Even though this approach yields low latency, the amount of data managed in such real-time data-marts is typically limited in comparison to the aforementioned longer term storage options. As a result, the overall data architecture has become more complex with more moving parts and specialized systems, leading to duplication of data and a strain on usability.
Hudi takes a hybrid approach. Instead of moving vast amounts of batch data to streaming systems, we simply add the streaming primitives (upserts & incremental consumption) onto existing batch processing technologies. We believe that by adding some missing blocks to an existing Hadoop stack, we are able to a provide similar capabilities right on top of Hadoop at a reduced cost and with an increased efficiency, greatly simplifying the overall architecture in the process.
Hudi was originally developed at Uber (original name “Hoodie”) to address such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data ecosystem that required the upsert & incremental consumption primitives supported by Hudi.
We truly believe the capabilities supported by Hudi would be increasingly useful for big-data ecosystems, as data volumes & need for faster data continue to increase. A detailed description of target use-cases can be found at https://uber.github.io/hudi/use_cases.html.
Given our reliance on so many great Apache projects, we believe that the Apache way of open source community driven development will enable us to evolve Hudi in collaboration with a diverse set of contributors who can bring new ideas into the project.
- Move the existing codebase, website, documentation, and mailing lists to an Apache-hosted infrastructure.
- Integrate with the Apache development process.
- Ensure all dependencies are compliant with Apache License version 2.0.
- Incrementally develop and release per Apache guidelines.
Hudi is a stable project used in production at Uber since 2016 and was open sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi manages 4000+ tables holding several petabytes, bringing our Hadoop warehouse from several hours of data delays to under 30 minutes, over the past two years. The source code is currently hosted at github.com (https://github.com/uber/hudi ), which will seed the Apache git repository.
- Meritocracy:
- Community:
- Core Developers:
- Alignment:
- Orphaned products:
- Inexperience with Open Source:
- Length of Incubation:
- Homogenous Developers:
- Reliance on Salaried Developers:
- Relationships with Other Apache Products:
- A Excessive Fascination with the Apache Brand:
Documentation
References to further reading material.
Examples (Heraldry):
[1] Information on Yadis can be found at: http://yadis.org
[2] Information on OpenID can be found at: http://www.openid.net http://www.openidenabled.com
The mailing list for both OpenID and Yadis is located
at:
http://lists.danga.com/mailman/listinfo/yadis ...
Initial Source
Describes the origin of the proposed code base. If the initial code arrives from more than one source, this is the right place to outline the different histories.
If there is no initial source, note that here.
Example (Heraldry):
OpenID has been in development since the summer of 2005. It
currently
has an active community (over 15 million enabled accounts) and libraries in a variety of languages. Additionally it is supported by LiveJournal.com and is continuing to gain
traction in the Open
Source Community.
Yadis has been in development since late 2005 and the specification has not changed since early 2006. Like OpenID, it has libraries in various
languages and there is a large overlap between the two
communities. The specification is...
Source and Intellectual Property Submission Plan
Complex proposals (typically involving multiple code bases) may find it useful to draw up an initial plan for the submission of the code here. Demonstrate that the proposal is practical.
Example (Heraldry):
* The OpenID
specification and content on openid.net from Brad
Fitzpatrick of Six Apart, Ltd. and David Recordon of VeriSign, Inc. * The domains openid.net and yadis.org from Brad Fitzpatrick of
Six Apart, Ltd. and Johannes Ernst of NetMesh, Inc.
* OpenID libraries in Python, Ruby, Perl, PHP, and C# from JanRain, Inc. ... * Yadis conformance test suite from NetMesh and
VeriSign, Inc.
We will also be soliciting contributions of further plugins and patches to various pieces of Open Source software.
External Dependencies:
External
dependencies for the initial source is important. Only some external dependencies are allowed by Apache policy. These restrictions are (to some extent) initially relaxed for projects under incubation.
If the initial source has dependencies which would prevent graduation then this is the right place to indicate how these issues will be resolved.
Example (CeltiXfire): The
dependencies all have Apache compatible licenses. These include
BSD, CDDL, CPL, MPL and MIT licensed dependencies.
Cryptography
If the proposal involves cryptographic code either directly or indirectly, Apache needs to know so that the relevant paperwork can be obtained.
Required Resources:
* '''Mailing lists:'''
The minimum required lists are
private@{podling}.incubator.apache.org (for confidential PPMC discussions) and dev@{podling}.incubator.apache.org lists. Note that projects historically misnamed the private list pmc. To avoid confusion over appropriate usage it was resolved that all such lists be renamed.
If this project is new to open source, then starting with these minimum lists is the best approach. The initial
focus needs to be on recruiting new developers. Early adopters are potential developers. As momentum is gained, the community may decide to create commit and user lists as they become necessary.
Existing open source projects moving to Apache will probably want to adopt the same mailing list set up here as they have already. However, there is no necessity that all mailing lists be created during bootstrapping. New mailing lists can be added by a VOTE on the Podling list.
By default, commits for {podling} will be emailed to commits@{podling}.incubator.apache.org. It is therefore
recommended that this naming convention is adopted.
Mailing list options are described at greater length elsewhere.
Example (Beehive): * [email protected] (with
moderated subscriptions)
* [email protected] * [email protected]
* '''Subversion Directory:'''
It is conventional to use all lower case,
dash-separated (-) directory names. The directory should be within the incubator directory space (http://svn.apache.org/repos/asf/incubator).
Example (OpenJPA):
https://svn.apache.org/repos/asf/incubator/openjpa
* '''Git Repositories:''' It is conventional to use all lower case, dash-separated (-) repository names. The repository should be
prefixed with incubator and later renamed assuming the project is promoted to a TLP.
Example (Blur):
https://git-wip-us.apache.org/repos/asf/incubator-blur.git
* '''Issue Tracking:'''
Apache runs JIRA and Bugzilla. Choose one. Indicate the name by which project should be known in the
issue tracking system.
Example (OpenJPA): JIRA Open-JPA (OPEN-JPA)
* '''Other Resources:'''
Describe here any other special infrastructure requirements necessary for the
proposal. Note that the infrastructure team usually requires a compelling argument before new services are allowed on core hardware. Most proposals should not require this section.
Most standard
resources not covered above (such as continuous integration) should be added after bootstrapping. The infrastructure documentation explains the process.
Initial Committers
List of committers (stating name and an email address) used to bootstrap the community. Mark each which has submitted a contributor license agreement (CLA). Existing committers should use their apache.org email address (since they require only appropriate karma). Others should use the email address that is (or will be) on the CLA. That makes it easy to match CLAs with proposed committers to the project.
It is a good idea to submit CLAs at the same time as the proposal. Nothing is lost by having a CLA on file at Apache but processing may take some time.
Note this and this. Consider creating a separate section where interested developers can express an interest (and possibly leave a brief introduction) or ask them to post to the general list.
Example (OpenJPA):
Abe White
(awhite at bea dot com)
Marc Prud'hommeaux (mprudhom at bea dot com) Patrick Linskey (plinskey at bea dot com) ... Geir Magnusson Jr (geirm at apache dot org) * Craig Russell (clr at
apache dot org) *
Sponsors
Little bit of a controversial subject. Committers at Apache are individuals and work here on their own behalf. They are judged on their merits not their affiliations. However, in the spirit of full disclosure, it is useful for any current affiliations which may effect the perceived independence of the initial committers to be listed openly at the start.
For example, those in salaried positions whose job is to work on the project should list their affiliation. Having this list helps to judge how much diversity exists in the initial list and so how much work there is to do.
This is best done in a separate section away from the committers list.
Only the affiliations of committers on the initial bootstrap list are relevant. These committers have not been added by the usual meritocratic process. It is strongly recommended that the once a project is bootstrapped, developers are judged by their contributions and not by their background. This list should not be maintained after the bootstrap has been completed.
* '''Champion:''' The Champion is a person already associated with Apache who leads the proposal
process. It is common - but not necessary - for the Champion to also be proposed as a Mentor.
A Champion should be found while the proposal is still being formulated. Their role is to help
formulate the proposal and work with you to resolve comments and questions put forth by the IPMC while reviewing the proposal.
* '''Nominated Mentors:'''
Lists eligible (and willing) individuals nominated as Mentors [definition] for the candidate.
Three Mentors gives a quorum and allows a Podling more autonomy from the Incubator PMC, so the current consensus is that three Mentors is a good number. Any experienced Apache community member can provide informal mentorship anyway, what's important is to make sure the podling has enough regularly available mentors to progress smoothly. There is no restriction on the number of mentors, formal or informal, a Podling may have.
* '''Sponsoring Entity''': The Sponsor is the organizational unit within
Apache taking responsibility for this proposal. The sponsoring entity can be:
- the Apache Board - the Incubator - another Apache project The PMC for the appropriate project will decide
whether to sponsor (by a vote). Unless there are strong links to an existing Apache project, it is recommended that the proposal asks that the Incubator for sponsorship.
Note that the final
destination within the Apache organizational structure will be decided upon graduation.