Skip to content

Commit

Permalink
Merge pull request #1 from dropbox/master
Browse files Browse the repository at this point in the history
Merge upstream
  • Loading branch information
ptallada authored Aug 28, 2024
2 parents 4754b49 + ac09074 commit 223444a
Show file tree
Hide file tree
Showing 20 changed files with 804 additions and 153 deletions.
87 changes: 63 additions & 24 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,31 +1,31 @@
================================
Project is currently unsupported
================================
========================================================
PyHive project has been donated to Apache Kyuubi
========================================================

You can follow it's development and report any issues you are experiencing here: https://github.com/apache/kyuubi/tree/master/python/pyhive



.. image:: https://travis-ci.org/dropbox/PyHive.svg?branch=master
:target: https://travis-ci.org/dropbox/PyHive
.. image:: https://img.shields.io/codecov/c/github/dropbox/PyHive.svg
Legacy notes / instructions
===========================

======
PyHive
======
**********


PyHive is a collection of Python `DB-API <http://www.python.org/dev/peps/pep-0249/>`_ and
`SQLAlchemy <http://www.sqlalchemy.org/>`_ interfaces for `Presto <http://prestodb.io/>`_ and
`Hive <http://hive.apache.org/>`_.
`SQLAlchemy <http://www.sqlalchemy.org/>`_ interfaces for `Presto <http://prestodb.io/>`_ ,
`Hive <http://hive.apache.org/>`_ and `Trino <https://trino.io/>`_.

Usage
=====
**********

DB-API
------
.. code-block:: python
from pyhive import presto # or import hive or import trino
cursor = presto.connect('localhost').cursor()
cursor = presto.connect('localhost').cursor() # or use hive.connect or use trino.connect
cursor.execute('SELECT * FROM my_awesome_data LIMIT 10')
print cursor.fetchone()
print cursor.fetchall()
Expand Down Expand Up @@ -61,7 +61,7 @@ In Python 3.7 `async` became a keyword; you can use `async_` instead:
SQLAlchemy
----------
First install this package to register it with SQLAlchemy (see ``setup.py``).
First install this package to register it with SQLAlchemy, see ``entry_points`` in ``setup.py``.

.. code-block:: python
Expand All @@ -71,9 +71,11 @@ First install this package to register it with SQLAlchemy (see ``setup.py``).
# Presto
engine = create_engine('presto://localhost:8080/hive/default')
# Trino
engine = create_engine('trino://localhost:8080/hive/default')
engine = create_engine('trino+pyhive://localhost:8080/hive/default')
# Hive
engine = create_engine('hive://localhost:10000/default')
# SQLAlchemy < 2.0
logs = Table('my_awesome_data', MetaData(bind=engine), autoload=True)
print select([func.count('*')], from_obj=logs).scalar()
Expand All @@ -82,6 +84,20 @@ First install this package to register it with SQLAlchemy (see ``setup.py``).
logs = Table('my_awesome_data', MetaData(bind=engine), autoload=True)
print select([func.count('*')], from_obj=logs).scalar()
# SQLAlchemy >= 2.0
metadata_obj = MetaData()
books = Table("books", metadata_obj, Column("id", Integer), Column("title", String), Column("primary_author", String))
metadata_obj.create_all(engine)
inspector = inspect(engine)
inspector.get_columns('books')
with engine.connect() as con:
data = [{ "id": 1, "title": "The Hobbit", "primary_author": "Tolkien" },
{ "id": 2, "title": "The Silmarillion", "primary_author": "Tolkien" }]
con.execute(books.insert(), data[0])
result = con.execute(text("select * from books"))
print(result.fetchall())
Note: query generation functionality is not exhaustive or fully tested, but there should be no
problem with raw SQL.

Expand All @@ -101,7 +117,7 @@ Passing session configuration
'session_props': {'query_max_run_time': '1234m'}}
)
create_engine(
'trino://user@host:443/hive',
'trino+pyhive://user@host:443/hive',
connect_args={'protocol': 'https',
'session_props': {'query_max_run_time': '1234m'}}
)
Expand All @@ -116,27 +132,30 @@ Passing session configuration
)
Requirements
============
************

Install using

- ``pip install 'pyhive[hive]'`` for the Hive interface and
- ``pip install 'pyhive[presto]'`` for the Presto interface.
- ``pip install 'pyhive[hive]'`` or ``pip install 'pyhive[hive_pure_sasl]'`` for the Hive interface
- ``pip install 'pyhive[presto]'`` for the Presto interface
- ``pip install 'pyhive[trino]'`` for the Trino interface

Note: ``'pyhive[hive]'`` extras uses `sasl <https://pypi.org/project/sasl/>`_ that doesn't support Python 3.11, See `github issue <https://github.com/cloudera/python-sasl/issues/30>`_.
Hence PyHive also supports `pure-sasl <https://pypi.org/project/pure-sasl/>`_ via additional extras ``'pyhive[hive_pure_sasl]'`` which support Python 3.11.

PyHive works with

- Python 2.7 / Python 3
- For Presto: Presto install
- For Trino: Trino install
- For Presto: `Presto installation <https://prestodb.io/docs/current/installation.html>`_
- For Trino: `Trino installation <https://trino.io/docs/current/installation.html>`_
- For Hive: `HiveServer2 <https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2>`_ daemon

Changelog
=========
*********
See https://github.com/dropbox/PyHive/releases.

Contributing
============
************
- Please fill out the Dropbox Contributor License Agreement at https://opensource.dropbox.com/cla/ and note this in your pull request.
- Changes must come with tests, with the exception of trivial things like fixing comments. See .travis.yml for the test environment setup.
- Notes on project scope:
Expand All @@ -146,8 +165,28 @@ Contributing
- We prefer having a small number of generic features over a large number of specialized, inflexible features.
For example, the Presto code takes an arbitrary ``requests_session`` argument for customizing HTTP calls, as opposed to having a separate parameter/branch for each ``requests`` option.

Tips for test environment setup
****************************************
You can setup test environment by following ``.travis.yaml`` in this repository. It uses `Cloudera's CDH 5 <https://docs.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_download_510.html>`_ which requires username and password for download.
It may not be feasible for everyone to get those credentials. Hence below are alternative instructions to setup test environment.

You can clone `this repository <https://github.com/big-data-europe/docker-hive/blob/master/docker-compose.yml>`_ which has Docker Compose setup for Presto and Hive.
You can add below lines to its docker-compose.yaml to start Trino in same environment::
trino:
image: trinodb/trino:351
ports:
- "18080:18080"
volumes:
- ./trino:/etc/trino

Note: ``./trino`` for docker volume defined above is `trino config from PyHive repository <https://github.com/dropbox/PyHive/tree/master/scripts/travis-conf/trino>`_

Then run::
docker-compose up -d

Testing
=======
*******
.. image:: https://travis-ci.org/dropbox/PyHive.svg
:target: https://travis-ci.org/dropbox/PyHive
.. image:: http://codecov.io/github/dropbox/PyHive/coverage.svg?branch=master
Expand All @@ -166,7 +205,7 @@ WARNING: This drops/creates tables named ``one_row``, ``one_row_complex``, and `
database called ``pyhive_test_database``.

Updating TCLIService
====================
********************

The TCLIService module is autogenerated using a ``TCLIService.thrift`` file. To update it, the
``generate.py`` file can be used: ``python generate.py <TCLIServiceURL>``. When left blank, the
Expand Down
2 changes: 2 additions & 0 deletions dev_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ pytest-timeout==1.2.0
requests>=1.0.0
requests_kerberos>=0.12.0
sasl>=0.2.1
pure-sasl>=0.6.2
kerberos>=1.3.0
thrift>=0.10.0
#thrift_sasl>=0.1.0
git+https://github.com/cloudera/thrift_sasl # Using master branch in order to get Python 3 SASL patches
2 changes: 1 addition & 1 deletion pyhive/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from __future__ import absolute_import
from __future__ import unicode_literals
__version__ = '0.6.3'
__version__ = '0.7.0'
7 changes: 6 additions & 1 deletion pyhive/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@
from future.utils import with_metaclass
from itertools import islice

try:
from collections.abc import Iterable
except ImportError:
from collections import Iterable


class DBAPICursor(with_metaclass(abc.ABCMeta, object)):
"""Base class for some common DB-API logic"""
Expand Down Expand Up @@ -245,7 +250,7 @@ def escape_item(self, item):
return self.escape_number(item)
elif isinstance(item, basestring):
return self.escape_string(item)
elif isinstance(item, collections.Iterable):
elif isinstance(item, Iterable):
return self.escape_sequence(item)
elif isinstance(item, datetime.datetime):
return self.escape_datetime(item, self._DATETIME_FORMAT)
Expand Down
56 changes: 41 additions & 15 deletions pyhive/hive.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,45 @@
}


def get_sasl_client(host, sasl_auth, service=None, username=None, password=None):
import sasl
sasl_client = sasl.Client()
sasl_client.setAttr('host', host)

if sasl_auth == 'GSSAPI':
sasl_client.setAttr('service', service)
elif sasl_auth == 'PLAIN':
sasl_client.setAttr('username', username)
sasl_client.setAttr('password', password)
else:
raise ValueError("sasl_auth only supports GSSAPI and PLAIN")

sasl_client.init()
return sasl_client


def get_pure_sasl_client(host, sasl_auth, service=None, username=None, password=None):
from pyhive.sasl_compat import PureSASLClient

if sasl_auth == 'GSSAPI':
sasl_kwargs = {'service': service}
elif sasl_auth == 'PLAIN':
sasl_kwargs = {'username': username, 'password': password}
else:
raise ValueError("sasl_auth only supports GSSAPI and PLAIN")

return PureSASLClient(host=host, **sasl_kwargs)


def get_installed_sasl(host, sasl_auth, service=None, username=None, password=None):
try:
return get_sasl_client(host=host, sasl_auth=sasl_auth, service=service, username=username, password=password)
# The sasl library is available
except ImportError:
# Fallback to pure-sasl library
return get_pure_sasl_client(host=host, sasl_auth=sasl_auth, service=service, username=username, password=password)


def _parse_timestamp(value):
if value:
match = _TIMESTAMP_PATTERN.match(value)
Expand Down Expand Up @@ -232,7 +271,6 @@ def __init__(
self._transport = thrift.transport.TTransport.TBufferedTransport(socket)
elif auth in ('LDAP', 'KERBEROS', 'NONE', 'CUSTOM'):
# Defer import so package dependency is optional
import sasl
import thrift_sasl

if auth == 'KERBEROS':
Expand All @@ -243,20 +281,8 @@ def __init__(
if password is None:
# Password doesn't matter in NONE mode, just needs to be nonempty.
password = 'x'

def sasl_factory():
sasl_client = sasl.Client()
sasl_client.setAttr('host', host)
if sasl_auth == 'GSSAPI':
sasl_client.setAttr('service', kerberos_service_name)
elif sasl_auth == 'PLAIN':
sasl_client.setAttr('username', username)
sasl_client.setAttr('password', password)
else:
raise AssertionError
sasl_client.init()
return sasl_client
self._transport = thrift_sasl.TSaslClientTransport(sasl_factory, sasl_auth, socket)

self._transport = thrift_sasl.TSaslClientTransport(lambda: get_installed_sasl(host=host, sasl_auth=sasl_auth, service=kerberos_service_name, username=username, password=password), sasl_auth, socket)
else:
# All HS2 config options:
# https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-Configuration
Expand Down
18 changes: 12 additions & 6 deletions pyhive/presto.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
from __future__ import unicode_literals

from builtins import object
from decimal import Decimal

from pyhive import common
from pyhive.common import DBAPITypeObject
# Make all exceptions visible in this module per DB-API
Expand All @@ -34,6 +36,11 @@

_logger = logging.getLogger(__name__)

TYPES_CONVERTER = {
"decimal": Decimal,
# As of Presto 0.69, binary data is returned as the varbinary type in base64 format
"varbinary": base64.b64decode
}

class PrestoParamEscaper(common.ParamEscaper):
def escape_datetime(self, item, format):
Expand Down Expand Up @@ -307,14 +314,13 @@ def _fetch_more(self):
"""Fetch the next URI and update state"""
self._process_response(self._requests_session.get(self._nextUri, **self._requests_kwargs))

def _decode_binary(self, rows):
# As of Presto 0.69, binary data is returned as the varbinary type in base64 format
# This function decodes base64 data in place
def _process_data(self, rows):
for i, col in enumerate(self.description):
if col[1] == 'varbinary':
col_type = col[1].split("(")[0].lower()
if col_type in TYPES_CONVERTER:
for row in rows:
if row[i] is not None:
row[i] = base64.b64decode(row[i])
row[i] = TYPES_CONVERTER[col_type](row[i])

def _process_response(self, response):
"""Given the JSON response from Presto's REST API, update the internal state with the next
Expand All @@ -341,7 +347,7 @@ def _process_response(self, response):
if 'data' in response_json:
assert self._columns
new_data = response_json['data']
self._decode_binary(new_data)
self._process_data(new_data)
self._data += map(tuple, new_data)
if 'nextUri' not in response_json:
self._state = self._STATE_FINISHED
Expand Down
56 changes: 56 additions & 0 deletions pyhive/sasl_compat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Original source of this file is https://github.com/cloudera/impyla/blob/master/impala/sasl_compat.py
# which uses Apache-2.0 license as of 21 May 2023.
# This code was added to Impyla in 2016 as a compatibility layer to allow use of either python-sasl or pure-sasl
# via PR https://github.com/cloudera/impyla/pull/179
# Even though thrift_sasl lists pure-sasl as dependency here https://github.com/cloudera/thrift_sasl/blob/master/setup.py#L34
# but it still calls functions native to python-sasl in this file https://github.com/cloudera/thrift_sasl/blob/master/thrift_sasl/__init__.py#L82
# Hence this code is required for the fallback to work.


from puresasl.client import SASLClient, SASLError
from contextlib import contextmanager

@contextmanager
def error_catcher(self, Exc = Exception):
try:
self.error = None
yield
except Exc as e:
self.error = str(e)


class PureSASLClient(SASLClient):
def __init__(self, *args, **kwargs):
self.error = None
super(PureSASLClient, self).__init__(*args, **kwargs)

def start(self, mechanism):
with error_catcher(self, SASLError):
if isinstance(mechanism, list):
self.choose_mechanism(mechanism)
else:
self.choose_mechanism([mechanism])
return True, self.mechanism, self.process()
# else
return False, mechanism, None

def encode(self, incoming):
with error_catcher(self):
return True, self.unwrap(incoming)
# else
return False, None

def decode(self, outgoing):
with error_catcher(self):
return True, self.wrap(outgoing)
# else
return False, None

def step(self, challenge=None):
with error_catcher(self):
return True, self.process(challenge)
# else
return False, None

def getError(self):
return self.error
Loading

0 comments on commit 223444a

Please sign in to comment.