Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: #3531

Open
1 task done
CaryMoore-DB opened this issue Jan 16, 2025 · 1 comment
Open
1 task done

[BUG]: #3531

CaryMoore-DB opened this issue Jan 16, 2025 · 1 comment
Labels
bug Something isn't working needs-triage

Comments

@CaryMoore-DB
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

We are seeing the assessment misclassifying tables in the external hive metastore as managed, when they are external hive and parquet. You can use the location and have spark.read, and register the table directly as parquet and are able to select from it. The code does not process any of the tables and provides this output. WARN [d.l.u.hive_metastore.table_migrate][migrate_tables_0] failed-to-migrate: SYNC command failed to migrate table hive_metastore.<table name redacted> to <catalog redacted>.<schema redacted>.<tablename redacted>. Status code: NOT_EXTERNAL. Description: [UPGRADE_NOT_SUPPORTED.NOT_EXTERNAL] Table is not eligible for upgrade from Hive Metastore to Unity Catalog. Reason: Not an external table. SQLSTATE: 0AKUC

Expected Behavior

Maybe this is the intent of the migrate hive serde in place experimental, but I would expect instead of using the SYNC command, we would instead register the table as parquet in its existing location or maybe I don't fully understand the ability of SYNC. Also, since it is in an external hive metastore, I'm not sure how it is being classified as managed.

Steps To Reproduce

No response

Cloud

AWS

Operating System

macOS

Version

latest via Databricks CLI

Relevant log output

WARN [d.l.u.hive_metastore.table_migrate][migrate_tables_0] failed-to-migrate: SYNC command failed to migrate table hive_metastore.<table name redacted> to <catalog redacted>.<schema redacted>.<tablename redacted>. Status code: NOT_EXTERNAL. Description: [UPGRADE_NOT_SUPPORTED.NOT_EXTERNAL] Table is not eligible for upgrade from Hive Metastore to Unity Catalog. Reason: Not an external table. SQLSTATE: 0AKUC

21:47:08  INFO [d.l.u.hive_metastore.table_migrate][convert_tables_3] Changing HMS managed table <table_name> to External Table type.
21:47:09  WARN [d.l.u.hive_metastore.table_migrate][convert_tables_0] Error converting HMS table <table_name> to external: An error occurred while calling None.org.apache.spark.sql.catalyst.catalog.CatalogTable. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.catalyst.catalog.CatalogTable([class org.apache.spark.sql.catalyst.TableIdentifier, class org.apache.spark.sql.catalyst.catalog.CatalogTableType, class org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat, class org.apache.spark.sql.types.StructType, class scala.Some, class scala.collection.immutable.Nil$, class scala.None$, class java.lang.String, class java.lang.Long, class java.lang.Integer, class java.lang.String, class scala.collection.immutable.Map$EmptyMap$, class scala.None$, class scala.None$, class scala.None$, class scala.collection.mutable.ArrayBuffer, class java.lang.Boolean, class java.lang.Boolean, class scala.collection.immutable.Map$EmptyMap$, class scala.None$]) does not exist
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:203)
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:220)
	at py4j.Gateway.invoke(Gateway.java:255)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
	at java.base/java.lang.Thread.run(Thread.java:840)

: Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/databricks/labs/ucx/hive_metastore/table_migrate.py", line 302, in _convert_hms_table_to_external
    new_table = self._catalog_table(
                ^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1620, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 263, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 330, in get_return_value
    raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.catalyst.catalog.CatalogTable. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.catalyst.catalog.CatalogTable([class org.apache.spark.sql.catalyst.TableIdentifier, class org.apache.spark.sql.catalyst.catalog.CatalogTableType, class org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat, class org.apache.spark.sql.types.StructType, class scala.Some, class scala.collection.immutable.Nil$, class scala.None$, class java.lang.String, class java.lang.Long, class java.lang.Integer, class java.lang.String, class scala.collection.immutable.Map$EmptyMap$, class scala.None$, class scala.None$, class scala.None$, class scala.collection.mutable.ArrayBuffer, class java.lang.Boolean, class java.lang.Boolean, class scala.collection.immutable.Map$EmptyMap$, class scala.None$]) does not exist
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:203)
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:220)
	at py4j.Gateway.invoke(Gateway.java:255)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
	at java.base/java.lang.Thread.run(Thread.java:840)

da-dbx-support
Ben Tibbetts, Scott Parker, and 2 others


Ben Tibbetts
  Tuesday at 5:13 PM
Hello! I am working on the issue where our mapping file is too big to process within the migrate_tables workflow. I split up our mapping file, unpacked and modified the python package so that it could load multiple files, pointed to that new package within the convert_managed_table task.
Everything was working fine up until this error for every table (Putting stacktrace in thread).
Does that look familiar at all/who could help me look at this? (edited) 


Ben Tibbetts
  Tuesday at 5:15 PM
21:47:08  INFO [d.l.u.hive_metastore.table_migrate][convert_tables_3] Changing HMS managed table <table_name> to External Table type.
21:47:09  WARN [d.l.u.hive_metastore.table_migrate][convert_tables_0] Error converting HMS table <table_name> to external: An error occurred while calling None.org.apache.spark.sql.catalyst.catalog.CatalogTable. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.catalyst.catalog.CatalogTable([class org.apache.spark.sql.catalyst.TableIdentifier, class org.apache.spark.sql.catalyst.catalog.CatalogTableType, class org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat, class org.apache.spark.sql.types.StructType, class scala.Some, class scala.collection.immutable.Nil$, class scala.None$, class java.lang.String, class java.lang.Long, class java.lang.Integer, class java.lang.String, class scala.collection.immutable.Map$EmptyMap$, class scala.None$, class scala.None$, class scala.None$, class scala.collection.mutable.ArrayBuffer, class java.lang.Boolean, class java.lang.Boolean, class scala.collection.immutable.Map$EmptyMap$, class scala.None$]) does not exist
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:203)
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:220)
	at py4j.Gateway.invoke(Gateway.java:255)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
	at java.base/java.lang.Thread.run(Thread.java:840)

: Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/databricks/labs/ucx/hive_metastore/table_migrate.py", line 302, in _convert_hms_table_to_external
    new_table = self._catalog_table(
                ^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1620, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 263, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 330, in get_return_value
    raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.catalyst.catalog.CatalogTable. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.catalyst.catalog.CatalogTable([class org.apache.spark.sql.catalyst.TableIdentifier, class org.apache.spark.sql.catalyst.catalog.CatalogTableType, class org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat, class org.apache.spark.sql.types.StructType, class scala.Some, class scala.collection.immutable.Nil$, class scala.None$, class java.lang.String, class java.lang.Long, class java.lang.Integer, class java.lang.String, class scala.collection.immutable.Map$EmptyMap$, class scala.None$, class scala.None$, class scala.None$, class scala.collection.mutable.ArrayBuffer, class java.lang.Boolean, class java.lang.Boolean, class scala.collection.immutable.Map$EmptyMap$, class scala.None$]) does not exist
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:203)
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:220)
	at py4j.Gateway.invoke(Gateway.java:255)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
	at java.base/java.lang.Thread.run(Thread.java:840)


Scott Parker
  Yesterday at 8:59 AM
@sheila.stewart
 
@cary.moore
 :point_up:
:eyes:
1



Cary Moore
  Yesterday at 9:39 AM
Hi 
@ben.tibbetts
 what is the code calling to read the parquet?


Cary Moore
  Yesterday at 9:43 AM
Based on the context provided, here are some potential causes and solutions:
Table Type Issue: Managed tables on DBFS mounts can sometimes face issues during migration if they are not correctly identified as external tables. This can be resolved by updating the table type in the Hive Metastore to external before attempting the migration. You can do this by setting the tableType to CatalogTableType.EXTERNAL in your code.
Configuration Settings: Ensure that the necessary configuration settings are enabled. For instance, setting spark.databricks.sync.command.enableManagedTable=true can help in enabling the migration of managed tables using the SYNC command.
Unsupported Operations: There are certain restrictions with UC enabled clusters. Converting managed tables to external tables directly might not be supported. In such cases, using a deep clone for HMS Parquet and Delta tables to copy the data and then migrating the tables in UC from HMS is recommended.
Compatibility Issues: Ensure that you are using compatible versions of libraries and Databricks Runtime. Mismatches in versions can lead to errors like NoSuchMethodError.
To troubleshoot further, you can:
Verify that all required elements are present in the CatalogTablePartition class.
Check for any null values being accessed in the CatalogTable class.
Ensure that you are using compatible versions of libraries, especially if you are using third-party libraries like Iceberg with Databricks Runtime.
If the issue persists, providing more details about the specific context, such as the exact code being used and the Databricks Runtime version, can help in diagnosing the problem more accurately. (edited) 


Cary Moore
  Yesterday at 9:44 AM
From our internal docs


Ben Tibbetts
  Yesterday at 9:44 AM
Here is the snippet that is failing:
    @cached_property
    def _catalog_table(self):
        return self._spark._jvm.org.apache.spark.sql.catalyst.catalog.CatalogTable  # pylint: disable=protected-access

    def _convert_hms_table_to_external(self, src_table: Table):
        logger.info(f"Changing HMS managed table {src_table.name} to External Table type.")
        inventory_table = self._tables_crawler.full_name
        try:
            database = self._spark._jvm.scala.Some(src_table.database)  # pylint: disable=protected-access
            table_identifier = self._table_identifier(src_table.name, database)
            old_table = self._catalog.getTableMetadata(table_identifier)
            new_table = self._catalog_table(
                old_table.identifier(),
                self._catalog_type('EXTERNAL'),
                old_table.storage(),
                old_table.schema(),
                old_table.provider(),
                old_table.partitionColumnNames(),
                old_table.bucketSpec(),
                old_table.owner(),
                old_table.createTime(),
                old_table.lastAccessTime(),
                old_table.createVersion(),
                old_table.properties(),
                old_table.stats(),
                old_table.viewText(),
                old_table.comment(),
                old_table.unsupportedFeatures(),
                old_table.tracksPartitionsInCatalog(),
                old_table.schemaPreservesCase(),
                old_table.ignoredProperties(),
                old_table.viewOriginalText(),
            )
            self._catalog.alterTable(new_table)
            self._update_table_status(src_table, inventory_table)
            logger.info(f"Converted {src_table.name} to External Table type.")
        except Exception as e:  # pylint: disable=broad-exception-caught
            logger.warning(f"Error converting HMS table {src_table.name} to external: {e}", exc_info=True)
            return False
        return True
@CaryMoore-DB CaryMoore-DB added bug Something isn't working needs-triage labels Jan 16, 2025
@CaryMoore-DB
Copy link
Author

CaryMoore-DB commented Jan 16, 2025

we tested SYNC TABLE AS EXTERNAL and that did work. How can we get the code to use SYNC TABLE EXTERNAL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage
Projects
None yet
Development

No branches or pull requests

1 participant