Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read HTTP catalog #301

Closed
3 tasks done
hombit opened this issue Jul 8, 2024 · 0 comments · Fixed by #315
Closed
3 tasks done

Cannot read HTTP catalog #301

hombit opened this issue Jul 8, 2024 · 0 comments · Fixed by #315
Labels
bug Something isn't working

Comments

@hombit
Copy link
Contributor

hombit commented Jul 8, 2024

Bug report

I've got a FileNotFoundError exception when trying to read an HTTP (not HTTPS) catalog with LSDB:

>>> lsdb.read_hipscat('http://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/des/des_dr2/')
FileNotFoundError: epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/des/des_dr2/catalog_info.json

It looks like the protocol part of the URL is truncated because we manually check for the https protocol, but not for http, s3, ftp, or anything else:
https://github.com/astronomy-commons/hipscat/blob/97643b635090472bfc10adef904dbcb53103f327/src/hipscat/io/file_io/file_pointer.py#L74

It may be related to this fixed issue:
#215 (comment)

Full traceback
InvalidURL                                Traceback (most recent call last)
File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/fsspec/implementations/http.py:422, in HTTPFileSystem._info(self, url, **kwargs)
    420 try:
    421     info.update(
--> 422         await _file_info(
    423             self.encode_url(url),
    424             size_policy=policy,
    425             session=session,
    426             **self.kwargs,
    427             **kwargs,
    428         )
    429     )
    430     if info.get("size") is not None:

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/fsspec/implementations/http.py:831, in _file_info(url, session, size_policy, **kwargs)
    830 elif size_policy == "get":
--> 831     r = await session.get(url, allow_redirects=ar, **kwargs)
    832 else:

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/aiohttp/client.py:548, in ClientSession._request(self, method, str_or_url, params, data, json, cookies, headers, skip_auto_headers, auth, allow_redirects, max_redirects, compress, chunked, expect100, raise_for_status, read_until_eof, proxy, proxy_auth, timeout, verify_ssl, fingerprint, ssl_context, ssl, server_hostname, proxy_headers, trace_request_ctx, read_bufsize, auto_decompress, max_line_size, max_field_size)
    546         proxy, proxy_auth = get_env_proxy_for_url(url)
--> 548 req = self._request_class(
    549     method,
    550     url,
    551     params=params,
    552     headers=headers,
    553     skip_auto_headers=skip_headers,
    554     data=data,
    555     cookies=all_cookies,
    556     auth=auth,
    557     version=version,
    558     compress=compress,
    559     chunked=chunked,
    560     expect100=expect100,
    561     loop=self._loop,
    562     response_class=self._response_class,
    563     proxy=proxy,
    564     proxy_auth=proxy_auth,
    565     timer=timer,
    566     session=self,
    567     ssl=ssl if ssl is not None else True,
    568     server_hostname=server_hostname,
    569     proxy_headers=proxy_headers,
    570     traces=traces,
    571     trust_env=self.trust_env,
    572 )
    574 # connection timeout

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/aiohttp/client_reqrep.py:327, in ClientRequest.__init__(self, method, url, params, headers, skip_auto_headers, data, cookies, auth, version, compress, chunked, expect100, loop, response_class, proxy, proxy_auth, timer, session, ssl, proxy_headers, traces, trust_env, server_hostname)
    326 self.update_version(version)
--> 327 self.update_host(url)
    328 self.update_headers(headers)

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/aiohttp/client_reqrep.py:401, in ClientRequest.update_host(self, url)
    400 if not url.raw_host:
--> 401     raise InvalidURL(url)
    403 # basic auth info

InvalidURL: epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/des/des_dr2/catalog_info.json

The above exception was the direct cause of the following exception:

FileNotFoundError                         Traceback (most recent call last)
Cell In[3], line 1
----> 1 catalog = lsdb.read_hipscat('http://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/des/des_dr2/')

File ~/projects/lincc-frameworks/lsdb/src/lsdb/loaders/hipscat/read_hipscat.py:79, in read_hipscat(path, catalog_type, search_filter, columns, margin_cache, dtype_backend, storage_options, **kwargs)
     76 config_args = {field.name: kwd_args[field.name] for field in dataclasses.fields(HipscatLoadingConfig)}
     77 config = HipscatLoadingConfig(**config_args)
---> 79 catalog_type_to_use = _get_dataset_class_from_catalog_info(path, storage_options=storage_options)
     81 if catalog_type is not None:
     82     catalog_type_to_use = catalog_type

File ~/projects/lincc-frameworks/lsdb/src/lsdb/loaders/hipscat/read_hipscat.py:93, in _get_dataset_class_from_catalog_info(base_catalog_path, storage_options)
     91 base_catalog_dir = hc.io.get_file_pointer_from_path(base_catalog_path)
     92 catalog_info_path = hc.io.paths.get_catalog_info_pointer(base_catalog_dir)
---> 93 catalog_info = BaseCatalogInfo.read_from_metadata_file(catalog_info_path, storage_options=storage_options)
     94 catalog_type = catalog_info.catalog_type
     95 if catalog_type not in dataset_class_for_catalog_type:

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/hipscat/catalog/dataset/base_catalog_info.py:60, in BaseCatalogInfo.read_from_metadata_file(cls, catalog_info_file, storage_options)
     47 @classmethod
     48 def read_from_metadata_file(
     49     cls, catalog_info_file: FilePointer, storage_options: Union[Dict[Any, Any], None] = None
     50 ) -> Self:
     51     """Read catalog info from the `catalog_info.json` metadata file
     52
     53     Args:
   (...)
     58         A CatalogInfo object with the data from the `catalog_info.json` file
     59     """
---> 60     metadata_keywords = file_io.load_json_file(catalog_info_file, storage_options=storage_options)
     61     catalog_info_keywords = {}
     62     for field in dataclasses.fields(cls):

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/hipscat/io/file_io/file_io.py:115, in load_json_file(file_pointer, encoding, storage_options)
    113 json_dict = None
    114 file_system, file_pointer = get_fs(file_pointer, storage_options)
--> 115 with file_system.open(file_pointer, "r", encoding=encoding) as json_file:
    116     json_dict = json.load(json_file)
    118 return json_dict

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/fsspec/spec.py:1286, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1278     mode = mode.replace("t", "") + "b"
   1280     text_kwargs = {
   1281         k: kwargs.pop(k)
   1282         for k in ["encoding", "errors", "newline"]
   1283         if k in kwargs
   1284     }
   1285     return io.TextIOWrapper(
-> 1286         self.open(
   1287             path,
   1288             mode,
   1289             block_size=block_size,
   1290             cache_options=cache_options,
   1291             compression=compression,
   1292             **kwargs,
   1293         ),
   1294         **text_kwargs,
   1295     )
   1296 else:
   1297     ac = kwargs.pop("autocommit", not self._intrans)

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/fsspec/spec.py:1298, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1296 else:
   1297     ac = kwargs.pop("autocommit", not self._intrans)
-> 1298     f = self._open(
   1299         path,
   1300         mode=mode,
   1301         block_size=block_size,
   1302         autocommit=ac,
   1303         cache_options=cache_options,
   1304         **kwargs,
   1305     )
   1306     if compression is not None:
   1307         from fsspec.compression import compr

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/fsspec/implementations/http.py:361, in HTTPFileSystem._open(self, path, mode, block_size, autocommit, cache_type, cache_options, size, **kwargs)
    359 kw["asynchronous"] = self.asynchronous
    360 kw.update(kwargs)
--> 361 size = size or self.info(path, **kwargs)["size"]
    362 session = sync(self.loop, self.set_session)
    363 if block_size and size:

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/fsspec/asyn.py:118, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
    115 @functools.wraps(func)
    116 def wrapper(*args, **kwargs):
    117     self = obj or args[0]
--> 118     return sync(self.loop, func, *args, **kwargs)

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/fsspec/asyn.py:103, in sync(loop, func, timeout, *args, **kwargs)
    101     raise FSTimeoutError from return_result
    102 elif isinstance(return_result, BaseException):
--> 103     raise return_result
    104 else:
    105     return return_result

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout)
     54     coro = asyncio.wait_for(coro, timeout=timeout)
     55 try:
---> 56     result[0] = await coro
     57 except Exception as ex:
     58     result[0] = ex

File ~/.virtualenvs/lsdb/lib/python3.11/site-packages/fsspec/implementations/http.py:435, in HTTPFileSystem._info(self, url, **kwargs)
    432     except Exception as exc:
    433         if policy == "get":
    434             # If get failed, then raise a FileNotFoundError
--> 435             raise FileNotFoundError(url) from exc
    436         logger.debug("", exc_info=exc)
    438 return {"name": url, "size": None, **info, "type": "file"}

FileNotFoundError: epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys/des/des_dr2/catalog_info.json

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
  • If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant