🐝 Update pandas to 2.2.2 #3312

Marigold · 2024-09-18T19:13:00Z

Needed by #3305

owidbot · 2024-09-18T19:15:08Z

Quick links (staging server):

Site	Admin	Wizard

Login: ssh owid@staging-site-update-pandas

chart-diff: ✅

No charts for review.

data-diff:

= Dataset garden/artificial_intelligence/2024-09-09/epoch_aggregates_domain
  = Table epoch_aggregates_domain
    ~ Column cumulative_count (changed metadata)
-       -   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 10 September 2024.
        ?                                                                                                                                                                                                                                                             ^
+       +   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 18 September 2024.
        ?                                                                                                                                                                                                                                                             ^
    ~ Column yearly_count (changed metadata)
-       -   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 10 September 2024.
        ?                                                                                                                                                                                                                                                             ^
+       +   Describes the specific area, application, or field in which an AI system is designed to operate. An AI system can operate in more than one domain, thus contributing to the count for multiple domains. The 2024 data is incomplete and was last updated 18 September 2024.
        ?                                                                                                                                                                                                                                                             ^
= Dataset garden/artificial_intelligence/2024-09-09/epoch_compute_intensive_countries
  = Table epoch_compute_intensive_countries
    ~ Column cumulative_count (changed metadata)
-       -   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 10 September 2024.
        ?                                                                                                                                                                           ^
+       +   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 18 September 2024.
        ?                                                                                                                                                                           ^
    ~ Column yearly_count (changed metadata)
-       -   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 10 September 2024.
        ?                                                                                                                                                                           ^
+       +   Refers to the location of the primary organization with which the authors of a large-scale AI systems are affiliated. The 2024 data is incomplete and was last updated 18 September 2024.
        ?                                                                                                                                                                           ^
= Dataset garden/artificial_intelligence/2024-09-09/epoch_compute_intensive_domain
  = Table epoch_compute_intensive_domain
    ~ Column cumulative_count (changed metadata)
-       -   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 10 September 2024.
        ?                                                                                                                                                                ^
+       +   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 18 September 2024.
        ?                                                                                                                                                                ^
    ~ Column yearly_count (changed metadata)
-       -   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 10 September 2024.
        ?                                                                                                                                                                ^
+       +   Describes the specific area, application, or field in which a large-scale AI model is designed to operate. The 2024 data is incomplete and was last updated 18 September 2024.
        ?                                                                                                                                                                ^
= Dataset garden/demography/2023-09-27/survivor_percentiles
  ~ Table survivor_percentiles (changed metadata)
+     + title: Human Mortality Database
+     + description: |-
+     +   The Human Mortality Database (HMD) contains original calculations of death rates and life tables for national populations (countries or areas), as well as the input data used in constructing those tables. The input data consist of death counts from vital statistics, plus census counts, birth counts, and population estimates from various sources.
+     + 
+     + 
+     +   # Scope and basic principles
+     + 
+     +   New data series to this collection. However, the database is limited by design to populations where death registration and census data are virtually complete, since this type of information is required for the uniform method used to reconstruct historical data series. As a result, the countries and areas included here are relatively wealthy and for the most part highly industrialized.
+     + 
+     +   The main goal of the Human Mortality Database is to document the longevity revolution of the modern era and to facilitate research into its causes and consequences. As much as possible, the authors of the database have followed four guiding principles: comparability, flexibility, accessibility, reproducibility.
+     + 
+     + 
+     +   # Computing death rates and life tables
+     + 
+     +   Their process for computing mortality rates and life tables can be described in terms of six steps, corresponding to six data types that are available from the HMD. Here is an overview of the process:
+     + 
+     +   1. Births. Annual counts of live births by sex are collected for each population over the longest possible time period. These counts are used mainly for making population estimates at younger ages.
+     +   2. Deaths. Death counts are collected at the finest level of detail available. If raw data are aggregated, uniform methods are used to estimate death counts by completed age (i.e., age-last-birthday at time of death), calendar year of death, and calendar year of birth.
+     +   3. Population size. Annual estimates of population size on January 1st are either obtained from another source or are derived from census data plus birth and death counts.
+     +   4. Exposure-to-risk. Estimates of the population exposed to the risk of death during some age-time interval are based on annual (January 1st) population estimates, with a small correction that reflects the timing of deaths within the interval.
+     +   5. Death rates. Death rates are always a ratio of the death count for a given age-time interval divided by an estimate of the exposure-to-risk in the same interval.
+     +   6. Life tables. To build a life table, probabilities of death are computed from death rates. These probabilities are used to construct life tables, which include life expectancies and other useful indicators of mortality and longevity.
+     + 
+     + 
+     +   # Corrections to the data
+     + 
+     +   The data presented here have been corrected for gross errors (e.g., a processing error whereby 3,800 becomes 38,000 in a published statistical table would be obvious in most cases, and it would be corrected). However, the authors have not attempted to correct the data for systematic age misstatement (misreporting of age) or coverage errors (over- or under-enumeration of people or events).
+     + 
+     +   Some available studies assess the completeness of census coverage or death registration in the various countries, and more work is needed in this area. However, in developing the database thus far, the authors did not consider it feasible or desirable to attempt corrections of this sort, especially since it would be impossible to correct the data by a uniform method across all countries.
+     + 
+     + 
+     +   # Age misreporting
+     + 
+     +   Populations are included here if there is a well-founded belief that the coverage of their census and vital registration systems is relatively high, and thus, that fruitful analyses by both specialists and non-specialists should be possible with these data. Nevertheless, there is evidence of both age heaping (overreporting ages ending in "0" or "5") and age exaggeration in these data.
+     + 
+     +   In general, the degree of age heaping in these data varies by the time period and population considered, but it is usually no burden to scientific analysis. In most cases, it is sufficient to analyze data in five-year age groups in order to avoid the false impressions created by this particular form of age misstatement.
+     + 
+     +   Age exaggeration, on the other hand, is a more insidious problem. The authors' approach is guided by the conventional wisdom that age reporting in death registration systems is typically more reliable than in census counts or official population estimates. For this reason, the authors derive population estimates at older ages from the death counts themselves, employing extinct cohort methods. Such methods eliminate some, but certainly not all, of the biases in old-age mortality estimates due to age exaggeration.
+     + 
+     + 
+     +   # Uniform set of procedures
+     + 
+     +   A key goal of this project is to follow a uniform set of procedures for each population. This approach does not guarantee the cross-national comparability of the data. Rather, it ensures only that the authors have not introduced biases by the authors' own manipulations. The desire of the authors for uniformity had to face the challenge that raw data come in a variety of formats (for example, 1-year versus 5-year age groups). The authors' general approach to this problem is that the available raw data are used first to estimate two quantities: 1) the number of deaths by completed age, year of birth, and year of death; and 2) population estimates by single years of age on January 1 of each year. For each population, these calculations are performed separately by sex. From these two pieces of information, they compute death rates and life tables in a variety of age-time configurations.
+     + 
+     +   It is reasonable to ask whether a single procedure is the best method for treating the data from a variety of populations. Here, two points must be considered. First, the authors' uniform methodology is based on procedures that were developed separately, though following similar principles, for various countries and by different researchers. Earlier methods were synthesized by choosing what they considered the best among alternative procedures and by eliminating superficial inconsistencies. The second point is that a uniform procedure is possible only because the authors have not attempted to correct the data for reporting and coverage errors. Although some general principles could be followed, such problems would have to be addressed individually for each population.
+     + 
+     +   Although the authors adhere strictly to a uniform procedure, the data for each population also receive significant individualized attention. Each country or area is assigned to an individual researcher, who takes responsibility for assembling and checking the data for errors. In addition, the person assigned to each country/area checks the authors' data against other available sources. These procedures help to assure a high level of data quality, but assistance from database users in identifying problems is always appreciated!
= Dataset garden/demography/2023-10-04/gini_le
  = Table gini_le
    ~ Dim location
+       + New values: 1626 / 212193 (0.77%)
           year    sex     location
           1776 female East Germany
           1857 female East Germany
           1898   male East Germany
           1926 female East Germany
           1985 female East Germany
    ~ Dim year
+       + New values: 1626 / 212193 (0.77%)
              location    sex  year
          East Germany female  1776
          East Germany female  1857
          East Germany   male  1898
          East Germany female  1926
          East Germany female  1985
    ~ Dim sex
+       + New values: 1626 / 212193 (0.77%)
              location  year    sex
          East Germany  1776 female
          East Germany  1857 female
          East Germany  1898   male
          East Germany  1926 female
          East Germany  1985 female
    ~ Column life_expectancy_gini (new data)
+       + New values: 1626 / 212193 (0.77%)
              location  year    sex  life_expectancy_gini
          East Germany  1776 female                   NaN
          East Germany  1857 female                   NaN
          East Germany  1898   male                   NaN
          East Germany  1926 female                   NaN
          East Germany  1985 female                   NaN
= Dataset garden/insee/2024-04-26/relative_poverty_france
  = Table relative_poverty_france
    ~ Dim country
+       + New values: 1 / 34 (2.94%)
           year  spell country
           1975      1  France
-       - Removed values: 1 / 34 (2.94%)
           year  spell country
           1975   <NA>  France
    ~ Dim year
+       + New values: 1 / 34 (2.94%)
          country  spell  year
           France      1  1975
-       - Removed values: 1 / 34 (2.94%)
          country  spell  year
           France   <NA>  1975
    ~ Dim spell
+       + New values: 1 / 34 (2.94%)
          country  year  spell
           France  1975      1
-       - Removed values: 1 / 34 (2.94%)
          country  year  spell
           France  1975   <NA>
    ~ Column headcount_ratio_40_median (new data, changed data)
+       + New values: 1 / 34 (2.94%)
          country  year  spell  headcount_ratio_40_median
           France  1975      1                        5.8
-       - Removed values: 1 / 34 (2.94%)
          country  year  spell  headcount_ratio_40_median
           France  1975   <NA>                        5.8
    ~ Column headcount_ratio_50_median (new data, changed data)
+       + New values: 1 / 34 (2.94%)
          country  year  spell  headcount_ratio_50_median
           France  1975      1                       10.6
-       - Removed values: 1 / 34 (2.94%)
          country  year  spell  headcount_ratio_50_median
           France  1975   <NA>                       10.6
    ~ Column headcount_ratio_60_median (new data, changed data)
+       + New values: 1 / 34 (2.94%)
          country  year  spell  headcount_ratio_60_median
           France  1975      1                       17.0
-       - Removed values: 1 / 34 (2.94%)
          country  year  spell  headcount_ratio_60_median
           France  1975   <NA>                       17.0
    ~ Column headcount_ratio_70_median (new data, changed data)
+       + New values: 1 / 34 (2.94%)
          country  year  spell  headcount_ratio_70_median
           France  1975      1                       23.9
-       - Removed values: 1 / 34 (2.94%)
          country  year  spell  headcount_ratio_70_median
           France  1975   <NA>                       23.9
= Dataset garden/who/2023-11-01/who_statins
  = Table who_statins
2024-09-19 08:09:10 [error    ] Traceback (most recent call last):

  File "/home/owid/etl/.venv/lib/python3.10/site-packages/requests/models.py", line 974, in json
    return complexjson.loads(self.text, **kwargs)

  File "/home/owid/etl/.venv/lib/python3.10/site-packages/simplejson/__init__.py", line 514, in loads
    return _default_decoder.decode(s)

  File "/home/owid/etl/.venv/lib/python3.10/site-packages/simplejson/decoder.py", line 386, in decode
    obj, end = self.raw_decode(s)

  File "/home/owid/etl/.venv/lib/python3.10/site-packages/simplejson/decoder.py", line 416, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())

simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/home/owid/etl/etl/datadiff.py", line 429, in cli
    lines = future.result()

  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()

  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception

  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)

  File "/home/owid/etl/etl/datadiff.py", line 422, in func
    differ.summary()

  File "/home/owid/etl/etl/datadiff.py", line 260, in summary
    self._diff_tables(self.ds_a, self.ds_b, table_name)

  File "/home/owid/etl/etl/datadiff.py", line 122, in _diff_tables
    table_a = future_a.result()

  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()

  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception

  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)

  File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)

  File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)

  File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)

  File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())

  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()

  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception

  File "/home/owid/etl/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)

  File "/home/owid/etl/etl/datadiff.py", line 843, in get_table_with_retry
    return ds[table_name]

  File "/home/owid/etl/etl/datadiff.py", line 284, in __getitem__
    return tables.load()

  File "/home/owid/etl/lib/catalog/owid/catalog/catalogs.py", line 312, in load
    return self.iloc[0].load()  # type: ignore

  File "/home/owid/etl/lib/catalog/owid/catalog/catalogs.py", line 363, in load
    return Table.read(uri)

  File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 179, in read
    table = cls.read_feather(path, **kwargs)

  File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 365, in read_feather
    cls._add_metadata(df, path, **kwargs)

  File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 337, in _add_metadata
    metadata = cls._read_metadata(path)

  File "/home/owid/etl/lib/catalog/owid/catalog/tables.py", line 399, in _read_metadata
    return cast(Dict[str, Any], requests.get(metadata_path).json())

  File "/home/owid/etl/.venv/lib/python3.10/site-packages/requests/models.py", line 978, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)

requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)


⚠ Found errors, create an issue please

Legend: +New  ~Modified  -Removed  =Identical  Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet

Edited: 2024-09-19 08:09:14 UTC
Execution time: 1030.42 seconds

Marigold · 2024-09-30T09:32:33Z

Verified that this doesn't introduce any discrepancies. Closing and adding it to #3236.

github-actions bot assigned Marigold Sep 18, 2024

Marigold marked this pull request as ready for review September 18, 2024 19:14

Marigold added 2 commits September 19, 2024 09:49

🐝 Update pandas to 2.2.2

b72e407

wip

7e2dd3b

Marigold force-pushed the update-pandas branch from 2eb9d9c to 7e2dd3b Compare September 19, 2024 07:49

Marigold closed this Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐝 Update pandas to 2.2.2 #3312

🐝 Update pandas to 2.2.2 #3312

Marigold commented Sep 18, 2024 •

edited

Loading

owidbot commented Sep 18, 2024 •

edited

Loading

Marigold commented Sep 30, 2024

🐝 Update pandas to 2.2.2 #3312

🐝 Update pandas to 2.2.2 #3312

Conversation

Marigold commented Sep 18, 2024 • edited Loading

owidbot commented Sep 18, 2024 • edited Loading

Marigold commented Sep 30, 2024

Marigold commented Sep 18, 2024 •

edited

Loading

owidbot commented Sep 18, 2024 •

edited

Loading