Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with column types #6

Open
iipr opened this issue Apr 26, 2019 · 0 comments
Open

Problems with column types #6

iipr opened this issue Apr 26, 2019 · 0 comments

Comments

@iipr
Copy link

iipr commented Apr 26, 2019

After playing around a bit with gastrodon, I think I have bumped into some problems regarding column types. To reproduce:

Preliminaries

from gastrodon import RemoteEndpoint, inline
import pandas as pd

prefixes = inline("""
    @prefix : <http://dbpedia.org/resource/> .
    @prefix dbp: <http://dbpedia.org/ontology/> .
    @prefix pr: <http://dbpedia.org/property/> .
    @prefix foaf: <http://xmlns.com/foaf/0.1/> .
""").graph
endpoint = RemoteEndpoint(
    "http://dbpedia.org/sparql/"
    ,default_graph="http://dbpedia.org"
    ,prefixes=prefixes
    ,base_uri="http://dbpedia.org/resource/"
)

Error with dates

endpoint.select("""
SELECT DISTINCT ?personName ?bDay
WHERE {
    ?person a dbp:Person .
    ?person foaf:name ?nombrePersona .
    ?person dbp:birthDate ?bDay .
    }
    LIMIT 10
""")

Output:

Traceback (most recent call last):
  File "<stdin>", line 9, in <module>
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 502, in select
    frame=self._dataframe(result)
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 397, in _dataframe
    column[key] = self._normalize_column_type(column[key])
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 376, in _normalize_column_type
    return [None if x==None else int(x) for x in column]
  File "/opt/conda/lib/python3.6/site-packages/gastrodon/__init__.py", line 376, in <listcomp>
    return [None if x==None else int(x) for x in column]
TypeError: int() argument must be a string, a bytes-like object or a number, not 'datetime.date'

Issue (casting floats)

endpoint.select("""
SELECT DISTINCT ?starName ?mass
WHERE {
    ?star a dbp:Star .
    ?star foaf:name ?starName .
    ?star pr:mass ?mass
} LIMIT 1
""")

Output:

    starName  mass
0  61 Cygni     0

Expected output:

    starName  mass
0  61 Cygni   0.63

(see this)

Possible cause

I believe that they are coming from _normalize_column_type:

  1. pd.datetime is not considered, so when trying to do int(x) where x is a pd.datetime, the above error appears.
  2. If all elements in the column are float type, they are silently casted into int, as shown in the issue above.

My question now is: is it really necessary to normalize the columns?
pandas is usually smart enough to accommodate column types and cast if needed.
If I skip the _normalize_column_type() in the code, in the previous issue with the stars example, the mass is not casted to int, and if needed to cast to str, it does:

endpoint.select("""
SELECT DISTINCT ?starName ?mass
WHERE {
    ?star a dbp:Star .
    ?star foaf:name ?starName .
    ?star pr:mass ?mass
} LIMIT 100
""").head()

_.mass.dtype

Output:

      starName          mass
0     61 Cygni          0.63
1     61 Cygni           0.7
2  70 Virginis          1.12
3  70 Virginis  >7.49 ± 0.61
4      Albireo           3.2

dtype('O')

Python 3.6.6
gastrodon 0.9.3
pandas 0.23.4
iipr added a commit to iipr/gastrodon that referenced this issue Apr 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant