Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a section to read and parse data into a DataFrame #610

Open
anuradhawick opened this issue Apr 11, 2024 · 1 comment
Open

Add a section to read and parse data into a DataFrame #610

anuradhawick opened this issue Apr 11, 2024 · 1 comment

Comments

@anuradhawick
Copy link

How could the content be improved?

The following section introduce how data can be processed using loops

Automating data processing using For Loops

I believe it would also be advantageous to have a similar section in the following

Reading CSV Data Using Pandas

Here we can briefly introduce python generators as well. For example, consider a CSV file where entries are name, age, location. We can parse this data to a dataframe using a generator. Image location is a comma separated string field and we want to read latitude and longitude separately.

name age location
John 50 123341,123321
Emily 25 321321,123321
Wick 35 123341,654789
Raj 40 987789,123321
import csv
import pandas as pd


def transform_lines(csv_path):
    reader = csv.reader(open(csv_path))

    for line_no, line in enumerate(reader):
        if line_no == 0:
            yield ["Name", "Age", "Latitude", "Longitude"]
        else:
            name, age, location = line
            lat, lng = location.split(",")
            yield [name, int(age), float(lat), float(lng)]


lines = transform_lines("./data.csv")
df = pd.DataFrame(lines)

print(df.head())

This is specially useful in large datasets where loading large amount of data in text form is memory consuming.

@quist00
Copy link
Contributor

quist00 commented Apr 19, 2024

Hi, @anuradhawick
Thanks for taking time to suggest this modification. It definitely addresses an issue that many researchers will likely incur at some point. That said, what are your thoughts though on this being a good match for potentially absolute beginners. I fear that if someone is brand new to all this, there is a lot of automagical stuff introduced by the yield keyword that might be a bridge to far for some to wrap their head around. It might be a better fit for the instructor notes. Also, there is the potential for a community based re-write (https://carpentries.slack.com/archives/C03LE48AY/p1711535383742769) so I will likely table any major changes like this until that is settled one way or another. If you wanted to do a PR to put it into the instructor section prior to that, though, I would be happy to consider it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants