Skip to content

Latest commit

 

History

History
164 lines (145 loc) · 7.34 KB

Data Importation in Python.md

File metadata and controls

164 lines (145 loc) · 7.34 KB

Data Importation in Python

Reading flat files

  • Flat files are basic text files containing records, that is, table data, without structured relationships.

  • This is in contrast to a relational database, in which columns of distinct tables can be related.

  • Examples of flat files: .txt, .csv (Excel files are not flat files, as they can contain many sheets)

  • To import:
    file = open('filename.txt', mode='r')
    text=file.read()
    file.close

  • You can avoid having to close the file using a context manager, defined using with statement:
    with open('filename.txt', 'r') as file:
    print(file.read()) # to display all text
    print(file.readLine()) #to display one line

Importing flat files using NumPy

  • If you now want to import a flat file and assign it to a variable, and all the data is numerical, you can use the package numpy to import the data as a numpy array
  • Numpy arrays are the Python standard for storing numerical data, and they can leverage other packages e.g. scikit-learn for ML

Importing using np.loadtxt()

  • Syntax:
    data = np.loadtxt(filename, delimiter=',', skiprows=1, usecols=[0, 2])
  • skiprows is used when there is a string header, and usecols defines the columns you want e.g. 1st and 3rd columns
  • To read all data as strings: data = np.loadtxt(filename.txt, delimiter=',', dtype=str)
  • np.loadtxt() performs well for basic cases, but tends to break when we have columns having mixed datatypes
  • The best library for such situations is Pandas, the Python standard for importing flat files as dataframes

Importing Other File Types

  • Pickle files

    • Are Python objects (e.g. ML models, lists) which are serialized (converted to sequence of bytes, or bytestream)
    • To import:
      import pickle
      with open('filename.pkl', 'rb') as file:
      data=pickle.load(file)
      print (data)
    • rb specifies that the file is both read only (r) and binary(b)
  • Excel Spreadsheets

    • Commonly used file type
    • To import:
      data = pd.ExcelFile(filename) # read Excel file to data variable
      print(data.sheet_names) # print sheets in file
      df1 = data.parse('sheet1') # create dataframe using sheet name
      df2 = data.parse(0) # create dataframe using sheet index
  • SAS (Statistical Analysis System) Files

    • Mostly used in in business analytics and biostatistics
    • Most common SAS files have extensions .sas7bdat (dataset files) and .sas7bcat (catalog files)
    • To import the dataset files:
      from sas7bdat import SAS7BDAT
      with SAS7BDAT('filename.sas7bdat') as file:
      df_sas = file.to_data_frame()
  • Stata (Statistics + Data) Files

    • Mostly used in academic social sciences research, such as economics and epidemiology
    • Have extension .dta
    • To import:
      data = pd.read_stata('filename.dta')
  • HDF5 (Hierarchical Data Format version 5) Files

    • Used for storing large quantities of numerical data (100s of GBs/TBs).
    • HDF5 itself can scale up to exabytes
    • To import:
      import h5py
      data = h5py.File('filename.hdf5', 'r')
    • To access the structure of the file, you use same method as that of a dictionary:
      for key in data.keys():
      print(key) # the results are HDF groups, which are analogous to directories
    • To find the data in one HDF group:
      for key in data['group1'].keys():
      print(key) # returns columns
    • To access the values as a numpy array:
      print(np.array(data['group1']['column1']))
  • MATLAB (Matrix Laboratory) Files

    • Is a numerical computing environment, which is the industry standard in engineering and science
    • Data is saved as .mat files
    • Data is read and written using Scipy's methods: scipy.io.loadmat(), and scipy.io.savemat()
    • To import:
      import scipy.io
      mat = scipy.io.loadmat('filename.mat')
    • The variable mat is a dictionary, with keys corresponding to MATLAB variable names, and values corresponding to objects assigned to variables
    • To access the keys: print(mat.keys())
    • To access one value: print(mat['key1'])

Working with Relational Databases in Python

  • Create a SQLite database engine from sqlachemy:
    from sqlachemy import create_engine
    engine=create_engine('sqlite:///DatabaseName.sqlite')
  • To get table names: engine.table_names()
  • To run SQL queries:
    con = engine.connect()
    results = con.execute('SQL Query')
    df = pd.DataFrame(results.fetchall())
    df.columns=results.keys() # defines the column names in df
    con.close()
  • To save the hussle of closing the connection, you can use the context manager as:
    with engine.connect as con:
    results = con.execute('SQL Query')
    df = pd.DataFrame(results.fetchall()) # or .fetchmany(10) to import 10 rows instead of all
    df.columns=results.keys()
  • You can achieve running SQL queries in a single line of code using Pandas:
    df = pd.read_sql_query('SQL Query', engine)

Importing Data From the Web

Reading Flat Files from the Web

  • Syntax:
    from urllib.request import urlretrieve
    urlretrieve(url, 'path_to_save.csv') # Saving file locally
    df = pd.read_csv(url, sep=';') # Reading directly without saving

Reading Non-flat Files from the Web

  • Syntax:
    xls = pd.read_excel(url, sheet_name = None) #To import all sheets, sheetname should be None
    for k, v in xls.items(): print(k) # To display sheet names

Performing HTTP Requests using urllib

  • Syntax:
    from urllib.request import urlopen, Request
    request = Request(url) # Packages the request
    response = urlopen(request) # Sends the request and catches the response
    html = response.read() # Extract the response
    response.close()

Performing HTTP Requests using requests (High Level)

  • Syntax:
    import requests
    r = requests.get(url)
    text = r.text

Scraping the web in Python: BeautifulSoup

  • Syntax:
    from bs4 import BeautifulSoup
    import requests
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc)
    print(soup.prettify()) #printing soup
    for link in soup.find_all('a'): print(link.get('href')) #printing all links in page

Interacting with APIs to import data from the web

  • An API is a set of protocols and routines for building and interacting with software applications

Loading data using JSON

  • Was invented for real-time server-to-browser communication

  • Syntax:
    import json
    with open('file.json', 'r') as json_file:
    json_data = json.load(json_file) # Returns a dict

  • To pull data from an API using requests:
    r = requests.get(url)
    json_data = r.json()
    for key, value in json_data.items(): print(key + ':', value)

  • To use Python lists as counter for many variables:
    [clinton, trump, sanders, cruz] = [0, 0, 0, 0] # Initialize list to store tweet counts
    for index, row in df.iterrows():
    clinton += word_in_text('clinton', row['text'])
    trump += word_in_text('trump', row['text'])
    sanders += word_in_text('sanders',row['text'])
    cruz += word_in_text('cruz', row['text'])

    • To get the counter list:
      y = [clinton, trump, sanders, cruz] # Counter List