Data Importation in Python

Reading flat files

Flat files are basic text files containing records, that is, table data, without structured relationships.
This is in contrast to a relational database, in which columns of distinct tables can be related.
Examples of flat files: .txt, .csv (Excel files are not flat files, as they can contain many sheets)
To import:
file = open('filename.txt', mode='r')
text=file.read()
file.close
You can avoid having to close the file using a context manager, defined using with statement:
with open('filename.txt', 'r') as file:
print(file.read()) # to display all text
print(file.readLine()) #to display one line

Importing flat files using NumPy

If you now want to import a flat file and assign it to a variable, and all the data is numerical, you can use the package numpy to import the data as a numpy array
Numpy arrays are the Python standard for storing numerical data, and they can leverage other packages e.g. scikit-learn for ML

Importing using np.loadtxt()

Syntax:
data = np.loadtxt(filename, delimiter=',', skiprows=1, usecols=[0, 2])
skiprows is used when there is a string header, and usecols defines the columns you want e.g. 1st and 3rd columns
To read all data as strings: data = np.loadtxt(filename.txt, delimiter=',', dtype=str)
np.loadtxt() performs well for basic cases, but tends to break when we have columns having mixed datatypes
The best library for such situations is Pandas, the Python standard for importing flat files as dataframes

Importing Other File Types

Pickle files
- Are Python objects (e.g. ML models, lists) which are serialized (converted to sequence of bytes, or bytestream)
- To import:
  import pickle
  with open('filename.pkl', 'rb') as file:
  data=pickle.load(file)
  print (data)
- rb specifies that the file is both read only (r) and binary(b)
Excel Spreadsheets
- Commonly used file type
- To import:
  data = pd.ExcelFile(filename) # read Excel file to data variable
  print(data.sheet_names) # print sheets in file
  df1 = data.parse('sheet1') # create dataframe using sheet name
  df2 = data.parse(0) # create dataframe using sheet index
SAS (Statistical Analysis System) Files
- Mostly used in in business analytics and biostatistics
- Most common SAS files have extensions .sas7bdat (dataset files) and .sas7bcat (catalog files)
- To import the dataset files:
  from sas7bdat import SAS7BDAT
  with SAS7BDAT('filename.sas7bdat') as file:
  df_sas = file.to_data_frame()
Stata (Statistics + Data) Files
- Mostly used in academic social sciences research, such as economics and epidemiology
- Have extension .dta
- To import:
  data = pd.read_stata('filename.dta')
HDF5 (Hierarchical Data Format version 5) Files
- Used for storing large quantities of numerical data (100s of GBs/TBs).
- HDF5 itself can scale up to exabytes
- To import:
  import h5py
  data = h5py.File('filename.hdf5', 'r')
- To access the structure of the file, you use same method as that of a dictionary:
  for key in data.keys():
  print(key) # the results are HDF groups, which are analogous to directories
- To find the data in one HDF group:
  for key in data['group1'].keys():
  print(key) # returns columns
- To access the values as a numpy array:
  print(np.array(data['group1']['column1']))
MATLAB (Matrix Laboratory) Files
- Is a numerical computing environment, which is the industry standard in engineering and science
- Data is saved as .mat files
- Data is read and written using Scipy's methods: scipy.io.loadmat(), and scipy.io.savemat()
- To import:
  import scipy.io
  mat = scipy.io.loadmat('filename.mat')
- The variable mat is a dictionary, with keys corresponding to MATLAB variable names, and values corresponding to objects assigned to variables
- To access the keys: print(mat.keys())
- To access one value: print(mat['key1'])

Working with Relational Databases in Python

Create a SQLite database engine from sqlachemy:
from sqlachemy import create_engine
engine=create_engine('sqlite:///DatabaseName.sqlite')
To get table names: engine.table_names()
To run SQL queries:
con = engine.connect()
results = con.execute('SQL Query')
df = pd.DataFrame(results.fetchall())
df.columns=results.keys() # defines the column names in df
con.close()
To save the hussle of closing the connection, you can use the context manager as:
with engine.connect as con:
results = con.execute('SQL Query')
df = pd.DataFrame(results.fetchall()) # or .fetchmany(10) to import 10 rows instead of all
df.columns=results.keys()
You can achieve running SQL queries in a single line of code using Pandas:
df = pd.read_sql_query('SQL Query', engine)

Importing Data From the Web

Reading Flat Files from the Web

Syntax:
from urllib.request import urlretrieve
urlretrieve(url, 'path_to_save.csv') # Saving file locally
df = pd.read_csv(url, sep=';') # Reading directly without saving

Reading Non-flat Files from the Web

Syntax:
xls = pd.read_excel(url, sheet_name = None) #To import all sheets, sheetname should be None
for k, v in xls.items(): print(k) # To display sheet names

Performing HTTP Requests using urllib

Syntax:
from urllib.request import urlopen, Request
request = Request(url) # Packages the request
response = urlopen(request) # Sends the request and catches the response
html = response.read() # Extract the response
response.close()

Performing HTTP Requests using requests (High Level)

Syntax:
import requests
r = requests.get(url)
text = r.text

Scraping the web in Python: BeautifulSoup

Syntax:
from bs4 import BeautifulSoup
import requests
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
print(soup.prettify()) #printing soup
for link in soup.find_all('a'): print(link.get('href')) #printing all links in page

Interacting with APIs to import data from the web

An API is a set of protocols and routines for building and interacting with software applications

Loading data using JSON

Was invented for real-time server-to-browser communication
Syntax:
import json
with open('file.json', 'r') as json_file:
json_data = json.load(json_file) # Returns a dict
To pull data from an API using requests:
r = requests.get(url)
json_data = r.json()
for key, value in json_data.items(): print(key + ':', value)
To use Python lists as counter for many variables:
[clinton, trump, sanders, cruz] = [0, 0, 0, 0] # Initialize list to store tweet counts
for index, row in df.iterrows():
clinton += word_in_text('clinton', row['text'])
trump += word_in_text('trump', row['text'])
sanders += word_in_text('sanders',row['text'])
cruz += word_in_text('cruz', row['text'])
- To get the counter list:
  y = [clinton, trump, sanders, cruz] # Counter List

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Importation in Python.md

Data Importation in Python.md

Data Importation in Python

Reading flat files

Importing flat files using NumPy

Importing using np.loadtxt()

Importing Other File Types

Working with Relational Databases in Python

Importing Data From the Web

Reading Flat Files from the Web

Reading Non-flat Files from the Web

Performing HTTP Requests using urllib

Performing HTTP Requests using requests (High Level)

Scraping the web in Python: BeautifulSoup

Interacting with APIs to import data from the web

Loading data using JSON

Files

Data Importation in Python.md

Latest commit

History

Data Importation in Python.md

File metadata and controls

Data Importation in Python

Reading flat files

Importing flat files using NumPy

Importing using np.loadtxt()

Importing Other File Types

Working with Relational Databases in Python

Importing Data From the Web

Reading Flat Files from the Web

Reading Non-flat Files from the Web

Performing HTTP Requests using urllib

Performing HTTP Requests using requests (High Level)

Scraping the web in Python: BeautifulSoup

Interacting with APIs to import data from the web

Loading data using JSON