Skip to content

Utilize Python to Perform An ETL (Extract, Transform, Load) Project to Store Data Obtained From Multiple Online Sources into An SQL Database.

Notifications You must be signed in to change notification settings

hanyang2019/Project_ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

H1B

H1B VISA PREDICTION

Developer: Han Yang and Monica Ramos

Goal

To predict the approval rate of obtaining an H1B VISA in United States of America based on job industry sectors through availabe datasets of the most recent 4 fiscal years.


Tool

  • Python
  • postgreSQL

Data Source

  1. H1B VISA Datasets
  • H-1B_Disclosure_RAW_Data_FY15.csv
  • H-1B_Disclosure_RAW_Data_FY16.csv
  • H-1B_Disclosure_RAW_Data_FY17.csv
  • H-1B_Disclosure_RAW_Data_FY18.csv

(NAICS_CODE in FY17 and FY18 were manually changed to NAIC_CODE to make it identical across all datasets.)

  1. NAICS Codes

Web Scraping of NAICS Codes

Install dependencies

from bs4 import BeautifulSoup as bs
from splinter import Browser
import pandas as pd

Visit web page

executable_path={'executable_path':'/usr/local/bin/chromedriver'}
browser=Browser('chrome',**executable_path, headless=False)
url='https://www.naics.com/search-naics-codes-by-industry/'
browser.visit(url)
html=browser.html

The table on the page provides links to tables of a full list of codes for each category. Those links were obtained in order to access those tables.

soup=bs(html,'html.parser')
results=soup.find_all('td',{'class':'noWrap'})
code_url_list=[result.a['href'] for result in results]
browser.quit()

Use Pandas to scrape the full code table for each category and concatdenate them into a single tables (3 rows of code containing '-' were meaningless and deleted manually).

table_list=[]
for url in code_url_list:
    tables=pd.read_html(url) 
    df=tables[0] 
    df_new=df[['Codes','Titles']] 
    table_list.append(df_new)
big_df=pd.concat([*table_list])
big_df.to_csv('../H1B_Data/NAICS_CODE.csv',index=False, header=True)

H1B VISA Datasets Cleaning

Install dependencies

import pandas as pd

Define a function to process each dataset respectively to remove unnecessary columns and concatdenate them into a single file.

data_list=['../H1B_Data/H-1B_Disclosure_RAW_Data_FY15.csv','../H1B_Data/H-1B_Disclosure_RAW_Data_FY16.csv','../H1B_Data/H-1B_Disclosure_RAW_Data_FY17.csv','../H1B_Data/H-1B_Disclosure_RAW_Data_FY18.csv']

def clean(data):
    df=pd.read_csv(data, encoding="ISO-8859-1")
    new_df=df[["CASE_NUMBER","CASE_STATUS", "DECISION_DATE", "EMPLOYER_NAME","EMPLOYER_CITY","EMPLOYER_STATE", "EMPLOYER_COUNTRY", "NAIC_CODE"]]
    us_df=new_df[new_df["EMPLOYER_COUNTRY"]=="UNITED STATES OF AMERICA"]
    return us_df

H1B_df_list=[]
for data in data_list:
    h1b_df=clean(data)
    H1B_df_list.append(h1b_df)
all_H1B_df=pd.concat([*H1B_df_list]) 
all_H1B_df['EMPLOYER_NAME']=all_H1B_df['EMPLOYER_NAME'].str.replace(',','') 
all_H1B_df.to_csv('../H1B_Data/Pre_Cleaned_H1B_Data.csv',index=False, header=True)

Keep columns that are relevant to employers, reset index twice to rename it EMP_ID as a foreign key in database and export to employer.csv.

H1B_no_missing=all_H1B_df.dropna(how='any')
employee_df=H1B_no_missing[["CASE_NUMBER","CASE_STATUS", "DECISION_DATE", "EMPLOYER_NAME","NAIC_CODE"]]

employer_df.drop_duplicates(subset="EMPLOYER_NAME",keep="first",inplace=True)
reset_employer_df=employer_df.reset_index(inplace=False, drop=True)
new_reset_employer_df=reset_employer_df.reset_index(inplace=False, drop=False)

new_employer_df=new_reset_employer_df.rename(columns={'index':'EMP_ID'})
new_employer_df.head(10)
new_employer_df.to_csv('../H1B_Data/employer.csv',index=False)

Keep columns that are relevant to employees, add EMP_ID to it as a foreign key for the popurse of data normaliztion.

employer_df=H1B_no_missing[["EMPLOYER_NAME","EMPLOYER_CITY","EMPLOYER_STATE"]]
merge_df=pd.merge(employee_df,new_employer_df,on="EMPLOYER_NAME")
new_employee_df=merge_df[["CASE_NUMBER","CASE_STATUS", "DECISION_DATE", "EMP_ID","NAIC_CODE"]]
new_employee_df.head(10)
employee_no_missing_df=new_employee_df.dropna(how='any')
employee_no_missing_df['NAIC_CODE'].astype(int)

Since NAICS codes are updated yearly, codes of previous years might not be in the table was obtained through web scraping. Only keep those records whose NAICS codes are compatible with our table.

nacis_df=pd.read_csv('../H1B_Data/NAICS_CODE.csv')
nac_employee_merge=pd.merge(employee_no_missing_df,nacis_df,left_on='NAIC_CODE',right_on='Codes')
update_employee=nac_employee_merge[["CASE_NUMBER","CASE_STATUS", "DECISION_DATE", "EMP_ID","NAIC_CODE"]]
update_employee.to_csv('../H1B_Data/employee.csv',index=False)

postgreSQL DataBase

ERD All datasets (NAICS_CODE.csv, employee.csv and employer.csv) were stored into PostgresSQL database for further analysis.

CREATE TABLE "NAICS_CODE" (
    "CODES" DEC   NOT NULL,
    "TITLES" VARCHAR   NOT NULL,
    CONSTRAINT "pk_NAICS_CODE" PRIMARY KEY (
        "CODES"
     )
);

CREATE TABLE "EMPLOYEE" (
    "CASE_NUMBER" VARCHAR   NOT NULL,
    "CASE_STATUS" VARCHAR  NOT NULL,
    "DECISION_DATE" DATE   NOT NULL,
    "EMP_ID" INT   NOT NULL,
    "NAIC_CODE" DEC   NOT NULL
);

CREATE TABLE "EMPLOYER" (
    "EMP_ID" INT   NOT NULL,
    "EMPLOYER_NAME" VARCHAR   NOT NULL,
    "EMPLOYER_CITY" VARCHAR  NOT NULL,
    "EMPLOYER_STATE" VARCHAR  NOT NULL,
    CONSTRAINT "pk_EMPLOYER" PRIMARY KEY (
        "EMP_ID"
     )
);

ALTER TABLE "EMPLOYEE" ADD CONSTRAINT "fk_EMPLOYEE_NAIC_CODE" FOREIGN KEY("NAIC_CODE")
REFERENCES "NAICS_CODE" ("CODES");

ALTER TABLE "EMPLOYEE" ADD CONSTRAINT "fk_EMPLOYEE_EMP_ID" FOREIGN KEY("EMP_ID")
REFERENCES "EMPLOYER" ("EMP_ID");

About

Utilize Python to Perform An ETL (Extract, Transform, Load) Project to Store Data Obtained From Multiple Online Sources into An SQL Database.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published