H1B VISA PREDICTION

Developer: Han Yang and Monica Ramos

Goal

To predict the approval rate of obtaining an H1B VISA in United States of America based on job industry sectors through availabe datasets of the most recent 4 fiscal years.

Tool

Python
postgreSQL

Data Source

H1B VISA Datasets

H-1B_Disclosure_RAW_Data_FY15.csv
H-1B_Disclosure_RAW_Data_FY16.csv
H-1B_Disclosure_RAW_Data_FY17.csv
H-1B_Disclosure_RAW_Data_FY18.csv

(NAICS_CODE in FY17 and FY18 were manually changed to NAIC_CODE to make it identical across all datasets.)

NAICS Codes

Web Scraping of NAICS Codes

Install dependencies

from bs4 import BeautifulSoup as bs
from splinter import Browser
import pandas as pd

Visit web page

executable_path={'executable_path':'/usr/local/bin/chromedriver'}
browser=Browser('chrome',**executable_path, headless=False)
url='https://www.naics.com/search-naics-codes-by-industry/'
browser.visit(url)
html=browser.html

The table on the page provides links to tables of a full list of codes for each category. Those links were obtained in order to access those tables.

soup=bs(html,'html.parser')
results=soup.find_all('td',{'class':'noWrap'})
code_url_list=[result.a['href'] for result in results]
browser.quit()

Use Pandas to scrape the full code table for each category and concatdenate them into a single tables (3 rows of code containing '-' were meaningless and deleted manually).

table_list=[]
for url in code_url_list:
    tables=pd.read_html(url) 
    df=tables[0] 
    df_new=df[['Codes','Titles']] 
    table_list.append(df_new)
big_df=pd.concat([*table_list])
big_df.to_csv('../H1B_Data/NAICS_CODE.csv',index=False, header=True)

H1B VISA Datasets Cleaning

Install dependencies

import pandas as pd

Define a function to process each dataset respectively to remove unnecessary columns and concatdenate them into a single file.

data_list=['../H1B_Data/H-1B_Disclosure_RAW_Data_FY15.csv','../H1B_Data/H-1B_Disclosure_RAW_Data_FY16.csv','../H1B_Data/H-1B_Disclosure_RAW_Data_FY17.csv','../H1B_Data/H-1B_Disclosure_RAW_Data_FY18.csv']

def clean(data):
    df=pd.read_csv(data, encoding="ISO-8859-1")
    new_df=df[["CASE_NUMBER","CASE_STATUS", "DECISION_DATE", "EMPLOYER_NAME","EMPLOYER_CITY","EMPLOYER_STATE", "EMPLOYER_COUNTRY", "NAIC_CODE"]]
    us_df=new_df[new_df["EMPLOYER_COUNTRY"]=="UNITED STATES OF AMERICA"]
    return us_df

H1B_df_list=[]
for data in data_list:
    h1b_df=clean(data)
    H1B_df_list.append(h1b_df)
all_H1B_df=pd.concat([*H1B_df_list]) 
all_H1B_df['EMPLOYER_NAME']=all_H1B_df['EMPLOYER_NAME'].str.replace(',','') 
all_H1B_df.to_csv('../H1B_Data/Pre_Cleaned_H1B_Data.csv',index=False, header=True)

Keep columns that are relevant to employers, reset index twice to rename it EMP_ID as a foreign key in database and export to employer.csv.

H1B_no_missing=all_H1B_df.dropna(how='any')
employee_df=H1B_no_missing[["CASE_NUMBER","CASE_STATUS", "DECISION_DATE", "EMPLOYER_NAME","NAIC_CODE"]]

employer_df.drop_duplicates(subset="EMPLOYER_NAME",keep="first",inplace=True)
reset_employer_df=employer_df.reset_index(inplace=False, drop=True)
new_reset_employer_df=reset_employer_df.reset_index(inplace=False, drop=False)

new_employer_df=new_reset_employer_df.rename(columns={'index':'EMP_ID'})
new_employer_df.head(10)
new_employer_df.to_csv('../H1B_Data/employer.csv',index=False)

Keep columns that are relevant to employees, add EMP_ID to it as a foreign key for the popurse of data normaliztion.

employer_df=H1B_no_missing[["EMPLOYER_NAME","EMPLOYER_CITY","EMPLOYER_STATE"]]
merge_df=pd.merge(employee_df,new_employer_df,on="EMPLOYER_NAME")
new_employee_df=merge_df[["CASE_NUMBER","CASE_STATUS", "DECISION_DATE", "EMP_ID","NAIC_CODE"]]
new_employee_df.head(10)
employee_no_missing_df=new_employee_df.dropna(how='any')
employee_no_missing_df['NAIC_CODE'].astype(int)

Since NAICS codes are updated yearly, codes of previous years might not be in the table was obtained through web scraping. Only keep those records whose NAICS codes are compatible with our table.

nacis_df=pd.read_csv('../H1B_Data/NAICS_CODE.csv')
nac_employee_merge=pd.merge(employee_no_missing_df,nacis_df,left_on='NAIC_CODE',right_on='Codes')
update_employee=nac_employee_merge[["CASE_NUMBER","CASE_STATUS", "DECISION_DATE", "EMP_ID","NAIC_CODE"]]
update_employee.to_csv('../H1B_Data/employee.csv',index=False)

postgreSQL DataBase

All datasets (NAICS_CODE.csv, employee.csv and employer.csv) were stored into PostgresSQL database for further analysis.

CREATE TABLE "NAICS_CODE" (
    "CODES" DEC   NOT NULL,
    "TITLES" VARCHAR   NOT NULL,
    CONSTRAINT "pk_NAICS_CODE" PRIMARY KEY (
        "CODES"
     )
);

CREATE TABLE "EMPLOYEE" (
    "CASE_NUMBER" VARCHAR   NOT NULL,
    "CASE_STATUS" VARCHAR  NOT NULL,
    "DECISION_DATE" DATE   NOT NULL,
    "EMP_ID" INT   NOT NULL,
    "NAIC_CODE" DEC   NOT NULL
);

CREATE TABLE "EMPLOYER" (
    "EMP_ID" INT   NOT NULL,
    "EMPLOYER_NAME" VARCHAR   NOT NULL,
    "EMPLOYER_CITY" VARCHAR  NOT NULL,
    "EMPLOYER_STATE" VARCHAR  NOT NULL,
    CONSTRAINT "pk_EMPLOYER" PRIMARY KEY (
        "EMP_ID"
     )
);

ALTER TABLE "EMPLOYEE" ADD CONSTRAINT "fk_EMPLOYEE_NAIC_CODE" FOREIGN KEY("NAIC_CODE")
REFERENCES "NAICS_CODE" ("CODES");

ALTER TABLE "EMPLOYEE" ADD CONSTRAINT "fk_EMPLOYEE_EMP_ID" FOREIGN KEY("EMP_ID")
REFERENCES "EMPLOYER" ("EMP_ID");

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
CleaningData.ipynb		CleaningData.ipynb
ERD.png		ERD.png
H1B_DB.sql		H1B_DB.sql
NAICS_Code_Scrapping.ipynb		NAICS_Code_Scrapping.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

H1B VISA PREDICTION

Developer: Han Yang and Monica Ramos

Goal

Tool

Data Source

Web Scraping of NAICS Codes

H1B VISA Datasets Cleaning

postgreSQL DataBase

About

Releases

Packages

Languages

hanyang2019/Project_ETL

Folders and files

Latest commit

History

Repository files navigation

H1B VISA PREDICTION

Developer: Han Yang and Monica Ramos

Goal

Tool

Data Source

Web Scraping of NAICS Codes

H1B VISA Datasets Cleaning

postgreSQL DataBase

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages