Personal Data Extractor

This project is designed to automate the extraction of personal data from LinkedIn profiles, GitHub repositories, and resumes, and then generate a structured JSON file with the extracted information.

The project uses Python, Selenium, BeautifulSoup, and various other libraries to accomplish this.

Before you begin, ensure you have the following installed:

Python 3.x
pip (Python package installer)
Google Chrome Browser
ChromeDriver (automatically managed by webdriver_manager)

Installation

Clone the Repository

git clone https://github.com/yourusername/RAG-Personal-Data-Extractor.git
cd RAG-Personal-Data-Extractor

Install Required Python Packages

Install all the required Python packages using pip:

pip install -r requirements.txt

If you don't have a requirements.txt file, you can manually install the required packages:

pip install selenium beautifulsoup4 webdriver-manager python-dotenv pymupdf

Environment Setup

Create a .env file in the project root directory and add your credentials:

[email protected]
LINKEDIN_PASSWORD=your_password
GITHUB_ACCESS_TOKEN=your_github_access_token
LINKEDIN_URL=https://www.linkedin.com/in/your-profile-url/
GOOGLE_API_KEY=your_google_api_key
MONGO_URI=your_mongodb_uri
MONGO_DB_NAME=your_database_name
MONGO_CL_NAME=your_collection_name
USER_NAME=your_name
USER_EMAIL=your_email

Replace [email protected] with your LinkedIn email.
Replace your_password with your LinkedIn password.
Replace your_github_access_token with your GitHub Personal Access Token.
Replace https://www.linkedin.com/in/your-profile-url/ with your LinkedIn profile URL.
Replace your_google_api_key with your Google API key.
Replace your_mongodb_uri with your MongoDB connection string.
Replace your_database_name with your MongoDB database name.
Replace your_collection_name with your MongoDB collection name for basic details.
Replace your_name with your name.
Replace your_email with your email.

Add Your Resume

Place your resume file named Resume.pdf in the ./resources/ directory of the project. This file will be used for resume parsing.

Running the Project

Execute the main Python script to start the data extraction process:

python main.py

The script will:

Log in to your LinkedIn account and scrape the profile data.
Fetch all repositories from your GitHub account and extract README content.
Parse your resume from a PDF file located in ./resources/Resume.pdf.
Combine the extracted data into a structured JSON file and save it as final_data.json.
Review the Output

The extracted data will be printed in the terminal and saved in a JSON file named final_data.json in the project root directory.

Project Structure

├── linkedin_scraper.py         # LinkedIn scraping script
├── github_scraper.py           # GitHub scraping script
├── resume_parser.py            # Resume parsing script
├── user_input.py               # Script for handling user input
├── data_processing.py          # Script for processing and generating final JSON
├── main.py                     # Main script to run the project
├── .env                        # Environment variables file
├── README.md                   # Project documentation
├── requirements.txt            # Python package dependencies
└── resources/
    └── Resume.pdf              # Your resume file for parsing

Troubleshooting

403 Error When Fetching GitHub Data:

Ensure your GitHub Personal Access Token has the necessary permissions to read your repositories.
Check your API rate limits on GitHub.
If you don't have a personal access token, go to https://github.com/settings/tokens

LinkedIn Scraping Issues:

If Selenium is not able to log in to LinkedIn, ensure that the email and password in the .env file are correct.
If LinkedIn profile data is not being scraped correctly, verify the profile URL and the page structure for any changes.

PDF Extraction Issues:

Ensure that the pymupdf package is installed correctly and that your resume file is located in the ./resources/ directory.

MongoDB Connection Issues:

Ensure that your MongoDB connection string (MONGO_URI) is correct and includes the proper username and password.
Verify that the user has the necessary permissions to access the database and collections.
Check MongoDB logs for more detailed error messages if authentication fails.

Contributing

Contributions are welcome! If you find any issues or have suggestions for improvements, feel free to create a pull request or open an issue on GitHub.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
.vercelignore		.vercelignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
config.py		config.py
data_processing.py		data_processing.py
github_scraper.py		github_scraper.py
linkedin_scraper.py		linkedin_scraper.py
main.py		main.py
mongodb_connector.py		mongodb_connector.py
requirements.txt		requirements.txt
resume_parser.py		resume_parser.py
user_input.py		user_input.py
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Personal Data Extractor

Installation

Add Your Resume

Running the Project

Project Structure

Troubleshooting

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

nh0397/personal-data-extractor

Folders and files

Latest commit

History

Repository files navigation

Personal Data Extractor

Installation

Add Your Resume

Running the Project

Project Structure

Troubleshooting

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages