This project is designed to automate the extraction of personal data from LinkedIn profiles, GitHub repositories, and resumes, and then generate a structured JSON file with the extracted information.
The project uses Python, Selenium, BeautifulSoup, and various other libraries to accomplish this.
Before you begin, ensure you have the following installed:
- Python 3.x
- pip (Python package installer)
- Google Chrome Browser
- ChromeDriver (automatically managed by webdriver_manager)
Clone the Repository
git clone https://github.com/yourusername/RAG-Personal-Data-Extractor.git
cd RAG-Personal-Data-Extractor
Install Required Python Packages
Install all the required Python packages using pip:
pip install -r requirements.txt
If you don't have a requirements.txt file, you can manually install the required packages:
pip install selenium beautifulsoup4 webdriver-manager python-dotenv pymupdf
Environment Setup
Create a .env file in the project root directory and add your credentials:
[email protected]
LINKEDIN_PASSWORD=your_password
GITHUB_ACCESS_TOKEN=your_github_access_token
LINKEDIN_URL=https://www.linkedin.com/in/your-profile-url/
GOOGLE_API_KEY=your_google_api_key
MONGO_URI=your_mongodb_uri
MONGO_DB_NAME=your_database_name
MONGO_CL_NAME=your_collection_name
USER_NAME=your_name
USER_EMAIL=your_email
- Replace
[email protected]
with your LinkedIn email. - Replace
your_password
with your LinkedIn password. - Replace
your_github_access_token
with your GitHub Personal Access Token. - Replace
https://www.linkedin.com/in/your-profile-url/
with your LinkedIn profile URL. - Replace
your_google_api_key
with your Google API key. - Replace
your_mongodb_uri
with your MongoDB connection string. - Replace
your_database_name
with your MongoDB database name. - Replace
your_collection_name
with your MongoDB collection name for basic details. - Replace
your_name
with your name. - Replace
your_email
with your email.
Place your resume file named Resume.pdf
in the ./resources/
directory of the project. This file will be used for resume parsing.
Execute the main Python script to start the data extraction process:
python main.py
The script will:
- Log in to your LinkedIn account and scrape the profile data.
- Fetch all repositories from your GitHub account and extract README content.
- Parse your resume from a PDF file located in ./resources/Resume.pdf.
- Combine the extracted data into a structured JSON file and save it as final_data.json.
- Review the Output
The extracted data will be printed in the terminal and saved in a JSON file named final_data.json in the project root directory.
├── linkedin_scraper.py # LinkedIn scraping script
├── github_scraper.py # GitHub scraping script
├── resume_parser.py # Resume parsing script
├── user_input.py # Script for handling user input
├── data_processing.py # Script for processing and generating final JSON
├── main.py # Main script to run the project
├── .env # Environment variables file
├── README.md # Project documentation
├── requirements.txt # Python package dependencies
└── resources/
└── Resume.pdf # Your resume file for parsing
403 Error When Fetching GitHub Data:
- Ensure your GitHub Personal Access Token has the necessary permissions to read your repositories.
- Check your API rate limits on GitHub.
- If you don't have a personal access token, go to https://github.com/settings/tokens
LinkedIn Scraping Issues:
- If Selenium is not able to log in to LinkedIn, ensure that the email and password in the .env file are correct.
- If LinkedIn profile data is not being scraped correctly, verify the profile URL and the page structure for any changes.
PDF Extraction Issues:
- Ensure that the pymupdf package is installed correctly and that your resume file is located in the ./resources/ directory.
MongoDB Connection Issues:
- Ensure that your MongoDB connection string (
MONGO_URI
) is correct and includes the proper username and password. - Verify that the user has the necessary permissions to access the database and collections.
- Check MongoDB logs for more detailed error messages if authentication fails.
Contributions are welcome! If you find any issues or have suggestions for improvements, feel free to create a pull request or open an issue on GitHub.
This project is licensed under the MIT License. See the LICENSE file for details.