diff --git a/MetaDataScraper.egg-info/PKG-INFO b/MetaDataScraper.egg-info/PKG-INFO new file mode 100644 index 0000000..f378879 --- /dev/null +++ b/MetaDataScraper.egg-info/PKG-INFO @@ -0,0 +1,89 @@ +Metadata-Version: 2.1 +Name: MetaDataScraper +Version: 1.0.2 +Summary: A module designed to automate the extraction of follower counts and post details from a public Facebook page. +Author-email: Ishan Surana +Project-URL: Homepage, https://metadatascraper.readthedocs.io/en/latest/ +Classifier: Programming Language :: Python :: 3 +Classifier: License :: OSI Approved :: Apache Software License +Classifier: Operating System :: Microsoft :: Windows +Requires-Python: >=3.10 +Description-Content-Type: text/markdown +License-File: LICENCE +Requires-Dist: selenium==4.1.0 +Requires-Dist: webdriver-manager==4.0.1 + +[![Licence](https://badgen.net/github/license/ishan-surana/MetaDataScraper?color=DC143C)](https://github.com/ishan-surana/MetaDataScraper/blob/main/LICENCE) [![Python](https://img.shields.io/badge/python-%3E=3.10-slateblue.svg)](https://www.python.org/downloads/release/python-3119/) [![Wheel](https://img.shields.io/badge/wheel-yes-FF00C9.svg)](https://files.pythonhosted.org/packages/02/80/c53d5e8439361c913e23b6345e85e748a7ac7e82e22cb9f7cd9ec77d5d52/MetaDataScraper-1.0.0-py3-none-any.whl) [![Latest](https://badgen.net/github/release/ishan-surana/MetaDataScraper?label=latest+release&color=green)](https://pypi.org/project/MetaDataScraper/1.0.0/) [![Releases](https://badgen.net/github/releases/ishan-surana/MetaDataScraper?color=orange)](https://github.com/ishan-surana/MetaDataScraper/releases) [![Stars](https://badgen.net/github/stars/ishan-surana/MetaDataScraper?color=yellow)](https://github.com/ishan-surana/MetaDataScraper/stargazers) [![Forks](https://badgen.net/github/forks/ishan-surana/MetaDataScraper?color=dark)](https://github.com/ishan-surana/MetaDataScraper/forks) [![Issues](https://badgen.net/github/issues/ishan-surana/MetaDataScraper?color=800000)](https://github.com/ishan-surana/MetaDataScraper/issues) [![PRs](https://badgen.net/github/prs/ishan-surana/MetaDataScraper?color=C71585)](https://github.com/ishan-surana/MetaDataScraper/pulls) [![Last commit](https://badgen.net/github/last-commit/ishan-surana/MetaDataScraper?color=blue)](https://github.com/ishan-surana/MetaDataScraper/commits/main/) ![Downloads](https://img.shields.io/github/downloads/ishan-surana/MetaDataScraper/total) [![Workflow](https://github.com/ishan-surana/MetaDataScraper/actions/workflows/python-publish.yml/badge.svg)](https://github.com/ishan-surana/MetaDataScraper/blob/main/.github/workflows/python-publish.yml) [![PyPI](https://d25lcipzij17d.cloudfront.net/badge.svg?id=py&r=r&ts=1683906897&type=6e&v=1.0.0&x2=0)](https://pypi.org/project/MetaDataScraper/) [![Maintained](https://img.shields.io/badge/maintained-yes-cyan)](https://github.com/ishan-surana/MetaDataScraper/pulse) [![OS](https://img.shields.io/badge/OS-Windows-FF0000)](https://www.microsoft.com/software-download/windows11) [![Documentation Status](https://readthedocs.org/projects/metadatascraper/badge/?version=latest)](https://metadatascraper.readthedocs.io/en/latest/?badge=latest) + +# MetaDataScraper + +MetaDataScraper is a Python package designed to automate the extraction of information like follower counts, and post details & interactions from a public Facebook page, in the form of a list. It uses Selenium WebDriver for web automation and scraping. +The module provides two classes: `LoginlessScraper` and `LoggedInScraper`. The `LoginlessScraper` class does not require any authentication or API keys to scrape the data. However, it has a drawback of being unable to access some Facebook pages. +The `LoggedInScraper` class overcomes this drawback by utilising the credentials of a Facebook account (of user) to login and scrape the data. + +## Installation + +You can install MetaDataScraper using pip: + +``` +pip install MetaDataScraper +``` + +Make sure you have Python 3.x and pip installed. + +## Usage + +To use MetaDataScraper, follow these steps: + +1. Import the `LoginlessScraper` or the `LoggedInScraper` class: + + ```python + from MetaDataScraper import LoginlessScraper, LoggedInScraper + ``` + +2. Initialize the scraper with the Facebook page ID: + + ```python + page_id = "your_target_page_id" + scraper = LoginlessScraper(page_id) + email = "your_facebook_email" + password = "your_facebook_password" + scraper = LoggedInScraper(page_id, email, password) + ``` + +3. Scrape the Facebook page to retrieve information: + + ```python + result = scraper.scrape() + ``` + +4. Access the scraped data from the result dictionary: + + ```python + print(f"Followers: {result['followers']}") + print(f"Post Texts: {result['post_texts']}") + print(f"Post Likes: {result['post_likes']}") + print(f"Post Shares: {result['post_shares']}") + print(f"Is Video: {result['is_video']}") + print(f"Video Links: {result['video_links']}") + ``` + +## Features + +- **Automated Extraction**: Automatically fetches follower counts, post texts, likes, shares, and video links from Facebook pages. +- **Comprehensive Data Retrieval**: Retrieves detailed information about each post, including text content, interaction metrics (likes, shares), and multimedia (e.g., video links). +- **Flexible Handling**: Adapts to diverse post structures and various types of multimedia content present on Facebook pages, like post texts or reels. +- **Enhanced Access with Logged-In Scraper**: Overcomes limitations faced by anonymous scraping (loginless) by utilizing Facebook account credentials for broader page access. +- **Headless Operation**: Executes scraping tasks in headless mode, ensuring seamless and non-intrusive data collection without displaying a browser interface. +- **Scalability**: Supports scaling to handle large volumes of data extraction efficiently, suitable for monitoring multiple Facebook pages simultaneously. +- **Dependency Management**: Utilizes Selenium WebDriver for robust web automation and scraping capabilities, compatible with Python 3.x environments. +- **Ease of Use**: Simplifies the process with straightforward initialization and method calls, facilitating quick integration into existing workflows. + +## Dependencies + +- selenium +- webdriver_manager + +## License + +This project is licensed under the Apache Software License Version 2.0 - see the [LICENSE](https://github.com/ishan-surana/MetaDataScraper/blob/main/LICENCE) file for details. diff --git a/MetaDataScraper.egg-info/SOURCES.txt b/MetaDataScraper.egg-info/SOURCES.txt new file mode 100644 index 0000000..1718c3a --- /dev/null +++ b/MetaDataScraper.egg-info/SOURCES.txt @@ -0,0 +1,10 @@ +LICENCE +README.md +pyproject.toml +MetaDataScraper/FacebookScraper.py +MetaDataScraper/__init__.py +MetaDataScraper.egg-info/PKG-INFO +MetaDataScraper.egg-info/SOURCES.txt +MetaDataScraper.egg-info/dependency_links.txt +MetaDataScraper.egg-info/requires.txt +MetaDataScraper.egg-info/top_level.txt \ No newline at end of file diff --git a/MetaDataScraper.egg-info/dependency_links.txt b/MetaDataScraper.egg-info/dependency_links.txt new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/MetaDataScraper.egg-info/dependency_links.txt @@ -0,0 +1 @@ + diff --git a/MetaDataScraper.egg-info/requires.txt b/MetaDataScraper.egg-info/requires.txt new file mode 100644 index 0000000..a45c98f --- /dev/null +++ b/MetaDataScraper.egg-info/requires.txt @@ -0,0 +1,2 @@ +selenium==4.1.0 +webdriver-manager==4.0.1 diff --git a/MetaDataScraper.egg-info/top_level.txt b/MetaDataScraper.egg-info/top_level.txt new file mode 100644 index 0000000..e39f09c --- /dev/null +++ b/MetaDataScraper.egg-info/top_level.txt @@ -0,0 +1 @@ +MetaDataScraper diff --git a/MetaDataScraper/FacebookScraper.py b/MetaDataScraper/FacebookScraper.py index 6d917e6..89ab1da 100644 --- a/MetaDataScraper/FacebookScraper.py +++ b/MetaDataScraper/FacebookScraper.py @@ -1,13 +1,12 @@ +import time +import logging from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By -from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.keys import Keys -from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC -import time -import logging +from webdriver_manager.chrome import ChromeDriverManager logging.getLogger().setLevel(logging.CRITICAL) class LoginlessScraper: @@ -471,7 +470,7 @@ def __scroll_to_top(self): def __get_xpath_constructor(self): """Constructs the XPath for locating posts on the Facebook page.""" - xpath_return_script = r""" + _xpath_return_script = r""" var iterator = document.evaluate('.//*[@aria-label="Like"]', document); var firstelement = iterator.iterateNext() var firstpost = firstelement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement @@ -509,79 +508,79 @@ def __get_xpath_constructor(self): } return xpath_first """ - xpath_constructor = self.driver.execute_script(xpath_return_script) - split_xpath = xpath_constructor.split('[') - split_index = split_xpath.index('1]/div/div/div/div/div/div/div/div/div/div/div') - self.xpath_first = '['.join(split_xpath[:split_index])+'[' - self.xpath_last = '['+'['.join(split_xpath[split_index+1:]) - self.xpath_identifier_addum = ']/div/div/div/div/div/div/div/div/div/div/div' - if len(self.driver.find_element(By.XPATH, xpath_constructor).find_elements(By.TAG_NAME, 'video')): - self.xpath_last = '/'.join(self.xpath_last.split('/')[:3]) + _xpath_constructor = self.driver.execute_script(_xpath_return_script) + _split_xpath = _xpath_constructor.split('[') + _split_index = _split_xpath.index('1]/div/div/div/div/div/div/div/div/div/div/div') + self._xpath_first = '['.join(_split_xpath[:_split_index])+'[' + self._xpath_last = '['+'['.join(_split_xpath[_split_index+1:]) + self._xpath_identifier_addum = ']/div/div/div/div/div/div/div/div/div/div/div' + if len(self.driver.find_element(By.XPATH, _xpath_constructor).find_elements(By.TAG_NAME, 'video')): + self._xpath_last = '/'.join(self._xpath_last.split('/')[:3]) def __extract_post_details(self): """Extracts details of posts including text, likes, shares, and video links.""" - c = 1 - error_count = 0 + _c = 1 + _error_count = 0 while True: - xpath = self.xpath_first + str(c) + self.xpath_identifier_addum + self.xpath_last - if not self.driver.find_elements(By.XPATH, xpath): - error_count += 1 - if error_count < 3: - print('Error extracting post', c, '\b. Count', error_count,'Retrying extraction...', end='\r') + _xpath = self._xpath_first + str(c) + self._xpath_identifier_addum + self._xpath_last + if not self.driver.find_elements(By.XPATH, _xpath): + _error_count += 1 + if _error_count < 3: + print('Error extracting post', _c, '\b. Count', _error_count,'Retrying extraction...', end='\r') time.sleep(5) self.driver.execute_script("window.scrollBy(0, +40);") continue break - error_count = 0 + _error_count = 0 print(" "*100, end='\r') - print("Extracting data of post", c, end='\r') - self.driver.execute_script("arguments[0].scrollIntoView();", self.driver.find_elements(By.XPATH, xpath)[0]) - post_components = self.driver.find_element(By.XPATH, xpath).find_elements(By.XPATH, './*') - if len(post_components) > 2: - post_text = '\n'.join(post_components[2].text.split('\n')) - if post_components[3].text.split('\n')[0] == 'All reactions:': - post_likes = post_components[3].text.split('\n')[1] - if len(post_components[3].text.split('\n')) > 4: - post_shares = post_components[3].text.split('\n')[4].split(' ')[0] - elif len(post_components) > 4 and post_components[4].text.split('\n')[0] == 'All reactions:': - post_likes = post_components[4].text.split('\n')[1] - if len(post_components[4].text.split('\n')) > 4: - post_shares = post_components[4].text.split('\n')[4].split(' ')[0] + print("Extracting data of post", _c, end='\r') + self.driver.execute_script("arguments[0].scrollIntoView();", self.driver.find_elements(By.XPATH, _xpath)[0]) + _post_components = self.driver.find_element(By.XPATH, _xpath).find_elements(By.XPATH, './*') + if len(_post_components) > 2: + _post_text = '\n'.join(_post_components[2].text.split('\n')) + if _post_components[3].text.split('\n')[0] == 'All reactions:': + _post_like = _post_components[3].text.split('\n')[1] + if len(_post_components[3].text.split('\n')) > 4: + _post_share = _post_components[3].text.split('\n')[4].split(' ')[0] + elif len(_post_components) > 4 and _post_components[4].text.split('\n')[0] == 'All reactions:': + _post_like = _post_components[4].text.split('\n')[1] + if len(_post_components[4].text.split('\n')) > 4: + _post_share = _post_components[4].text.split('\n')[4].split(' ')[0] else: - post_likes = 0 - post_shares = 0 - self.post_texts.append(post_text) - self.post_likes.append(post_likes if post_likes else 0) - self.post_shares.append(post_shares if post_shares else 0) - elif len(post_components) == 2: + _post_like = 0 + _post_share = 0 + self.post_texts.append(_post_text) + self.post_likes.append(_post_like if _post_like else 0) + self.post_shares.append(_post_share if _post_share else 0) + elif len(_post_components) == 2: try: - post_shares = post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text + _post_share = _post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text except: - print("Some error occurred while extracting post", c, ". Skipping post...", end='\r') - c += 1 + print("Some error occurred while extracting post", _c, ". Skipping post...", end='\r') + _c += 1 continue - post_likes = post_components[1].find_element(By.XPATH, './/*[@aria-label="Like"]').text - post_shares = post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text + _post_like = _post_components[1].find_element(By.XPATH, './/*[@aria-label="Like"]').text + _post_share = _post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text self.post_texts.append('') - self.post_likes.append(post_likes if post_likes else 0) - self.post_shares.append(post_shares if post_shares else 0) - elif len(post_components) == 1: - post_text = post_components[0].text.split('\n')[0] - post_likes = post_components[0].find_element(By.XPATH, './/*[@aria-label="Like"]').text - post_shares = post_components[0].find_element(By.XPATH, './/*[@aria-label="Share"]').text - self.post_texts.append(post_text) - self.post_likes.append(post_likes if post_likes else 0) - self.post_shares.append(post_shares if post_shares else 0) - if len(self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'video')) > 0: - if 'reel' in self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href'): - self.video_links.append('https://www.facebook.com' + self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href')) + self.post_likes.append(_post_like if _post_like else 0) + self.post_shares.append(_post_share if _post_share else 0) + elif len(_post_components) == 1: + _post_text = _post_components[0].text.split('\n')[0] + _post_like = _post_components[0].find_element(By.XPATH, './/*[@aria-label="Like"]').text + _post_share = _post_components[0].find_element(By.XPATH, './/*[@aria-label="Share"]').text + self.post_texts.append(_post_text) + self.post_likes.append(_post_like if _post_like else 0) + self.post_shares.append(_post_share if _post_share else 0) + if len(self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'video')) > 0: + if 'reel' in self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href'): + self.video_links.append('https://www.facebook.com' + self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href')) else: - self.video_links.append(self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'a')[4].get_attribute('href')) + self.video_links.append(self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'a')[4].get_attribute('href')) self.is_video.append(True) else: self.is_video.append(False) self.video_links.append('') - c += 1 + _c += 1 self.post_likes = [int(i) if str(i).isdigit() else 0 for i in self.post_likes] self.post_shares = [int(i) if str(i).isdigit() else 0 for i in self.post_shares] diff --git a/dist/MetaDataScraper-1.0.2-py3-none-any.whl b/dist/MetaDataScraper-1.0.2-py3-none-any.whl new file mode 100644 index 0000000..89e0b70 Binary files /dev/null and b/dist/MetaDataScraper-1.0.2-py3-none-any.whl differ diff --git a/dist/metadatascraper-1.0.2.tar.gz b/dist/metadatascraper-1.0.2.tar.gz new file mode 100644 index 0000000..79fed33 Binary files /dev/null and b/dist/metadatascraper-1.0.2.tar.gz differ diff --git a/pyproject.toml b/pyproject.toml index 0f90d07..9184099 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "MetaDataScraper" -version = "1.0.1" +version = "1.0.2" authors = [ { name="Ishan Surana", email="ishansurana1234@gmail.com" }, ]