Minor code changes and build files added

ishan-surana · Jul 2, 2024 · 57f73f4 · 57f73f4
1 parent 7b4298b
commit 57f73f4
Show file tree

Hide file tree

Showing 9 changed files with 162 additions and 60 deletions.
diff --git a/MetaDataScraper.egg-info/PKG-INFO b/MetaDataScraper.egg-info/PKG-INFO
@@ -0,0 +1,89 @@
+Metadata-Version: 2.1
+Name: MetaDataScraper
+Version: 1.0.2
+Summary: A module designed to automate the extraction of follower counts and post details from a public Facebook page.
+Author-email: Ishan Surana <[email protected]>
+Project-URL: Homepage, https://metadatascraper.readthedocs.io/en/latest/
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: Microsoft :: Windows
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENCE
+Requires-Dist: selenium==4.1.0
+Requires-Dist: webdriver-manager==4.0.1
+
+[![Licence](https://badgen.net/github/license/ishan-surana/MetaDataScraper?color=DC143C)](https://github.com/ishan-surana/MetaDataScraper/blob/main/LICENCE) [![Python](https://img.shields.io/badge/python-%3E=3.10-slateblue.svg)](https://www.python.org/downloads/release/python-3119/) [![Wheel](https://img.shields.io/badge/wheel-yes-FF00C9.svg)](https://files.pythonhosted.org/packages/02/80/c53d5e8439361c913e23b6345e85e748a7ac7e82e22cb9f7cd9ec77d5d52/MetaDataScraper-1.0.0-py3-none-any.whl) [![Latest](https://badgen.net/github/release/ishan-surana/MetaDataScraper?label=latest+release&color=green)](https://pypi.org/project/MetaDataScraper/1.0.0/) [![Releases](https://badgen.net/github/releases/ishan-surana/MetaDataScraper?color=orange)](https://github.com/ishan-surana/MetaDataScraper/releases) [![Stars](https://badgen.net/github/stars/ishan-surana/MetaDataScraper?color=yellow)](https://github.com/ishan-surana/MetaDataScraper/stargazers) [![Forks](https://badgen.net/github/forks/ishan-surana/MetaDataScraper?color=dark)](https://github.com/ishan-surana/MetaDataScraper/forks) [![Issues](https://badgen.net/github/issues/ishan-surana/MetaDataScraper?color=800000)](https://github.com/ishan-surana/MetaDataScraper/issues) [![PRs](https://badgen.net/github/prs/ishan-surana/MetaDataScraper?color=C71585)](https://github.com/ishan-surana/MetaDataScraper/pulls) [![Last commit](https://badgen.net/github/last-commit/ishan-surana/MetaDataScraper?color=blue)](https://github.com/ishan-surana/MetaDataScraper/commits/main/) ![Downloads](https://img.shields.io/github/downloads/ishan-surana/MetaDataScraper/total) [![Workflow](https://github.com/ishan-surana/MetaDataScraper/actions/workflows/python-publish.yml/badge.svg)](https://github.com/ishan-surana/MetaDataScraper/blob/main/.github/workflows/python-publish.yml) [![PyPI](https://d25lcipzij17d.cloudfront.net/badge.svg?id=py&r=r&ts=1683906897&type=6e&v=1.0.0&x2=0)](https://pypi.org/project/MetaDataScraper/) [![Maintained](https://img.shields.io/badge/maintained-yes-cyan)](https://github.com/ishan-surana/MetaDataScraper/pulse) [![OS](https://img.shields.io/badge/OS-Windows-FF0000)](https://www.microsoft.com/software-download/windows11) [![Documentation Status](https://readthedocs.org/projects/metadatascraper/badge/?version=latest)](https://metadatascraper.readthedocs.io/en/latest/?badge=latest)
+
+# MetaDataScraper
+
+MetaDataScraper is a Python package designed to automate the extraction of information like follower counts, and post details & interactions from a public Facebook page, in the form of a list. It uses Selenium WebDriver for web automation and scraping.  
+The module provides two classes: `LoginlessScraper` and `LoggedInScraper`. The `LoginlessScraper` class does not require any authentication or API keys to scrape the data. However, it has a drawback of being unable to access some Facebook pages. 
+The `LoggedInScraper` class overcomes this drawback by utilising the credentials of a Facebook account (of user) to login and scrape the data.
+
+## Installation
+
+You can install MetaDataScraper using pip:
+
+```
+pip install MetaDataScraper
+```
+
+Make sure you have Python 3.x and pip installed.
+
+## Usage
+
+To use MetaDataScraper, follow these steps:
+
+1. Import the `LoginlessScraper` or the `LoggedInScraper` class:
+
+   ```python
+   from MetaDataScraper import LoginlessScraper, LoggedInScraper
+   ```
+
+2. Initialize the scraper with the Facebook page ID:
+
+   ```python
+   page_id = "your_target_page_id"
+   scraper = LoginlessScraper(page_id)
+   email = "your_facebook_email"
+   password = "your_facebook_password"
+   scraper = LoggedInScraper(page_id, email, password)
+   ```
+
+3. Scrape the Facebook page to retrieve information:
+
+   ```python
+   result = scraper.scrape()
+   ```
+
+4. Access the scraped data from the result dictionary:
+
+   ```python
+   print(f"Followers: {result['followers']}")
+   print(f"Post Texts: {result['post_texts']}")
+   print(f"Post Likes: {result['post_likes']}")
+   print(f"Post Shares: {result['post_shares']}")
+   print(f"Is Video: {result['is_video']}")
+   print(f"Video Links: {result['video_links']}")
+   ```
+
+## Features
+
+- **Automated Extraction**: Automatically fetches follower counts, post texts, likes, shares, and video links from Facebook pages.
+- **Comprehensive Data Retrieval**: Retrieves detailed information about each post, including text content, interaction metrics (likes, shares), and multimedia (e.g., video links).
+- **Flexible Handling**: Adapts to diverse post structures and various types of multimedia content present on Facebook pages, like post texts or reels.
+- **Enhanced Access with Logged-In Scraper**: Overcomes limitations faced by anonymous scraping (loginless) by utilizing Facebook account credentials for broader page access.
+- **Headless Operation**: Executes scraping tasks in headless mode, ensuring seamless and non-intrusive data collection without displaying a browser interface.
+- **Scalability**: Supports scaling to handle large volumes of data extraction efficiently, suitable for monitoring multiple Facebook pages simultaneously.
+- **Dependency Management**: Utilizes Selenium WebDriver for robust web automation and scraping capabilities, compatible with Python 3.x environments.
+- **Ease of Use**: Simplifies the process with straightforward initialization and method calls, facilitating quick integration into existing workflows.
+
+## Dependencies
+
+- selenium
+- webdriver_manager
+
+## License
+
+This project is licensed under the Apache Software License Version 2.0 - see the [LICENSE](https://github.com/ishan-surana/MetaDataScraper/blob/main/LICENCE) file for details.
diff --git a/MetaDataScraper.egg-info/SOURCES.txt b/MetaDataScraper.egg-info/SOURCES.txt
@@ -0,0 +1,10 @@
+LICENCE
+README.md
+pyproject.toml
+MetaDataScraper/FacebookScraper.py
+MetaDataScraper/__init__.py
+MetaDataScraper.egg-info/PKG-INFO
+MetaDataScraper.egg-info/SOURCES.txt
+MetaDataScraper.egg-info/dependency_links.txt
+MetaDataScraper.egg-info/requires.txt
+MetaDataScraper.egg-info/top_level.txt
diff --git a/MetaDataScraper.egg-info/dependency_links.txt b/MetaDataScraper.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/MetaDataScraper.egg-info/requires.txt b/MetaDataScraper.egg-info/requires.txt
@@ -0,0 +1,2 @@
+selenium==4.1.0
+webdriver-manager==4.0.1
diff --git a/MetaDataScraper.egg-info/top_level.txt b/MetaDataScraper.egg-info/top_level.txt
@@ -0,0 +1 @@
+MetaDataScraper
diff --git a/MetaDataScraper/FacebookScraper.py b/MetaDataScraper/FacebookScraper.py
@@ -1,13 +1,12 @@
+import time
+import logging
 from selenium import webdriver
 from selenium.webdriver.chrome.service import Service
 from selenium.webdriver.common.by import By
-from selenium.webdriver.chrome.options import Options
 from selenium.webdriver.common.keys import Keys
-from webdriver_manager.chrome import ChromeDriverManager
 from selenium.webdriver.support.ui import WebDriverWait
 from selenium.webdriver.support import expected_conditions as EC
-import time
-import logging
+from webdriver_manager.chrome import ChromeDriverManager
 logging.getLogger().setLevel(logging.CRITICAL)
 
 class LoginlessScraper:
@@ -471,7 +470,7 @@ def __scroll_to_top(self):
 
     def __get_xpath_constructor(self):
         """Constructs the XPath for locating posts on the Facebook page."""
-        xpath_return_script = r"""
+        _xpath_return_script = r"""
             var iterator = document.evaluate('.//*[@aria-label="Like"]', document);
             var firstelement = iterator.iterateNext()
             var firstpost = firstelement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement.parentElement
@@ -509,79 +508,79 @@ def __get_xpath_constructor(self):
             }
             return xpath_first
         """
-        xpath_constructor = self.driver.execute_script(xpath_return_script)
-        split_xpath = xpath_constructor.split('[')
-        split_index = split_xpath.index('1]/div/div/div/div/div/div/div/div/div/div/div')
-        self.xpath_first = '['.join(split_xpath[:split_index])+'['
-        self.xpath_last = '['+'['.join(split_xpath[split_index+1:])
-        self.xpath_identifier_addum = ']/div/div/div/div/div/div/div/div/div/div/div'
-        if len(self.driver.find_element(By.XPATH, xpath_constructor).find_elements(By.TAG_NAME, 'video')):
-            self.xpath_last = '/'.join(self.xpath_last.split('/')[:3])
+        _xpath_constructor = self.driver.execute_script(_xpath_return_script)
+        _split_xpath = _xpath_constructor.split('[')
+        _split_index = _split_xpath.index('1]/div/div/div/div/div/div/div/div/div/div/div')
+        self._xpath_first = '['.join(_split_xpath[:_split_index])+'['
+        self._xpath_last = '['+'['.join(_split_xpath[_split_index+1:])
+        self._xpath_identifier_addum = ']/div/div/div/div/div/div/div/div/div/div/div'
+        if len(self.driver.find_element(By.XPATH, _xpath_constructor).find_elements(By.TAG_NAME, 'video')):
+            self._xpath_last = '/'.join(self._xpath_last.split('/')[:3])
 
     def __extract_post_details(self):
         """Extracts details of posts including text, likes, shares, and video links."""
-        c = 1
-        error_count = 0
+        _c = 1
+        _error_count = 0
         while True:
-            xpath = self.xpath_first + str(c) + self.xpath_identifier_addum + self.xpath_last
-            if not self.driver.find_elements(By.XPATH, xpath):
-                error_count += 1
-                if error_count < 3:
-                    print('Error extracting post', c, '\b. Count', error_count,'Retrying extraction...', end='\r')
+            _xpath = self._xpath_first + str(c) + self._xpath_identifier_addum + self._xpath_last
+            if not self.driver.find_elements(By.XPATH, _xpath):
+                _error_count += 1
+                if _error_count < 3:
+                    print('Error extracting post', _c, '\b. Count', _error_count,'Retrying extraction...', end='\r')
                     time.sleep(5)
                     self.driver.execute_script("window.scrollBy(0, +40);")
                     continue
                 break
-            error_count = 0
+            _error_count = 0
             print(" "*100, end='\r')
-            print("Extracting data of post", c, end='\r')
-            self.driver.execute_script("arguments[0].scrollIntoView();", self.driver.find_elements(By.XPATH, xpath)[0])
-            post_components = self.driver.find_element(By.XPATH, xpath).find_elements(By.XPATH, './*')
-            if len(post_components) > 2:
-                post_text = '\n'.join(post_components[2].text.split('\n'))
-                if post_components[3].text.split('\n')[0] == 'All reactions:':
-                    post_likes = post_components[3].text.split('\n')[1]
-                    if len(post_components[3].text.split('\n')) > 4:
-                        post_shares = post_components[3].text.split('\n')[4].split(' ')[0]
-                elif len(post_components) > 4 and post_components[4].text.split('\n')[0] == 'All reactions:':
-                    post_likes = post_components[4].text.split('\n')[1]
-                    if len(post_components[4].text.split('\n')) > 4:
-                        post_shares = post_components[4].text.split('\n')[4].split(' ')[0]
+            print("Extracting data of post", _c, end='\r')
+            self.driver.execute_script("arguments[0].scrollIntoView();", self.driver.find_elements(By.XPATH, _xpath)[0])
+            _post_components = self.driver.find_element(By.XPATH, _xpath).find_elements(By.XPATH, './*')
+            if len(_post_components) > 2:
+                _post_text = '\n'.join(_post_components[2].text.split('\n'))
+                if _post_components[3].text.split('\n')[0] == 'All reactions:':
+                    _post_like = _post_components[3].text.split('\n')[1]
+                    if len(_post_components[3].text.split('\n')) > 4:
+                        _post_share = _post_components[3].text.split('\n')[4].split(' ')[0]
+                elif len(_post_components) > 4 and _post_components[4].text.split('\n')[0] == 'All reactions:':
+                    _post_like = _post_components[4].text.split('\n')[1]
+                    if len(_post_components[4].text.split('\n')) > 4:
+                        _post_share = _post_components[4].text.split('\n')[4].split(' ')[0]
                 else:
-                    post_likes = 0
-                    post_shares = 0
-                self.post_texts.append(post_text)
-                self.post_likes.append(post_likes if post_likes else 0)
-                self.post_shares.append(post_shares if post_shares else 0)
-            elif len(post_components) == 2:
+                    _post_like = 0
+                    _post_share = 0
+                self.post_texts.append(_post_text)
+                self.post_likes.append(_post_like if _post_like else 0)
+                self.post_shares.append(_post_share if _post_share else 0)
+            elif len(_post_components) == 2:
                 try:
-                    post_shares = post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text
+                    _post_share = _post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text
                 except:
-                    print("Some error occurred while extracting post", c, ". Skipping post...", end='\r')
-                    c += 1
+                    print("Some error occurred while extracting post", _c, ". Skipping post...", end='\r')
+                    _c += 1
                     continue
-                post_likes = post_components[1].find_element(By.XPATH, './/*[@aria-label="Like"]').text
-                post_shares = post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text
+                _post_like = _post_components[1].find_element(By.XPATH, './/*[@aria-label="Like"]').text
+                _post_share = _post_components[1].find_element(By.XPATH, './/*[@aria-label="Share"]').text
                 self.post_texts.append('')
-                self.post_likes.append(post_likes if post_likes else 0)
-                self.post_shares.append(post_shares if post_shares else 0)
-            elif len(post_components) == 1:
-                post_text = post_components[0].text.split('\n')[0]
-                post_likes = post_components[0].find_element(By.XPATH, './/*[@aria-label="Like"]').text
-                post_shares = post_components[0].find_element(By.XPATH, './/*[@aria-label="Share"]').text
-                self.post_texts.append(post_text)
-                self.post_likes.append(post_likes if post_likes else 0)
-                self.post_shares.append(post_shares if post_shares else 0)
-            if len(self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'video')) > 0:
-                if 'reel' in self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href'):
-                    self.video_links.append('https://www.facebook.com' + self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href'))
+                self.post_likes.append(_post_like if _post_like else 0)
+                self.post_shares.append(_post_share if _post_share else 0)
+            elif len(_post_components) == 1:
+                _post_text = _post_components[0].text.split('\n')[0]
+                _post_like = _post_components[0].find_element(By.XPATH, './/*[@aria-label="Like"]').text
+                _post_share = _post_components[0].find_element(By.XPATH, './/*[@aria-label="Share"]').text
+                self.post_texts.append(_post_text)
+                self.post_likes.append(_post_like if _post_like else 0)
+                self.post_shares.append(_post_share if _post_share else 0)
+            if len(self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'video')) > 0:
+                if 'reel' in self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href'):
+                    self.video_links.append('https://www.facebook.com' + self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'a')[0].get_attribute('href'))
                 else:
-                    self.video_links.append(self.driver.find_elements(By.XPATH, xpath)[0].find_elements(By.TAG_NAME, 'a')[4].get_attribute('href'))
+                    self.video_links.append(self.driver.find_elements(By.XPATH, _xpath)[0].find_elements(By.TAG_NAME, 'a')[4].get_attribute('href'))
                 self.is_video.append(True)
             else:
                 self.is_video.append(False)
                 self.video_links.append('')
-            c += 1
+            _c += 1
 
         self.post_likes = [int(i) if str(i).isdigit() else 0 for i in self.post_likes]
         self.post_shares = [int(i) if str(i).isdigit() else 0 for i in self.post_shares]

diff --git a/dist/MetaDataScraper-1.0.2-py3-none-any.whl b/dist/MetaDataScraper-1.0.2-py3-none-any.whl
diff --git a/dist/metadatascraper-1.0.2.tar.gz b/dist/metadatascraper-1.0.2.tar.gz
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "MetaDataScraper"
-version = "1.0.1"
+version = "1.0.2"
 authors = [
   { name="Ishan Surana", email="[email protected]" },
 ]