Adding WebSites

A WebSite object corresponds to at least one WebPage objects. The corresponding WebSite to a WebPage can found as follows:

domain = WebPage.domain_for_url(webpage.url)
site_data = WebPage.site_data_for_domain(domain)
website = WebSite().load(site_data)

A WebSite requires a name, domains, and is_whitelisted. It can also have the following optional attributes: bad_urls, normalization_rules, title_branding, initial_title_branding, exclude_from_tracking, and whitelist_selectors. exclude_from_tracking and whitelist_selectors are only relevant for linker v1 and v2. name is what displays in the WebPages sidebar. domains is a list of all domains corresponding to the WebSite with the specified name. is_whitelisted must be set to True in order for the WebSite's WebPages to appear in the Sefaria sidebar. bad_urls is a list of regular expressions specifying URLs that match any of the domains but we nevertheless don't want to save in our database or appear in the sidebar. To understand normalization_rules, see normalize_url() in sefaria/model/webpage.py. In normalize_url, the URL of an incoming WebPage is normalized based on global rules that are applied to all incoming WebPages, and the URL can be normalized by other rules if specified in the WebSite object's normalization_rules list. When WebPage data is received by the server, the incoming dictionary has a title field. title_branding and initial_title_branding are used for normalizing this title. See clean_title in sefaria/model/webpage.py to understand how title_branding and initial_title_branding are used to normalize the title field.

ChatGPT table

Attribute Name	Is Required	Description
name	Yes	Displays in the WebPages sidebar.
domains	Yes	List of all domains corresponding to the WebSite with the specified name.
is_whitelisted	Yes	Must be set to True in order for the WebSite's WebPages to appear in the Sefaria sidebar.
bad_urls	No	List of regular expressions that match URls we don't want to save in our database or appear in the sidebar.
normalization_rules	No	see `normalize_url()` in `sefaria/model/webpage.py`. In `normalize_url()`, the URL of an incoming WebPage is normalized based on global rules that are applied to all incoming WebPages, and the URL can be normalized by other rules if specified in the WebSite object's `normalization_rules` list
title_branding	No	Used for normalizing the title field when WebPage data is received by the server
initial_title_branding	No	Used for normalizing the title field when WebPage data is received by the server
exclude_from_tracking	No	Only relevant for linker v1 and v2.
whitelist_selectors	No	Only relevant for linker v3. List of CSS selectors that should be included in the page content when searching for citations. This should be used when you see some parts of the page are not included by default.

Here is an example of a WebSite in the database:

{"name" : "Torah In Motion",
    "domains" : [
        "torahinmotion.org",
        "torahinmotionorg.e.civicrm.ca"
    ],
    "is_whitelisted" : true,
    "bad_urls" : [
        "torahinmotionorg\\.e\\.civicrm\\.ca\\/store"
    ],
    "normalization_rules" : [
        "remove www"
    ],
    "title_branding" : [
        "TORAH IN MOTION"
    ],
    "initial_title_branding" : true,
}

To add a WebSite in the CLI:

from sefaria.model.webpage import *
w = WebSite()
w.name = "Torah In Motion" # required attribute
w.domains = ["torahinmotion.org", "torahinmotionorg.e.civicrm.ca"] # required attribute
w.is_whitelisted = True #required attribute
w.save()

Wiki Home | Back to Sefaria

Sefaria Wiki Home

Back to Sefaria

Forums:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding WebSites

ChatGPT table

Clone this wiki locally