Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cellar GitHub workflows #10

Closed
wants to merge 70 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
5fc8012
Update github-actions.yml
venvis May 1, 2024
78155aa
Update github-actions.yml
venvis May 1, 2024
20f4804
Update github-actions.yml
venvis May 1, 2024
85e8f38
Update github-actions.yml
venvis May 1, 2024
0bff0fa
Update github-actions.yml
venvis May 1, 2024
fd8625d
Update github-actions.yml
venvis May 1, 2024
99a15dd
Update github-actions.yml
venvis May 1, 2024
8b55bbc
Change lxml to html.parser
venvis May 8, 2024
19fa708
Change lxml to html.parser
venvis May 8, 2024
8e2da71
Remove lxml dependency
venvis May 8, 2024
fd75fa8
Trial with 3.10
venvis May 8, 2024
62f0ff3
Trial with 3.12
venvis May 8, 2024
1778293
Add badge for passing tests
venvis May 8, 2024
e030547
install_requires version checks
venvis May 8, 2024
b0170dd
Include Pypi automated publishing
venvis May 8, 2024
c02112a
Include pypi dependencies
venvis May 8, 2024
817c467
Modified github actions to work on pypi, testpypi, and do github-release
May 15, 2024
13b648f
Removed if condition
May 15, 2024
a7765eb
Changed path of setup.py file
May 15, 2024
398eec7
Removed Ruff Linting. Code needs to be properly linted separately
May 15, 2024
f754ac1
Installed setuptools in githubactions
May 15, 2024
b9658f7
Updated the package installation
shashankmc May 22, 2024
81b5a0e
Modified comment and provided input dir for cibuildwheels
shashankmc May 22, 2024
70b266e
Removed input_dir argument with package_dir for cibuildwheels
shashankmc May 22, 2024
7c29fd6
Attempt at fixing the setup.py file location
shashankmc May 22, 2024
39db7a6
Create pyproject.toml
shashankmc May 22, 2024
fff9486
Updated with pyproject.toml file as config file for build
shashankmc May 22, 2024
8d9dd15
Update dependencies for cellar extractor
venvis May 22, 2024
99ad9d3
Removed lxml dependency from toml file
shashankmc May 22, 2024
eeb79d9
Updated path of toml file
shashankmc May 22, 2024
817e856
Moved pyproject to root and removed rechtspraak directory
May 22, 2024
c746af6
Additional details in toml file
shashankmc May 22, 2024
0ab9367
Removed redundant readme
shashankmc May 22, 2024
a2a5be0
Removed dynamic version
shashankmc May 22, 2024
115e75b
Added pypacibuildwheels to cater to multiple platforms
May 23, 2024
408908e
Removed various pythoin versions
shashankmc Jun 5, 2024
f7b4bf7
Added cibuild options for build wheels
shashankmc Jun 6, 2024
4449c16
Removed cibuildwheel config and moved it to workflow file
shashankmc Jun 6, 2024
447f17c
Added cibuildwheel config referencing matplotlib repo
shashankmc Jun 6, 2024
7f343d1
Added build install
shashankmc Jun 6, 2024
99a335c
correct setup.py path in build sdist
shashankmc Jun 6, 2024
517fce4
Path correction of distribution tar file for cibuildwheel
shashankmc Jun 6, 2024
711a1cc
Path correction of distribution files when created
shashankmc Jun 6, 2024
5b902a6
variable name SDIST_NAME update
shashankmc Jun 6, 2024
13ea215
Trying to find the right path of tar file
shashankmc Jun 6, 2024
3057066
Change cbuildhweel to buildwheel
venvis Jul 2, 2024
799bc31
Remove cbuild
venvis Jul 2, 2024
fa69709
Remove cbuild
venvis Jul 2, 2024
8e6d984
Providing the right path for setup.py file during build
shashankmc Jul 4, 2024
f0ce7a1
Removed build for multiple platforms as it was causing duplicate file…
shashankmc Jul 4, 2024
65358c0
Update version to 1.1.0
venvis Jul 5, 2024
5d57d17
Update version to 1.1.0
venvis Jul 5, 2024
0c55355
Update version to 1.1.0
venvis Jul 5, 2024
4f58203
Update readme.md location
venvis Jul 5, 2024
3a431b9
Update versions to 1.1.1
venvis Jul 5, 2024
32ee28f
Update versions to 1.1.1
venvis Jul 5, 2024
92b5067
Update versions to 1.1.2
venvis Jul 6, 2024
46b5175
Update versions to 1.1.2
venvis Jul 6, 2024
4454747
Update versions to 1.1.1
venvis Jul 6, 2024
31ca046
Update versions to 1.1.1
venvis Jul 6, 2024
66c5ad1
Update to version 1.1.1
venvis Jul 6, 2024
a9ea1d6
Update setup.py
venvis Jul 6, 2024
44495b5
Update versions to 1.1.2
venvis Jul 8, 2024
05c6f82
Update versions to 1.1.2
venvis Jul 8, 2024
803dfa8
Update versions to 1.1.2
venvis Jul 8, 2024
a89bc3f
Update versions to 1.1.2
venvis Jul 8, 2024
e4814ee
Update versions to 1.1.3
venvis Jul 8, 2024
c87613f
Update README.md
venvis Jul 8, 2024
809458e
Update versions to 1.1.3
venvis Jul 8, 2024
f573a19
Update versions to 1.1.3
venvis Jul 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 132 additions & 23 deletions .github/workflows/github-actions.yml
Original file line number Diff line number Diff line change
@@ -1,32 +1,141 @@
name: Extraction Libraries
run-name: ${{ github.actor }} is testing out Extraction Libraries using GitHub Actions 🚀
on: [push]
name: Build, Test, Lint & Upload to TestPypi and Pypi for Cellar_Extractor
on:
push:
branches: [ cellar ]
pull_request:
branches: [ cellar ]

jobs:
Explore-Extraction-Libraries:
runs-on: ubuntu-latest
test:
name: Test on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
python-version: ['3.9', '3.10', '3.11', '3.12']

steps:
- run: echo "🎉 The job was automatically triggered by a ${{ github.event_name }} event."
- run: echo "🐧 This job is now running on a ${{ runner.os }} server hosted by GitHub!"
- run: echo "🔎 The name of your branch is ${{ github.ref }} and your repository is ${{ github.repository }}."
- name: Check out repository code
uses: actions/checkout@v3
- name: Set up Python 3.9
uses: actions/setup-python@v4
- name: Check out the repository
uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: '3.9'
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e cellar/
# pip install echr-extractor
- run: echo "💡 The ${{ github.repository }} repository has been cloned to the runner."
- run: echo "🖥️ The workflow is now ready to test your code on the runner."
- name: List files in the repository
pip install setuptools wheel
pip install -r requirements.txt

- name: Install package for testing
run: |
ls ${{ github.workspace }}
- run: echo "🍏 This job's status is ${{ job.status }}."
- name: Test with pytest
pip install -e cellar/

- name: Run tests with pytest
run: |
pip install pytest
pip install pytest-cov
pip install pytest pytest-cov
pytest tests.py --doctest-modules --junitxml=junit/test-results.xml --cov=com --cov-report=xml --cov-report=html

build:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
python-version: '3.9'
- run: |
python -m pip install --upgrade pip
pip install setuptools wheel
- run: python cellar/setup.py sdist bdist_wheel
- uses: actions/upload-artifact@v4
with:
name: universal-wheels
path: |
dist/*.whl
dist/*.tar.gz
if-no-files-found: error

testpypi-publish:
name: Publish to TestPyPI
needs: build
runs-on: ubuntu-latest
environment:
name: testpypi
url: https://test.pypi.org/project/cellar-extractor/
permissions:
id-token: write

steps:
- name: Download all artifacts
uses: actions/download-artifact@v4
with:
path: dist

- name: Publish distribution to TestPyPi
uses: pypa/gh-action-pypi-publish@release/v1
with:
repository-url: https://test.pypi.org/legacy/
packages-dir: dist/*

pypi-publish:
name: Publish to PyPI
needs:
- testpypi-publish
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/project/cellar-extractor/
permissions:
id-token: write

steps:
- name: Download all artifacts
uses: actions/download-artifact@v4
with:
path: dist

- name: Publish distribution to PyPi
uses: pypa/gh-action-pypi-publish@release/v1
with:
packages_dir: dist/*/

github-release:
name: Sign the Python distribution with Sigstore and upload them to GitHub Releases
needs:
- pypi-publish
runs-on: ubuntu-latest
permissions:
id-token: write
contents: write

steps:
- name: Download all artifacts
uses: actions/download-artifact@v4
with:
path: dist

- name: Sign the Python distribution with Sigstore
uses: sigstore/[email protected]
with:
inputs: >-
./dist/**/*.whl
./dist/**/*.tar.gz

- name: Create Github release
env:
GITHUB_TOKEN: ${{ github.token }}
run: >-
gh release create
'${{ github.ref_name }}'
--repo '${{ github.repository }}'
--notes ""

- name: Upload artifact signatures to Github release
env:
GITHUB_TOKEN: ${{ github.token }}
run: >-
gh release upload
'${{ github.ref_name }}' ./dist/**/*
--repo '${{ github.repository }}'
189 changes: 186 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,186 @@
# extraction_libraries
Python library for extracting caselaw data from Cellar.
Full documentation available at [cellar-extractor](https://pypi.org/project/cellar-extractor/).
## Cellar extractor
This library contains two functions to get cellar case law data from eurlex.

## Version
Python 3.9 onwards *

## Tests
![Workflow Status](https://github.com/maastrichtlawtech/extraction_libraries/actions/workflows/github-actions.yml/badge.svg)


## Contributors

<!-- readme: contributors,gijsvd -start -->
<table>
<tr>
<td align="center">
<a href="https://github.com/pranavnbapat">
<img src="https://avatars.githubusercontent.com/u/7271334?v=4" width="100;" alt="pranavnbapat"/>
<br />
<sub><b>Pranav Bapat</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/Cloud956">
<img src="https://avatars.githubusercontent.com/u/24865274?v=4" width="100;" alt="Cloud956"/>
<br />
<sub><b>Piotr Lewandowski</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/shashankmc">
<img src="https://avatars.githubusercontent.com/u/3445114?v=4" width="100;" alt="shashankmc"/>
<br />
<sub><b>shashankmc</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/gijsvd">
<img src="https://avatars.githubusercontent.com/u/31765316?v=4" width="100;" alt="gijsvd"/>
<br />
<sub><b>gijsvd</b></sub>
</a>
</td>
<td align="center">
<a href="https://github.com/venvis">
<img src="https://avatars.githubusercontent.com/venvis" width="100;" alt="venvis"/>
<br />
<sub><b>venvis</b></sub>
</a>
</td>
</tr>
</table>
<!-- readme: contributors,gijsvd -end -->

## How to install?
<code>pip install cellar-extractor</code>

## What are the functions?
<ol>
<li><code>get_cellar</code></li>
Gets all the ECLI data from the eurlex sparql endpoint and saves them in the CSV or JSON format, in-memory or as a saved file.
<br>
<li><code>get_cellar_extra</code></li>
Gets all the ECLI data from the eurlex sparql endpoint, and on top of that scrapes the eurlex websites to acquire
the full text, keywords, case law directory code and eurovoc identifiers. If the user does have an eurlex account with access to the eurlex webservices, he can also
pass his webservices login credentials to the method, in order to extract data about works citing work and works
being cited by work. The full text is returned as a JSON file, rest of data as a CSV. Can be in-memory or as saved files.
<li><code>get_nodes_and_edges_lists</code></li>
Gets 2 list objects, one for the nodes and edges of the citations within the passed dataframe.
Allows the creation of a network graph of the citations. Can only be returned in-memory.
<li><code>filter_subject_matter</code></li>
Returns a dataframe of cases only containing a certain phrase in the column containing the subject of cases.
<li><code>Analyzer</code></li>
A class whose instance(declaration) when called returns a list of the all the text contained within the operative part for each European Court of Justice (CJEU, formerly known as European Court of Justice (ECJ)) judgement (English only).
<li><code>Writing</code></li>
A class which writes the text for the operative part for each European Case law case(En-English only) into csv,json and txt files(Generated upon initialization).<br>
the <code>Writing</code> class has three functions : <br><br>
<ul>
<li><code>to_csv()</code> - Writes the operative part along with celex id into a csv file</li>
<li><code>to_json()</code> - Writes the operative part along with celex id into a json file</li>
<li><code>to_txt()</code> - Writes the operative part along with celex id into a txt file</li>
</ul>
<br>
</ol>

## What are the parameters?
<ol>
<li><code>get_cellar</code></li>
<strong>Parameters:</strong>
<ul>
<li><strong>max_ecli: int, optional, default 100</strong></li>
Maximum number of ECLIs to retrieve.
<li><strong>sd: date, optional, default '2022-05-01'</strong></li>
The start last modification date (yyyy-mm-dd).
<li><strong>ed: date, optional, default current date</strong></li>
The end last modification date (yyyy-mm-dd).
<li><strong>save_file: ['y', 'n'],optional, default 'y'</strong></li>
Save data in a data folder, or return in-memory.
<li><strong>file_format: ['csv', 'json'],optional, default 'csv'</strong></li>
Returns the data as a JSON/dictionary, or as a CSV/Pandas Dataframe object.
</ul>
<li><code>get_cellar_extra</code></li>
<ul>
<li><strong>max_ecli: int, optional, default 100</strong></li>
Maximum number of ECLIs to retrieve.
<li><strong>sd: date, optional, default '2022-05-01'</strong></li>
The start last modification date (yyyy-mm-dd).
<li><strong>ed: date, optional, default current date</strong></li>
The end last modification date (yyyy-mm-dd).
<li><strong>save_file: ['y', 'n'],optional, default 'y'</strong></li>
Save the full text of cases as JSON file / return as a dictionary and save the rest of
the data as a CSV file / return as a Pandas Dataframe object.
<li><strong>threads: int ,optional, default 10</strong></li>
Extracting the additional data takes a lot of time. The use of multi-threading can cut down this time.
Even with this, the method may take a couple of minutes for a couple of hundred cases. A maximum number
of 10 recommended, as this method may also affect the device's internet connection.
<li><strong>username: string, optional, default empty string</strong></li>
The username to the eurlex webservices.
<li><strong>password: string, optional, default empty string</strong></li>
The password to the eurlex webservices.
<br>
</ul>
<li><code>get_nodes_and_edges_lists</code></li>
<ul>
<li><strong>df: DataFrame object, required, default None</strong></li>
DataFrame of cellar metadata acquired from the get_cellar_extra method with eurlex webservice credentials passed.
This method will only work on dataframes with citations data.
<li><strong>only_local: boolean, optional, default False</strong></li>
Flag for nodes and edges generation. If set to True, the network created will only include nodes and edges between
cases exclusively inside the given dataframe.
</ul>
<li><code>filter_subject_matter</code></li>
<ul>
<li><strong>df: DataFrame object, required, default None</strong></li>
DataFrame of cellar metadata acquired from any of the cellar extraction methods listed above.
<li><strong>phrase: string, required, default None</strong></li>
The phrase which has to be present in the subject matter of cases. Case insensitive.
</ul>
<li><code>Analyzer</code></li>
<ul>
<li><strong>celex id: str, required</strong></li>
<li>Pass as a constructor upon initializing the class</li>
</ul>
<li><code>Writing</code></li>
<ul>
<li><strong>celex id: str, required</strong></li>
<li>Pass as a constructor upon initializing the class</li>
</ul>

</ol>


## Examples
```python
import cellar_extractor as cell

Below are examples for in-file saving:

cell.get_cellar(save_file='y', max_ecli=200, sd='2022-01-01', file_format='csv')
cell.get_cellar_extra(max_ecli=100, sd='2022-01-01', threads=10)

Below are examples for in-memory saving:

df = cell.get_cellar(save_file='n', file_format='csv', sd='2022-01-01', max_ecli=1000)
df,json = cell.get_cellar_extra(save_file='n', max_ecli=100, sd='2022-01-01', threads=10)
```
<p>Create a callback of the instance of the class initiated and pass a list as it's value.</p>

```python
import cellar_extractor as cell
instance=cell.Analyzer(celex_id:str)
output_list=instance()
print(output_list) # prints operative part of the Case as a list
```


<p>The Writing Class also takes a celex id , upon initializing the class , through the means of the constructor and writes the content of its operative part into different files , depending on the function called</p>

```python
import cellar_extractor as cell
instance=cell.Writing(celex_id:str)
output=instance.to_csv()#for csv
output=instance.to_txt()#for txt
output=instance.to_json()#for json

```
6 changes: 5 additions & 1 deletion cellar/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@
This library contains two functions to get cellar case law data from eurlex.

## Version
Python 3.9
Python 3.9 onwards *

## Tests
![Workflow Status](https://github.com/maastrichtlawtech/extraction_libraries/actions/workflows/github-actions.yml/badge.svg)


## Contributors

Expand Down
2 changes: 1 addition & 1 deletion cellar/cellar_extractor/json_to_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ def json_to_csv(json_data):
# Making commas as the only value separator in the dataset
value = re.sub(r",", ";", str(value))
# Remove HTML tags
value = BeautifulSoup(value, "lxml").text
value = BeautifulSoup(value, "html.parser").text

for j in [j for j, x in enumerate(COLS) if x == title]:
data[j] = value
Expand Down
Loading
Loading