Skip to content

Commit 85af8a3

Browse files
committed
README update for motifs, can't get them to work in local... because I was editing the heracles file?
1 parent 9fffa9a commit 85af8a3

File tree

2 files changed

+40
-8
lines changed

2 files changed

+40
-8
lines changed

README.md

+34-1
Original file line numberDiff line numberDiff line change
@@ -74,10 +74,43 @@ Python 3.10.11 ('graphml') /opt/anaconda3/envs/graphml/bin/python
7474

7575
Note: the Python version is set to `3.10.11` because Jupyter Stacks have not been updated more recently.
7676

77-
## Network Motifs
77+
## Knowledge Graph Construction in PySpark
78+
79+
We build a knowledge graph from the [Stack Exchange Archive](https://archive.org/details/stackexchange) for the network motif section of the course.
80+
81+
### Docker Exec Commands
82+
83+
To run a bash shell in the Jupyter container, type:
84+
85+
```bash
86+
docker exec -it jupyter bash
87+
```
88+
89+
Once you're there, you can run the following commands to download and prepare the data for the course.
90+
91+
First, download the data:
92+
93+
```bash
94+
graphml_class/stats/download.py stats.meta
95+
```
96+
97+
Then you will need to convert the data from XML to Parquet:
98+
99+
```bash
100+
spark-submit --packages "com.databricks:spark-xml_2.12:0.18.0" graphml_class/stats/xml_to_parquet.py
101+
```
102+
103+
The course covers knowledge graph construction in PySpark in [graphml_class.stats.graph.py](graphml_class/stats/graph.py).
104+
105+
```bash
106+
spark-submit graphml_class/stats/graph.py
107+
```
108+
109+
## Network Motifs with GraphFrames
78110

79111
This course now covers [network motifs](https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/network-motif#:~:text=A%20network%20motif%20is%20a,multiple%20times%20within%20a%20network.) in property graphs (frequent patterns of structure) using pyspark / [GraphFrames](https://graphframes.github.io/graphframes/docs/_site/index.html) (see [motif.py](https://github.com/Graphlet-AI/graphml-class/blob/main/graphml_class/stats/motif.py), no notebook yet).
80112
It supports directed motifs, not undirected. All the 4-node motifs are outlined below. Note that GraphFrames can also filter the
81113
paths returned by its `f.find()` method using any Spark `DataFrame` filter - enabling temporal and complex property graph motifs.
82114

83115
<center><img src="images/illustration-of-directed-graphlets-a-The-40-two-to-four-node-directed-graphlets-G0.png" alt="All 4-node directed network motifs"></center>
116+

graphml_class/stats/graph.py

+6-7
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,7 @@
1212
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"
1313

1414
# Setup PySpark to use the GraphFrames jar package from maven central
15-
os.environ["PYSPARK_SUBMIT_ARGS"] = (
16-
"--driver-memory 4g pyspark-shell "
17-
"--executor-memory 4g pyspark-shell "
18-
"--driver-java-options='-Xmx4g -Xms4g' "
19-
)
15+
os.environ["PYSPARK_SUBMIT_ARGS"] = "--driver-memory 4g --executor-memory 4g"
2016

2117
# Show all the rows in a pd.DataFrame
2218
pd.set_option("display.max_columns", None)
@@ -43,8 +39,8 @@
4339
# Lets the Id:(Stack Overflow int) and id:(GraphFrames ULID) coexist
4440
.config("spark.sql.caseSensitive", True)
4541
# Single node mode - 128GB machine
46-
.config("spark.driver.memory", "16g")
47-
.config("spark.executor.memory", "8g")
42+
.config("spark.driver.memory", "4g")
43+
.config("spark.executor.memory", "4g")
4844
.getOrCreate()
4945
)
5046
sc: SparkContext = spark.sparkContext
@@ -388,3 +384,6 @@ def add_missing_columns(df, all_cols):
388384

389385
# Write to disk and back again
390386
relationships_df.write.mode("overwrite").parquet(EDGES_PATH)
387+
388+
spark.stop()
389+
print("Spark stopped.")

0 commit comments

Comments
 (0)