Skip to content

Commit d8307a2

Browse files
committed
tech overview doc
1 parent d9f8d99 commit d8307a2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

114 files changed

+9169
-4439
lines changed

double.png

66.5 KB
Loading

fusus/about/tech.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
"""
2+
.. include:: ../docs/about/tech.md
3+
"""

fusus/book.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,8 @@
105105
showImage,
106106
splitext,
107107
getNbPath,
108+
getNbLink,
109+
dh,
108110
)
109111
from .clean import reborder
110112
from .page import Page
@@ -715,7 +717,9 @@ def plainText(self, pages=None):
715717
indent = tm.indent
716718

717719
C = self.C
720+
cd = self.cd
718721
textDir = C.textDir
722+
719723
if not os.path.exists(textDir):
720724
os.makedirs(textDir, exist_ok=True)
721725

@@ -729,7 +733,8 @@ def plainText(self, pages=None):
729733

730734
page = None
731735

732-
path = f"{textDir}/{pagesDesc}.html"
736+
fileName = f"{pagesDesc}.html"
737+
path = f"{textDir}/{fileName}"
733738

734739
doc = """\
735740
<html>
@@ -834,6 +839,12 @@ def plainText(self, pages=None):
834839
with open(path, "w") as f:
835840
f.write(doc.replace("«body»", "\n".join(body)))
836841
info(f"written to {path}")
842+
showPath = unexpanduser(f"{cd}{path}")
843+
nbLink = getNbLink(showPath, fileName)
844+
if nbLink is None:
845+
dh(showPath)
846+
else:
847+
dh(f"""{nbLink} (local file: {showPath})""")
837848

838849

839850
def main():

fusus/convert.py

Lines changed: 54 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,44 @@
4545
4646
See `fusus.convert.HELP`.
4747
48+
Here are concrete commands for known works:
49+
50+
---
51+
52+
``` sh
53+
python3 -m fusus.convert tsv fususa
54+
python3 -m fusus.convert tsv fususl
55+
```
56+
57+
This will run the OCR pipeline for the Affifi/Lakhnawi editions respectively,
58+
and deliver TSV data as result;
59+
60+
---
61+
62+
``` sh
63+
python3 -m fusus.convert tf fususa 0.5
64+
python3 -m fusus.convert tf fususl 0.5
65+
```
66+
67+
This will convert the TSV data to TF and deliver the tf files in version 0.5,
68+
for the Affifi and Lakhnawi editions resepctively.
69+
70+
---
71+
72+
``` sh
73+
python3 -m fusus.convert tf fususa 0.5 loadonly
74+
python3 -m fusus.convert tf fususl 0.5 loadonly
75+
```
76+
77+
This will load the TF data in version 0.5.
78+
The first time it loads, some extra computations will be performed, and
79+
a binary version of the tf files will be generated, which will be used for
80+
subsequent use by Text-Fabric.
81+
82+
---
83+
84+
See also `fusus.convert`.
85+
4886
## Load TSV
4987
5088
The function `loadTsv` to load TSV data in memory.
@@ -76,26 +114,33 @@
76114
77115
--help: print this text and exit
78116
79-
"source" : a work (given as keyword or as path to its work directory)
80-
Examples:
117+
source : a work (given as keyword or as path to its work directory)
118+
Examples:
81119
fususl (Fusus Al Hikam in Lakhnawi edition)
82120
fususa (Fusus Al Hikam in Affifi edition)
83121
any commentary by its keyword
84122
~/github/myorg/myrepo/mydata
85123
mydir/mysubdir
86-
"pages" : page specification, only process these pages; default: all pages
87-
Examples:
124+
125+
pages : page specification, only process these pages; default: all pages
126+
Examples:
88127
50
89128
50,70
90129
50-70,91,92,300-350
91-
"ocr" : assume the work is in the OCR pipeline
92-
"noocr" : assume the work is not in the OCR pipeline
93-
(it is then a text extract from a pdf)
130+
131+
ocr : assume the work is in the OCR pipeline
132+
noocr : assume the work is not in the OCR pipeline
133+
(it is then a text extract from a pdf)
94134
95135
For tf only:
96136
97-
"load" : loads the generated TF; if missing this step is not performed
98-
"loadOnly": does not generate TF; loads previously generated TF
137+
versiontf : loads the generated TF; if missing this step is not performed
138+
Examples:
139+
0.4
140+
3.7.2
141+
142+
load : loads the generated TF; if missing this step is not performed
143+
loadOnly : does not generate TF; loads previously generated TF
99144
"""
100145
"""Help"""
101146

fusus/docs/about/howto.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,10 @@
2424

2525
*Page by page in a notebook*
2626

27-
* [do](https://nbviewer.jupyter.org/github/among/fusus/blob/master/notebooks/example/do.ipynb)
28-
Run the pipeline in a notebook;
27+
* [do example](https://nbviewer.jupyter.org/github/among/fusus/blob/master/notebooks/example/do.ipynb)
28+
Run the pipeline in a notebook on the examples;
29+
* [do Affifi](https://nbviewer.jupyter.org/github/among/fusus/blob/master/notebooks/Affifi/do.ipynb)
30+
Run the pipeline in a notebook on the Affifi edition of the Fusus;
2931
* [inspect](https://nbviewer.jupyter.org/github/among/fusus/blob/master/notebooks/example/inspect.ipynb)
3032
Inspect intermediate results in a notebook.
3133
* [ocr](https://nbviewer.jupyter.org/github/among/fusus/blob/master/notebooks/example/ocr.ipynb)
@@ -57,6 +59,8 @@
5759
* `fusus.lakhnawi`
5860
PDF reverse engineering.
5961

62+
* [drilldown](https://nbviewer.jupyter.org/github/among/fusus/blob/master/notebooks/Lakhnawi/drilldown.ipynb)
63+
Narrow down to specific pages and lines and see what text is extracted from which portion.
6064
* [pages](https://nbviewer.jupyter.org/github/among/fusus/blob/master/notebooks/Lakhnawi/pages.ipynb)
6165
Work with pages, follow line division, extract text and save to disk.
6266
* [characters](https://nbviewer.jupyter.org/github/among/fusus/blob/master/notebooks/Lakhnawi/characters.ipynb)

0 commit comments

Comments
 (0)