Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CATEGORY_LV1 is not as descripted in README.md #15

Open
JanYanisa opened this issue Jul 24, 2021 · 4 comments
Open

CATEGORY_LV1 is not as descripted in README.md #15

JanYanisa opened this issue Jul 24, 2021 · 4 comments

Comments

@JanYanisa
Copy link

Hello there,
According to README.md, it said "หมวดงบรายจ่าย level-1 จะประกอบไปด้วย
งบบุคลากร,
งบดำเนินงาน,
งบลงทุน,
งบเงินอุดหนุน,
งบรายจ่ายอื่น เท่านั้น" for CATEGORY_LV1
But this is what I got when using python pandas

df.groupby(["CATEGORY_LV1"])[["ITEM_ID"]].agg(["count"]) 

image

So, I would like to state the issues here and hope that it will be fixed soon

@JanYanisa
Copy link
Author

In addition, these are <ITEM_ID > for the ones with <CATEGORY_LV1> that not "งบบุคลากร, งบดำเนินงาน, งบลงทุน, งบเงินอุดหนุน, งบรายจ่ายอื่น"

2022.3.1.0
2022.3.1.1
2022.3.1.2
2022.3.1.3
2022.3.1.4
2022.3.1.5
2022.3.1.6
2022.3.1.7
2022.3.1.8
2022.3.1.9
2022.3.1.10
2022.3.2.472
2022.3.2.473
2022.3.2.474
2022.3.2.475
2022.3.2.476
2022.3.2.477
2022.3.2.478
2022.3.2.479
2022.3.2.480
2022.3.2.481
2022.3.2.482
2022.3.2.483
2022.3.2.484
2022.3.2.485
2022.3.2.486
2022.3.2.520
2022.3.3(1).431
2022.3.3(1).432
2022.3.3(2).584
2022.3.3(2).585
2022.3.3(2).586
2022.3.3(2).587
2022.3.3(2).588
2022.3.3(2).589
2022.3.3(2).590
2022.3.3(2).887
2022.3.3(2).888
2022.3.3(2).889
2022.3.3(2).890
2022.3.3(2).891
2022.3.3(2).892
2022.3.3(2).893
2022.3.3(4).957
2022.3.3(4).958
2022.3.3(4).959
2022.3.3(4).960
2022.3.3(4).961
2022.3.3(4).962
2022.3.3(4).963
2022.3.3(4).964
2022.3.3(4).965
2022.3.3(4).966
2022.3.3(4).967
2022.3.3(4).968
2022.3.3(4).969
2022.3.3(4).970
2022.3.3(4).971
2022.3.3(4).972
2022.3.3(4).973
2022.3.3(4).974
2022.3.3(4).975
2022.3.3(4).976
2022.3.3(4).977
2022.3.3(4).978
2022.3.3(4).979
2022.3.3(4).980
2022.3.3(4).981
2022.3.3(4).982
2022.3.3(4).983
2022.3.3(4).984
2022.3.3(4).985
2022.3.3(4).986
2022.3.3(4).987
2022.3.3(4).988
2022.3.7.891
2022.3.7.892
2022.3.7.893
2022.3.7.1091
2022.3.7.1092
2022.3.7.1093
2022.3.8.252
2022.3.8.253
2022.3.8.254
2022.3.8.442
2022.3.8.443
2022.3.8.444
2022.3.8.445
2022.3.8.446
2022.3.8.447
2022.3.8.448
2022.3.8.449
2022.3.8.450
2022.3.8.451
2022.3.8.452
2022.3.10.1489
2022.3.10.1490
2022.3.10.1491
2022.3.10.1492
2022.3.10.1493
2022.3.10.1494
2022.3.10.1495
2022.3.10.1496
2022.3.10.1497
2022.3.10.1498
2022.3.12.124
2022.3.12.125
2022.3.12.126
2022.3.12.127
2022.3.12.128
2022.3.12.129
2022.3.12.197
2022.3.12.1667
2022.3.12.1718
2022.3.13(2).79
2022.3.13(2).187
2022.3.14.501
2022.3.16(2).665
2022.3.16(2).666
2022.3.16(2).667

@tee4cute
Copy link
Member

Hi @JanYanisa, thanks for reporting issue here!

I've done some investigations and flag the errors you provided in the image above into three categories:

  • S for Syntactic Errors.
  • E for Exceptional cases.
  • O for OCR Errors.

as shown in the image below.

1

Syntactic Error (S)

For syntactic errors, it means that the data is incorrectly written in source pdfs. For example, "ค่าครุภัณฑ์ ที่ดินและสิ่งก่อสร้าง" as shown in the image below:

2

You can see that it was an error produced from the Government Budgetary Office! This is one kind of syntactic error that we MUST CORRECT THE DATA BY HAND.

Exceptional Cases (E)

These cases are produced from "งบกลาง" entries. We'll update the documentation to state the exceptions more clearly. Thanks for your notice :)

OCR Error (O)

These errors produced by the OCR Tool used in this project --Google Cloud Vision API. There is nothing we can do except editing data by hand as well.

Summary

Some of Syntactic Errors, such as wrong indentation and bulleting, etc., can be resolved by adding some logics to the compiler to make it more robust to those errors. We've planned to develop our compiler to make it more robust, and, run through all of the source pdfs to regenerate the output file again. Hence, the errors which must be edited by hand will be the last to be fixed.

I'll let this issue open and keep updating you when we do the further release.

Cheers ;)

tee4cute added a commit that referenced this issue Jul 25, 2021
Add exceptional cases explanation to `CATEGORY_LV1` for "งบกลาง" reported by this issue [#15].
@JanYanisa
Copy link
Author

Thank for the response!

There are also the problematic rows that do not show in the image above which are NaN [<ITEM_ID> : 2022.3.16(2).665, 2022.3.16(2).666, 2022.3.16(2).667] .
So, for double check purpose, I recommend the following to see the problematic row in <CATEGORY_LV1> [python pandas]:

con3 = df["CATEGORY_LV1"] == "งบบุคลากร"
con4 = df["CATEGORY_LV1"] == "งบดำเนินงาน"
con5 = df["CATEGORY_LV1"] == "งบลงทุน"
con6 = df["CATEGORY_LV1"] == "งบเงินอุดหนุน"
con7 = df["CATEGORY_LV1"] == "งบรายจ่ายอื่น"
df[["ITEM_ID","CATEGORY_LV1"]][~(con3 | con4 | con5 | con6 | con7)]

@tee4cute
Copy link
Member

@napatswift I think we should produce a message to DEBUG_LOG to identify CATEGORY_LV1 error clearly in further releases.

Valid CATEGORY_LV1 values:

  • If it is "งบกลาง", the CATEGORY_LV1 can be any value.
  • Else, it must be the following five values as stated in the documentation: "งบบุคลากร", "งบดำเนินงาน", "งบลงทุน", "งบเงินอุดหนุน", "งบรายจ่ายอื่น".
  • Otherwise, produce log message.

Since it is minor change --producing log message, I think we could wait for other major changes, then, include this issue into the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants