-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Norway? #6
Comments
That is strange. Not sure why Norway shows up. There is no mention of Norway in the report? Also, how long is the report? One thing to note is these models were trained on abstracts. The performance tends to degrade a bit with very long texts. We actually found that the performance nearly matched that of an abstract when we automatically summarized longer texts down to abstract length (for example, using sumy). |
No, and seen it in quite a few others...so relating Norway to something. So summarise and keyword the summary, as well? |
Sumy link says not found btw. |
I think Anthony was suggesting to use 'sumy' to summarize large documents down to abstract length. It is weird that "Norway" is one of the predicted tags. Are there place names, proper names, or acronyms in the text that could relate to Norway? |
Didn't look like it and I am somewhat familiar with Norwegian and place names. Came up some times in some purely South Australian things too, The above had a Canadian address in it. Might have to run it on some chunks to narrow it down. |
That library does look handy, thanks. |
Checked a few of them... maybe 20% of docs had it come up in this 600 odd sample. Checked a few just then for the existence of norway - not that, so something related. |
Took the example above and kept shrinking it to find the 'Norwayish' bit. When doing so the Norway probability declined finally down to 0.018 - shrunk it further, then had 0.025 Sweden. Only proper nouns were two company names and SOW as an acronym. So interesting. |
I'm not sure what the particular reason is for Norway appearing. I think we would have to manually inspect the model for that individual word and see how it is weighting term importances. Could be a case of overfitting. When you say shrink, do you mean you are using automated summarization? |
Just saw this tool, might be useful for debugging this type of problem https://github.com/pair-code/lit |
Anthony, no, taking sections of the text to try and see which indicated Norway. Justin, saw that one yesterday, will have to investigate. Maybe a higher proportion of abstracts mentioning norway are exploration geosciencey than others, so a longer document like that gets more similarities. |
Tomorrow should be able to do a 'Norway' count on a fair set of documents. |
7000 odd reports (some from mangled OCR). Norway keyword percentage: 82.1 |
Interesting. There must be a really strong correlation between the type of content you're feeding in and Norway in the training data. I'll try to take a closer look at the Norway model at some point and see if it leads to any insights. |
Thanks! Haven't had time yet to try any of the techniques above, or get to the summarisation comparison. Hopefully this week. |
Stability is more common than norway, too, although that at least seems more likely. |
Random report summary picked out : Apart from fair dealing for the purposes of study, research, criticism or review as permitted under the Copyright Act, no part may be reproduced without written permission of the Chief Executive of Primary Industries and Resources South Australia, GPO Box 1671, Adelaide, SA 5001. |
"probability_threshold": 0.5, |
and the keywords from the whole report: ,"sti_keywords":[[{"keyword":"detection","probability":1.0,"unstemmed":"DETECTION"},{"keyword":"frequency","probability":1.0,"unstemmed":"FREQUENCIES"},{"keyword":"norway","probability":1.0,"unstemmed":"NORWAY"},{"keyword":"simulation","probability":1.0,"unstemmed":"SIMULATION"},{"keyword":"stability","probability":1.0,"unstemmed":"STABILITY"},{"keyword":"velocity","probability":1.0,"unstemmed":"VELOCITY"},{"keyword":"wind (meteorology)","probability":1.0,"unstemmed":"WIND (METEOROLOGY)"},{"keyword":"drainage","probability":0.9999985098838806,"unstemmed":"DRAINAGE"},{"keyword":"boundary","probability":0.9999971389770508,"unstemmed":"BOUNDARIES"},{"keyword":"binders (materials)","probability":0.9999895691871643,"unstemmed":"BINDERS (MATERIALS)"},{"keyword":"spectrum","probability":0.9999833106994629,"unstemmed":"SPECTRA"},{"keyword":"topography","probability":0.9999693632125854,"unstemmed":"TOPOGRAPHY"},{"keyword":"turbulence","probability":0.9999633431434631,"unstemmed":"TURBULENCE"},{"keyword":"gas","probability":0.9999365210533142,"unstemmed":"GASES"},{"keyword":"gravitation","probability":0.9999074935913086,"unstemmed":"GRAVITATION"},{"keyword":"scattering","probability":0.9998400807380676,"unstemmed":"SCATTERING"},{"keyword":"coast","probability":0.9998008012771606,"unstemmed":"COASTS"},{"keyword":"geology","probability":0.9997913837432861,"unstemmed":"GEOLOGY"},{"keyword":"circulation","probability":0.9997692704200745,"unstemmed":"CIRCULATION"},{"keyword":"horizontal orientation","probability":0.9996856451034546,"unstemmed":"HORIZONTAL ORIENTATION"},{"keyword":"survey","probability":0.9996093511581421,"unstemmed":"SURVEYS"},{"keyword":"deposition","probability":0.9995365738868713,"unstemmed":"DEPOSITION"},{"keyword":"sediment","probability":0.9993331432342529,"unstemmed":"SEDIMENTS"},{"keyword":"crystal","probability":0.9993000030517578,"unstemmed":"CRYSTALS"},{"keyword":"plasmas (physics)","probability":0.9992559552192688,"unstemmed":"PLASMAS (PHYSICS)"},{"keyword":"adult","probability":0.9988338947296143,"unstemmed":"ADULTS"},{"keyword":"rock","probability":0.9987765550613403,"unstemmed":"ROCKS"},{"keyword":"prediction","probability":0.9986652731895447,"unstemmed":"PREDICTIONS"},{"keyword":"error","probability":0.99862140417099,"unstemmed":"ERRORS"},{"keyword":"canada","probability":0.9981276392936707,"unstemmed":"CANADA"},{"keyword":"position (location)","probability":0.9980089068412781,"unstemmed":"POSITION (LOCATION)"},{"keyword":"perturbation","probability":0.9977721571922302,"unstemmed":"PERTURBATION"},{"keyword":"geophysic","probability":0.9969445466995239,"unstemmed":"GEOPHYSICS"},{"keyword":"image","probability":0.9956791996955872,"unstemmed":"IMAGES"},{"keyword":"ion","probability":0.9956188201904297,"unstemmed":"IONS"},{"keyword":"loss","probability":0.9947844743728638,"unstemmed":"LOSSES"},{"keyword":"wake","probability":0.9905879497528076,"unstemmed":"WAKES"},{"keyword":"projectile","probability":0.9886454939842224,"unstemmed":"PROJECTILES"},{"keyword":"resolution","probability":0.987578272819519,"unstemmed":"RESOLUTION"},{"keyword":"x ray","probability":0.9852748513221741,"unstemmed":"X RAYS"},{"keyword":"water","probability":0.9840564727783203,"unstemmed":"WATER"},{"keyword":"longitude","probability":0.9819511771202087,"unstemmed":"LONGITUDE"},{"keyword":"injection","probability":0.9818744659423828,"unstemmed":"INJECTION"},{"keyword":"test","probability":0.9723952412605286,"unstemmed":"TESTS"},{"keyword":"mollusk","probability":0.963683009147644,"unstemmed":"MOLLUSKS"},{"keyword":"attitude (inclination)","probability":0.956942081451416,"unstemmed":"ATTITUDE (INCLINATION)"},{"keyword":"deformation","probability":0.9378564357757568,"unstemmed":"DEFORMATION"},{"keyword":"united kingdom","probability":0.9181330800056458,"unstemmed":"UNITED KINGDOM"},{"keyword":"sensitivity analysi","probability":0.9107515811920166,"unstemmed":"SENSITIVITY ANALYSIS"},{"keyword":"australia","probability":0.9103879928588867,"unstemmed":"AUSTRALIA"},{"keyword":"crystallinity","probability":0.8907729387283325,"unstemmed":"CRYSTALLINITY"},{"keyword":"mathematic","probability":0.870129406452179,"unstemmed":"MATHEMATICS"},{"keyword":"excitation","probability":0.8667937517166138,"unstemmed":"EXCITATION"},{"keyword":"thicknes","probability":0.8121394515037537,"unstemmed":"THICKNESS"},{"keyword":"measurement","probability":0.8025984764099121,"unstemmed":"MEASUREMENT"},{"keyword":"reliability","probability":0.7956236600875854,"unstemmed":"RELIABILITY"},{"keyword":"drying","probability":0.7842564582824707,"unstemmed":"DRYING"},{"keyword":"angular resolution","probability":0.7671957612037659,"unstemmed":"ANGULAR RESOLUTION"},{"keyword":"male","probability":0.6826151609420776,"unstemmed":"MALES"},{"keyword":"wall","probability":0.6692981719970703,"unstemmed":"WALLS"},{"keyword":"oscillation","probability":0.6411301493644714,"unstemmed":"OSCILLATIONS"},{"keyword":"female","probability":0.6219513416290283,"unstemmed":"FEMALES"},{"keyword":"shape","probability":0.49701258540153503,"unstemmed":"SHAPES"},{"keyword":"biodynamic","probability":0.48482751846313477,"unstemmed":"BIODYNAMICS"},{"keyword":"target","probability":0.4754076600074768,"unstemmed":"TARGETS"},{"keyword":"heating","probability":0.46355319023132324,"unstemmed":"HEATING"},{"keyword":"europe","probability":0.4513147473335266,"unstemmed":"EUROPE"},{"keyword":"infrared radiation","probability":0.36240354180336,"unstemmed":"INFRARED RADIATION"},{"keyword":"analysis (mathematics)","probability":0.357046902179718,"unstemmed":"ANALYSIS (MATHEMATICS)"}]],"topic_probabilities":[[{"keyword":"geoscience","probability":0.9999999999995617,"unstemmed":"geosciences"},{"keyword":"life science","probability":0.9999999998044267,"unstemmed":"life sciences"},{"keyword":"space science","probability":5.952435167211817e-05,"unstemmed":"space sciences"},{"keyword":"physic","probability":4.797571174203055e-07,"unstemmed":"physics"},{"keyword":"aeronautic","probability":3.452614189217303e-07,"unstemmed":"aeronautics"},{"keyword":"social and information science","probability":2.775078370610512e-07,"unstemmed":"social and information sciences"},{"keyword":"engineering","probability":2.1643729127246794e-07,"unstemmed":"engineering"},{"keyword":"astronautic","probability":3.961110196099098e-08,"unstemmed":"astronautics"},{"keyword":"chemistry and material","probability":6.182643119277685e-10,"unstemmed":"chemistry and materials"},{"keyword":"mathematical and computer science","probability":4.1405143328909e-10,"unstemmed":"mathematical and computer sciences"},{"keyword":"general","probability":2.2434030242077953e-19,"unstemmed":"general"}]],"topic_threshold":1.0},"service_version":"unspecified","status":"okay"} |
I made a graph out of what I have done, here's the edge counts for 'norway' [('stability', {'weight': 12740}), |
and coal |
One area perhaps where Northern Hemisphere dominated keywords will be a problem - would be differences between petroleum and gas if they call them all gas? |
Anecdotally, 'norway' seems to be popular. Even for example when looking at SA Exploration reports. Or here:-
This was from a proposal to do predictive analytics on a bunch of geoscience stuff with some dubious analogies - from a Canadian company.
"sti_keywords": [
[
{
"keyword": "education",
"probability": 0.9999927878379822,
"unstemmed": "EDUCATION"
},
{
"keyword": "prediction",
"probability": 0.9999526143074036,
"unstemmed": "PREDICTIONS"
},
{
"keyword": "stability",
"probability": 0.9999324679374695,
"unstemmed": "STABILITY"
},
{
"keyword": "frequency",
"probability": 0.9998124837875366,
"unstemmed": "FREQUENCIES"
},
{
"keyword": "simulation",
"probability": 0.9997246265411377,
"unstemmed": "SIMULATION"
},
{
"keyword": "detection",
"probability": 0.9996479153633118,
"unstemmed": "DETECTION"
},
{
"keyword": "organization",
"probability": 0.999604344367981,
"unstemmed": "ORGANIZATIONS"
},
{
"keyword": "information system",
"probability": 0.9994807243347168,
"unstemmed": "INFORMATION SYSTEMS"
},
{
"keyword": "test",
"probability": 0.9952819347381592,
"unstemmed": "TESTS"
},
{
"keyword": "management method",
"probability": 0.9947834014892578,
"unstemmed": "MANAGEMENT METHODS"
},
{
"keyword": "document",
"probability": 0.9927297234535217,
"unstemmed": "DOCUMENTS"
},
{
"keyword": "sensitivity analysi",
"probability": 0.9906492829322815,
"unstemmed": "SENSITIVITY ANALYSIS"
},
{
"keyword": "survey",
"probability": 0.9885730743408203,
"unstemmed": "SURVEYS"
},
{
"keyword": "measurement",
"probability": 0.9878258109092712,
"unstemmed": "MEASUREMENT"
},
{
"keyword": "on-line system",
"probability": 0.9819966554641724,
"unstemmed": "ON-LINE SYSTEMS"
},
{
"keyword": "probability theory",
"probability": 0.9808198809623718,
"unstemmed": "PROBABILITY THEORY"
},
{
"keyword": "decision making",
"probability": 0.9807706475257874,
"unstemmed": "DECISION MAKING"
},
{
"keyword": "statistical analysi",
"probability": 0.9802524447441101,
"unstemmed": "STATISTICAL ANALYSIS"
},
{
"keyword": "commerce",
"probability": 0.9727779626846313,
"unstemmed": "COMMERCE"
},
{
"keyword": "wind (meteorology)",
"probability": 0.9680668115615845,
"unstemmed": "WIND (METEOROLOGY)"
},
{
"keyword": "ranking",
"probability": 0.966843843460083,
"unstemmed": "RANKING"
},
{
"keyword": "norway",
"probability": 0.9654883742332458,
"unstemmed": "NORWAY"
},
{
"keyword": "turbulence",
"probability": 0.9554965496063232,
"unstemmed": "TURBULENCE"
},
{
"keyword": "environment",
"probability": 0.9523953795433044,
"unstemmed": "ENVIRONMENTS"
},
{
"keyword": "artificial intelligence",
"probability": 0.9461550712585449,
"unstemmed": "ARTIFICIAL INTELLIGENCE"
},
{
"keyword": "spectrum",
"probability": 0.9431440830230713,
"unstemmed": "SPECTRA"
},
{
"keyword": "supplying",
"probability": 0.9292123317718506,
"unstemmed": "SUPPLYING"
},
{
"keyword": "thesis",
"probability": 0.9177210330963135,
"unstemmed": "THESES"
},
{
"keyword": "industry",
"probability": 0.8984866738319397,
"unstemmed": "INDUSTRIES"
},
{
"keyword": "research management",
"probability": 0.8902236223220825,
"unstemmed": "RESEARCH MANAGEMENT"
},
{
"keyword": "logistic",
"probability": 0.8653072118759155,
"unstemmed": "LOGISTICS"
},
{
"keyword": "marketing",
"probability": 0.8555718660354614,
"unstemmed": "MARKETING"
},
{
"keyword": "material",
"probability": 0.8554256558418274,
"unstemmed": "MATERIALS"
},
{
"keyword": "exploration",
"probability": 0.8326903581619263,
"unstemmed": "EXPLORATION"
},
{
"keyword": "finance",
"probability": 0.7999643683433533,
"unstemmed": "FINANCE"
},
{
"keyword": "procedure",
"probability": 0.7985092997550964,
"unstemmed": "PROCEDURES"
},
{
"keyword": "plasmas (physics)",
"probability": 0.7585477828979492,
"unstemmed": "PLASMAS (PHYSICS)"
},
{
"keyword": "management system",
"probability": 0.7525011897087097,
"unstemmed": "MANAGEMENT SYSTEMS"
},
{
"keyword": "personnel",
"probability": 0.7381218075752258,
"unstemmed": "PERSONNEL"
},
{
"keyword": "quality control",
"probability": 0.6919220685958862,
"unstemmed": "QUALITY CONTROL"
},
{
"keyword": "binders (materials)",
"probability": 0.682461678981781,
"unstemmed": "BINDERS (MATERIALS)"
},
{
"keyword": "statistic",
"probability": 0.6598255634307861,
"unstemmed": "STATISTICS"
},
{
"keyword": "technology transfer",
"probability": 0.6495755910873413,
"unstemmed": "TECHNOLOGY TRANSFER"
},
{
"keyword": "economic",
"probability": 0.6489657163619995,
"unstemmed": "ECONOMICS"
},
{
"keyword": "environment management",
"probability": 0.6489007472991943,
"unstemmed": "ENVIRONMENT MANAGEMENT"
},
{
"keyword": "reliability",
"probability": 0.6394709348678589,
"unstemmed": "RELIABILITY"
},
{
"keyword": "methodology",
"probability": 0.5842968225479126,
"unstemmed": "METHODOLOGY"
},
{
"keyword": "life (durability)",
"probability": 0.560967743396759,
"unstemmed": "LIFE (DURABILITY)"
},
{
"keyword": "task",
"probability": 0.5498309135437012,
"unstemmed": "TASKS"
},
{
"keyword": "feedback",
"probability": 0.5493220686912537,
"unstemmed": "FEEDBACK"
},
{
"keyword": "embedding",
"probability": 0.5298750996589661,
"unstemmed": "EMBEDDING"
},
{
"keyword": "model",
"probability": 0.5171043276786804,
"unstemmed": "MODELS"
},
{
"keyword": "image",
"probability": 0.5064711570739746,
"unstemmed": "IMAGES"
},
{
"keyword": "market research",
"probability": 0.5063830018043518,
"unstemmed": "MARKET RESEARCH"
}
]
],
"topic_probabilities": [
[
{
"keyword": "social and information science",
"probability": 0.9999999967051496,
"unstemmed": "social and information sciences"
},
{
"keyword": "life science",
"probability": 0.9997857836273429,
"unstemmed": "life sciences"
},
{
"keyword": "mathematical and computer science",
"probability": 0.9995673040968835,
"unstemmed": "mathematical and computer sciences"
},
{
"keyword": "geoscience",
"probability": 0.6036738988592714,
"unstemmed": "geosciences"
},
{
"keyword": "aeronautic",
"probability": 0.006360614090616023,
"unstemmed": "aeronautics"
},
{
"keyword": "astronautic",
"probability": 0.0010483669327786213,
"unstemmed": "astronautics"
},
{
"keyword": "engineering",
"probability": 0.0006828873398622611,
"unstemmed": "engineering"
},
{
"keyword": "chemistry and material",
"probability": 0.0006251699807542186,
"unstemmed": "chemistry and materials"
},
{
"keyword": "space science",
"probability": 0.0004059099588450474,
"unstemmed": "space sciences"
},
{
"keyword": "physic",
"probability": 0.000008485140273070631,
"unstemmed": "physics"
},
{
"keyword": "general",
"probability": 4.569450859931134e-8,
"unstemmed": "general"
}
]
],
"topic_threshold": 1
The text was updated successfully, but these errors were encountered: