Hello there

Building

download data files from here

git clone [whatever this repository is]

Place the data files in the same folder as the files fetched from git.

And then

Run

./make-solution.sh

Will

Install protobuf-compiler

Check if cargo(rust package manager) exists and if not downloads and prompts you to install

Run cargo build --release with target-cpu=native

copy the file over at ./tgnews/target/release/tgnews to this directory

clean build files

If you want to do it yourself;

apt install protobuf-compiler

cargo build --release cp ./target/release/tgnews ./

Then run the binary with the specified arguments

How things work

All modules spawn 16 worker threads to handle processing (this can be changed with the --threads cmd argument)

###Languages Use whatlang to provide language detection on the body of the article. Whatlang utilizes trigrams to classify languages.

Passing the whole article results in a lot of time spent in calculating trigrams, but allows me to set high thresholds for accuracy, (a threshold of 1.0)

News

A set of regex based filters that match the titles that were indicated to be evil and flags them as ot news

Threads

Haaa Dissimilarity matrices are amazing

Okay repeat steps 1,2,3

Detect language, is news? categorize, normal stuff
Put them to respective categories and one master category
If the files are less than 5000(or 7000 idk) we cluster them all at once
If more than 7000 i guess we classify them based on respective category i.e those of society are clustered in one group,economy in another and so on so forth
Still if a group contains more than 10000 articles(still not sure with this numbers) we sort them by published time, split em into batches and categorize them
Clustering is done by some very weird SLINK algorithm which returns a label of files whose title dissimilarities are low,
OKay a little bit about the clustering, it pulls documents which have the lowest dissimilarity to self
So for example say a document A is dissimilar to B by 0.12 and A is dissimilar to C by 0.15 but B is dissimilar to C by 0.05, A won't pull B, B will pull C so we end up with something looking like

{
 [ title:"A",
  "articles": [A],
  ],
 [
  title:"B",
 "articles":[B,C]
 ],
}

This prevents

Creation of really really girnomous clusters(like 500 articles); because I've never met a person reading 500 articles
Pushes similar news into small meaningful clusters

Title is determined by weird stuff, but mainly influenced by ALEXA_PAGERANK i borrowed from here so titles for large threads should always be from known sources, though not the best but its honest work.
Then boom print those stuff like crazy
Is it fast?

Well yes but actually no.

It's compiled with optimizations so it's okay

Jokes it's pretty fast

Server

I was partially sane and insane here.

Server chosen was rocket because of its ease of use and i defaulted to its async 0.5.0-dev branch

Async works magic

PUTting of articles

File /src/server/upload.rs

It is preprocessed and the following things are extracted from it

url,title,body and published_time

Later, the language and category of the file is determined. and the file is stored

For storing we use sled a persistence key value store , the key is the article file and the value is a protobuf representation of a TDocument(defined in src/server/enums.rs)

All dirty buffers in sled are flushed after 5 minutes using a global worker thread pool

###DELETEing an article

File /src/server/delete_article.rs

Just deletes the article from the DataBase,

When removing an article, sled returns the value if it existed or not so whether we get a Some(Article) or None influenced our return type the former returns 204 code the latter returning a 404 in compliance with the specifications.

NOTE:When an article is deleted, it is still stored in the clustering index, and will be removed some few minutes, because deleting articles was expensive and since clusters are in real time quite frankly mind numbing ( though one of the contestants did manage to do that, check out his medium article)

I wanted to maintain my sanity so I made a compromise

Clustering and returning

Okay a request just came in

PUT /upload/longticlename.html

Detect language.

English or Russian? No? Okay tell the client we can't handle this.

Is it news?

No tell the client we only speak news

Categorize it.

Low detection accuracy.

Slowly drop it.. slowly..

Passed all the above?

Get its Alexa rating, its title and its url its language and every info needed to present it as a /proto/server_files.proto and then represent it as such

Add it to the cluster

Asynchronously send it to sled(a KV database) to store it.

Okay another request came in

GET /threads/?lang_code='en'&category=science&period=3200

And I parse that And then we

Look at the category. Do i know it?, Ya , Do I have it? Ya . So lemme fetch it
I have the category cluster, now then we pass it to the time filter and we see how many articles actually made the time(meaning some clusters will be empty)
Remove empty clusters and for those remaining, sort them by size(not the best metric, but gets the job done)
Send them back like crazy

Okay a server reload was needed

Okay server is up

We got any articles in DBASE?
Read them and put them into threads
Await further commands

And i believe that is all

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
protos		protos
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
Readme.md		Readme.md
build.rs		build.rs
make-solution.sh		make-solution.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hello there

Building

How things work

News

Categories

Manual Classification

A FastText model

Threads

Server

`PUT`ting of articles

Clustering and returning

Okay a request just came in

Okay another request came in

Okay a server reload was needed

About

Releases

Packages

Languages

etemesi-ke/tgnews

Folders and files

Latest commit

History

Repository files navigation

Hello there

Building

How things work

News

Categories

Manual Classification

A FastText model

Threads

Server

PUTting of articles

Clustering and returning

Okay a request just came in

Okay another request came in

Okay a server reload was needed

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`PUT`ting of articles

Packages