Processing Watchlist Internet with Miller #1266

T145 · 2023-04-14T21:41:02Z

T145
Apr 14, 2023

I'm trying to process this document with Miller: https://www.watchlist-internet.at/index.php?id=120
And have the following command so far:

mlr --csv --ifs ';' -N cut -f 1 watchlist

Which works, however there are weird records in the list (or edge cases) that need to be handled, like these:

andleben-leymann.de#http://landleben-leymann.de
b-elektronikfirma.com/index.php?
dailymetcon.it/M
www.vidamon.at/
www.wasserspender-deutschland.de/index.html

In the first example, the URL has a unique domain I'd like to include in the output. As for the others, simply splitting on "/" would be sufficient. But how specifically can I handle the edge cases in the same Miller command?

aborruso · 2023-04-15T07:21:43Z

aborruso
Apr 15, 2023

I think it's not a Miller task, but a regular expression task. Only as example, if you run

echo "andleben-leymann.de#http://landleben-leymann.de
b-elektronikfirma.com/index.php?
dailymetcon.it/M
www.vidamon.at/
www.wasserspender-deutschland.de/index.html" | grep -oE '\b(https?://|www\.)\S+'

you get

http://landleben-leymann.de
www.vidamon.at/
www.wasserspender-deutschland.de/index.html

So you mostly need to find the best regex or various regex to apply.

In Miller in example, starting from this input

"1-proof.cyou";02.03.2023;"Finanzbetrug"
"10xcryptotrade.uk";22.11.2022;"Finanzbetrug"
"11trikots.com";14.12.2021;"Betrügerische Shops"
"andleben-leymann.de#http://landleben-leymann.de";17.07.2020;"Betrügerische Shops"
"b-elektronikfirma.com/index.php?";17.07.2020;"Betrügerische Shops"
"dailymetcon.it/M";21.02.2019;"Betrügerische Shops"
"www.vidamon.at/";17.07.2020;"Betrügerische Shops"
"www.wasserspender-deutschland.de/index.html";17.07.2020;"Betrügerische Shops"

you could run

mlr --csv --implicit-csv-header --ifs ";" put '$extract=regextract($1,"\b(https?://|www\.)\S+")' then unsparsify input.csv

to get

+-------------------------------------------------+------------+---------------------+---------------------------------------------+
| 1                                               | 2          | 3                   | extract                                     |
+-------------------------------------------------+------------+---------------------+---------------------------------------------+
| 1-proof.cyou                                    | 02.03.2023 | Finanzbetrug        |                                             |
| 10xcryptotrade.uk                               | 22.11.2022 | Finanzbetrug        |                                             |
| 11trikots.com                                   | 14.12.2021 | Betrügerische Shops |                                             |
| andleben-leymann.de#http://landleben-leymann.de | 17.07.2020 | Betrügerische Shops | http://landleben-leymann.de                 |
| b-elektronikfirma.com/index.php?                | 17.07.2020 | Betrügerische Shops |                                             |
| dailymetcon.it/M                                | 21.02.2019 | Betrügerische Shops |                                             |
| www.vidamon.at/                                 | 17.07.2020 | Betrügerische Shops | www.vidamon.at/                             |
| www.wasserspender-deutschland.de/index.html     | 17.07.2020 | Betrügerische Shops | www.wasserspender-deutschland.de/index.html |
+-------------------------------------------------+------------+---------------------+---------------------------------------------+

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing Watchlist Internet with Miller #1266

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Processing Watchlist Internet with Miller #1266

T145 Apr 14, 2023

Replies: 1 comment

aborruso Apr 15, 2023

T145
Apr 14, 2023

aborruso
Apr 15, 2023