-
Notifications
You must be signed in to change notification settings - Fork 5
Feature request: output <table> elements as tables #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
That's an interesting idea. I have been using a bookmarklet to extract tables from webpages as CSV files. The default output format for tq is till plain text, but I am pretty sure that most people that script tq, use the json options as the provide the ability to properly consume the extracted data. For example with jq. We could define a flag to enable table parsing and throw an error if the selected element is not a table. The output could be by default in json lines. An json array of strings per line. This would be the bare minimum functionality as one can always pipe to another command for formatting. jq can do this, for example: https://stackoverflow.com/questions/39139107/how-to-format-a-json-string-as-a-table-using-jq Although that migt be to intricate. An extra switch to format the ouput can also be added. |
I checked prettytable. What a nice little library! No dependencies. To the point. Posting the link in here for future me :) |
Ahaha, maybe we have been using the same bookmarklet ^^ Glad you like the idea :) |
Ok, let's get the ball rolling on this one. UIActivate this feature with a CLI flag. Perhaps ''-T'' or ''--table''. I'm tempted to support only the latest as ''-t'' is already taken for tex,t and having -T and -t doing two very different things may be difficult to remember. Although there's already ''-J'' and ''-j''. It might be useful to include an extra flag to omit the table headers. BehaviorIs there any use in being able to select the html inside table cells? I am not sure may people, if any would have an use for this. Supporting innerText only feels like the way to go. Which is to say that, if we do choose ''-T'', it would imply ''-Tt''. Select just one table element. If more than one matches, pick the first. This is not how tq behaves otherwise, but I don't see much of a use case for extracting many tables at once. Now that there i support for fancier css selectors, it's possible to use for example ''nth-match'' to get the desired table. How strict should we be with selection? OutputThis is the trickiest. Pipeline composability is an important goal. This is a command line tool in the tradition of classic unix principles. ASCII art formats such the one from pretty-table are suboptimal for this purpose, so I think that they should not be the default. Perhaps a format that outputs the text of each row in a single line could be useful for simple tables that contain numerical data, in the sense that they can be easil processed with awk or similar. Bu json lnes feels like the least brittle in my opinion. What do you think @Lucas-C ? |
IMHO:
|
Examples of possible test data: Premier League standings: Downjones top movers: Plenty of per-country data on GDP, wikipedia. plenty of clean tables |
example of using lynx to render html $ curl -s "https://www.espn.com/soccer/table/_/league/eng.1" | tq ".Table__Scroller > table:nth-child(1)" | lynx -stdin -dump
[1]GP [2]W [3]D [4]L [5]F [6]A [7]GD [8]P
34 25 5 4 71 24 +47 80
33 19 10 4 64 35 +29 67
34 19 6 9 61 39 +22 63
34 17 10 7 53 31 +22 61
34 17 7 10 55 44 +11 58
34 16 8 10 60 38 +22 56
33 15 9 9 55 39 +16 54
33 15 7 11 45 42 +3 52
34 14 7 13 46 37 +9 49
33 14 6 13 48 38 +10 48
34 14 5 15 50 52 -2 47
34 11 9 14 33 46 -13 42
33 10 8 15 34 56 -22 38
34 8 13 13 35 39 -4 37
33 10 7 16 41 59 -18 37
34 9 9 16 31 47 -16 36
34 9 9 16 36 56 -20 36
34 5 12 17 25 45 -20 27
34 5 11 18 31 65 -34 26
34 5 2 27 18 60 -42 17
References
1. file:///soccer/standings/_/league/ENG.1/sort/gamesplayed/dir/desc
2. file:///soccer/standings/_/league/ENG.1/sort/wins/dir/desc
3. file:///soccer/standings/_/league/ENG.1/sort/ties/dir/desc
4. file:///soccer/standings/_/league/ENG.1/sort/losses/dir/asc
5. file:///soccer/standings/_/league/ENG.1/sort/pointsfor/dir/desc
6. file:///soccer/standings/_/league/ENG.1/sort/pointsagainst/dir/asc
7. file:///soccer/standings/_/league/ENG.1/sort/pointdifferential/dir/desc
8. file:///soccer/standings/_/league/ENG.1/sort/points/dir/desc i think this is good enough for my use case |
Yes, such usage has always worked, but it is brittle if the data contains
spaces, or generally speaking whatever is used as a separator.
What we are discussing is treating it as a special case so it can be output
in a reliably parseable format. Specifically json.
The html is already parsed with beautifulsoup, so we have acess to the data
element by element.
I forgot about this ticket. I guess I haven't needed this lately.
…On Tue, May 4, 2021 at 6:20 PM rachmadani haryono ***@***.***> wrote:
example of using lynx to render html
$ curl -s "https://www.espn.com/soccer/table/_/league/eng.1" | tq ".Table__Scroller > table:nth-child(1)" | lynx -stdin -dump
[1]GP [2]W [3]D [4]L [5]F [6]A [7]GD [8]P
34 25 5 4 71 24 +47 80
33 19 10 4 64 35 +29 67
34 19 6 9 61 39 +22 63
34 17 10 7 53 31 +22 61
34 17 7 10 55 44 +11 58
34 16 8 10 60 38 +22 56
33 15 9 9 55 39 +16 54
33 15 7 11 45 42 +3 52
34 14 7 13 46 37 +9 49
33 14 6 13 48 38 +10 48
34 14 5 15 50 52 -2 47
34 11 9 14 33 46 -13 42
33 10 8 15 34 56 -22 38
34 8 13 13 35 39 -4 37
33 10 7 16 41 59 -18 37
34 9 9 16 31 47 -16 36
34 9 9 16 36 56 -20 36
34 5 12 17 25 45 -20 27
34 5 11 18 31 65 -34 26
34 5 2 27 18 60 -42 17
References
1. file:///soccer/standings/_/league/ENG.1/sort/gamesplayed/dir/desc
2. file:///soccer/standings/_/league/ENG.1/sort/wins/dir/desc
3. file:///soccer/standings/_/league/ENG.1/sort/ties/dir/desc
4. file:///soccer/standings/_/league/ENG.1/sort/losses/dir/asc
5. file:///soccer/standings/_/league/ENG.1/sort/pointsfor/dir/desc
6. file:///soccer/standings/_/league/ENG.1/sort/pointsagainst/dir/asc
7. file:///soccer/standings/_/league/ENG.1/sort/pointdifferential/dir/desc
8. file:///soccer/standings/_/league/ENG.1/sort/points/dir/desc
i think this is good enough for my use case
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#17 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHRBGO7SF2S7TGED42QNJDTMANDDANCNFSM4JAN6NMQ>
.
|
What do you think of the idea ?
Maybe this could be enabled through a CLI flag.
It could be done relatively easily using tabulate, PrettyTable or Pylsy.
The text was updated successfully, but these errors were encountered: