Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[html rowspan] Rowspan is not handled. #1308

Open
frosencrantz opened this issue Feb 27, 2022 · 11 comments
Open

[html rowspan] Rowspan is not handled. #1308

frosencrantz opened this issue Feb 27, 2022 · 11 comments

Comments

@frosencrantz
Copy link
Contributor

Small description
HTML table loading doesn't handle rowspan properly.

Expected result
The data in the rowspan column is duplicated on the spanned rows.

Actual result with screenshot

https://asciinema.org/a/qotdqplkmpKJxPUQUC5kaGHEd

Animation shows html files, how w3m renders the data, and how Visidata shows data.
The columns with the errors have this exception:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/visidata/wrappers.py", line 108, in wrapply
    return func(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/visidata/column.py", line 273, in getValue
    return self.calcValue(row)
  File "/usr/lib/python3.9/site-packages/visidata/column.py", line 235, in calcValue
    return (self.getter)(self, row)
  File "/usr/lib/python3.9/site-packages/visidata/loaders/html.py", line 103, in <lambda>
    self.addColumn(Column(name, getter=lambda c,r,i=colnum: r[i][0]))
IndexError: list index out of range

Steps to reproduce with sample data and a .vd

I would expect the data from columns 2 & 3 for both of these tables would be the same.

Regular 3x3.html (works)

<table >
        <tbody>
                <tr>
                        <td>1.1 </td>
                        <td>1.2 </td>
                        <td>1.3 </td>
                </tr>
                <tr>
                        <td> 2.1 </td>
                        <td> 2.2 </td>
                        <td> 2.3 </td>
                </tr>
                <tr>
                        <td> 3.1 </td>
                        <td> 3.2 </td>
                        <td> 3.3 </td>
                </tr>
        </tbody>
</table>

With row span 3x3-rowspan.html (breaks):

<table >
        <tbody>
                <tr>
                        <td rowspan=3>1.1 </td>
                        <td>1.2 </td>
                        <td>1.3 </td>
                </tr>
                <tr>
                        <td> 2.2 </td>
                        <td> 2.3 </td>
                </tr>
                <tr>
                        <td> 3.2 </td>
                        <td> 3.3 </td>
                </tr>
        </tbody>
</table>

Additional context
Please include the version of VisiData. Using latest version from develop branch

I have data sources I try to use with Visidata that make use of rowspan to format html tables.

There is code in the html loader to handle rowspan for column headers.

@frosencrantz
Copy link
Contributor Author

I realized later that the animation was missing the w3m output. Maybe next week I can show that. Though I think from the Visidata output you can see that for some rows the values are shifted left without a direct cell in the column with the rowspan.

@frosencrantz
Copy link
Contributor Author

Here is the w3m-only screenshot. The first one shows the table with rowspan, and the second example without rowspan.

https://asciinema.org/a/SE31VbT4U156d9s1VBFJO7Rxs

That first column that uses rowspan should touch all rows that are spanned, not just the first row.

@anjakefala
Copy link
Collaborator

Thank you for providing the two nearly identical sets of sample data, one with rowspan and one without. It helps with seeing the problem much clearer!

@anjakefala
Copy link
Collaborator

Rowspans used to be at least partially handled! I think they were not handled in the way that is expected, and the logic needs to be adjusted.

But they were not resulting in an Exception. This is the change where the Cell-Exceptions started: 8a663b8

@anjakefala
Copy link
Collaborator

anjakefala commented Mar 12, 2022

One thing to note:

VisiData expects that the rowspan attribute is in a <th> tag! Is it rowspan being in the first <td> (instead of there being an explicit <th> a realistic scenario?

I.e. this is what VisiData is expecting for rowspan:

  <tr>                                                                                                                                  
                         <th rowspan=3>1.1 </th>                                                                                                       
                         <th>1.2 </th>                                                                                                                 
                         <th>1.3 </th>                                                                                                                 
                 </tr>      

Edit: It seems like rowspan could be expected in <td>. So this issue has two parts:

  • Do we want to handle the rowspan scenario for <td>, and what that will look like
  • rowspan in <th> does the right structuring in the Sheet, but still ends up with within-cell Exceptions that we will need to handle.

@frosencrantz
Copy link
Contributor Author

I'm not sure if you are asking me. I do think the rowspan should be handled for <td> cells, since it has implications for how other rows/cols are aligned. Compared to how browsers present the data, Visidata incorrectly displays the data, and misses the intended structure.

@frosencrantz
Copy link
Contributor Author

Hi @anjakefala

I had looked a little more at this issue. It reminds me how odd html tables are as a data format. And when I look for live examples, I find worst examples.

Here is a simple example file I created that shows the difference between header and data rows. The html is basically the same for the header as the body (except the replacing of th/td and thead/tbody).

You can see that w3m formats them the same, but visidata has a different view. For the header rows, it looks like visidata is doing the expected thing by flattening the values into one header row. For the body rows it shows this bug where colspan/rowspan are ignored.

https://asciinema.org/a/7GI0SKWYPecD8hcq1utN6RrPU

<table border>
        <thead>
                <tr>
                        <th rowspan=2 colspan=2>1.1 </th>
                        <th>1.3 </th>
                </tr>
                <tr>
                        <th> 2.3 </th>
                </tr>
                <tr>
                        <th> 3.1 </th>
                        <th colspan=2> 3.2 </th>
                </tr>

        </thead>
        <tbody>
                <tr>
                        <td rowspan=2 colspan=2>1.1 </td>
                        <td>1.3 </td>
                </tr>
                <tr>
                        <td> 2.3 </td>
                </tr>
                <tr>
                        <td> 3.1 </td>
                        <td colspan=2> 3.2 </td>
                </tr>
        </tbody>
</table>

@frosencrantz
Copy link
Contributor Author

FYI: I found a tool that claims to handle reading tables with a colspan/rowspan: https://github.com/rocheio/wiki-table-scrape

It works if you have only one of the types of spans, but my simple tests suggested it doesn't properly handle both types of spans for the same cell. It looks like it misses the lower right corner of a colspan=2 rowspan=2

@frosencrantz
Copy link
Contributor Author

VisiData has an alternate way to read html with pandas, so I tried that, but I found a new bug: #1986

Pandas read_html function returns a list of DataFrames while other read_* functions return a DataFrame.

@frosencrantz
Copy link
Contributor Author

The panda's reader also seem to have issues with some of the tables I want to be able to read.

Here is a deep dive of how to parse html tables including algorithms:

https://html.spec.whatwg.org/multipage/tables.html#table-processing-model

@frosencrantz
Copy link
Contributor Author

One thing in reading this is that a table is modeled by a 2-D grid of slots, very much like VisiData. Some slots can be empty, or they can be occupied one or more cells (e.g. TD/TH). Cells occupy the slot they first encounter, and may occupy more, but only to the right and down because of colspan/rowspan:

A table consists of cells aligned on a two-dimensional grid of slots with coordinates (x, y). The grid is finite, and is either empty or has one or more slots. If the grid has one or more slots, then the x coordinates are always in the range 0 ≤ x < xwidth, and the y coordinates are always in the range 0 ≤ y < yheight. If one or both of xwidth and yheight are zero, then the table is empty (has no slots). Tables correspond to table elements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants