Row Functions¶

Functions rowall, rowany, rowcount, rowfirst, rowlast, rowmax, rowmean, rowmin, rowsd, rowsum are functions that aggregate across rows instead of columns and return a single column. These functions are equivalent to Pandas aggregation functions with parameter (axis=1).

These functions make it easy to compute rowwise aggregations - for instance, you may want the sum of columns A, B, C and D. You could say : f.A + f.B + f.C + f.D. Rowsum makes it easier - dt.rowsum(f['A':'D']).

Rowall, Rowany¶

These work only on Boolean expressions - rowall checks if all the values in the row are True, while rowany checks if any value in the row is True. It is similar to Pandas’ all or any with a parameter of (axis=1). A single Boolean column is returned.

from datatable import dt, f, by

df = dt.Frame({'A': [True, True], 'B': [True, False]})
df

    A       B
0   1       1
1   1       0

# rowall :
df[:, dt.rowall(f[:])]

    C0
0   1
1   0

# rowany :
df[:, dt.rowany(f[:])]

    C0
0   1
1   1

The single boolean column that is returned can be very handy when filtering in the i section.

Filter for rows where at least one cell is greater than 0

df = dt.Frame({'a': [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
               'b': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
               'c': [0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0],
               'd': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
               'e': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
               'f': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]})

df

    a       b       c       d       e       f
0   0       0       0       0       0       0
1   0       0       0       0       0       1
2   0       0       0       0       0       0
3   0       0       0       0       0       0
4   0       0       0       0       0       0
5   0       0       5       0       0       0
6   1       0       0       0       0       0
7   0       0       0       0       0       0
8   0       0       0       1       0       0
9   1       0       0       0       0       0
10  0       0       0       0       0       0

df[dt.rowany(f[:] > 0), :]


    a       b       c       d       e       f
0   0       0       0       0       0       1
1   0       0       5       0       0       0
2   1       0       0       0       0       0
3   0       0       0       1       0       0
4   1       0       0       0       0       0

Filter for rows where all the cells are 0

df[dt.rowall(f[:] == 0), :]


    a       b       c       d       e       f
 0       0       0       0       0       0
 0       0       0       0       0       0
 0       0       0       0       0       0
 0       0       0       0       0       0
 0       0       0       0       0       0
 0       0       0       0       0       0

Filter for rows where all the columns’ values are the same

df = dt.Frame("""Name  A1   A2  A3  A4
                 deff  0    0   0   0
                 def1  0    1   0   0
                 def2  0    0   0   0
                 def3  1    0   0   0
                 def4  0    0   0   0""")

# compare the first integer column with the rest,
# use rowall to find rows where all is True
# and filter with the resulting boolean
df[dt.rowall(f[1]==f[1:]), :]

    Name    A1      A2      A3      A4
0   deff    0       0       0       0
1   def2    0       0       0       0
2   def4    0       0       0       0

Filter for rows where the values are increasing

df = dt.Frame({"A": [1, 2, 6, 4],
               "B": [2, 4, 5, 6],
               "C": [3, 5, 4, 7],
               "D": [4, -3, 3, 8],
               "E": [5, 1, 2, 9]})

df

    A       B       C       D       E
0   1       2       3       4       5
1   2       4       5       −3      1
2   6       5       4       3       2
3   4       6       7       8       9

df[dt.rowall(f[1:] >= f[:-1]), :]

    A       B       C       D       E
0   1       2       3       4       5
1   4       6       7       8       9

Rowfirst, Rowlast¶

These look for the first and last non-missing value in a row respectively.

df = dt.Frame({'A':[1, None, None, None],
               'B':[None, 3, 4, None],
               'C':[2, None, 5, None]})
df

    A       B       C
0   1       NA      2
1   NA      3       NA
2   NA      4       5
3   NA      NA      NA

# rowfirst :
df[:, dt.rowfirst(f[:])]

    C0
0   1
1   3
2   4
3   NA

# rowlast :
df[:, dt.rowlast(f[:])]

    C0
0   2
1   3
2   5
3   NA

Get rows where the last value in the row is greater than the first value in the row

df = dt.Frame({'a': [50, 40, 30, 20, 10],
               'b': [60, 10, 40, 0, 5],
               'c': [40, 30, 20, 30, 40]})

df

    a       b       c
0   50      60      40
1   40      10      30
2   30      40      20
3   20      0       30
4   10      5       40

df[dt.rowlast(f[:]) > dt.rowfirst(f[:]), :]

    a       b       c
0   20      0       30
1   10      5       40

Rowmax, Rowmin¶

These get the maximum and minimum values per row, respectively.

df = dt.Frame({"C": [2, 5, 30, 20, 10],
               "D": [10, 8, 20, 20, 1]})

df

    C       D
0   2       10
1   5       8
2   30      20
3   20      20
4   10      1

# rowmax
df[:, dt.rowmax(f[:])]

    C0
0   10
1   8
2   30
3   20
4   10

# rowmin
df[:, dt.rowmin(f[:])]

    C0
0   2
1   5
2   20
3   20
4   1

Find the difference between the maximum and minimum of each row

df = dt.Frame("""Value1  Value2  Value3  Value4
                    5       4      3        2
                    4       3      2        1
                    3       3      5        1""")

df[:, dt.update(max_min = dt.rowmax(f[:]) - dt.rowmin(f[:]))]
df

    Value1  Value2  Value3  Value4  max_min
        5       4       3       2       3
        4       3       2       1       3
        3       3       5       1       4

Rowsum, Rowmean, Rowcount, Rowsd¶

rowsum and rowmean get the sum and mean of rows respectively; rowcount counts the number of non-missing values in a row, while rowsd aggregates a row to get the standard deviation

Get the count, sum, mean and standard deviation for each row

df = dt.Frame("""ORD  A   B   C    D
                198  23  45  NaN  12
                138  25  NaN NaN  62
                625  52  36  49   35
                457  NaN NaN NaN  82
                626  52  32  39   45""")

df[:, dt.update(rowcount = dt.rowcount(f[:]),
                rowsum = dt.rowsum(f[:]),
                rowmean = dt.rowmean(f[:]),
                rowsd = dt.rowsd(f[:])
                )]

df

    ORD     A       B       C       D       rowcount  rowsum  rowmean  rowsd
0   198     23      45      NA      12          4       278     69.5   86.7583
1   138     25      NA      NA      62          3       225     75     57.6108
2   625     52      36      49      35          5       797     159.4  260.389
3   457     NA      NA      NA      82          2       539     269.5  265.165
4   626     52      32      39      45          5       794     158.8  261.277

Find rows where the number of nulls is greater than 3

df = dt.Frame({'city': ["city1", "city2", "city3", "city4"],
               'state': ["state1", "state2", "state3", "state4"],
               '2005': [144, 205, 123, None],
               '2006': [173, 211, 123, 124],
               '2007': [None, None, None, None],
               '2008': [None, 206, None, None],
               '2009': [None, None, 124, 123],
               '2010': [128, 273, None, None]})

df

    city    state   2005    2006    2007    2008    2009    2010
0   city1   state1  144     173     NA      NA      NA      128
1   city2   state2  205     211     NA      206     NA      273
2   city3   state3  123     123     NA      NA      124     NA
3   city4   state4  NA      124     NA      NA      123     NA

# get columns that are null, then sum on the rows
# and finally filter where the sum is greater than 3
df[dt.rowsum(dt.isna(f[:])) > 3, :]

    city    state   2005    2006    2007    2008    2009    2010
0   city4   state4  NA      124     NA      NA      123     NA

Rowwise sum of the float columns

df = dt.Frame("""ID   W_1       W_2     W_3
                 1    0.1       0.2     0.3
                 1    0.2       0.4     0.5
                 2    0.3       0.3     0.2
                 2    0.1       0.3     0.4
                 2    0.2       0.0     0.5
                 1    0.5       0.3     0.2
                 1    0.4       0.2     0.1""")

df[:, dt.update(sum_floats = dt.rowsum(f[float]))]

    ID      W_1     W_2     W_3     sum_floats
0   1       0.1     0.2     0.3     0.6
1   1       0.2     0.4     0.5     1.1
2   2       0.3     0.3     0.2     0.8
3   2       0.1     0.3     0.4     0.8
4   2       0.2     0       0.5     0.7
5   1       0.5     0.3     0.2     1
6   1       0.4     0.2     0.1     0.7

More Examples¶

Divide columns A, B, C, D by the total column, square it and sum rowwise

df = dt.Frame({'A': [2, 3],
               'B': [1, 2],
               'C': [0, 1],
               'D': [1, 0],
               'total': [4, 6]})
df

    A       B       C       D       total
0   2       1       0       1       4
1   3       2       1       0       6

df[:, update(result = dt.rowsum((f[:-1]/f[-1])**2))]
df

    A       B       C       D       total   result
0   2       1       0       1       4       0.375
1   3       2       1       0       6       0.388889

Get the row sum of the COUNT columns

df = dt.Frame("""USER OBSERVATION COUNT.1 COUNT.2 COUNT.3
                    A    1           0       1       1
                    A    2           1       1       2
                    A    3           3       0       0""")

columns = [f[column] for column in df.names if column.startswith("COUNT")]
df[:, update(total = dt.rowsum(columns))]
df
    USER    OBSERVATION     COUNT.1 COUNT.2 COUNT.3 total
0     A         1             0         1       1     2
1     A         2             1         1       2     4
2     A         3             3         0       0     3

Sum selected columns rowwise

df = dt.Frame({'location' : ("a","b","c","d"),
               'v1' : (3,4,3,3),
               'v2' : (4,56,3,88),
               'v3' : (7,6,2,9),
               'v4':  (7,6,1,9),
               'v5' : (4,4,7,9),
               'v6' : (2,8,4,6)})

df
    location        v1      v2      v3      v4      v5      v6
0       a           3       4       7       7       4       2
1       b           4       56      6       6       4       8
2       c           3       3       2       1       7       4
3       d           3       88      9       9       9       6

df[:, {"x1": dt.rowsum(f[1:4]), "x2": dt.rowsum(f[4:])}]

    x1      x2
0   14      13
1   66      18
2   8       12
3   100     24