Row Functions¶
rowall
, rowany
, rowcount
, rowfirst
, rowlast
, rowmax
, rowmean
, rowmin
, rowsd
, rowsum
are functions that aggregate across rows instead of columns and return a single column. These functions are equivalent to Pandas aggregation functions with parameter (axis=1)
.f.A + f.B + f.C + f.D
. Rowsum makes it easier - dt.rowsum(f['A':'D'])
.Rowall, Rowany¶
These work only on Boolean expressions - rowall
checks if all the values in the row are True
, while rowany
checks if any value in the row is True. It is similar to Pandas’ all or any with a parameter of (axis=1)
. A single Boolean column is returned.
from datatable import dt, f, by
df = dt.Frame({'A': [True, True], 'B': [True, False]})
df
A B
0 1 1
1 1 0
# rowall :
df[:, dt.rowall(f[:])]
C0
0 1
1 0
# rowany :
df[:, dt.rowany(f[:])]
C0
0 1
1 1
The single boolean column that is returned can be very handy when filtering in the i
section.
Filter for rows where at least one cell is greater than 0
df = dt.Frame({'a': [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0], 'b': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'c': [0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0], 'd': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], 'e': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'f': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]}) df a b c d e f 0 0 0 0 0 0 0 1 0 0 0 0 0 1 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 5 0 0 0 6 1 0 0 0 0 0 7 0 0 0 0 0 0 8 0 0 0 1 0 0 9 1 0 0 0 0 0 10 0 0 0 0 0 0 df[dt.rowany(f[:] > 0), :] a b c d e f 0 0 0 0 0 0 1 1 0 0 5 0 0 0 2 1 0 0 0 0 0 3 0 0 0 1 0 0 4 1 0 0 0 0 0
Filter for rows where all the cells are 0
df[dt.rowall(f[:] == 0), :] a b c d e f 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 0 0 0 0
Filter for rows where all the columns’ values are the same
df = dt.Frame("""Name A1 A2 A3 A4 deff 0 0 0 0 def1 0 1 0 0 def2 0 0 0 0 def3 1 0 0 0 def4 0 0 0 0""") # compare the first integer column with the rest, # use rowall to find rows where all is True # and filter with the resulting boolean df[dt.rowall(f[1]==f[1:]), :] Name A1 A2 A3 A4 0 deff 0 0 0 0 1 def2 0 0 0 0 2 def4 0 0 0 0
Filter for rows where the values are increasing
df = dt.Frame({"A": [1, 2, 6, 4], "B": [2, 4, 5, 6], "C": [3, 5, 4, 7], "D": [4, -3, 3, 8], "E": [5, 1, 2, 9]}) df A B C D E 0 1 2 3 4 5 1 2 4 5 −3 1 2 6 5 4 3 2 3 4 6 7 8 9 df[dt.rowall(f[1:] >= f[:-1]), :] A B C D E 0 1 2 3 4 5 1 4 6 7 8 9
Rowfirst, Rowlast¶
These look for the first and last non-missing value in a row respectively.
df = dt.Frame({'A':[1, None, None, None],
'B':[None, 3, 4, None],
'C':[2, None, 5, None]})
df
A B C
0 1 NA 2
1 NA 3 NA
2 NA 4 5
3 NA NA NA
# rowfirst :
df[:, dt.rowfirst(f[:])]
C0
0 1
1 3
2 4
3 NA
# rowlast :
df[:, dt.rowlast(f[:])]
C0
0 2
1 3
2 5
3 NA
Get rows where the last value in the row is greater than the first value in the row
df = dt.Frame({'a': [50, 40, 30, 20, 10], 'b': [60, 10, 40, 0, 5], 'c': [40, 30, 20, 30, 40]}) df a b c 0 50 60 40 1 40 10 30 2 30 40 20 3 20 0 30 4 10 5 40 df[dt.rowlast(f[:]) > dt.rowfirst(f[:]), :] a b c 0 20 0 30 1 10 5 40
Rowmax, Rowmin¶
These get the maximum and minimum values per row, respectively.
df = dt.Frame({"C": [2, 5, 30, 20, 10],
"D": [10, 8, 20, 20, 1]})
df
C D
0 2 10
1 5 8
2 30 20
3 20 20
4 10 1
# rowmax
df[:, dt.rowmax(f[:])]
C0
0 10
1 8
2 30
3 20
4 10
# rowmin
df[:, dt.rowmin(f[:])]
C0
0 2
1 5
2 20
3 20
4 1
Find the difference between the maximum and minimum of each row
df = dt.Frame("""Value1 Value2 Value3 Value4 5 4 3 2 4 3 2 1 3 3 5 1""") df[:, dt.update(max_min = dt.rowmax(f[:]) - dt.rowmin(f[:]))] df Value1 Value2 Value3 Value4 max_min 5 4 3 2 3 4 3 2 1 3 3 3 5 1 4
Rowsum, Rowmean, Rowcount, Rowsd¶
rowsum
and rowmean
get the sum and mean of rows respectively; rowcount
counts the number of non-missing values in a row, while rowsd
aggregates a row to get the standard deviation
Get the count, sum, mean and standard deviation for each row
df = dt.Frame("""ORD A B C D 198 23 45 NaN 12 138 25 NaN NaN 62 625 52 36 49 35 457 NaN NaN NaN 82 626 52 32 39 45""") df[:, dt.update(rowcount = dt.rowcount(f[:]), rowsum = dt.rowsum(f[:]), rowmean = dt.rowmean(f[:]), rowsd = dt.rowsd(f[:]) )] df ORD A B C D rowcount rowsum rowmean rowsd 0 198 23 45 NA 12 4 278 69.5 86.7583 1 138 25 NA NA 62 3 225 75 57.6108 2 625 52 36 49 35 5 797 159.4 260.389 3 457 NA NA NA 82 2 539 269.5 265.165 4 626 52 32 39 45 5 794 158.8 261.277
Find rows where the number of nulls is greater than 3
df = dt.Frame({'city': ["city1", "city2", "city3", "city4"], 'state': ["state1", "state2", "state3", "state4"], '2005': [144, 205, 123, None], '2006': [173, 211, 123, 124], '2007': [None, None, None, None], '2008': [None, 206, None, None], '2009': [None, None, 124, 123], '2010': [128, 273, None, None]}) df city state 2005 2006 2007 2008 2009 2010 0 city1 state1 144 173 NA NA NA 128 1 city2 state2 205 211 NA 206 NA 273 2 city3 state3 123 123 NA NA 124 NA 3 city4 state4 NA 124 NA NA 123 NA # get columns that are null, then sum on the rows # and finally filter where the sum is greater than 3 df[dt.rowsum(dt.isna(f[:])) > 3, :] city state 2005 2006 2007 2008 2009 2010 0 city4 state4 NA 124 NA NA 123 NA
Rowwise sum of the float columns
df = dt.Frame("""ID W_1 W_2 W_3 1 0.1 0.2 0.3 1 0.2 0.4 0.5 2 0.3 0.3 0.2 2 0.1 0.3 0.4 2 0.2 0.0 0.5 1 0.5 0.3 0.2 1 0.4 0.2 0.1""") df[:, dt.update(sum_floats = dt.rowsum(f[float]))] ID W_1 W_2 W_3 sum_floats 0 1 0.1 0.2 0.3 0.6 1 1 0.2 0.4 0.5 1.1 2 2 0.3 0.3 0.2 0.8 3 2 0.1 0.3 0.4 0.8 4 2 0.2 0 0.5 0.7 5 1 0.5 0.3 0.2 1 6 1 0.4 0.2 0.1 0.7
More Examples¶
Divide columns
A
,B
,C
,D
by the total column, square it and sum rowwisedf = dt.Frame({'A': [2, 3], 'B': [1, 2], 'C': [0, 1], 'D': [1, 0], 'total': [4, 6]}) df A B C D total 0 2 1 0 1 4 1 3 2 1 0 6 df[:, update(result = dt.rowsum((f[:-1]/f[-1])**2))] df A B C D total result 0 2 1 0 1 4 0.375 1 3 2 1 0 6 0.388889
Get the row sum of the
COUNT
columnsdf = dt.Frame("""USER OBSERVATION COUNT.1 COUNT.2 COUNT.3 A 1 0 1 1 A 2 1 1 2 A 3 3 0 0""") columns = [f[column] for column in df.names if column.startswith("COUNT")] df[:, update(total = dt.rowsum(columns))] df USER OBSERVATION COUNT.1 COUNT.2 COUNT.3 total 0 A 1 0 1 1 2 1 A 2 1 1 2 4 2 A 3 3 0 0 3
Sum selected columns rowwise
df = dt.Frame({'location' : ("a","b","c","d"), 'v1' : (3,4,3,3), 'v2' : (4,56,3,88), 'v3' : (7,6,2,9), 'v4': (7,6,1,9), 'v5' : (4,4,7,9), 'v6' : (2,8,4,6)}) df location v1 v2 v3 v4 v5 v6 0 a 3 4 7 7 4 2 1 b 4 56 6 6 4 8 2 c 3 3 2 1 7 4 3 d 3 88 9 9 9 6 df[:, {"x1": dt.rowsum(f[1:4]), "x2": dt.rowsum(f[4:])}] x1 x2 0 14 13 1 66 18 2 8 12 3 100 24