Row Functions

Functions rowall(), rowany(), rowcount(), rowfirst(), rowlast(), rowmax(), rowmean(), rowmin(), rowsd(), rowsum() are functions that aggregate across rows instead of columns and return a single column. These functions are equivalent to pandas aggregation functions with parameter (axis=1).

These functions make it easy to compute rowwise aggregations – for instance, you may want the sum of columns A, B, C and D. You could say: f.A + f.B + f.C + f.D. Rowsum makes it easier – dt.rowsum(f['A':'D']).

rowall, rowany

These work only on boolean expressions – rowall() checks if all the values in the row are True, while rowany() checks if any value in the row is True. It is similar to pandas’ all or any with parameter (axis=1). A single boolean column is returned:

from datatable import dt, f, by df = dt.Frame({'A': [True, True], 'B': [True, False]}) df
AB
bool8bool8
011
110
# rowall : df[:, dt.rowall(f[:])]
C0
bool8
01
10
# rowany : df[:, dt.rowany(f[:])]
C0
bool8
01
11

The single boolean column that is returned can be very handy when filtering in the i section.

Filter for rows where at least one cell is greater than 0:

df = dt.Frame({'a': [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0], 'b': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'c': [0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0], 'd': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], 'e': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'f': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]}) df
abcdef
int8int8int32int8int8int8
0000000
1000001
2000000
3000000
4000000
5005000
6100000
7000000
8000100
9100000
10000000
df[dt.rowany(f[:] > 0), :]
abcdef
int8int8int32int8int8int8
0000001
1005000
2100000
3000100
4100000

Filter for rows where all the cells are 0:

df[dt.rowall(f[:] == 0), :]
abcdef
int8int8int32int8int8int8
0000000
1000000
2000000
3000000
4000000
5000000

Filter for rows where all the columns’ values are the same:

df = dt.Frame("""Name A1 A2 A3 A4 deff 0 0 0 0 def1 0 1 0 0 def2 0 0 0 0 def3 1 0 0 0 def4 0 0 0 0""") # compare the first integer column with the rest, # use rowall to find rows where all is True # and filter with the resulting boolean df[dt.rowall(f[1]==f[1:]), :]
NameA1A2A3A4
str32bool8bool8bool8bool8
0deff0000
1def20000
2def40000

Filter for rows where the values are increasing:

df = dt.Frame({"A": [1, 2, 6, 4], "B": [2, 4, 5, 6], "C": [3, 5, 4, 7], "D": [4, -3, 3, 8], "E": [5, 1, 2, 9]}) df
ABCDE
int32int32int32int32int32
012345
1245-31
265432
346789
df[dt.rowall(f[1:] >= f[:-1]), :]
ABCDE
int32int32int32int32int32
012345
146789

rowfirst, rowlast

These look for the first and last non-missing value in a row respectively:

df = dt.Frame({'A':[1, None, None, None], 'B':[None, 3, 4, None], 'C':[2, None, 5, None]}) df
ABC
int8int32int32
01NA2
1NA3NA
2NA45
3NANANA
# rowfirst : df[:, dt.rowfirst(f[:])]
C0
int32
01
13
24
3NA
# rowlast : df[:, dt.rowlast(f[:])]
C0
int32
02
13
25
3NA

Get rows where the last value in the row is greater than the first value in the row:

df = dt.Frame({'a': [50, 40, 30, 20, 10], 'b': [60, 10, 40, 0, 5], 'c': [40, 30, 20, 30, 40]}) df
abc
int32int32int32
0506040
1401030
2304020
320030
410540
df[dt.rowlast(f[:]) > dt.rowfirst(f[:]), :]
abc
int32int32int32
020030
110540

rowmax, rowmin

These get the maximum and minimum values per row, respectively:

df = dt.Frame({"C": [2, 5, 30, 20, 10], "D": [10, 8, 20, 20, 1]}) df
CD
int32int32
0210
158
23020
32020
4101
# rowmax df[:, dt.rowmax(f[:])]
C0
int32
010
18
230
320
410
# rowmin df[:, dt.rowmin(f[:])]
C0
int32
02
15
220
320
41

Find the difference between the maximum and minimum of each row:

df = dt.Frame("""Value1 Value2 Value3 Value4 5 4 3 2 4 3 2 1 3 3 5 1""") df[:, dt.update(max_min = dt.rowmax(f[:]) - dt.rowmin(f[:]))] df
Value1Value2Value3Value4max_min
int32int32int32int32int32
054323
143213
233514

rowsum, rowmean, rowcount, rowsd

rowsum() and rowmean() get the sum and mean of rows respectively; rowcount() counts the number of non-missing values in a row, while rowsd() aggregates a row to get the standard deviation.

Get the count, sum, mean and standard deviation for each row:

df = dt.Frame("""ORD A B C D 198 23 45 NaN 12 138 25 NaN NaN 62 625 52 36 49 35 457 NaN NaN NaN 82 626 52 32 39 45""") df[:, dt.update(rowcount = dt.rowcount(f[:]), rowsum = dt.rowsum(f[:]), rowmean = dt.rowmean(f[:]), rowsd = dt.rowsd(f[:]) )] df
ORDABCDrowcountrowsumrowmeanrowsd
int32float64float64float64int32int32float64float64float64
01982345NA12427869.586.7583
113825NANA6232257557.6108
2625523649355797159.4260.389
3457NANANA822539269.5265.165
4626523239455794158.8261.277

Find rows where the number of nulls is greater than 3:

df = dt.Frame({'city': ["city1", "city2", "city3", "city4"], 'state': ["state1", "state2", "state3", "state4"], '2005': [144, 205, 123, None], '2006': [173, 211, 123, 124], '2007': [None, None, None, None], '2008': [None, 206, None, None], '2009': [None, None, 124, 123], '2010': [128, 273, None, None]}) df
citystate200520062007200820092010
str32str32int32int32voidint32int32int32
0city1state1144173NANANA128
1city2state2205211NA206NA273
2city3state3123123NANA124NA
3city4state4NA124NANA123NA
# get columns that are null, then sum on the rows # and finally filter where the sum is greater than 3 df[dt.rowsum(dt.isna(f[:])) > 3, :]
citystate200520062007200820092010
str32str32int32int32voidint32int32int32
0city4state4NA124NANA123NA

Rowwise sum of the float columns:

df = dt.Frame("""ID W_1 W_2 W_3 1 0.1 0.2 0.3 1 0.2 0.4 0.5 2 0.3 0.3 0.2 2 0.1 0.3 0.4 2 0.2 0.0 0.5 1 0.5 0.3 0.2 1 0.4 0.2 0.1""") df[:, dt.update(sum_floats = dt.rowsum(f[float]))] df
IDW_1W_2W_3sum_floats
int32float64float64float64float64
010.10.20.30.6
110.20.40.51.1
220.30.30.20.8
320.10.30.40.8
420.200.50.7
510.50.30.21
610.40.20.10.7

More Examples

Divide columns A, B, C, D by the total column, square it and sum rowwise:

df = dt.Frame({'A': [2, 3], 'B': [1, 2], 'C': [0, 1], 'D': [1, 0], 'total': [4, 6]}) df
ABCDtotal
int32int32int8int8int32
021014
132106
df[:, update(result = dt.rowsum((f[:-1]/f[-1])**2))] df
ABCDtotalresult
int32int32int8int8int32float64
0210140.375
1321060.388889

Get the row sum of the COUNT columns:

df = dt.Frame("""USER OBSERVATION COUNT.1 COUNT.2 COUNT.3 A 1 0 1 1 A 2 1 1 2 A 3 3 0 0""") columns = [f[column] for column in df.names if column.startswith("COUNT")] df[:, update(total = dt.rowsum(columns))] df
USEROBSERVATIONCOUNT.1COUNT.2COUNT.3total
str32int32int32bool8int32int32
0A10112
1A21124
2A33003

Sum selected columns rowwise:

df = dt.Frame({'location' : ("a","b","c","d"), 'v1' : (3,4,3,3), 'v2' : (4,56,3,88), 'v3' : (7,6,2,9), 'v4': (7,6,1,9), 'v5' : (4,4,7,9), 'v6' : (2,8,4,6)}) df
locationv1v2v3v4v5v6
str32int32int32int32int32int32int32
0a347742
1b4566648
2c332174
3d3889996
df[:, {"x1": dt.rowsum(f[1:4]), "x2": dt.rowsum(f[4:])}]
x1x2
int32int32
01413
16618
2812
310024