Row Functions¶

Functions rowall(), rowany(), rowcount(), rowfirst(), rowlast(), rowmax(), rowmean(), rowmin(), rowsd(), rowsum() are functions that aggregate across rows instead of columns and return a single column. These functions are equivalent to pandas aggregation functions with parameter (axis=1).

These functions make it easy to compute rowwise aggregations – for instance, you may want the sum of columns A, B, C and D. You could say: f.A + f.B + f.C + f.D. Rowsum makes it easier – dt.rowsum(f['A':'D']).

rowall, rowany¶

These work only on boolean expressions – rowall() checks if all the values in the row are True, while rowany() checks if any value in the row is True. It is similar to pandas’ all or any with parameter (axis=1). A single boolean column is returned:

from datatable import dt, f, by

df = dt.Frame({'A': [True, True], 'B': [True, False]})
df
AB
bool8bool8
011
110
2 rows × 2 columns
# rowall :
df[:, dt.rowall(f[:])]
C0
bool8
01
10
2 rows × 1 column
# rowany :
df[:, dt.rowany(f[:])]
C0
bool8
01
11
2 rows × 1 column

The single boolean column that is returned can be very handy when filtering in the i section.

Filter for rows where at least one cell is greater than 0:

df = dt.Frame({'a': [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
               'b': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
               'c': [0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0],
               'd': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
               'e': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
               'f': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]})
df
abcdef
int8int8int32int8int8int8
0000000
1000001
2000000
3000000
4000000
5005000
6100000
7000000
8000100
9100000
10000000
11 rows × 6 columns
df[dt.rowany(f[:] > 0), :]
abcdef
int8int8int32int8int8int8
0000001
1005000
2100000
3000100
4100000
5 rows × 6 columns

Filter for rows where all the cells are 0:

df[dt.rowall(f[:] == 0), :]
abcdef
int8int8int32int8int8int8
0000000
1000000
2000000
3000000
4000000
5000000
6 rows × 6 columns

Filter for rows where all the columns’ values are the same:

df = dt.Frame("""Name  A1   A2  A3  A4
                 deff  0    0   0   0
                 def1  0    1   0   0
                 def2  0    0   0   0
                 def3  1    0   0   0
                 def4  0    0   0   0""")

# compare the first integer column with the rest,
# use rowall to find rows where all is True
# and filter with the resulting boolean
df[dt.rowall(f[1]==f[1:]), :]
NameA1A2A3A4
str32bool8bool8bool8bool8
0deff0000
1def20000
2def40000
3 rows × 5 columns

Filter for rows where the values are increasing:

df = dt.Frame({"A": [1, 2, 6, 4],
               "B": [2, 4, 5, 6],
               "C": [3, 5, 4, 7],
               "D": [4, -3, 3, 8],
               "E": [5, 1, 2, 9]})
df
ABCDE
int32int32int32int32int32
012345
1245-31
265432
346789
4 rows × 5 columns
df[dt.rowall(f[1:] >= f[:-1]), :]
ABCDE
int32int32int32int32int32
012345
146789
2 rows × 5 columns

rowfirst, rowlast¶

These look for the first and last non-missing value in a row respectively:

df = dt.Frame({'A':[1, None, None, None],
               'B':[None, 3, 4, None],
               'C':[2, None, 5, None]})
df
ABC
int8int32int32
01NA2
1NA3NA
2NA45
3NANANA
4 rows × 3 columns
# rowfirst :
df[:, dt.rowfirst(f[:])]
C0
int32
01
13
24
3NA
4 rows × 1 column
# rowlast :
df[:, dt.rowlast(f[:])]
C0
int32
02
13
25
3NA
4 rows × 1 column

Get rows where the last value in the row is greater than the first value in the row:

df = dt.Frame({'a': [50, 40, 30, 20, 10],
               'b': [60, 10, 40, 0, 5],
               'c': [40, 30, 20, 30, 40]})
df
abc
int32int32int32
0506040
1401030
2304020
320030
410540
5 rows × 3 columns
df[dt.rowlast(f[:]) > dt.rowfirst(f[:]), :]
abc
int32int32int32
020030
110540
2 rows × 3 columns

rowmax, rowmin¶

These get the maximum and minimum values per row, respectively:

df = dt.Frame({"C": [2, 5, 30, 20, 10],
               "D": [10, 8, 20, 20, 1]})
df
CD
int32int32
0210
158
23020
32020
4101
5 rows × 2 columns
# rowmax
df[:, dt.rowmax(f[:])]
C0
int32
010
18
230
320
410
5 rows × 1 column
# rowmin
df[:, dt.rowmin(f[:])]
C0
int32
02
15
220
320
41
5 rows × 1 column

Find the difference between the maximum and minimum of each row:

df = dt.Frame("""Value1  Value2  Value3  Value4
                    5       4      3        2
                    4       3      2        1
                    3       3      5        1""")

df[:, dt.update(max_min = dt.rowmax(f[:]) - dt.rowmin(f[:]))]
df
Value1Value2Value3Value4max_min
int32int32int32int32int32
054323
143213
233514
3 rows × 5 columns

rowsum, rowmean, rowcount, rowsd¶

rowsum() and rowmean() get the sum and mean of rows respectively; rowcount() counts the number of non-missing values in a row, while rowsd() aggregates a row to get the standard deviation.

Get the count, sum, mean and standard deviation for each row:

df = dt.Frame("""ORD  A   B   C    D
                198  23  45  NaN  12
                138  25  NaN NaN  62
                625  52  36  49   35
                457  NaN NaN NaN  82
                626  52  32  39   45""")

df[:, dt.update(rowcount = dt.rowcount(f[:]),
                rowsum = dt.rowsum(f[:]),
                rowmean = dt.rowmean(f[:]),
                rowsd = dt.rowsd(f[:])
                )]
df
ORDABCDrowcountrowsumrowmeanrowsd
int32float64float64float64int32int32float64float64float64
01982345NA12427869.586.7583
113825NANA6232257557.6108
2625523649355797159.4260.389
3457NANANA822539269.5265.165
4626523239455794158.8261.277
5 rows × 9 columns

Find rows where the number of nulls is greater than 3:

df = dt.Frame({'city': ["city1", "city2", "city3", "city4"],
               'state': ["state1", "state2", "state3", "state4"],
               '2005': [144, 205, 123, None],
               '2006': [173, 211, 123, 124],
               '2007': [None, None, None, None],
               '2008': [None, 206, None, None],
               '2009': [None, None, 124, 123],
               '2010': [128, 273, None, None]})
df
citystate200520062007200820092010
str32str32int32int32voidint32int32int32
0city1state1144173NANANA128
1city2state2205211NA206NA273
2city3state3123123NANA124NA
3city4state4NA124NANA123NA
4 rows × 8 columns
# get columns that are null, then sum on the rows
# and finally filter where the sum is greater than 3
df[dt.rowsum(dt.isna(f[:])) > 3, :]
citystate200520062007200820092010
str32str32int32int32voidint32int32int32
0city4state4NA124NANA123NA
1 row × 8 columns

Rowwise sum of the float columns:

df = dt.Frame("""ID   W_1       W_2     W_3
                 1    0.1       0.2     0.3
                 1    0.2       0.4     0.5
                 2    0.3       0.3     0.2
                 2    0.1       0.3     0.4
                 2    0.2       0.0     0.5
                 1    0.5       0.3     0.2
                 1    0.4       0.2     0.1""")

df[:, dt.update(sum_floats = dt.rowsum(f[float]))]
df
IDW_1W_2W_3sum_floats
int32float64float64float64float64
010.10.20.30.6
110.20.40.51.1
220.30.30.20.8
320.10.30.40.8
420.200.50.7
510.50.30.21
610.40.20.10.7
7 rows × 5 columns

More Examples¶

Divide columns A, B, C, D by the total column, square it and sum rowwise:

df = dt.Frame({'A': [2, 3],
               'B': [1, 2],
               'C': [0, 1],
               'D': [1, 0],
               'total': [4, 6]})
df
ABCDtotal
int32int32int8int8int32
021014
132106
2 rows × 5 columns
df[:, update(result = dt.rowsum((f[:-1]/f[-1])**2))]
df
ABCDtotalresult
int32int32int8int8int32float64
0210140.375
1321060.388889
2 rows × 6 columns

Get the row sum of the COUNT columns:

df = dt.Frame("""USER OBSERVATION COUNT.1 COUNT.2 COUNT.3
                    A    1           0       1       1
                    A    2           1       1       2
                    A    3           3       0       0""")

columns = [f[column] for column in df.names if column.startswith("COUNT")]
df[:, update(total = dt.rowsum(columns))]
df
USEROBSERVATIONCOUNT.1COUNT.2COUNT.3total
str32int32int32bool8int32int32
0A10112
1A21124
2A33003
3 rows × 6 columns

Sum selected columns rowwise:

df = dt.Frame({'location' : ("a","b","c","d"),
               'v1' : (3,4,3,3),
               'v2' : (4,56,3,88),
               'v3' : (7,6,2,9),
               'v4':  (7,6,1,9),
               'v5' : (4,4,7,9),
               'v6' : (2,8,4,6)})

df
locationv1v2v3v4v5v6
str32int32int32int32int32int32int32
0a347742
1b4566648
2c332174
3d3889996
4 rows × 7 columns
df[:, {"x1": dt.rowsum(f[1:4]), "x2": dt.rowsum(f[4:])}]
x1x2
int32int32
01413
16618
2812
310024
4 rows × 2 columns

	a	b	c	d	e	f
	int8	int8	int32	int8	int8	int8
0	0	0	0	0	0	0
1	0	0	0	0	0	1
2	0	0	0	0	0	0
3	0	0	0	0	0	0
4	0	0	0	0	0	0
5	0	0	5	0	0	0
6	1	0	0	0	0	0
7	0	0	0	0	0	0
8	0	0	0	1	0	0
9	1	0	0	0	0	0
10	0	0	0	0	0	0

	ORD	A	B	C	D	rowcount	rowsum	rowmean	rowsd
	int32	float64	float64	float64	int32	int32	float64	float64	float64
0	198	23	45	NA	12	4	278	69.5	86.7583
1	138	25	NA	NA	62	3	225	75	57.6108
2	625	52	36	49	35	5	797	159.4	260.389
3	457	NA	NA	NA	82	2	539	269.5	265.165
4	626	52	32	39	45	5	794	158.8	261.277

	city	state	2005	2006	2007	2008	2009	2010
	str32	str32	int32	int32	void	int32	int32	int32
0	city1	state1	144	173	NA	NA	NA	128
1	city2	state2	205	211	NA	206	NA	273
2	city3	state3	123	123	NA	NA	124	NA
3	city4	state4	NA	124	NA	NA	123	NA