Row Functions¶
Functions rowall()
, rowany()
, rowcount()
, rowfirst()
,
rowlast()
, rowmax()
, rowmean()
, rowmin()
, rowsd()
,
rowsum()
are functions that aggregate across rows instead of columns and
return a single column. These functions are equivalent to pandas aggregation
functions with parameter (axis=1)
.
These functions make it easy to compute rowwise aggregations – for instance,
you may want the sum of columns A
, B
, C
and D
. You could say:
f.A + f.B + f.C + f.D
. Rowsum makes it easier – dt.rowsum(f['A':'D'])
.
rowall, rowany¶
These work only on boolean expressions – rowall()
checks if all the
values in the row are True
, while rowany()
checks if any value in the
row is True
. It is similar to pandas’ all or any
with parameter (axis=1)
. A single boolean column is returned:
from datatable import dt, f, by
df = dt.Frame({'A': [True, True], 'B': [True, False]})
df
A | B | ||
---|---|---|---|
bool8 | bool8 | ||
0 | 1 | 1 | |
1 | 1 | 0 |
# rowall :
df[:, dt.rowall(f[:])]
C0 | ||
---|---|---|
bool8 | ||
0 | 1 | |
1 | 0 |
# rowany :
df[:, dt.rowany(f[:])]
C0 | ||
---|---|---|
bool8 | ||
0 | 1 | |
1 | 1 |
The single boolean column that is returned can be very handy when filtering in the i
section.
Filter for rows where at least one cell is greater than 0:
df = dt.Frame({'a': [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
'b': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'c': [0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0],
'd': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
'e': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'f': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]})
df
a | b | c | d | e | f | ||
---|---|---|---|---|---|---|---|
int8 | int8 | int32 | int8 | int8 | int8 | ||
0 | 0 | 0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 0 | 0 | 0 | 1 | |
2 | 0 | 0 | 0 | 0 | 0 | 0 | |
3 | 0 | 0 | 0 | 0 | 0 | 0 | |
4 | 0 | 0 | 0 | 0 | 0 | 0 | |
5 | 0 | 0 | 5 | 0 | 0 | 0 | |
6 | 1 | 0 | 0 | 0 | 0 | 0 | |
7 | 0 | 0 | 0 | 0 | 0 | 0 | |
8 | 0 | 0 | 0 | 1 | 0 | 0 | |
9 | 1 | 0 | 0 | 0 | 0 | 0 | |
10 | 0 | 0 | 0 | 0 | 0 | 0 |
df[dt.rowany(f[:] > 0), :]
a | b | c | d | e | f | ||
---|---|---|---|---|---|---|---|
int8 | int8 | int32 | int8 | int8 | int8 | ||
0 | 0 | 0 | 0 | 0 | 0 | 1 | |
1 | 0 | 0 | 5 | 0 | 0 | 0 | |
2 | 1 | 0 | 0 | 0 | 0 | 0 | |
3 | 0 | 0 | 0 | 1 | 0 | 0 | |
4 | 1 | 0 | 0 | 0 | 0 | 0 |
Filter for rows where all the cells are 0:
df[dt.rowall(f[:] == 0), :]
a | b | c | d | e | f | ||
---|---|---|---|---|---|---|---|
int8 | int8 | int32 | int8 | int8 | int8 | ||
0 | 0 | 0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 0 | 0 | 0 | 0 | |
2 | 0 | 0 | 0 | 0 | 0 | 0 | |
3 | 0 | 0 | 0 | 0 | 0 | 0 | |
4 | 0 | 0 | 0 | 0 | 0 | 0 | |
5 | 0 | 0 | 0 | 0 | 0 | 0 |
Filter for rows where all the columns’ values are the same:
df = dt.Frame("""Name A1 A2 A3 A4
deff 0 0 0 0
def1 0 1 0 0
def2 0 0 0 0
def3 1 0 0 0
def4 0 0 0 0""")
# compare the first integer column with the rest,
# use rowall to find rows where all is True
# and filter with the resulting boolean
df[dt.rowall(f[1]==f[1:]), :]
Name | A1 | A2 | A3 | A4 | ||
---|---|---|---|---|---|---|
str32 | bool8 | bool8 | bool8 | bool8 | ||
0 | deff | 0 | 0 | 0 | 0 | |
1 | def2 | 0 | 0 | 0 | 0 | |
2 | def4 | 0 | 0 | 0 | 0 |
Filter for rows where the values are increasing:
df = dt.Frame({"A": [1, 2, 6, 4],
"B": [2, 4, 5, 6],
"C": [3, 5, 4, 7],
"D": [4, -3, 3, 8],
"E": [5, 1, 2, 9]})
df
A | B | C | D | E | ||
---|---|---|---|---|---|---|
int32 | int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | 5 | |
1 | 2 | 4 | 5 | -3 | 1 | |
2 | 6 | 5 | 4 | 3 | 2 | |
3 | 4 | 6 | 7 | 8 | 9 |
df[dt.rowall(f[1:] >= f[:-1]), :]
A | B | C | D | E | ||
---|---|---|---|---|---|---|
int32 | int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | 5 | |
1 | 4 | 6 | 7 | 8 | 9 |
rowfirst, rowlast¶
These look for the first and last non-missing value in a row respectively:
df = dt.Frame({'A':[1, None, None, None],
'B':[None, 3, 4, None],
'C':[2, None, 5, None]})
df
A | B | C | ||
---|---|---|---|---|
int8 | int32 | int32 | ||
0 | 1 | NA | 2 | |
1 | NA | 3 | NA | |
2 | NA | 4 | 5 | |
3 | NA | NA | NA |
# rowfirst :
df[:, dt.rowfirst(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 3 | |
2 | 4 | |
3 | NA |
# rowlast :
df[:, dt.rowlast(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 2 | |
1 | 3 | |
2 | 5 | |
3 | NA |
Get rows where the last value in the row is greater than the first value in the row:
df = dt.Frame({'a': [50, 40, 30, 20, 10],
'b': [60, 10, 40, 0, 5],
'c': [40, 30, 20, 30, 40]})
df
a | b | c | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 50 | 60 | 40 | |
1 | 40 | 10 | 30 | |
2 | 30 | 40 | 20 | |
3 | 20 | 0 | 30 | |
4 | 10 | 5 | 40 |
df[dt.rowlast(f[:]) > dt.rowfirst(f[:]), :]
a | b | c | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 20 | 0 | 30 | |
1 | 10 | 5 | 40 |
rowmax, rowmin¶
These get the maximum and minimum values per row, respectively:
df = dt.Frame({"C": [2, 5, 30, 20, 10],
"D": [10, 8, 20, 20, 1]})
df
C | D | ||
---|---|---|---|
int32 | int32 | ||
0 | 2 | 10 | |
1 | 5 | 8 | |
2 | 30 | 20 | |
3 | 20 | 20 | |
4 | 10 | 1 |
# rowmax
df[:, dt.rowmax(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 10 | |
1 | 8 | |
2 | 30 | |
3 | 20 | |
4 | 10 |
# rowmin
df[:, dt.rowmin(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 2 | |
1 | 5 | |
2 | 20 | |
3 | 20 | |
4 | 1 |
Find the difference between the maximum and minimum of each row:
df = dt.Frame("""Value1 Value2 Value3 Value4
5 4 3 2
4 3 2 1
3 3 5 1""")
df[:, dt.update(max_min = dt.rowmax(f[:]) - dt.rowmin(f[:]))]
df
Value1 | Value2 | Value3 | Value4 | max_min | ||
---|---|---|---|---|---|---|
int32 | int32 | int32 | int32 | int32 | ||
0 | 5 | 4 | 3 | 2 | 3 | |
1 | 4 | 3 | 2 | 1 | 3 | |
2 | 3 | 3 | 5 | 1 | 4 |
rowsum, rowmean, rowcount, rowsd¶
rowsum()
and rowmean()
get the sum and mean of rows respectively;
rowcount()
counts the number of non-missing values in a row, while
rowsd()
aggregates a row to get the standard deviation.
Get the count, sum, mean and standard deviation for each row:
df = dt.Frame("""ORD A B C D
198 23 45 NaN 12
138 25 NaN NaN 62
625 52 36 49 35
457 NaN NaN NaN 82
626 52 32 39 45""")
df[:, dt.update(rowcount = dt.rowcount(f[:]),
rowsum = dt.rowsum(f[:]),
rowmean = dt.rowmean(f[:]),
rowsd = dt.rowsd(f[:])
)]
df
ORD | A | B | C | D | rowcount | rowsum | rowmean | rowsd | ||
---|---|---|---|---|---|---|---|---|---|---|
int32 | float64 | float64 | float64 | int32 | int32 | float64 | float64 | float64 | ||
0 | 198 | 23 | 45 | NA | 12 | 4 | 278 | 69.5 | 86.7583 | |
1 | 138 | 25 | NA | NA | 62 | 3 | 225 | 75 | 57.6108 | |
2 | 625 | 52 | 36 | 49 | 35 | 5 | 797 | 159.4 | 260.389 | |
3 | 457 | NA | NA | NA | 82 | 2 | 539 | 269.5 | 265.165 | |
4 | 626 | 52 | 32 | 39 | 45 | 5 | 794 | 158.8 | 261.277 |
Find rows where the number of nulls is greater than 3:
df = dt.Frame({'city': ["city1", "city2", "city3", "city4"],
'state': ["state1", "state2", "state3", "state4"],
'2005': [144, 205, 123, None],
'2006': [173, 211, 123, 124],
'2007': [None, None, None, None],
'2008': [None, 206, None, None],
'2009': [None, None, 124, 123],
'2010': [128, 273, None, None]})
df
city | state | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | ||
---|---|---|---|---|---|---|---|---|---|
str32 | str32 | int32 | int32 | void | int32 | int32 | int32 | ||
0 | city1 | state1 | 144 | 173 | NA | NA | NA | 128 | |
1 | city2 | state2 | 205 | 211 | NA | 206 | NA | 273 | |
2 | city3 | state3 | 123 | 123 | NA | NA | 124 | NA | |
3 | city4 | state4 | NA | 124 | NA | NA | 123 | NA |
# get columns that are null, then sum on the rows
# and finally filter where the sum is greater than 3
df[dt.rowsum(dt.isna(f[:])) > 3, :]
city | state | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | ||
---|---|---|---|---|---|---|---|---|---|
str32 | str32 | int32 | int32 | void | int32 | int32 | int32 | ||
0 | city4 | state4 | NA | 124 | NA | NA | 123 | NA |
Rowwise sum of the float columns:
df = dt.Frame("""ID W_1 W_2 W_3
1 0.1 0.2 0.3
1 0.2 0.4 0.5
2 0.3 0.3 0.2
2 0.1 0.3 0.4
2 0.2 0.0 0.5
1 0.5 0.3 0.2
1 0.4 0.2 0.1""")
df[:, dt.update(sum_floats = dt.rowsum(f[float]))]
df
ID | W_1 | W_2 | W_3 | sum_floats | ||
---|---|---|---|---|---|---|
int32 | float64 | float64 | float64 | float64 | ||
0 | 1 | 0.1 | 0.2 | 0.3 | 0.6 | |
1 | 1 | 0.2 | 0.4 | 0.5 | 1.1 | |
2 | 2 | 0.3 | 0.3 | 0.2 | 0.8 | |
3 | 2 | 0.1 | 0.3 | 0.4 | 0.8 | |
4 | 2 | 0.2 | 0 | 0.5 | 0.7 | |
5 | 1 | 0.5 | 0.3 | 0.2 | 1 | |
6 | 1 | 0.4 | 0.2 | 0.1 | 0.7 |
More Examples¶
Divide columns A
, B
, C
, D
by the total
column, square it
and sum rowwise:
df = dt.Frame({'A': [2, 3],
'B': [1, 2],
'C': [0, 1],
'D': [1, 0],
'total': [4, 6]})
df
A | B | C | D | total | ||
---|---|---|---|---|---|---|
int32 | int32 | int8 | int8 | int32 | ||
0 | 2 | 1 | 0 | 1 | 4 | |
1 | 3 | 2 | 1 | 0 | 6 |
df[:, update(result = dt.rowsum((f[:-1]/f[-1])**2))]
df
A | B | C | D | total | result | ||
---|---|---|---|---|---|---|---|
int32 | int32 | int8 | int8 | int32 | float64 | ||
0 | 2 | 1 | 0 | 1 | 4 | 0.375 | |
1 | 3 | 2 | 1 | 0 | 6 | 0.388889 |
Get the row sum of the COUNT
columns:
df = dt.Frame("""USER OBSERVATION COUNT.1 COUNT.2 COUNT.3
A 1 0 1 1
A 2 1 1 2
A 3 3 0 0""")
columns = [f[column] for column in df.names if column.startswith("COUNT")]
df[:, update(total = dt.rowsum(columns))]
df
USER | OBSERVATION | COUNT.1 | COUNT.2 | COUNT.3 | total | ||
---|---|---|---|---|---|---|---|
str32 | int32 | int32 | bool8 | int32 | int32 | ||
0 | A | 1 | 0 | 1 | 1 | 2 | |
1 | A | 2 | 1 | 1 | 2 | 4 | |
2 | A | 3 | 3 | 0 | 0 | 3 |
Sum selected columns rowwise:
df = dt.Frame({'location' : ("a","b","c","d"),
'v1' : (3,4,3,3),
'v2' : (4,56,3,88),
'v3' : (7,6,2,9),
'v4': (7,6,1,9),
'v5' : (4,4,7,9),
'v6' : (2,8,4,6)})
df
location | v1 | v2 | v3 | v4 | v5 | v6 | ||
---|---|---|---|---|---|---|---|---|
str32 | int32 | int32 | int32 | int32 | int32 | int32 | ||
0 | a | 3 | 4 | 7 | 7 | 4 | 2 | |
1 | b | 4 | 56 | 6 | 6 | 4 | 8 | |
2 | c | 3 | 3 | 2 | 1 | 7 | 4 | |
3 | d | 3 | 88 | 9 | 9 | 9 | 6 |
df[:, {"x1": dt.rowsum(f[1:4]), "x2": dt.rowsum(f[4:])}]
x1 | x2 | ||
---|---|---|---|
int32 | int32 | ||
0 | 14 | 13 | |
1 | 66 | 18 | |
2 | 8 | 12 | |
3 | 100 | 24 |