Name mangling¶
Column names in a Frame
satisfy several invariants:
they are all non-empty strings;
within a single Frame column names must be unique;
no column name may contain characters from the ASCII C0 control block. This set of forbidden characters includes: the NULL character
\0
, TAB character\t
, newline\n
, and similar.
If the user makes an attempt to create a Frame that would violate some of these assumptions, then instead of failing we will attempt to mangle the provided names, forcing them to satisfy the above requirements.
Given a list of column names requested by the user, the following algorithm is used:
First, we check all the non-empty names in the list, from left to right. If a name contains characters in the range
\x00-\x1F
, then every run of 1 or more such characters is replaced with a single dot.Once the special characters are removed from the name, we check it against the set of names that were already encountered. If the current name hasn’t been seen before, then we add it to the final list of names and proceed to consider the next name in the list. However, if the name was seen before, then it goes into the deduplication stage.
When a name needs to be deduplicated, we do the following:
If the name ends with a number, then split it into two parts: the
stem
and the numeric suffix. Letcount
be the value of the numeric suffix plus 1;If the name does not end with a number, then append a dot (
.
) to the name and consider this thestem
. For thecount
variable, take the value of optiondt.options.frame.name_auto_index
.Concatenate
stem
andcount
, and check whether this name has been seen before. If it was, then incrementcount
by 1, and repeat this step.Use
stem + count
as this column’s final name. Continue processing other columns.
Finally, re-scan the list of column names once again, this time replacing all the empty names. For each empty name we proceed exactly as in (3), using
dt.options.frame.name_auto_prefix
as thestem
, anddt.options.frame.name_auto_index
as the initialcount
.
Examples¶
The default value of dt.options.frame.name_auto_prefix
is "C"
, and the
default value of dt.options.frame.name_auto_index
is 0
. This means that
if no column names are given, they will be named as C0, C1, C2, ...
:
>>> dt.Frame([[]] * 5).names
('C0', 'C1', 'C2', 'C3', 'C4')
If the column names contain duplicates, then they will gain a numeric suffix (or reuse the existing suffix, if any):
>>> dt.Frame(names=["A", "A", "A"]).names
('A', 'A.0', 'A.1')
>>> dt.Frame(names=["R3"] * 4).names
('R3', 'R4', 'R5', 'R6')
If some of the column names are given, while others are missing, then the
missing names will be filled as C0, C1, ...
:
>>> dt.Frame(names=["A", None, "B", None]).names
('A', 'C0', 'B', 'C1')
When replacing the missing names, explicitly given names will have a higher precedence and tend to retain their names:
>>> dt.Frame(names=["A", None, "C0", "C1"]).names
('A', 'C2', 'C0', 'C1')
However, deduplication of the existing names happen from left to right, which may affect the subsequent columns:
>>> dt.Frame(names=["A1", "A1", "A2", "A3"]).names
('A1', 'A2', 'A3', 'A4')