Column names in a
Frame satisfy several invariants:
they are all non-empty strings;
within a single Frame column names must be unique;
no column name may contain characters from the ASCII C0 control block. This set of forbidden characters includes: the NULL character
\0, TAB character
\n, and similar.
If the user makes an attempt to create a Frame that would violate some of these assumptions, then instead of failing we will attempt to mangle the provided names, forcing them to satisfy the above requirements.
Given a list of column names requested by the user, the following algorithm is used:
First, we check all the non-empty names in the list, from left to right. If a name contains characters in the range
\x00-\x1F, then every run of 1 or more such characters is replaced with a single dot.
Once the special characters are removed from the name, we check it against the set of names that were already encountered. If the current name hasn’t been seen before, then we add it to the final list of names and proceed to consider the next name in the list. However, if the name was seen before, then it goes into the deduplication stage.
When a name needs to be deduplicated, we do the following:
If the name ends with a number, then split it into two parts: the
stemand the numeric suffix. Let
countbe the value of the numeric suffix plus 1;
If the name does not end with a number, then append a dot (
.) to the name and consider this the
stem. For the
countvariable, take the value of option
count, and check whether this name has been seen before. If it was, then increment
countby 1, and repeat this step.
stem + countas this column’s final name. Continue processing other columns.
Finally, re-scan the list of column names once again, this time replacing all the empty names. For each empty name we proceed exactly as in (3), using
dt.options.frame.name_auto_indexas the initial
The default value of
"C", and the
default value of
0. This means that
if no column names are given, they will be named as
C0, C1, C2, ...:
dt.Frame([] * 5).names
If the column names contain duplicates, then they will gain a numeric suffix (or reuse the existing suffix, if any):
dt.Frame(names=["A", "A", "A"]).names
dt.Frame(names=["R3"] * 4).names
If some of the column names are given, while others are missing, then the
missing names will be filled as
C0, C1, ...:
dt.Frame(names=["A", None, "B", None]).names
When replacing the missing names, explicitly given names will have a higher precedence and tend to retain their names:
dt.Frame(names=["A", None, "C0", "C1"]).names
However, deduplication of the existing names happen from left to right, which may affect the subsequent columns:
dt.Frame(names=["A1", "A1", "A2", "A3"]).names