Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | Idmappings |
| 4 | ========== |
| 5 | |
| 6 | Most filesystem developers will have encountered idmappings. They are used when |
| 7 | reading from or writing ownership to disk, reporting ownership to userspace, or |
| 8 | for permission checking. This document is aimed at filesystem developers that |
| 9 | want to know how idmappings work. |
| 10 | |
| 11 | Formal notes |
| 12 | ------------ |
| 13 | |
| 14 | An idmapping is essentially a translation of a range of ids into another or the |
| 15 | same range of ids. The notational convention for idmappings that is widely used |
| 16 | in userspace is:: |
| 17 | |
| 18 | u:k:r |
| 19 | |
| 20 | ``u`` indicates the first element in the upper idmapset ``U`` and ``k`` |
| 21 | indicates the first element in the lower idmapset ``K``. The ``r`` parameter |
| 22 | indicates the range of the idmapping, i.e. how many ids are mapped. From now |
| 23 | on, we will always prefix ids with ``u`` or ``k`` to make it clear whether |
| 24 | we're talking about an id in the upper or lower idmapset. |
| 25 | |
| 26 | To see what this looks like in practice, let's take the following idmapping:: |
| 27 | |
| 28 | u22:k10000:r3 |
| 29 | |
| 30 | and write down the mappings it will generate:: |
| 31 | |
| 32 | u22 -> k10000 |
| 33 | u23 -> k10001 |
| 34 | u24 -> k10002 |
| 35 | |
| 36 | From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an |
| 37 | idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are |
| 38 | order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of |
Bjorn Helgaas | d56b699 | 2023-08-14 16:28:22 -0500 | [diff] [blame] | 39 | the set of all possible ids usable on a given system. |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 40 | |
| 41 | Looking at this mathematically briefly will help us highlight some properties |
| 42 | that make it easier to understand how we can translate between idmappings. For |
| 43 | example, we know that the inverse idmapping is an order isomorphism as well:: |
| 44 | |
| 45 | k10000 -> u22 |
| 46 | k10001 -> u23 |
| 47 | k10002 -> u24 |
| 48 | |
| 49 | Given that we are dealing with order isomorphisms plus the fact that we're |
Bjorn Helgaas | d56b699 | 2023-08-14 16:28:22 -0500 | [diff] [blame] | 50 | dealing with subsets we can embed idmappings into each other, i.e. we can |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 51 | sensibly translate between different idmappings. For example, assume we've been |
| 52 | given the three idmappings:: |
| 53 | |
| 54 | 1. u0:k10000:r10000 |
| 55 | 2. u0:k20000:r10000 |
| 56 | 3. u0:k30000:r10000 |
| 57 | |
| 58 | and id ``k11000`` which has been generated by the first idmapping by mapping |
| 59 | ``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset. |
| 60 | |
| 61 | Because we're dealing with order isomorphic subsets it is meaningful to ask |
| 62 | what id ``k11000`` corresponds to in the second or third idmapping. The |
Bjorn Helgaas | d56b699 | 2023-08-14 16:28:22 -0500 | [diff] [blame] | 63 | straightforward algorithm to use is to apply the inverse of the first idmapping, |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 64 | mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using |
| 65 | either the second idmapping mapping or third idmapping mapping. The second |
| 66 | idmapping would map ``u1000`` down to ``21000``. The third idmapping would map |
| 67 | ``u1000`` down to ``u31000``. |
| 68 | |
| 69 | If we were given the same task for the following three idmappings:: |
| 70 | |
| 71 | 1. u0:k10000:r10000 |
| 72 | 2. u0:k20000:r200 |
| 73 | 3. u0:k30000:r300 |
| 74 | |
| 75 | we would fail to translate as the sets aren't order isomorphic over the full |
| 76 | range of the first idmapping anymore (However they are order isomorphic over |
| 77 | the full range of the second idmapping.). Neither the second or third idmapping |
| 78 | contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having |
| 79 | an id mapped. We can simply say that ``u1000`` is unmapped in the second and |
| 80 | third idmapping. The kernel will report unmapped ids as the overflowuid |
| 81 | ``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace. |
| 82 | |
| 83 | The algorithm to calculate what a given id maps to is pretty simple. First, we |
| 84 | need to verify that the range can contain our target id. We will skip this step |
| 85 | for simplicity. After that if we want to know what ``id`` maps to we can do |
| 86 | simple calculations: |
| 87 | |
| 88 | - If we want to map from left to right:: |
| 89 | |
| 90 | u:k:r |
| 91 | id - u + k = n |
| 92 | |
| 93 | - If we want to map from right to left:: |
| 94 | |
| 95 | u:k:r |
| 96 | id - k + u = n |
| 97 | |
| 98 | Instead of "left to right" we can also say "down" and instead of "right to |
| 99 | left" we can also say "up". Obviously mapping down and up invert each other. |
| 100 | |
| 101 | To see whether the simple formulas above work, consider the following two |
| 102 | idmappings:: |
| 103 | |
| 104 | 1. u0:k20000:r10000 |
| 105 | 2. u500:k30000:r10000 |
| 106 | |
| 107 | Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We |
| 108 | want to know what id this was mapped from in the upper idmapset of the first |
| 109 | idmapping. So we're mapping up in the first idmapping:: |
| 110 | |
| 111 | id - k + u = n |
| 112 | k21000 - k20000 + u0 = u1000 |
| 113 | |
| 114 | Now assume we are given the id ``u1100`` in the upper idmapset of the second |
| 115 | idmapping and we want to know what this id maps down to in the lower idmapset |
| 116 | of the second idmapping. This means we're mapping down in the second |
| 117 | idmapping:: |
| 118 | |
| 119 | id - u + k = n |
| 120 | u1100 - u500 + k30000 = k30600 |
| 121 | |
| 122 | General notes |
| 123 | ------------- |
| 124 | |
| 125 | In the context of the kernel an idmapping can be interpreted as mapping a range |
| 126 | of userspace ids into a range of kernel ids:: |
| 127 | |
| 128 | userspace-id:kernel-id:range |
| 129 | |
| 130 | A userspace id is always an element in the upper idmapset of an idmapping of |
| 131 | type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower |
| 132 | idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on |
| 133 | "userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t`` |
| 134 | types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``. |
| 135 | |
| 136 | The kernel is mostly concerned with kernel ids. They are used when performing |
| 137 | permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field. |
| 138 | A userspace id on the other hand is an id that is reported to userspace by the |
| 139 | kernel, or is passed by userspace to the kernel, or a raw device id that is |
| 140 | written or read from disk. |
| 141 | |
| 142 | Note that we are only concerned with idmappings as the kernel stores them not |
| 143 | how userspace would specify them. |
| 144 | |
| 145 | For the rest of this document we will prefix all userspace ids with ``u`` and |
| 146 | all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So |
| 147 | an idmapping will be written as ``u0:k10000:r10000``. |
| 148 | |
GONG, Ruiqi | b93ec21 | 2023-08-16 11:32:10 +0800 | [diff] [blame] | 149 | For example, within this idmapping, the id ``u1000`` is an id in the upper |
| 150 | idmapset or "userspace idmapset" starting with ``u0``. And it is mapped to |
| 151 | ``k11000`` which is a kernel id in the lower idmapset or "kernel idmapset" |
| 152 | starting with ``k10000``. |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 153 | |
| 154 | A kernel id is always created by an idmapping. Such idmappings are associated |
| 155 | with user namespaces. Since we mainly care about how idmappings work we're not |
| 156 | going to be concerned with how idmappings are created nor how they are used |
| 157 | outside of the filesystem context. This is best left to an explanation of user |
| 158 | namespaces. |
| 159 | |
| 160 | The initial user namespace is special. It always has an idmapping of the |
| 161 | following form:: |
| 162 | |
| 163 | u0:k0:r4294967295 |
| 164 | |
| 165 | which is an identity idmapping over the full range of ids available on this |
| 166 | system. |
| 167 | |
| 168 | Other user namespaces usually have non-identity idmappings such as:: |
| 169 | |
| 170 | u0:k10000:r10000 |
| 171 | |
| 172 | When a process creates or wants to change ownership of a file, or when the |
| 173 | ownership of a file is read from disk by a filesystem, the userspace id is |
| 174 | immediately translated into a kernel id according to the idmapping associated |
| 175 | with the relevant user namespace. |
| 176 | |
| 177 | For instance, consider a file that is stored on disk by a filesystem as being |
| 178 | owned by ``u1000``: |
| 179 | |
| 180 | - If a filesystem were to be mounted in the initial user namespaces (as most |
| 181 | filesystems are) then the initial idmapping will be used. As we saw this is |
| 182 | simply the identity idmapping. This would mean id ``u1000`` read from disk |
| 183 | would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field |
| 184 | would contain ``k1000``. |
| 185 | |
| 186 | - If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000`` |
| 187 | then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's |
| 188 | ``i_uid`` and ``i_gid`` would contain ``k11000``. |
| 189 | |
| 190 | Translation algorithms |
| 191 | ---------------------- |
| 192 | |
| 193 | We've already seen briefly that it is possible to translate between different |
| 194 | idmappings. We'll now take a closer look how that works. |
| 195 | |
| 196 | Crossmapping |
| 197 | ~~~~~~~~~~~~ |
| 198 | |
| 199 | This translation algorithm is used by the kernel in quite a few places. For |
| 200 | example, it is used when reporting back the ownership of a file to userspace |
| 201 | via the ``stat()`` system call family. |
| 202 | |
| 203 | If we've been given ``k11000`` from one idmapping we can map that id up in |
| 204 | another idmapping. In order for this to work both idmappings need to contain |
| 205 | the same kernel id in their kernel idmapsets. For example, consider the |
| 206 | following idmappings:: |
| 207 | |
| 208 | 1. u0:k10000:r10000 |
| 209 | 2. u20000:k10000:r10000 |
| 210 | |
| 211 | and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can |
| 212 | then translate ``k11000`` into a userspace id in the second idmapping using the |
| 213 | kernel idmapset of the second idmapping:: |
| 214 | |
| 215 | /* Map the kernel id up into a userspace id in the second idmapping. */ |
| 216 | from_kuid(u20000:k10000:r10000, k11000) = u21000 |
| 217 | |
| 218 | Note, how we can get back to the kernel id in the first idmapping by inverting |
| 219 | the algorithm:: |
| 220 | |
| 221 | /* Map the userspace id down into a kernel id in the second idmapping. */ |
| 222 | make_kuid(u20000:k10000:r10000, u21000) = k11000 |
| 223 | |
| 224 | /* Map the kernel id up into a userspace id in the first idmapping. */ |
| 225 | from_kuid(u0:k10000:r10000, k11000) = u1000 |
| 226 | |
| 227 | This algorithm allows us to answer the question what userspace id a given |
| 228 | kernel id corresponds to in a given idmapping. In order to be able to answer |
| 229 | this question both idmappings need to contain the same kernel id in their |
| 230 | respective kernel idmapsets. |
| 231 | |
| 232 | For example, when the kernel reads a raw userspace id from disk it maps it down |
| 233 | into a kernel id according to the idmapping associated with the filesystem. |
| 234 | Let's assume the filesystem was mounted with an idmapping of |
| 235 | ``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This |
| 236 | means ``u1000`` will be mapped to ``k21000`` which is what will be stored in |
| 237 | the inode's ``i_uid`` and ``i_gid`` field. |
| 238 | |
| 239 | When someone in userspace calls ``stat()`` or a related function to get |
| 240 | ownership information about the file the kernel can't simply map the id back up |
| 241 | according to the filesystem's idmapping as this would give the wrong owner if |
| 242 | the caller is using an idmapping. |
| 243 | |
| 244 | So the kernel will map the id back up in the idmapping of the caller. Let's |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 245 | assume the caller has the somewhat unconventional idmapping |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 246 | ``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``. |
| 247 | Consequently the user would see that this file is owned by ``u4000``. |
| 248 | |
| 249 | Remapping |
| 250 | ~~~~~~~~~ |
| 251 | |
| 252 | It is possible to translate a kernel id from one idmapping to another one via |
| 253 | the userspace idmapset of the two idmappings. This is equivalent to remapping |
| 254 | a kernel id. |
| 255 | |
| 256 | Let's look at an example. We are given the following two idmappings:: |
| 257 | |
| 258 | 1. u0:k10000:r10000 |
| 259 | 2. u0:k20000:r10000 |
| 260 | |
| 261 | and we are given ``k11000`` in the first idmapping. In order to translate this |
| 262 | kernel id in the first idmapping into a kernel id in the second idmapping we |
| 263 | need to perform two steps: |
| 264 | |
| 265 | 1. Map the kernel id up into a userspace id in the first idmapping:: |
| 266 | |
| 267 | /* Map the kernel id up into a userspace id in the first idmapping. */ |
| 268 | from_kuid(u0:k10000:r10000, k11000) = u1000 |
| 269 | |
| 270 | 2. Map the userspace id down into a kernel id in the second idmapping:: |
| 271 | |
| 272 | /* Map the userspace id down into a kernel id in the second idmapping. */ |
| 273 | make_kuid(u0:k20000:r10000, u1000) = k21000 |
| 274 | |
| 275 | As you can see we used the userspace idmapset in both idmappings to translate |
| 276 | the kernel id in one idmapping to a kernel id in another idmapping. |
| 277 | |
| 278 | This allows us to answer the question what kernel id we would need to use to |
| 279 | get the same userspace id in another idmapping. In order to be able to answer |
| 280 | this question both idmappings need to contain the same userspace id in their |
| 281 | respective userspace idmapsets. |
| 282 | |
| 283 | Note, how we can easily get back to the kernel id in the first idmapping by |
| 284 | inverting the algorithm: |
| 285 | |
| 286 | 1. Map the kernel id up into a userspace id in the second idmapping:: |
| 287 | |
| 288 | /* Map the kernel id up into a userspace id in the second idmapping. */ |
| 289 | from_kuid(u0:k20000:r10000, k21000) = u1000 |
| 290 | |
| 291 | 2. Map the userspace id down into a kernel id in the first idmapping:: |
| 292 | |
| 293 | /* Map the userspace id down into a kernel id in the first idmapping. */ |
| 294 | make_kuid(u0:k10000:r10000, u1000) = k11000 |
| 295 | |
| 296 | Another way to look at this translation is to treat it as inverting one |
| 297 | idmapping and applying another idmapping if both idmappings have the relevant |
| 298 | userspace id mapped. This will come in handy when working with idmapped mounts. |
| 299 | |
| 300 | Invalid translations |
| 301 | ~~~~~~~~~~~~~~~~~~~~ |
| 302 | |
| 303 | It is never valid to use an id in the kernel idmapset of one idmapping as the |
| 304 | id in the userspace idmapset of another or the same idmapping. While the kernel |
| 305 | idmapset always indicates an idmapset in the kernel id space the userspace |
| 306 | idmapset indicates a userspace id. So the following translations are forbidden:: |
| 307 | |
| 308 | /* Map the userspace id down into a kernel id in the first idmapping. */ |
| 309 | make_kuid(u0:k10000:r10000, u1000) = k11000 |
| 310 | |
| 311 | /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */ |
| 312 | make_kuid(u10000:k20000:r10000, k110000) = k21000 |
| 313 | ~~~~~~~ |
| 314 | |
| 315 | and equally wrong:: |
| 316 | |
| 317 | /* Map the kernel id up into a userspace id in the first idmapping. */ |
| 318 | from_kuid(u0:k10000:r10000, k11000) = u1000 |
| 319 | |
| 320 | /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */ |
| 321 | from_kuid(u20000:k0:r10000, u1000) = k21000 |
| 322 | ~~~~~ |
| 323 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 324 | Since userspace ids have type ``uid_t`` and ``gid_t`` and kernel ids have type |
| 325 | ``kuid_t`` and ``kgid_t`` the compiler will throw an error when they are |
| 326 | conflated. So the two examples above would cause a compilation failure. |
| 327 | |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 328 | Idmappings when creating filesystem objects |
| 329 | ------------------------------------------- |
| 330 | |
| 331 | The concepts of mapping an id down or mapping an id up are expressed in the two |
| 332 | kernel functions filesystem developers are rather familiar with and which we've |
| 333 | already used in this document:: |
| 334 | |
| 335 | /* Map the userspace id down into a kernel id. */ |
| 336 | make_kuid(idmapping, uid) |
| 337 | |
| 338 | /* Map the kernel id up into a userspace id. */ |
| 339 | from_kuid(idmapping, kuid) |
| 340 | |
| 341 | We will take an abbreviated look into how idmappings figure into creating |
| 342 | filesystem objects. For simplicity we will only look at what happens when the |
| 343 | VFS has already completed path lookup right before it calls into the filesystem |
| 344 | itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is |
| 345 | called. We will also assume that the directory we're creating filesystem |
| 346 | objects in is readable and writable for everyone. |
| 347 | |
| 348 | When creating a filesystem object the caller will look at the caller's |
| 349 | filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids |
| 350 | but they are exclusively used when determining file ownership which is why they |
| 351 | are called "filesystem ids". They are usually identical to the uid and gid of |
| 352 | the caller but can differ. We will just assume they are always identical to not |
| 353 | get lost in too many details. |
| 354 | |
| 355 | When the caller enters the kernel two things happen: |
| 356 | |
| 357 | 1. Map the caller's userspace ids down into kernel ids in the caller's |
| 358 | idmapping. |
| 359 | (To be precise, the kernel will simply look at the kernel ids stashed in the |
| 360 | credentials of the current task but for our education we'll pretend this |
| 361 | translation happens just in time.) |
| 362 | 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the |
| 363 | filesystem's idmapping. |
| 364 | |
| 365 | The second step is important as regular filesystem will ultimately need to map |
| 366 | the kernel id back up into a userspace id when writing to disk. |
| 367 | So with the second step the kernel guarantees that a valid userspace id can be |
| 368 | written to disk. If it can't the kernel will refuse the creation request to not |
| 369 | even remotely risk filesystem corruption. |
| 370 | |
Bjorn Helgaas | d56b699 | 2023-08-14 16:28:22 -0500 | [diff] [blame] | 371 | The astute reader will have realized that this is simply a variation of the |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 372 | crossmapping algorithm we mentioned above in a previous section. First, the |
| 373 | kernel maps the caller's userspace id down into a kernel id according to the |
| 374 | caller's idmapping and then maps that kernel id up according to the |
| 375 | filesystem's idmapping. |
| 376 | |
Alexander Mikhalitsyn | d220efa | 2023-06-25 20:20:47 +0200 | [diff] [blame] | 377 | From the implementation point it's worth mentioning how idmappings are represented. |
| 378 | All idmappings are taken from the corresponding user namespace. |
| 379 | |
| 380 | - caller's idmapping (usually taken from ``current_user_ns()``) |
| 381 | - filesystem's idmapping (``sb->s_user_ns``) |
| 382 | - mount's idmapping (``mnt_idmap(vfsmnt)``) |
| 383 | |
Rodrigo Campos | ccbd0c9 | 2022-04-29 15:57:48 +0200 | [diff] [blame] | 384 | Let's see some examples with caller/filesystem idmapping but without mount |
| 385 | idmappings. This will exhibit some problems we can hit. After that we will |
| 386 | revisit/reconsider these examples, this time using mount idmappings, to see how |
| 387 | they can solve the problems we observed before. |
| 388 | |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 389 | Example 1 |
| 390 | ~~~~~~~~~ |
| 391 | |
| 392 | :: |
| 393 | |
| 394 | caller id: u1000 |
| 395 | caller idmapping: u0:k0:r4294967295 |
| 396 | filesystem idmapping: u0:k0:r4294967295 |
| 397 | |
| 398 | Both the caller and the filesystem use the identity idmapping: |
| 399 | |
| 400 | 1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: |
| 401 | |
| 402 | make_kuid(u0:k0:r4294967295, u1000) = k1000 |
| 403 | |
| 404 | 2. Verify that the caller's kernel ids can be mapped to userspace ids in the |
| 405 | filesystem's idmapping. |
| 406 | |
| 407 | For this second step the kernel will call the function |
| 408 | ``fsuidgid_has_mapping()`` which ultimately boils down to calling |
| 409 | ``from_kuid()``:: |
| 410 | |
| 411 | from_kuid(u0:k0:r4294967295, k1000) = u1000 |
| 412 | |
| 413 | In this example both idmappings are the same so there's nothing exciting going |
| 414 | on. Ultimately the userspace id that lands on disk will be ``u1000``. |
| 415 | |
| 416 | Example 2 |
| 417 | ~~~~~~~~~ |
| 418 | |
| 419 | :: |
| 420 | |
| 421 | caller id: u1000 |
| 422 | caller idmapping: u0:k10000:r10000 |
| 423 | filesystem idmapping: u0:k20000:r10000 |
| 424 | |
| 425 | 1. Map the caller's userspace ids down into kernel ids in the caller's |
| 426 | idmapping:: |
| 427 | |
| 428 | make_kuid(u0:k10000:r10000, u1000) = k11000 |
| 429 | |
| 430 | 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the |
| 431 | filesystem's idmapping:: |
| 432 | |
| 433 | from_kuid(u0:k20000:r10000, k11000) = u-1 |
| 434 | |
| 435 | It's immediately clear that while the caller's userspace id could be |
| 436 | successfully mapped down into kernel ids in the caller's idmapping the kernel |
| 437 | ids could not be mapped up according to the filesystem's idmapping. So the |
| 438 | kernel will deny this creation request. |
| 439 | |
| 440 | Note that while this example is less common, because most filesystem can't be |
| 441 | mounted with non-initial idmappings this is a general problem as we can see in |
| 442 | the next examples. |
| 443 | |
| 444 | Example 3 |
| 445 | ~~~~~~~~~ |
| 446 | |
| 447 | :: |
| 448 | |
| 449 | caller id: u1000 |
| 450 | caller idmapping: u0:k10000:r10000 |
| 451 | filesystem idmapping: u0:k0:r4294967295 |
| 452 | |
| 453 | 1. Map the caller's userspace ids down into kernel ids in the caller's |
| 454 | idmapping:: |
| 455 | |
| 456 | make_kuid(u0:k10000:r10000, u1000) = k11000 |
| 457 | |
| 458 | 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the |
| 459 | filesystem's idmapping:: |
| 460 | |
| 461 | from_kuid(u0:k0:r4294967295, k11000) = u11000 |
| 462 | |
| 463 | We can see that the translation always succeeds. The userspace id that the |
| 464 | filesystem will ultimately put to disk will always be identical to the value of |
| 465 | the kernel id that was created in the caller's idmapping. This has mainly two |
| 466 | consequences. |
| 467 | |
| 468 | First, that we can't allow a caller to ultimately write to disk with another |
Bjorn Helgaas | d56b699 | 2023-08-14 16:28:22 -0500 | [diff] [blame] | 469 | userspace id. We could only do this if we were to mount the whole filesystem |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 470 | with the caller's or another idmapping. But that solution is limited to a few |
| 471 | filesystems and not very flexible. But this is a use-case that is pretty |
| 472 | important in containerized workloads. |
| 473 | |
| 474 | Second, the caller will usually not be able to create any files or access |
| 475 | directories that have stricter permissions because none of the filesystem's |
| 476 | kernel ids map up into valid userspace ids in the caller's idmapping |
| 477 | |
| 478 | 1. Map raw userspace ids down to kernel ids in the filesystem's idmapping:: |
| 479 | |
| 480 | make_kuid(u0:k0:r4294967295, u1000) = k1000 |
| 481 | |
| 482 | 2. Map kernel ids up to userspace ids in the caller's idmapping:: |
| 483 | |
| 484 | from_kuid(u0:k10000:r10000, k1000) = u-1 |
| 485 | |
| 486 | Example 4 |
| 487 | ~~~~~~~~~ |
| 488 | |
| 489 | :: |
| 490 | |
| 491 | file id: u1000 |
| 492 | caller idmapping: u0:k10000:r10000 |
| 493 | filesystem idmapping: u0:k0:r4294967295 |
| 494 | |
| 495 | In order to report ownership to userspace the kernel uses the crossmapping |
| 496 | algorithm introduced in a previous section: |
| 497 | |
| 498 | 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| 499 | idmapping:: |
| 500 | |
| 501 | make_kuid(u0:k0:r4294967295, u1000) = k1000 |
| 502 | |
| 503 | 2. Map the kernel id up into a userspace id in the caller's idmapping:: |
| 504 | |
| 505 | from_kuid(u0:k10000:r10000, k1000) = u-1 |
| 506 | |
| 507 | The crossmapping algorithm fails in this case because the kernel id in the |
| 508 | filesystem idmapping cannot be mapped up to a userspace id in the caller's |
| 509 | idmapping. Thus, the kernel will report the ownership of this file as the |
| 510 | overflowid. |
| 511 | |
| 512 | Example 5 |
| 513 | ~~~~~~~~~ |
| 514 | |
| 515 | :: |
| 516 | |
| 517 | file id: u1000 |
| 518 | caller idmapping: u0:k10000:r10000 |
| 519 | filesystem idmapping: u0:k20000:r10000 |
| 520 | |
| 521 | In order to report ownership to userspace the kernel uses the crossmapping |
| 522 | algorithm introduced in a previous section: |
| 523 | |
| 524 | 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| 525 | idmapping:: |
| 526 | |
| 527 | make_kuid(u0:k20000:r10000, u1000) = k21000 |
| 528 | |
| 529 | 2. Map the kernel id up into a userspace id in the caller's idmapping:: |
| 530 | |
| 531 | from_kuid(u0:k10000:r10000, k21000) = u-1 |
| 532 | |
| 533 | Again, the crossmapping algorithm fails in this case because the kernel id in |
| 534 | the filesystem idmapping cannot be mapped to a userspace id in the caller's |
| 535 | idmapping. Thus, the kernel will report the ownership of this file as the |
| 536 | overflowid. |
| 537 | |
| 538 | Note how in the last two examples things would be simple if the caller would be |
| 539 | using the initial idmapping. For a filesystem mounted with the initial |
| 540 | idmapping it would be trivial. So we only consider a filesystem with an |
| 541 | idmapping of ``u0:k20000:r10000``: |
| 542 | |
| 543 | 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| 544 | idmapping:: |
| 545 | |
| 546 | make_kuid(u0:k20000:r10000, u1000) = k21000 |
| 547 | |
| 548 | 2. Map the kernel id up into a userspace id in the caller's idmapping:: |
| 549 | |
| 550 | from_kuid(u0:k0:r4294967295, k21000) = u21000 |
| 551 | |
| 552 | Idmappings on idmapped mounts |
| 553 | ----------------------------- |
| 554 | |
| 555 | The examples we've seen in the previous section where the caller's idmapping |
| 556 | and the filesystem's idmapping are incompatible causes various issues for |
| 557 | workloads. For a more complex but common example, consider two containers |
| 558 | started on the host. To completely prevent the two containers from affecting |
| 559 | each other, an administrator may often use different non-overlapping idmappings |
| 560 | for the two containers:: |
| 561 | |
| 562 | container1 idmapping: u0:k10000:r10000 |
| 563 | container2 idmapping: u0:k20000:r10000 |
| 564 | filesystem idmapping: u0:k30000:r10000 |
| 565 | |
| 566 | An administrator wanting to provide easy read-write access to the following set |
| 567 | of files:: |
| 568 | |
| 569 | dir id: u0 |
| 570 | dir/file1 id: u1000 |
| 571 | dir/file2 id: u2000 |
| 572 | |
| 573 | to both containers currently can't. |
| 574 | |
| 575 | Of course the administrator has the option to recursively change ownership via |
| 576 | ``chown()``. For example, they could change ownership so that ``dir`` and all |
| 577 | files below it can be crossmapped from the filesystem's into the container's |
| 578 | idmapping. Let's assume they change ownership so it is compatible with the |
| 579 | first container's idmapping:: |
| 580 | |
| 581 | dir id: u10000 |
| 582 | dir/file1 id: u11000 |
| 583 | dir/file2 id: u12000 |
| 584 | |
| 585 | This would still leave ``dir`` rather useless to the second container. In fact, |
| 586 | ``dir`` and all files below it would continue to appear owned by the overflowid |
| 587 | for the second container. |
| 588 | |
| 589 | Or consider another increasingly popular example. Some service managers such as |
| 590 | systemd implement a concept called "portable home directories". A user may want |
| 591 | to use their home directories on different machines where they are assigned |
| 592 | different login userspace ids. Most users will have ``u1000`` as the login id |
| 593 | on their machine at home and all files in their home directory will usually be |
| 594 | owned by ``u1000``. At uni or at work they may have another login id such as |
| 595 | ``u1125``. This makes it rather difficult to interact with their home directory |
| 596 | on their work machine. |
| 597 | |
| 598 | In both cases changing ownership recursively has grave implications. The most |
| 599 | obvious one is that ownership is changed globally and permanently. In the home |
Bjorn Helgaas | d56b699 | 2023-08-14 16:28:22 -0500 | [diff] [blame] | 600 | directory case this change in ownership would even need to happen every time the |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 601 | user switches from their home to their work machine. For really large sets of |
| 602 | files this becomes increasingly costly. |
| 603 | |
| 604 | If the user is lucky, they are dealing with a filesystem that is mountable |
| 605 | inside user namespaces. But this would also change ownership globally and the |
| 606 | change in ownership is tied to the lifetime of the filesystem mount, i.e. the |
| 607 | superblock. The only way to change ownership is to completely unmount the |
| 608 | filesystem and mount it again in another user namespace. This is usually |
| 609 | impossible because it would mean that all users currently accessing the |
| 610 | filesystem can't anymore. And it means that ``dir`` still can't be shared |
| 611 | between two containers with different idmappings. |
| 612 | But usually the user doesn't even have this option since most filesystems |
| 613 | aren't mountable inside containers. And not having them mountable might be |
| 614 | desirable as it doesn't require the filesystem to deal with malicious |
| 615 | filesystem images. |
| 616 | |
| 617 | But the usecases mentioned above and more can be handled by idmapped mounts. |
| 618 | They allow to expose the same set of dentries with different ownership at |
| 619 | different mounts. This is achieved by marking the mounts with a user namespace |
| 620 | through the ``mount_setattr()`` system call. The idmapping associated with it |
| 621 | is then used to translate from the caller's idmapping to the filesystem's |
| 622 | idmapping and vica versa using the remapping algorithm we introduced above. |
| 623 | |
| 624 | Idmapped mounts make it possible to change ownership in a temporary and |
| 625 | localized way. The ownership changes are restricted to a specific mount and the |
| 626 | ownership changes are tied to the lifetime of the mount. All other users and |
| 627 | locations where the filesystem is exposed are unaffected. |
| 628 | |
| 629 | Filesystems that support idmapped mounts don't have any real reason to support |
| 630 | being mountable inside user namespaces. A filesystem could be exposed |
| 631 | completely under an idmapped mount to get the same effect. This has the |
| 632 | advantage that filesystems can leave the creation of the superblock to |
| 633 | privileged users in the initial user namespace. |
| 634 | |
| 635 | However, it is perfectly possible to combine idmapped mounts with filesystems |
| 636 | mountable inside user namespaces. We will touch on this further below. |
| 637 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 638 | Filesystem types vs idmapped mount types |
| 639 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 640 | |
| 641 | With the introduction of idmapped mounts we need to distinguish between |
| 642 | filesystem ownership and mount ownership of a VFS object such as an inode. The |
| 643 | owner of a inode might be different when looked at from a filesystem |
| 644 | perspective than when looked at from an idmapped mount. Such fundamental |
| 645 | conceptual distinctions should almost always be clearly expressed in the code. |
| 646 | So, to distinguish idmapped mount ownership from filesystem ownership separate |
| 647 | types have been introduced. |
| 648 | |
| 649 | If a uid or gid has been generated using the filesystem or caller's idmapping |
| 650 | then we will use the ``kuid_t`` and ``kgid_t`` types. However, if a uid or gid |
| 651 | has been generated using a mount idmapping then we will be using the dedicated |
| 652 | ``vfsuid_t`` and ``vfsgid_t`` types. |
| 653 | |
| 654 | All VFS helpers that generate or take uids and gids as arguments use the |
| 655 | ``vfsuid_t`` and ``vfsgid_t`` types and we will be able to rely on the compiler |
| 656 | to catch errors that originate from conflating filesystem and VFS uids and gids. |
| 657 | |
| 658 | The ``vfsuid_t`` and ``vfsgid_t`` types are often mapped from and to ``kuid_t`` |
| 659 | and ``kgid_t`` types similar how ``kuid_t`` and ``kgid_t`` types are mapped |
| 660 | from and to ``uid_t`` and ``gid_t`` types:: |
| 661 | |
| 662 | uid_t <--> kuid_t <--> vfsuid_t |
| 663 | gid_t <--> kgid_t <--> vfsgid_t |
| 664 | |
| 665 | Whenever we report ownership based on a ``vfsuid_t`` or ``vfsgid_t`` type, |
| 666 | e.g., during ``stat()``, or store ownership information in a shared VFS object |
| 667 | based on a ``vfsuid_t`` or ``vfsgid_t`` type, e.g., during ``chown()`` we can |
| 668 | use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers. |
| 669 | |
| 670 | To illustrate why this helper currently exists, consider what happens when we |
| 671 | change ownership of an inode from an idmapped mount. After we generated |
| 672 | a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to |
Bjorn Helgaas | d56b699 | 2023-08-14 16:28:22 -0500 | [diff] [blame] | 673 | this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesystem wide ownership. |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 674 | Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t`` |
| 675 | or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and |
| 676 | ``vfsgid_into_kgid()``. |
| 677 | |
| 678 | Note, whenever a shared VFS object, e.g., a cached ``struct inode`` or a cached |
| 679 | ``struct posix_acl``, stores ownership information a filesystem or "global" |
| 680 | ``kuid_t`` and ``kgid_t`` must be used. Ownership expressed via ``vfsuid_t`` |
| 681 | and ``vfsgid_t`` is specific to an idmapped mount. |
| 682 | |
| 683 | We already noted that ``vfsuid_t`` and ``vfsgid_t`` types are generated based |
| 684 | on mount idmappings whereas ``kuid_t`` and ``kgid_t`` types are generated based |
| 685 | on filesystem idmappings. To prevent abusing filesystem idmappings to generate |
| 686 | ``vfsuid_t`` or ``vfsgid_t`` types or mount idmappings to generate ``kuid_t`` |
| 687 | or ``kgid_t`` types filesystem idmappings and mount idmappings are different |
| 688 | types as well. |
| 689 | |
| 690 | All helpers that map to or from ``vfsuid_t`` and ``vfsgid_t`` types require |
| 691 | a mount idmapping to be passed which is of type ``struct mnt_idmap``. Passing |
| 692 | a filesystem or caller idmapping will cause a compilation error. |
| 693 | |
| 694 | Similar to how we prefix all userspace ids in this document with ``u`` and all |
| 695 | kernel ids with ``k`` we will prefix all VFS ids with ``v``. So a mount |
| 696 | idmapping will be written as: ``u0:v10000:r10000``. |
| 697 | |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 698 | Remapping helpers |
| 699 | ~~~~~~~~~~~~~~~~~ |
| 700 | |
| 701 | Idmapping functions were added that translate between idmappings. They make use |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 702 | of the remapping algorithm we've introduced earlier. We're going to look at: |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 703 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 704 | - ``i_uid_into_vfsuid()`` and ``i_gid_into_vfsgid()`` |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 705 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 706 | The ``i_*id_into_vfs*id()`` functions translate filesystem's kernel ids into |
| 707 | VFS ids in the mount's idmapping:: |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 708 | |
| 709 | /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */ |
| 710 | from_kuid(filesystem, kid) = uid |
| 711 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 712 | /* Map the filesystem's userspace id down ito a VFS id in the mount's idmapping. */ |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 713 | make_kuid(mount, uid) = kuid |
| 714 | |
| 715 | - ``mapped_fsuid()`` and ``mapped_fsgid()`` |
| 716 | |
| 717 | The ``mapped_fs*id()`` functions translate the caller's kernel ids into |
| 718 | kernel ids in the filesystem's idmapping. This translation is achieved by |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 719 | remapping the caller's VFS ids using the mount's idmapping:: |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 720 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 721 | /* Map the caller's VFS id up into a userspace id in the mount's idmapping. */ |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 722 | from_kuid(mount, kid) = uid |
| 723 | |
| 724 | /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ |
| 725 | make_kuid(filesystem, uid) = kuid |
| 726 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 727 | - ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` |
| 728 | |
| 729 | Whenever |
| 730 | |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 731 | Note that these two functions invert each other. Consider the following |
| 732 | idmappings:: |
| 733 | |
| 734 | caller idmapping: u0:k10000:r10000 |
| 735 | filesystem idmapping: u0:k20000:r10000 |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 736 | mount idmapping: u0:v10000:r10000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 737 | |
| 738 | Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id |
Randy Dunlap | 622d6f19 | 2022-08-31 17:28:28 -0700 | [diff] [blame] | 739 | to ``k21000`` according to its idmapping. This is what is stored in the |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 740 | inode's ``i_uid`` and ``i_gid`` fields. |
| 741 | |
| 742 | When the caller queries the ownership of this file via ``stat()`` the kernel |
| 743 | would usually simply use the crossmapping algorithm and map the filesystem's |
| 744 | kernel id up to a userspace id in the caller's idmapping. |
| 745 | |
| 746 | But when the caller is accessing the file on an idmapped mount the kernel will |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 747 | first call ``i_uid_into_vfsuid()`` thereby translating the filesystem's kernel |
| 748 | id into a VFS id in the mount's idmapping:: |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 749 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 750 | i_uid_into_vfsuid(k21000): |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 751 | /* Map the filesystem's kernel id up into a userspace id. */ |
| 752 | from_kuid(u0:k20000:r10000, k21000) = u1000 |
| 753 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 754 | /* Map the filesystem's userspace id down into a VFS id in the mount's idmapping. */ |
| 755 | make_kuid(u0:v10000:r10000, u1000) = v11000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 756 | |
| 757 | Finally, when the kernel reports the owner to the caller it will turn the |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 758 | VFS id in the mount's idmapping into a userspace id in the caller's |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 759 | idmapping:: |
| 760 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 761 | k11000 = vfsuid_into_kuid(v11000) |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 762 | from_kuid(u0:k10000:r10000, k11000) = u1000 |
| 763 | |
| 764 | We can test whether this algorithm really works by verifying what happens when |
| 765 | we create a new file. Let's say the user is creating a file with ``u1000``. |
| 766 | |
| 767 | The kernel maps this to ``k11000`` in the caller's idmapping. Usually the |
| 768 | kernel would now apply the crossmapping, verifying that ``k11000`` can be |
| 769 | mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't |
| 770 | be mapped up in the filesystem's idmapping directly this creation request |
| 771 | fails. |
| 772 | |
| 773 | But when the caller is accessing the file on an idmapped mount the kernel will |
| 774 | first call ``mapped_fs*id()`` thereby translating the caller's kernel id into |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 775 | a VFS id according to the mount's idmapping:: |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 776 | |
| 777 | mapped_fsuid(k11000): |
| 778 | /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ |
| 779 | from_kuid(u0:k10000:r10000, k11000) = u1000 |
| 780 | |
| 781 | /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 782 | make_kuid(u0:v20000:r10000, u1000) = v21000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 783 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 784 | When finally writing to disk the kernel will then map ``v21000`` up into a |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 785 | userspace id in the filesystem's idmapping:: |
| 786 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 787 | k21000 = vfsuid_into_kuid(v21000) |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 788 | from_kuid(u0:k20000:r10000, k21000) = u1000 |
| 789 | |
| 790 | As we can see, we end up with an invertible and therefore information |
| 791 | preserving algorithm. A file created from ``u1000`` on an idmapped mount will |
| 792 | also be reported as being owned by ``u1000`` and vica versa. |
| 793 | |
| 794 | Let's now briefly reconsider the failing examples from earlier in the context |
| 795 | of idmapped mounts. |
| 796 | |
| 797 | Example 2 reconsidered |
| 798 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 799 | |
| 800 | :: |
| 801 | |
| 802 | caller id: u1000 |
| 803 | caller idmapping: u0:k10000:r10000 |
| 804 | filesystem idmapping: u0:k20000:r10000 |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 805 | mount idmapping: u0:v10000:r10000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 806 | |
| 807 | When the caller is using a non-initial idmapping the common case is to attach |
| 808 | the same idmapping to the mount. We now perform three steps: |
| 809 | |
| 810 | 1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: |
| 811 | |
| 812 | make_kuid(u0:k10000:r10000, u1000) = k11000 |
| 813 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 814 | 2. Translate the caller's VFS id into a kernel id in the filesystem's |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 815 | idmapping:: |
| 816 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 817 | mapped_fsuid(v11000): |
| 818 | /* Map the VFS id up into a userspace id in the mount's idmapping. */ |
| 819 | from_kuid(u0:v10000:r10000, v11000) = u1000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 820 | |
| 821 | /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ |
| 822 | make_kuid(u0:k20000:r10000, u1000) = k21000 |
| 823 | |
| 824 | 2. Verify that the caller's kernel ids can be mapped to userspace ids in the |
| 825 | filesystem's idmapping:: |
| 826 | |
| 827 | from_kuid(u0:k20000:r10000, k21000) = u1000 |
| 828 | |
| 829 | So the ownership that lands on disk will be ``u1000``. |
| 830 | |
| 831 | Example 3 reconsidered |
| 832 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 833 | |
| 834 | :: |
| 835 | |
| 836 | caller id: u1000 |
| 837 | caller idmapping: u0:k10000:r10000 |
| 838 | filesystem idmapping: u0:k0:r4294967295 |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 839 | mount idmapping: u0:v10000:r10000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 840 | |
| 841 | The same translation algorithm works with the third example. |
| 842 | |
| 843 | 1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: |
| 844 | |
| 845 | make_kuid(u0:k10000:r10000, u1000) = k11000 |
| 846 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 847 | 2. Translate the caller's VFS id into a kernel id in the filesystem's |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 848 | idmapping:: |
| 849 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 850 | mapped_fsuid(v11000): |
| 851 | /* Map the VFS id up into a userspace id in the mount's idmapping. */ |
| 852 | from_kuid(u0:v10000:r10000, v11000) = u1000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 853 | |
| 854 | /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ |
| 855 | make_kuid(u0:k0:r4294967295, u1000) = k1000 |
| 856 | |
| 857 | 2. Verify that the caller's kernel ids can be mapped to userspace ids in the |
| 858 | filesystem's idmapping:: |
| 859 | |
| 860 | from_kuid(u0:k0:r4294967295, k21000) = u1000 |
| 861 | |
| 862 | So the ownership that lands on disk will be ``u1000``. |
| 863 | |
| 864 | Example 4 reconsidered |
| 865 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 866 | |
| 867 | :: |
| 868 | |
| 869 | file id: u1000 |
| 870 | caller idmapping: u0:k10000:r10000 |
| 871 | filesystem idmapping: u0:k0:r4294967295 |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 872 | mount idmapping: u0:v10000:r10000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 873 | |
| 874 | In order to report ownership to userspace the kernel now does three steps using |
| 875 | the translation algorithm we introduced earlier: |
| 876 | |
| 877 | 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| 878 | idmapping:: |
| 879 | |
| 880 | make_kuid(u0:k0:r4294967295, u1000) = k1000 |
| 881 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 882 | 2. Translate the kernel id into a VFS id in the mount's idmapping:: |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 883 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 884 | i_uid_into_vfsuid(k1000): |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 885 | /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ |
| 886 | from_kuid(u0:k0:r4294967295, k1000) = u1000 |
| 887 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 888 | /* Map the userspace id down into a VFS id in the mounts's idmapping. */ |
| 889 | make_kuid(u0:v10000:r10000, u1000) = v11000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 890 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 891 | 3. Map the VFS id up into a userspace id in the caller's idmapping:: |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 892 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 893 | k11000 = vfsuid_into_kuid(v11000) |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 894 | from_kuid(u0:k10000:r10000, k11000) = u1000 |
| 895 | |
| 896 | Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's |
| 897 | idmapping. With the idmapped mount in place it now can be crossmapped into the |
| 898 | filesystem's idmapping via the mount's idmapping. The file will now be created |
| 899 | with ``u1000`` according to the mount's idmapping. |
| 900 | |
| 901 | Example 5 reconsidered |
| 902 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 903 | |
| 904 | :: |
| 905 | |
| 906 | file id: u1000 |
| 907 | caller idmapping: u0:k10000:r10000 |
| 908 | filesystem idmapping: u0:k20000:r10000 |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 909 | mount idmapping: u0:v10000:r10000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 910 | |
| 911 | Again, in order to report ownership to userspace the kernel now does three |
| 912 | steps using the translation algorithm we introduced earlier: |
| 913 | |
| 914 | 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| 915 | idmapping:: |
| 916 | |
| 917 | make_kuid(u0:k20000:r10000, u1000) = k21000 |
| 918 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 919 | 2. Translate the kernel id into a VFS id in the mount's idmapping:: |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 920 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 921 | i_uid_into_vfsuid(k21000): |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 922 | /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ |
| 923 | from_kuid(u0:k20000:r10000, k21000) = u1000 |
| 924 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 925 | /* Map the userspace id down into a VFS id in the mounts's idmapping. */ |
| 926 | make_kuid(u0:v10000:r10000, u1000) = v11000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 927 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 928 | 3. Map the VFS id up into a userspace id in the caller's idmapping:: |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 929 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 930 | k11000 = vfsuid_into_kuid(v11000) |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 931 | from_kuid(u0:k10000:r10000, k11000) = u1000 |
| 932 | |
| 933 | Earlier, the file's kernel id couldn't be crossmapped in the filesystems's |
| 934 | idmapping. With the idmapped mount in place it now can be crossmapped into the |
| 935 | filesystem's idmapping via the mount's idmapping. The file is now owned by |
| 936 | ``u1000`` according to the mount's idmapping. |
| 937 | |
| 938 | Changing ownership on a home directory |
| 939 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 940 | |
| 941 | We've seen above how idmapped mounts can be used to translate between |
| 942 | idmappings when either the caller, the filesystem or both uses a non-initial |
| 943 | idmapping. A wide range of usecases exist when the caller is using |
| 944 | a non-initial idmapping. This mostly happens in the context of containerized |
| 945 | workloads. The consequence is as we have seen that for both, filesystem's |
| 946 | mounted with the initial idmapping and filesystems mounted with non-initial |
| 947 | idmappings, access to the filesystem isn't working because the kernel ids can't |
| 948 | be crossmapped between the caller's and the filesystem's idmapping. |
| 949 | |
| 950 | As we've seen above idmapped mounts provide a solution to this by remapping the |
| 951 | caller's or filesystem's idmapping according to the mount's idmapping. |
| 952 | |
| 953 | Aside from containerized workloads, idmapped mounts have the advantage that |
| 954 | they also work when both the caller and the filesystem use the initial |
| 955 | idmapping which means users on the host can change the ownership of directories |
| 956 | and files on a per-mount basis. |
| 957 | |
| 958 | Consider our previous example where a user has their home directory on portable |
| 959 | storage. At home they have id ``u1000`` and all files in their home directory |
| 960 | are owned by ``u1000`` whereas at uni or work they have login id ``u1125``. |
| 961 | |
| 962 | Taking their home directory with them becomes problematic. They can't easily |
| 963 | access their files, they might not be able to write to disk without applying |
| 964 | lax permissions or ACLs and even if they can, they will end up with an annoying |
| 965 | mix of files and directories owned by ``u1000`` and ``u1125``. |
| 966 | |
| 967 | Idmapped mounts allow to solve this problem. A user can create an idmapped |
| 968 | mount for their home directory on their work computer or their computer at home |
| 969 | depending on what ownership they would prefer to end up on the portable storage |
| 970 | itself. |
| 971 | |
| 972 | Let's assume they want all files on disk to belong to ``u1000``. When the user |
| 973 | plugs in their portable storage at their work station they can setup a job that |
| 974 | creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now |
| 975 | when they create a file the kernel performs the following steps we already know |
| 976 | from above::: |
| 977 | |
| 978 | caller id: u1125 |
| 979 | caller idmapping: u0:k0:r4294967295 |
| 980 | filesystem idmapping: u0:k0:r4294967295 |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 981 | mount idmapping: u1000:v1125:r1 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 982 | |
| 983 | 1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: |
| 984 | |
| 985 | make_kuid(u0:k0:r4294967295, u1125) = k1125 |
| 986 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 987 | 2. Translate the caller's VFS id into a kernel id in the filesystem's |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 988 | idmapping:: |
| 989 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 990 | mapped_fsuid(v1125): |
| 991 | /* Map the VFS id up into a userspace id in the mount's idmapping. */ |
| 992 | from_kuid(u1000:v1125:r1, v1125) = u1000 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 993 | |
| 994 | /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ |
| 995 | make_kuid(u0:k0:r4294967295, u1000) = k1000 |
| 996 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 997 | 2. Verify that the caller's filesystem ids can be mapped to userspace ids in the |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 998 | filesystem's idmapping:: |
| 999 | |
| 1000 | from_kuid(u0:k0:r4294967295, k1000) = u1000 |
| 1001 | |
| 1002 | So ultimately the file will be created with ``u1000`` on disk. |
| 1003 | |
| 1004 | Now let's briefly look at what ownership the caller with id ``u1125`` will see |
| 1005 | on their work computer: |
| 1006 | |
| 1007 | :: |
| 1008 | |
| 1009 | file id: u1000 |
| 1010 | caller idmapping: u0:k0:r4294967295 |
| 1011 | filesystem idmapping: u0:k0:r4294967295 |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 1012 | mount idmapping: u1000:v1125:r1 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 1013 | |
| 1014 | 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| 1015 | idmapping:: |
| 1016 | |
| 1017 | make_kuid(u0:k0:r4294967295, u1000) = k1000 |
| 1018 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 1019 | 2. Translate the kernel id into a VFS id in the mount's idmapping:: |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 1020 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 1021 | i_uid_into_vfsuid(k1000): |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 1022 | /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ |
| 1023 | from_kuid(u0:k0:r4294967295, k1000) = u1000 |
| 1024 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 1025 | /* Map the userspace id down into a VFS id in the mounts's idmapping. */ |
| 1026 | make_kuid(u1000:v1125:r1, u1000) = v1125 |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 1027 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 1028 | 3. Map the VFS id up into a userspace id in the caller's idmapping:: |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 1029 | |
Christian Brauner | 5d3ca59 | 2023-03-06 15:11:42 +0100 | [diff] [blame] | 1030 | k1125 = vfsuid_into_kuid(v1125) |
Christian Brauner | ad19607 | 2021-07-27 12:44:16 +0200 | [diff] [blame] | 1031 | from_kuid(u0:k0:r4294967295, k1125) = u1125 |
| 1032 | |
| 1033 | So ultimately the caller will be reported that the file belongs to ``u1125`` |
| 1034 | which is the caller's userspace id on their workstation in our example. |
| 1035 | |
| 1036 | The raw userspace id that is put on disk is ``u1000`` so when the user takes |
| 1037 | their home directory back to their home computer where they are assigned |
| 1038 | ``u1000`` using the initial idmapping and mount the filesystem with the initial |
| 1039 | idmapping they will see all those files owned by ``u1000``. |