是否有算法来确定符号链接是否循环?

是否有算法来确定符号链接是否循环?

如果 Unix 系统遇到包含符号链接循环或太多符号链接的路径,通常会出错,因为它们在一次路径查找中遍历的符号链接数量受到限制。但是有没有一种方法可以真正决定给定的路径是否解析为某些内容或包含循环,即使它包含的链接数量多于 UNIX 愿意遵循的链接数量?或者这是一个形式上无法判定的问题?如果可以决定,是否可以在合理的时间/内存量内决定(例如,无需访问文件系统上的所有文件)?

一些例子:

a/b/c/d
where a/b is a symlink to ../e
and e is a symlink to f
and f is a symlink to a/b

a/b/c/d
where a/b/c is a symlink to ../c

a/b/c/d
where a/b/c is a symlink to ../c/d

a/b/c/d
where a/b/c is a symlink to /a/b/e
where a/b/e is a symlink to /a/b/f
where a/b/f is a symlink to /a/b/g

编辑:

为了澄清,我不是在询问在文件系统中查找循环,而是在询问一种决策算法,该算法决定给定路径是否解析为确定的文件/目录,或者是否根本不解析。例如,在以下系统中,存在循环,但给定路径仍然可以正常解析:

/ -- a -- b
where b is a symlink to /a

该目录树显然有一个循环,但路径a/b/b/b/b/b仍然可以很好地解析为/a.

答案1

我不完全明白你在问什么。如果我不知道更多,我认为您是在问是否有一种方法可以在处理文件时检测到这一点。我不相信这是可能的。

我能想到的唯一方法是进行查找,在其中专门开始查找目录树中的特定分支。

例子

$ tree 
.
`-- a
    `-- b
        |-- c
        |   `-- d
        |       `-- e -> ../../../../a/b
        `-- e -> e

5 directories, 1 file

find命令将检测此循环,但不会真正告诉您有关它的全部信息。

$ find -L . -mindepth 15
find: File system loop detected; `./a/b/c/d/e' is part of the same file system loop as `./a/b'.
find: `./a/b/e': Too many levels of symbolic links

我任意选择了 15 个级别,以阻止find.但是-mindepth,如果您不关心显示的目录树,则可以删除该开关 ( )。该find命令仍然检测到循环并停止:

$ find -L . 
.
./a
./a/b
./a/b/c
./a/b/c/d
find: File system loop detected; `./a/b/c/d/e' is part of the same file system loop as `./a/b'.
find: `./a/b/e': Too many levels of symbolic links

顺便说一句,如果你想覆盖MAXSYMLINKSLinux(较新的 3.x 版本内核)上的默认值 40,你可以看到这个 U&L Q&A,标题为:如何增加 MAXSYMLINKS

使用符号链接命令

FTP 站点维护人员可以使用一个名为 的工具,symlinks该工具将有助于暴露由符号链接引起的过长或悬空树的问题。

在某些情况下,该symlinks工具也可用于删除违规链接。

例子

$ symlinks -srv a
lengthy:  /home/saml/tst/99159/a/b/c/d/e -> ../../../../a/b
dangling: /home/saml/tst/99159/a/b/e -> e

glibc 库

glibc 库似乎提供了一些与此相关的 C 函数,但我并不完全了解它们的作用或如何实际使用它们。所以我只能简单地向你指出它们。

手册页man symlink显示了名为 的函数的函数定义symlink()。描述是这样的:

symlink() 创建一个名为 newpath 的符号链接,其中包含字符串 oldpath。

错误之一指出该函数返回:

ELOOP 解析 newpath 时遇到太多符号链接。

我还将引导您访问手册页,man path_resolution其中讨论了 Unix 如何确定磁盘上项目的路径。具体来说就是这一段。

If  the component is found and is a symbolic link (symlink), we first 
resolve this symbolic link (with the current lookup directory as starting 
lookup directory).  Upon error, that error is returned.  If the result is 
not a directory, an ENOTDIR error is returned.  If the resolution of the 
symlink is successful and returns a directory, we set the current lookup
directory to that directory, and go to the next component.  Note that the 
resolution process here involves recursion.  In order  to  protect  the 
kernel against stack overflow, and also to protect against denial of 
service, there are limits on the maximum recursion depth, and on the maximum 
number of symbolic links followed.  An ELOOP error is returned  when  the
maximum is exceeded ("Too many levels of symbolic links").

答案2

好吧,经过一番思考,我想我有一个明确的解决方案。

关键的见解是,如果路径中的每个链接都解析为某些内容,则整个路径都会解析。或者反过来,如果路径无法解析,则必须有一个需要遍历但无法解析的特定符号链接。

之前在考虑这个问题时,我使用了一种从根开始遍历路径元素的算法,当它遇到符号链接时,它会用符号链接的内容替换该路径元素,然后继续遍历。由于这种方法不记得它当前正在解析哪个符号链接,因此无法检测它何时处于非解析循环中。

如果算法跟踪当前正在解析哪个符号链接(或者在递归链接的情况下是哪个符号链接),它可以检测是否正在尝试再次递归地解析仍忙于解析的链接。

算法:

initialize `location` to the current working directory
initialize `link_contents` to the path we want to resolve
initialize `active_symlinks` to the empty set

def resolve_symlink(location, link_contents, active_symlinks) :
    loop forever:
        next_location = location / [first element of link_contents]
        see if next_location is a symlink.
        if so:
            if next_location in active_symlinks: abort, we have a loop
            location = resolve_symlink(location, readlink(next_location), active_symlinks ∪ {next_location})
        else:
            location = next_location
        strip first element of link_contents
        if link_contents is empty: 
            return location

编辑:

我在 python 中有一个有效的实现https://bitbucket.org/JanKanis/python-inotify/src/853ed903e870cbfa283e6ce7a5e41aeffe16d4e7/inotify/pathresolver.py?at=pathwatcher

编辑2:Bitbucket 停止托管 Mercurial 存储库。这是完整的文件:

# pathresolver.py - This module contains an iterator that iterates over all
# elements of a path including any symlinks. 

# Copyright 2012-2013 Jan Kanis

# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.


"""
This module contains a few functions that help traverse paths with
symlinks.

`resolve_symlinks` is a generator that yields pairs of paths, with one
yield for each path element that is traversed to reach the final
target of a path. This includes path elements and paths from symlinks.

`resolve_path` is a wrapper around `resolve_symlinks` that takes a
single path as an argument and sets up the other arguments to
`resolve_symlinks`.

`get_symlinkmax` is a function that determines the maximum number of
symlinks the system will traverse in a path.

Note: this module forms part of python-inotify, but is considered an
internal module. As such there are no stability guarantees regarding
the api's of these functions.
"""


__author__ = "Jan Kanis"


import sys, os, errno
import tempfile, shutil
from pathlib import PosixPath


_curdir = PosixPath('.')
_root = PosixPath('/')
_parentdir = PosixPath('..')


def resolve_path(path):
    '''Resolve the symlinks in path, yielding all filesystem locations that are traversed.

    The yielded value is a tuple, of which the first element is a symlink-free
    path, and the second element is a path relative to the first element that
    has not yet been traversed. This second element may contain more symlinks.
    
    The resolution implementation will follow an unbounded number of symlinks
    but will still detect symlink loops if they prevent a path from resolving.

    path can be given as a string or as a pathlib object. The yielded values
    are pathlib.PosixPath objects.

    '''
    linkcache = {}
    linkcounter = [0]
    for p in resolve_symlink(_curdir, PosixPath(path), set(),
                                  linkcache, linkcounter):
        yield p


def resolve_symlink(location, link_contents, active_links, known_links, linkcounter):
    '''Recursively resolve a symlink (or another path) to the file or
    directory it ultimately points to. This function handles an
    unlimited number of symlinks, and correctly detects symlink
    loops. All path parameters should be given as pathlib.PosixPath
    instances.

    location: The directory in which the currently to be resolved link resides.

    link_contents: The path stored in the symlink as returned by readlink().

    active_links: a set of symlinks that is currently being resolved.

    linkcache: a dictionary of link location -> resolved target paths. This
    cache prevents this function from having to resolve the same symlink
    twice. (Note that having to traverse the same symlink multiple times
    does not necessarily mean that the path does not resolve to anything.)

    linkcounter: A list containing a single number. (We use a list so that the
    value can be passed by reference.) This number is updated to indicate the
    total number of symlinks that has been traversed.

    '''

    while True:
        if link_contents.is_absolute():
            location = _root
            link_contents = link_contents.relative()

        yield location, link_contents
        if link_contents == _curdir:
            return

        if link_contents.parts[0] == '..':
            # We need to choose here if we allow traversing of a path above
            # the root or above the current directory. Going above CWD
            # should be allowed as long as we don't go above / by doing
            # so. The OS allows going to /.. (which just ends up at /
            # again), so for consistency with that we also allow it,
            # although a path that requires us to do this is probably a bug
            # somewhere.
            if all(p in ('/', '..') for p in location.parts):
                location = location['..']
            else:
                location = location.parent()
            # Strip the first part of link_contents off
            link_contents = link_contents.parts[1:]
            continue

        try:
            nextpath = location[link_contents.parts[0]]
            newlink = PosixPath(os.readlink(str(nextpath)))
        except OSError as e:
            if e.errno == errno.EINVAL:
                # The entry is not a symbolic link, assume it is a normal file
                # or directory. If it is a file and we need it to be a
                # directory that will be detected the next time through the
                # loop in the os.errno.ENOTDIR check. Checking it here would be
                # possible, but keeping the number of system calls at one per
                # loop makes reasoning about race conditions easier.
                location = nextpath
                link_contents = link_contents.parts[1:]
                continue
            if e.errno == errno.ENOENT:
                # The entry does not exist
                raise FileNotFoundError(nextpath)
            if e.errno == errno.ENOTDIR:
                # At this point we know the path is not valid, but we can not
                # determine with certainty what is wrong. If there were no
                # concurrent modifications we can safely make an is_dir()
                # call. If concurrent modifications did happen the is_dir()
                # check may succeed erroneously but we can't detect all
                # concurrent modifications anyway. If the check fails
                # (erroneously or not) that indicates a concurrent modification
                # so we fall through.
                if not location.is_dir():
                    raise NotADirectoryError(location)
                # We should not be able to get here, unless there is a bug
                # or some relevant part of the file system was changed
                # concurrently while we were resolving this link.
                raise ConcurrentFilesystemModificationError(nextpath)
            if e.errno == errno.ELOOP:
                # Can only happen if a path component was changed concurrently
                raise ConcurrentFilesystemModificationError(nextpath)
            # For other OSErrors (such as in case of EACCESS) we propagate to
            # the caller. 
            raise

        # It is a symlink!
        if nextpath in active_links:
            raise SymlinkLoopError(nextpath)

        link_contents = link_contents.parts[1:]
        # We have not yet attempted traversing this symlink during the
        # current call or any of its parents.
        if nextpath in known_links:
            # known_links stores fully resolved paths, so we don't need to
            # traverse the cached path and can just continue our traversal from
            # there.
            location = known_links[nextpath]
            continue
        
        # An unknown link, resolve it recursively
        linkcounter[0] += 1
        # Don't yield the very last result of this recursive call immediately,
        # we still want to process that further. 
        lastloc, lastlink = None, None
        for loc, link in resolve_symlink(location, newlink,
                          active_links.union((nextpath,)), known_links, linkcounter):
            if lastloc:
                yield lastloc, lastlink[link_contents]
            lastloc, lastlink = loc, link
        # The last yielded location is the final resolution of the symlink. The
        # last yielded link_contents is always '.' so we can ignore that.
        known_links[nextpath] = loc
        location = loc
        continue


_symlinkmax = None
def get_symlinkmax():
  '''Returns the maximum number of symlinks that this system will traverse in
  resolving a file path.

  '''
  global _symlinkmax
  if _symlinkmax is not None:
    return _symlinkmax

  try:
    tempdir = tempfile.mkdtemp(prefix='inotify-symlinkmax-tmpdir-')
    open(tempdir+'/testfile', 'w').close()

    target = 'testfile'
    for i in range(1, 60):
      name = 'l'+str(i)
      os.symlink(target, tempdir+'/'+name)
      target = name

      try:
        open(tempdir+'/'+name).close()
      except IOError as e:
        if e.errno == errno.ELOOP:
          _symlinkmax = i - 1
          break
        raise
    
  finally:
    if tempdir:
      shutil.rmtree(tempdir)
  return _symlinkmax



class InvalidPathError (OSError):
    def __init__(self, msg, path, errno=None, *args):
        self.filename = path
        self.errno = errno
        if errno:
            self.strerror = os.strerror(errno)
        OSError.__init__(self, msg, *args)

class SymlinkLoopError (InvalidPathError):
    def __init__(self, path, *args):
        msg = "Path not valid: The symlink at '{}' forms a symlink loop".format(path)
        InvalidPathError.__init__(self, msg, path, errno=errno.ELOOP, *args)

class ConcurrentFilesystemModificationError (InvalidPathError):
    def __init__(self, path, *args):
        msg = "Path not valid: A concurrent change was detected while traversing '{}'".format(path)
        InvalidPathError.__init__(self, msg, path, errno=None, *args)


# To be Python 2 and 3 compatible and also inherit from the right exception
# types in python 3, we need some boilerplate.

fnf_msg = "Path not valid: '{}' does not exist"
nad_msg = "Path not valid: '{}' is not a directory"

if sys.version_info >= (3, 3):
    class FileNotFoundError (InvalidPathError, FileNotFoundError):
        def __init__(self, path, *args):
            InvalidPathError.__init__(self, fnf_msg.format(path), path,
                                          errno=ENOENT, *args)

    class NotADirectoryError (InvalidPathError, NotADirectoryError):
        def __init__(self, path, *args):
            InvalidPathError.__init__(self, nad_msg.format(path), path,
                                          errno=errno.ENOTDIR, *args)

else:
    class FileNotFoundError (InvalidPathError):
        def __init__(self, path, *args):
            InvalidPathError.__init__(self, fnf_msg.format(path), path,
                                          errno=errno.ENOENT, *args)

    class NotADirectoryError (InvalidPathError):
        def __init__(self, path, *args):
            InvalidPathError.__init__(self, nad_msg.format(path), path,
                                          errno=errno.ENOTDIR, *args)

答案3

在静态系统上(即没有发生任何变化时),是的,有一个算法。符号链接的数量是有限的,因此它们构成了有限的图,检测循环是一个有限的过程。

在实时系统上,无法检测循环,因为符号链接在循环检测器运行时可能会发生变化。读取每个符号链接是原子的,但跟随符号链接则不是。如果在内核进行遍历时某些符号链接不断变化,它可能最终会出现一条涉及不同链接的无限路径。

答案4

据我通过查看当前的 Linux 内核源代码可知,内核所做的只是记录它所遵循的链接数量,如果该数字大于某个数字,则会出错。看namei.c 中的第 1330 行对于评论和nested_symlink()功能。 ELOOP 宏(在这种情况下从系统调用返回的错误号read(2))出现在该文件中的多个位置,因此它可能不像计算随后的链接那么简单,但这肯定是它的样子。

有许多算法可以在链表中查找“循环”(Floyd循环检测算法)或在有向图。我不清楚您必须执行哪一项才能检测特定路径中的实际“循环”或“循环”。无论如何,算法可能需要很长时间才能运行,因此我猜测仅计算所遵循的符号链接的数量就可以实现 90% 的目标。

相关内容