深入linux內核架構--虛擬文件系統(tǒng)(簡介)

在Linux中,“萬物兼文件”,我們知道在linux下面有很多文件系統(tǒng),如EXT/2/3/4,XFS等,為了很好的支持各種類型的文件系統(tǒng),Linux抽象了一層虛擬文件系統(tǒng)層,用于更加靈活的適配各種具體的文件系統(tǒng)實現(xiàn)。其基本架構如下:


虛擬文件系統(tǒng)架構

可以看到所有的虛擬文件系統(tǒng)操作都必須在內核態(tài)執(zhí)行,這是由于對于系統(tǒng)存儲及外部設備的訪問極其復雜,這部分的操作不能交給用戶去操作,否則系統(tǒng)會非常不穩(wěn)定。

文件系統(tǒng)類型

  1. 基于磁盤的文件系統(tǒng)
    在非易失介質存儲存儲文件的經典方法,也就是為我們所熟知的各類文件系統(tǒng),注入EXT2/3/4, FAT等
  2. 虛擬文件系統(tǒng)
    在內核中生成,是一種使用用戶應用程序與用戶通信的方法,最為人所知的就是proc文件系統(tǒng),其不需要與任何種類的硬件上存儲信息,所有的信息都存儲在內存中,伴隨著進程而消亡
  3. 網絡文件系統(tǒng)
    這種文件系統(tǒng)可以訪問其他計算機上的數(shù)據,本機不會陷入內核態(tài),所有的請求會發(fā)送到其他機器執(zhí)行,因此網絡文件系統(tǒng)一般會以FUSE的形式掛載。

通用文件系統(tǒng)

虛擬文件系統(tǒng)定義了一些了方法和抽象以及文件系統(tǒng)中對象(或文件)的統(tǒng)一視圖,但是在不同的實現(xiàn)中,會截然不同,其提供的是一個通用的全集,其提供的許多操作在某些子系統(tǒng)中并不需要,比如proc系統(tǒng)中的write_page操作。
在處理文件時,內核空間和用戶空間使用的對象是不同的,在用戶空間一個文件有一個"文件描述符"標識,是一個整數(shù),也就是我們經常說的FD,只在一個進程內部有效,兩個不同進程之間可以使用同一個FD;而FD對應的內核空間的數(shù)據結構是struct file,其主要的成員為address_space,address_space是真正與底層設備交互數(shù)據結構,而另外一個管理文件元信息的數(shù)據結構是inode,其存儲著文件的鏈接,訪問時間,版本,對應的后端設備,所在的超級塊等等元信息,但是不包括文件名,文件名存儲在struct dentry中,這是由于文件名是用于索引及管理inode的,而dentry就是用于管理inode的,而dentry則通過super_block索引。
下面我們就來具體討論一下具體的各個結構及他們的關系,并討論一下在linux中打開一個文件到寫入具體經歷了哪些事情。

VFS結構

VFS結構

inode

inode用于管理文件的元數(shù)據信息,包括權限信息,訪問信息,鏈接信息,存儲設備信息等, 對應的操作主要包括鏈接、權限、,其數(shù)據結構如下:
相關介紹參考inode

/*
 * Keep mostly read-only and often accessed (especially for
 * the RCU path lookup and 'stat' data) fields at the beginning
 * of the 'struct inode'
 */
struct inode {
    ...
    const struct inode_operations   *i_op; // inode的操作,與具體的文件系統(tǒng)相關
    struct super_block  *i_sb; // 超級塊
    struct address_space    *i_mapping; // 地址空間,真正的與設備交互模塊
        ...
    /* Stat data, not accessed from path walking */
    unsigned long       i_ino; // inode 編號
    /*
     * Filesystems may only read i_nlink directly.  They shall use the
     * following functions for modification:
     *
     *    (set|clear|inc|drop)_nlink
     *    inode_(inc|dec)_link_count
     */
    union {
        const unsigned int i_nlink;
        unsigned int __i_nlink;
    };
    dev_t           i_rdev;
    loff_t          i_size;
    struct timespec64   i_atime; // 最后訪問時間
    struct timespec64   i_mtime; // 最后修改時間
    struct timespec64   i_ctime; // 創(chuàng)建時間
    spinlock_t          i_lock; /* i_blocks, i_bytes, maybe i_size */
    unsigned short      i_bytes; // 文件大小字節(jié)數(shù)
    u8                  i_blkbits;       // 文件大小對應的塊長度
    u8                  i_write_hint;
    blkcnt_t            i_blocks; // 文件長度 / 塊長度

#ifdef __NEED_I_SIZE_ORDERED
    seqcount_t      i_size_seqcount;
#endif

    /* Misc */
    unsigned long       i_state;
    struct rw_semaphore i_rwsem;

    unsigned long       dirtied_when;   /* jiffies of first dirtying */
    unsigned long       dirtied_time_when;

    struct hlist_node   i_hash;
    struct list_head    i_io_list;  /* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
    struct bdi_writeback    *i_wb;      /* the associated cgroup wb */

    /* foreign inode detection, see wbc_detach_inode() */
    int         i_wb_frn_winner;
    u16         i_wb_frn_avg_time;
    u16         i_wb_frn_history;
#endif
    struct list_head    i_lru;      /* inode LRU list */
    struct list_head    i_sb_list;
    struct list_head    i_wb_list;  /* backing dev writeback list */
    union {
        struct hlist_head   i_dentry; // 一個inode可能被多個dentry使用(link)
        struct rcu_head i_rcu;
    };
    atomic64_t  i_version;
    atomic_t        i_count;
    atomic_t        i_dio_count;
    atomic_t        i_writecount;
#ifdef CONFIG_IMA
    atomic_t        i_readcount; /* struct files open RO */
#endif
    const struct file_operations    *i_fop; /* former ->i_op->default_file_ops */
    struct file_lock_context    *i_flctx;
    struct address_space    i_data;
    struct list_head    i_devices;
    union {
        struct pipe_inode_info  *i_pipe; // 管道類型
        struct block_device *i_bdev; // 塊設備
        struct cdev     *i_cdev;  // 字符設備
        char            *i_link; // 不知道是啥
        unsigned        i_dir_seq; // 不知道是啥
    };
    __u32           i_generation;
#ifdef CONFIG_FSNOTIFY
    __u32           i_fsnotify_mask; /* all events this inode cares about */
    struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
#endif

#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
    struct fscrypt_info *i_crypt_info;
#endif
    void            *i_private; /* fs or device private pointer */
} __randomize_layout;
struct inode_operations {
    struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); // 根據inode中的dir及dentry中的filename 查找 inode
    const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *); // 查找inode目錄下的對于dentryfilename的所有鏈接
    int (*permission) (struct inode *, int);
    struct posix_acl * (*get_acl)(struct inode *, int);

    int (*readlink) (struct dentry *, char __user *,int);

    int (*create) (struct inode *,struct dentry *, umode_t, bool);
    int (*link) (struct dentry *,struct inode *,struct dentry *); // 創(chuàng)建hard link
    int (*unlink) (struct inode *,struct dentry *); // 刪除hardlink
    int (*symlink) (struct inode *,struct dentry *,const char *); // 創(chuàng)建軟連接
    int (*mkdir) (struct inode *,struct dentry *,umode_t); // 根據mode及dentry中的目錄名創(chuàng)建目錄,并生成inode
    int (*rmdir) (struct inode *,struct dentry *); // 刪除目錄
    int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); // 根據
    int (*rename) (struct inode *, struct dentry *,
            struct inode *, struct dentry *, unsigned int); // VFS to move the file specified by old_dentry from the old_dir directory to the directory new_dir, with the filename specified by new_dentry
    int (*setattr) (struct dentry *, struct iattr *);
    int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
    ssize_t (*listxattr) (struct dentry *, char *, size_t);
    int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
              u64 len);
    int (*update_time)(struct inode *, struct timespec64 *, int);
    int (*atomic_open)(struct inode *, struct dentry *,
               struct file *, unsigned open_flag,
               umode_t create_mode); 
    int (*tmpfile) (struct inode *, struct dentry *, umode_t);
    int (*set_acl)(struct inode *, struct posix_acl *, int);
} ____cacheline_aligned;

dentry

dentry主要用于管理文件名,建立與所有子目錄項的聯(lián)系。

dentry state

dentry可以有三種狀態(tài) used,unused,negative
used:關聯(lián)到一個有效的inode
unused:關聯(lián)到了一個有效的inode,但是引用數(shù)為0,還沒被真正刪除
negative:沒有可關聯(lián)的inode,可能是文件被刪除了,或者根本沒有存儲設備的文件

dentry cache

通過一個path查找對應的dentry,如果每次都從磁盤中去獲取的話會比較耗資源,所以提供了一個lru緩存用于加速查找,比如我們查找 /usr/bin/java這個文件的目錄項的時候,先需要找到 / 的 目錄項,然后/bin,依次類推直到找到path的結尾,這樣中間的查找過程中涉及到的目錄項就會被緩存起來,方便下次查找。而這個查找過程在下面的look_up中詳細分析
更多細節(jié)看dentry
其數(shù)據結構如下:

struct dentry {
    /* RCU lookup touched fields */
    unsigned int d_flags;       /* protected by d_lock */
    seqcount_t d_seq;       /* per dentry seqlock */
    struct hlist_bl_node d_hash;    /* lookup hash list */
    struct dentry *d_parent;    /* parent directory */
    struct qstr d_name;
    struct inode *d_inode;      /* Where the name belongs to - NULL is
                     * negative */
    unsigned char d_iname[DNAME_INLINE_LEN];    /* small names */

    /* Ref lookup also touches following */
    struct lockref d_lockref;   /* per-dentry lock and refcount */
    const struct dentry_operations *d_op;
    struct super_block *d_sb;   /* The root of the dentry tree */
    unsigned long d_time;       /* used by d_revalidate */
    void *d_fsdata;         /* fs-specific data */

    union {
        struct list_head d_lru;     /* LRU list */
        wait_queue_head_t *d_wait;  /* in-lookup ones only */
    };
    struct list_head d_child;   /* child of parent list */
    struct list_head d_subdirs; /* our children */
    /*
     * d_alias and d_rcu can share memory
     */
    union {
        struct hlist_node d_alias;  /* inode alias list */
        struct hlist_bl_node d_in_lookup_hash;  /* only for in-lookup ones */
        struct rcu_head d_rcu;
    } d_u;
} __randomize_layout;
struct dentry_operations {
    int (*d_revalidate)(struct dentry *, unsigned int); // 檢測dentry有消息
    int (*d_weak_revalidate)(struct dentry *, unsigned int);
    int (*d_hash)(const struct dentry *, struct qstr *); // 計算dentry的hash值
    int (*d_compare)(const struct dentry *, // 比較文件名
            unsigned int, const char *, const struct str *);
    int (*d_delete)(const struct dentry *); 
                     // 刪除目錄項,默認實現(xiàn)為將引用置0,也就是標位unused
    int (*d_init)(struct dentry *);
    void (*d_release)(struct dentry *);
    void (*d_prune)(struct dentry *);
    void (*d_iput)(struct dentry *, struct inode *); //當丟失inode時,釋放dentry
    char *(*d_dname)(struct dentry *, char *, int);
    struct vfsmount *(*d_automount)(struct path *);
    int (*d_manage)(const struct path *, bool);
    struct dentry *(*d_real)(struct dentry *, const struct inode *);
} ____cacheline_aligned;

super_block

超級塊用于管理掛載點對于的實際文件系統(tǒng)中的一些參數(shù),包括:塊長度,文件系統(tǒng)可處理的最大文件長度,文件系統(tǒng)類型,對應的存儲設備等。(注:在之前的整體結構圖中superblock會有一個files指向所有打開的文件,但是在下面的數(shù)據結構中并沒有找到相關的代碼,是因為之前該結構會用于判斷umount邏輯時,確保所有文件都已被關閉,新版的不知道怎么處理這個邏輯了,后續(xù)看到了再補上
相關superblock的管理主要在文件系統(tǒng)的掛載邏輯,這個后續(xù)在講到掛載相關的模塊是詳細分析。而superblock主要功能是管理inode。
詳細信息見superblock
其數(shù)據結構如下:

struct super_block {
    struct list_head    s_list;     /* Keep this first */
    dev_t           s_dev;      /* search index; _not_ kdev_t */
    unsigned char       s_blocksize_bits; // 塊字節(jié)
    unsigned long       s_blocksize; // log2(塊字節(jié))
    loff_t          s_maxbytes; /* Max file size */
    struct file_system_type *s_type; // 文件系統(tǒng)類型
    const struct super_operations   *s_op; // 超級塊的操作
    const struct dquot_operations   *dq_op;
    const struct quotactl_ops   *s_qcop;
    const struct export_operations *s_export_op;
    unsigned long       s_flags;
    unsigned long       s_iflags;   /* internal SB_I_* flags */
    unsigned long       s_magic;
    struct dentry       *s_root; // 根目錄項。所有的path lookup 都是從此開始
    struct rw_semaphore s_umount;
    int         s_count;
    atomic_t        s_active;
#ifdef CONFIG_SECURITY
    void                    *s_security;
#endif
    const struct xattr_handler **s_xattr;
#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
    const struct fscrypt_operations *s_cop;
#endif
    struct hlist_bl_head    s_roots;    /* alternate root dentries for NFS */
    struct list_head    s_mounts;   /* list of mounts; _not_ for fs use */
    struct block_device *s_bdev;
    struct backing_dev_info *s_bdi;
    struct mtd_info     *s_mtd;
    struct hlist_node   s_instances;
    unsigned int        s_quota_types;  /* Bitmask of supported quota types */
    struct quota_info   s_dquot;    /* Diskquota specific options */

    struct sb_writers   s_writers;

    /*
     * Keep s_fs_info, s_time_gran, s_fsnotify_mask, and
     * s_fsnotify_marks together for cache efficiency. They are frequently
     * accessed and rarely modified.
     */
    void            *s_fs_info; /* Filesystem private info */

    /* Granularity of c/m/atime in ns (cannot be worse than a second) */
    u32         s_time_gran;
#ifdef CONFIG_FSNOTIFY
    __u32           s_fsnotify_mask;
    struct fsnotify_mark_connector __rcu    *s_fsnotify_marks;
#endif

    char            s_id[32];   /* Informational name */
    uuid_t          s_uuid;     /* UUID */

    unsigned int        s_max_links;
    fmode_t         s_mode;

    /*
     * The next field is for VFS *only*. No filesystems have any business
     * even looking at it. You had been warned.
     */
    struct mutex s_vfs_rename_mutex;    /* Kludge */

    /*
     * Filesystem subtype.  If non-empty the filesystem type field
     * in /proc/mounts will be "type.subtype"
     */
    char *s_subtype;

    const struct dentry_operations *s_d_op; /* default d_op for dentries */

    /*
     * Saved pool identifier for cleancache (-1 means none)
     */
    int cleancache_poolid;

    struct shrinker s_shrink;   /* per-sb shrinker handle */

    /* Number of inodes with nlink == 0 but still referenced */
    atomic_long_t s_remove_count;

    /* Pending fsnotify inode refs */
    atomic_long_t s_fsnotify_inode_refs;

    /* Being remounted read-only */
    int s_readonly_remount;

    /* AIO completions deferred from interrupt context */
    struct workqueue_struct *s_dio_done_wq;
    struct hlist_head s_pins;

    /*
     * Owning user namespace and default context in which to
     * interpret filesystem uids, gids, quotas, device nodes,
     * xattrs and security labels.
     */
    struct user_namespace *s_user_ns;

    /*
     * The list_lru structure is essentially just a pointer to a table
     * of per-node lru lists, each of which has its own spinlock.
     * There is no need to put them into separate cachelines.
     */
    struct list_lru     s_dentry_lru; // 目錄項緩存
    struct list_lru     s_inode_lru; // inode 緩存
    struct rcu_head     rcu;
    struct work_struct  destroy_work;

    struct mutex        s_sync_lock;    /* sync serialisation lock */

    /*
     * Indicates how deep in a filesystem stack this SB is
     */
    int s_stack_depth;

    /* s_inode_list_lock protects s_inodes */
    spinlock_t      s_inode_list_lock ____cacheline_aligned_in_smp;
    struct list_head    s_inodes;   /* all inodes */

    spinlock_t      s_inode_wblist_lock;
    struct list_head    s_inodes_wb;    /* writeback inodes */
} __randomize_layout;
struct super_operations {
    struct inode *(*alloc_inode)(struct super_block *sb); // 在當前sb創(chuàng)建inode
    void (*destroy_inode)(struct inode *); // 在當前sb刪除inode
    void (*dirty_inode) (struct inode *, int flags); // 標記為臟inode
    int (*write_inode) (struct inode *, struct writeback_control *wbc);// inode 寫回
    int (*drop_inode) (struct inode *); // 同delete,不過inode的引用必須為0
    void (*evict_inode) (struct inode *);
    void (*put_super) (struct super_block *);  // 卸載sb
    int (*sync_fs)(struct super_block *sb, int wait); 
    int (*freeze_super) (struct super_block *);
    int (*freeze_fs) (struct super_block *);
    int (*thaw_super) (struct super_block *);
    int (*unfreeze_fs) (struct super_block *);
    int (*statfs) (struct dentry *, struct kstatfs *); // 查詢元信息
    int (*remount_fs) (struct super_block *, int *, char *); //重新掛載
    void (*umount_begin) (struct super_block *); // 主要用于NFS
        // 查詢相關
    int (*show_options)(struct seq_file *, struct dentry *);
    int (*show_devname)(struct seq_file *, struct dentry *);
    int (*show_path)(struct seq_file *, struct dentry *);
    int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
    ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
    ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
    struct dquot **(*get_dquots)(struct inode *);
#endif
    int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
    long (*nr_cached_objects)(struct super_block *,
                  struct shrink_control *);
    long (*free_cached_objects)(struct super_block *,
                    struct shrink_control *);
};

address_space

之前提到spuerblock用于管理inode,而dentry用于文件名管理,文件名到inode的映射及目錄的管理,而inode用于管理一些文件的元數(shù)據信息,但是真正的將文件與磁盤等存儲設備的交互由誰來做呢?write一份數(shù)據是怎么從內存寫回磁盤,而又如何從磁盤讀數(shù)據到內存呢?這就是address_space主要需要處理的工作,address_space主要用于處理內存到后端設備之間的數(shù)據同步,其具體工作原理在內存緩存中詳細介紹。

struct address_space {
    struct inode        *host; // 所在的inode 以便于獲取文件元信息
    struct xarray       i_pages; // 文件對應的內存頁
    gfp_t           gfp_mask; // 內存類型
    atomic_t        i_mmap_writable; // VM_SHARED映射計數(shù)
    struct rb_root_cached   i_mmap; // mmap私有和共享映射的樹結構
    struct rw_semaphore i_mmap_rwsem;
    unsigned long       nrpages; // 文件大小對應的內存頁數(shù)量
    unsigned long       nrexceptional;
    pgoff_t         writeback_index; //回寫由此開始
    const struct address_space_operations *a_ops; // 地址空間操作
    unsigned long       flags; // 錯誤標識位
    errseq_t        wb_err; //
    spinlock_t      private_lock;
    struct list_head    private_list;
    void            *private_data;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;
struct address_space_operations {
    int (*writepage)(struct page *page, struct writeback_control *wbc); // 回寫一頁
    int (*readpage)(struct file *, struct page *); //讀取一頁數(shù)據到內存中

    /* Write back some dirty pages from this mapping. */
    int (*writepages)(struct address_space *, struct writeback_control *); // 回寫臟頁

    /* Set a page dirty.  Return true if this dirtied it */
    int (*set_page_dirty)(struct page *page); // 標記臟頁

    /*
     * Reads in the requested pages. Unlike ->readpage(), this is
     * PURELY used for read-ahead!.
     */
    int (*readpages)(struct file *filp, struct address_space *mapping,
            struct list_head *pages, unsigned nr_pages);

    int (*write_begin)(struct file *, struct address_space *mapping,
                loff_t pos, unsigned len, unsigned flags,
                struct page **pagep, void **fsdata);
    int (*write_end)(struct file *, struct address_space *mapping,
                loff_t pos, unsigned len, unsigned copied,
                struct page *page, void *fsdata);

    /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
    sector_t (*bmap)(struct address_space *, sector_t);
    void (*invalidatepage) (struct page *, unsigned int, unsigned int);
    int (*releasepage) (struct page *, gfp_t);
    void (*freepage)(struct page *);
    ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
    /*
     * migrate the contents of a page to the specified target. If
     * migrate_mode is MIGRATE_ASYNC, it must not block.
     */
    int (*migratepage) (struct address_space *,
            struct page *, struct page *, enum migrate_mode);
    bool (*isolate_page)(struct page *, isolate_mode_t);
    void (*putback_page)(struct page *);
    int (*launder_page) (struct page *);
    int (*is_partially_uptodate) (struct page *, unsigned long,
                    unsigned long);
    void (*is_dirty_writeback) (struct page *, bool *, bool *);
    int (*error_remove_page)(struct address_space *, struct page *);

    /* swapfile support */
    int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
                sector_t *span);
    void (*swap_deactivate)(struct file *file);
};

file

前文中提到對于進程來說,用戶空間看到的整數(shù)fd,而內核中的對應的數(shù)據結構則為file,所有用戶空間對于fd的操作都會由系統(tǒng)調用轉換到操作file。
更多詳細信息見file
其數(shù)據結構如下:

struct task_struct {
       ...
    /* Filesystem information: */
    struct fs_struct        *fs; // root & pwd path

    /* Open file information: */
    struct files_struct     *files; // opened files

    /* Namespaces: */
    struct nsproxy          *nsproxy;
        ...
};
/*
 * Open file table structure
 */
struct files_struct {
  /*
   * read mostly part
   */
    atomic_t count; // 打開文件數(shù)
    bool resize_in_progress; //
    wait_queue_head_t resize_wait;

    struct fdtable __rcu *fdt; // fd table
    struct fdtable fdtab; // fd table
  /*
   * written part on a separate cache line in SMP
   */
    spinlock_t file_lock ____cacheline_aligned_in_smp;
    unsigned int next_fd; // 該進程打開的下一個fd
    unsigned long close_on_exec_init[1];
    unsigned long open_fds_init[1];
    unsigned long full_fds_bits_init[1];
    struct file __rcu * fd_array[NR_OPEN_DEFAULT]; //打開的文件
};
struct fdtable {
    unsigned int max_fds; // ulimit -n 打開句柄上限
    struct file __rcu **fd;      /* current fd array */
    unsigned long *close_on_exec;
    unsigned long *open_fds;  // fd占用位圖
    unsigned long *full_fds_bits;
    struct rcu_head rcu;
};
struct file {
    union {
        struct llist_node   fu_llist;
        struct rcu_head     fu_rcuhead;
    } f_u;
    struct path     f_path;  // 路徑
    struct inode        *f_inode;    /* cached value */
    const struct file_operations    *f_op; // 文件操作
    /*
     * Protects f_ep_links, f_flags.
     * Must not be taken from IRQ context.
     */
    spinlock_t      f_lock;
    enum rw_hint        f_write_hint;
    atomic_long_t   f_count;
    unsigned int        f_flags;
    fmode_t         f_mode;
    struct mutex        f_pos_lock;
    loff_t          f_pos; // 當前文件的操作位置
    struct fown_struct  f_owner; // 當前文件所在的進程
    const struct cred   *f_cred;
    struct file_ra_state    f_ra;
    u64         f_version;
#ifdef CONFIG_SECURITY
    void            *f_security;
#endif
    /* needed for tty driver, and maybe others */
    void            *private_data;

#ifdef CONFIG_EPOLL
    /* Used by fs/eventpoll.c to link all the hooks to this file */
    struct list_head    f_ep_links;
    struct list_head    f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
    struct address_space    *f_mapping; // 地址空間
    errseq_t        f_wb_err;
} __randomize_layout
struct file_operations {
    struct module *owner;
    loff_t (*llseek) (struct file *, loff_t, int); // 移動操作位置
    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
    ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
    ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
    ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
    int (*iterate) (struct file *, struct dir_context *);
    int (*iterate_shared) (struct file *, struct dir_context *);
    __poll_t (*poll) (struct file *, struct poll_table_struct *);
    long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
    long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
    int (*mmap) (struct file *, struct vm_area_struct *); // 將文件與虛擬內存映射
    unsigned long mmap_supported_flags;
    int (*open) (struct inode *, struct file *); // 
    int (*flush) (struct file *, fl_owner_t id);
    int (*release) (struct inode *, struct file *);
    int (*fsync) (struct file *, loff_t, loff_t, int datasync);
    int (*fasync) (int, struct file *, int);
    int (*lock) (struct file *, int, struct file_lock *);
    ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
    unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
    int (*check_flags)(int); 
    int (*flock) (struct file *, int, struct file_lock *); // 對一個file 加鎖
    ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
    ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
    int (*setlease)(struct file *, long, struct file_lock **, void **);
    long (*fallocate)(struct file *file, int mode, loff_t offset,
              loff_t len);
    void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
    unsigned (*mmap_capabilities)(struct file *);
#endif
    ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
            loff_t, size_t, unsigned int);
    loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
                   struct file *file_out, loff_t pos_out,
                   loff_t len, unsigned int remap_flags);
    int (*fadvise)(struct file *, loff_t, loff_t, int);
} __randomize_layout;

虛擬文件系統(tǒng)實戰(zhàn)

由此對于虛擬文件的基本架構有了一定的理解,但是如果想要對于虛擬文件有比較深刻的認識還是比較模糊的,那么我們來通過自己偽碼來操作一下文件,以描述linux內核是如何來讀寫文件的,我們以寫文件為例來過一下整個流程:
需求:從0開始向文件/testmount/testdir/testfile1.txt 中寫入 hello world
基本過程其基本系統(tǒng)調用過程為1.mkdir 2. creat 3. open 4. write
mkdir對應的函數(shù)調用的執(zhí)行過程如下:
rootInode = sb->s_root->d_inode;
testDirDentry = dentry("testdir")
testDirInode = rootInode->i_op->mkdir(rootInode , testDirDentry, 777))
creat對應的函數(shù)調用的執(zhí)行過程如下:
testFileDentry = dentry("testfile1.txt")
testFileInode = testDirInode->i_op->create(testDirInode, testFileDentry, 777 )
open 的系統(tǒng)調用的執(zhí)行過程如下
testFileInode->f_op->open(testFileInode, testfile)
write的系統(tǒng)調用的執(zhí)行過程如下
testfile->f_op->write(file, "hello world", len, 0)
具體流程:

  1. 假設現(xiàn)在我們有一個快磁盤設備/dev/sda,我們將其格式化為EX2文件系統(tǒng),具體怎么將塊設備格式化這個我們再設備管理章節(jié)在描述。
  2. 我們將該磁盤掛載到/testmount 目錄,這樣內核就會通過掛載模塊注冊對應的superblock,具體如何掛載且聽下回分解。
  3. 我們想要寫文件/testmount/testdir/testfile1.txt文件,那么首先會要根據文件名完整路徑查找對應的目錄項,并在不存在的時候創(chuàng)建對應的inode文件。
    3.1 根據完整路徑找到對應的掛載點的superblock,我們這里最精確的匹配sb是/testmount
    3.2 找到sb后,找到當前sb的root dentry,找到root dentry對應的inode,通過inode中的address_space從磁盤中讀取信息,如果是目錄則其中存儲內容為所有子條目信息,從而構建完整的root dentry中的子條目;發(fā)現(xiàn)沒有對應testdir的目錄,這時候就會報目錄不存在的錯誤;用戶開始創(chuàng)建對應的目錄,并將對應的信息寫回inode對應的設備;同理也需要在/testdir目錄下創(chuàng)建testfile1.txt文件并寫回/testdir對應的inode設備。
  4. 找到inode之后,我們需要通過open系統(tǒng)調用打開對應的文件,進程通過files_struct中的next_fd申請分配一個文件描述符,然后調用inode->f_op->open(inode, file),生成一個file對象,并將inode中的address_space信息傳到file中,然后將用戶空間的fd關聯(lián)到該file對象。
  5. 打開文件之后所有后續(xù)的讀寫操作都是通過該fd來進行,在內核層面就是通過對應的file數(shù)據結構操作文件,比如我們要寫入hello world,那么就是通過調用file->f_op->write;
    其實file->f_op其實是講對應的字節(jié)內容寫入到address_space中對應的內存中,address_space再選擇合適的時間寫回磁盤,這就是我們常說的緩存系統(tǒng),當然我們也可以通過fsync系統(tǒng)調用強制將數(shù)據同步回存儲系統(tǒng)。在f_op的函數(shù)中都可以看到__user描述信息,說明數(shù)據是來自用戶空間的內存地址,這些數(shù)據最終要寫到內核緩存的address_space中的page內存中,這就是我們常說的內核拷貝,后來就出來了大家所熟知的零拷貝sendfile,直接在兩個fd直接拷貝數(shù)據,操作的都是內核里面的page數(shù)據,不需要到用戶地址空間走一遭。

結語

至此vfs的基本流程就介紹完了,但是對于super_block的掛載,address_space的具體讀寫操作后續(xù)再慢慢補上。其中address_space會在也緩存及塊緩存中詳細介紹,因為這一塊是特別復雜的而且與具體的文件系統(tǒng)實現(xiàn)相關,后續(xù)將結合EX2文件系統(tǒng)一起介紹。

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

友情鏈接更多精彩內容