Hadoop(3.3.4)-HDFS操作

Apache Hadoop 3.3.4 – Overview

01.appendToFile

hadoop fs -appendToFile localfile /user/hadoop/hadoopfile
hadoop fs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
hadoop fs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
hdfs dfs -appendToFile /root/tmp/202302/02/1.txt hdfs://192.168.88.161:8020/tmp/test20230202/1.txt

02.cat

-ignoreCrc 忽略檢查驗(yàn)證
hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -cat file:///file3 /user/hadoop/file4

03.checksum

-v 顯示文件塊的大小
hadoop fs -checksum hdfs://nn1.example.com/file1
hadoop fs -checksum file:///etc/hosts

04.chgrp

更改文件的組關(guān)聯(lián)。用戶必須是文件的所有者,或者是超級(jí)用戶。其他信息在權(quán)限指南中。

-R 將文件的組關(guān)聯(lián)進(jìn)行遞歸更改

05.chmod

-R 將文件使用權(quán)限進(jìn)行遞歸更改
hdfs dfs -chmod -R 777 /tmp/tmp

06.chown

-R 遞歸更改

07.copyFromLocal

將文件上傳到HDFS, 同 -put

08.copyToLocal

將文件下載到本地,同 -get

09.count

計(jì)算指定文件模式匹配的路徑下的目錄、文件和字節(jié)數(shù)。獲取配額和使用情況。帶有 -count 的輸出列包括:DIR_COUNT、FILE_COUNT、CONTENT_SIZE、路徑名

-q -u 和 -q 選項(xiàng)控制輸出包含哪些列。-q 表示顯示配額,-u 將輸出限制為僅顯示配額和使用情況。
-u -u 和 -q 選項(xiàng)控制輸出包含哪些列。-q 表示顯示配額,-u 將輸出限制為僅顯示配額和使用情況。
-v 顯示標(biāo)題行
-x -x 選項(xiàng)從結(jié)果計(jì)算中排除快照。如果沒有 -x 選項(xiàng)(缺省值),則始終根據(jù)所有 INodes 計(jì)算結(jié)果,包括給定路徑下的所有快照。如果給定了 -u 或 -q 選項(xiàng),則忽略 -x 選項(xiàng)。
-h 可以更人性化的展示字節(jié)大小B,K,M,G
-e 顯示糾刪碼策略
-s -s 選項(xiàng)顯示每個(gè)目錄的快照計(jì)數(shù)。
hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -count -q hdfs://nn1.example.com/file1
hadoop fs -count -q -h hdfs://nn1.example.com/file1
hadoop fs -count -q -h -v hdfs://nn1.example.com/file1
hadoop fs -count -u hdfs://nn1.example.com/file1
hadoop fs -count -u -h hdfs://nn1.example.com/file1
hadoop fs -count -u -h -v hdfs://nn1.example.com/file1
hadoop fs -count -e hdfs://nn1.example.com/file1
hadoop fs -count -s hdfs://nn1.example.com/file1

10.test

判斷hdfs是否存在文件或者文件夾

命令參數(shù) 描述
-d 如果指定路徑是一個(gè)目錄返回0否則返回1
-e 如果指定路徑存在返回0否則返回1
-f 如果指定路徑是一個(gè)文件返回0否則返回1
-s 如果指定路徑文件大小大于0返回0否則返回1
-z 如果指定指定文件大小等于0返回0否則返回1

11.getmerge

# 將hdfs目錄中的文件合并下載到本地
hdfs dfs -getmerge hdfs://ip:port/tmp/tmp ./value.txt

12.expunge

清空回收站

13.skipTrash

直接刪除,不放入回收站

14.report

查看hdfs總?cè)萘亢褪褂们闆r

hdfs dfsadmin -report

15.distcp

參數(shù) 說明
-append 重用目標(biāo)文件中的現(xiàn)有數(shù)據(jù),并在可能的情況下添加新數(shù)據(jù),新增進(jìn)去而不是覆蓋它
-async 是否應(yīng)該阻塞distcp執(zhí)行
-atomic 提交所有更改或不提交更改
-bandwidth<arg> 以MB/second為單位指定每個(gè)map的帶寬
-delete 刪除目標(biāo)文件中存在的文件,但在源文件中不存在,走HDFS垃圾回收站
-diff 使用snapshot diff報(bào)告來標(biāo)識(shí)源和目標(biāo)之間的差異
-f 需要復(fù)制的文件列表
-filelimit<arg> (已棄用?。┫拗茝?fù)制到<= n的文件數(shù)
-filters<arg> 從復(fù)制的文件列表中排除
-i 忽略復(fù)制過程中的失敗
-log<arg> HDFS上的distcp執(zhí)行日志文件夾保存
-m<arg> 限制同步啟動(dòng)的map數(shù),默認(rèn)每個(gè)文件對(duì)應(yīng)一個(gè)map,每臺(tái)機(jī)器最多啟動(dòng)20個(gè)map
-mapredSslConf<arg> 配置ssl配置文件,用于hftps://
-numListstatusThreads<arg> 用于構(gòu)建文件清單的線程數(shù)(最多40個(gè)),當(dāng)文件目錄結(jié)構(gòu)復(fù)雜時(shí)應(yīng)該適當(dāng)增大該值
-overwrite 選擇無條件覆蓋目標(biāo)文件,即使它們存在。
-p<arg> 保留源文件狀態(tài)(rbugpcaxt)(復(fù)制,塊大小,用戶,組,權(quán)限,校驗(yàn)和類型,ACL,XATTR,時(shí)間戳)
-sizelimit<arg> (已棄用?。┫拗茝?fù)制到<= n的文件數(shù)字節(jié)
-skipcrccheck 是否跳過源和目標(biāo)路徑之間的CRC檢查。
-strategy<arg> 選擇復(fù)制策略,默認(rèn)值uniformsize,每個(gè)map復(fù)制的文件總大小均衡;可以設(shè)置為dynamic,使更快的map復(fù)制更多的文件,以提高性能
-tmp<arg> 要用于原子的中間工作路徑承諾
-update 如果目標(biāo)文件的名稱和大小與源文件不同,則覆蓋;如果目標(biāo)文件大小和名稱與源文件相同則跳過
hadoop distcp -i  -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/

hadoop distcp -i -update -delete -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp

16.find

Usage: hadoop fs -find <path> ... <expression> ...

Finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults to the current working directory. If no expression is specified then defaults to -print.

The following primary expressions are recognised:

  • -name pattern
    -iname pattern

    Evaluates as true if the basename of the file matches the pattern using standard file system globbing. If -iname is used then the match is case insensitive.

  • -print
    -print0

    Always evaluates to true. Causes the current pathname to be written to standard output. If the -print0 expression is used then an ASCII NULL character is appended.

The following operators are recognised:

  • expression -a expression

    expression -and expression

    expression expression

    Logical AND operator for joining two expressions. Returns true if both child expressions return true. Implied by the juxtaposition of two expressions and so does not need to be explicitly specified. The second expression will not be applied if the first fails.

Example:

hadoop fs -find / -name test -print

16.ls

Usage: hadoop fs -ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] <args>

Options:

  • -C: Display the paths of files and directories only.(僅顯示文件和目錄的路徑)
  • -d: Directories are listed as plain files.(目錄列為普通文件)
  • -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).(格式化,顯示m和g)
  • -q: Print ? instead of non-printable characters.
  • -R: Recursively list subdirectories encountered.(遞歸列出遇到的子目錄。)
  • -t: Sort output by modification time (most recent first).(按修改時(shí)間對(duì)輸出進(jìn)行排序(最近的第一個(gè))。)
  • -S: Sort output by file size.(按文件大小對(duì)輸出進(jìn)行排序。)
  • -r: Reverse the sort order.(顛倒排序順序。)
  • -u: Use access time rather than modification time for display and sorting.(使用訪問時(shí)間而不是修改時(shí)間進(jìn)行顯示和排序。)
  • -e: Display the erasure coding policy of files and directories only.(僅顯示文件和目錄的擦除編碼策略。)

For a file ls returns stat on the file with the following format:

permissions number_of_replicas userid groupid filesize modification_date modification_time filename

For a directory it returns list of its direct children as in Unix. A directory is listed as:

permissions userid groupid modification_date modification_time dirname

Files within a directory are order by filename by default.

Example:

hadoop fs -ls /user/hadoop/file1
hadoop fs -ls -e /ecdir

17.mkdir

18.mkdir

Usage: hadoop fs -mkdir [-p] <paths>

Takes path uri’s as argument and creates directories.

Options:

  • The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.

Example:

hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir

19.mv

Usage: hadoop fs -mv URI [URI ...] <dest>

Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.

Example:

hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1

20.put

Usage: hadoop fs -put [-f] [-p] [-l] [-d] [-t <thread count>] [-q <thread pool queue size>] [ - | <localsrc> ...] <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to “-”

Copying fails if the file already exists, unless the -f flag is given.

Options:

  • -p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)(保留訪問和修改時(shí)間、所有權(quán)和權(quán)限。(假設(shè)權(quán)限可以跨文件系統(tǒng)傳播))
  • -f : Overwrites the destination if it already exists.(覆蓋已存在的目標(biāo)。)
  • -l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.(允許DataNode將文件延遲持久化到磁盤,強(qiáng)制復(fù)制因子為1。此標(biāo)志將導(dǎo)致耐久性降低。小心使用。)
  • -d : Skip creation of temporary file with the suffix ._COPYING_.(跳過創(chuàng)建后綴為的臨時(shí)文件_復(fù)制。)
  • -t <thread count> : Number of threads to be used, default is 1. Useful when uploading directories containing more than 1 file.(要使用的線程數(shù),默認(rèn)值為1。上傳包含多個(gè)文件的目錄時(shí)很有用。)
  • -q <thread pool queue size> : Thread pool queue size to be used, default is 1024. It takes effect only when thread count greater than 1.(要使用的線程池隊(duì)列大小,默認(rèn)值為1024。只有當(dāng)線程數(shù)大于1時(shí),它才會(huì)生效。)

Examples:

hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
hadoop fs -put -t 5 localdir hdfs://nn.example.com/hadoop/hadoopdir
hadoop fs -put -t 10 -q 2048 localdir1 localdir2 hdfs://nn.example.com/hadoop/hadoopdir

21.rm

Usage: hadoop fs -rm [-f] [-r |-R] [-skipTrash] [-safely] URI [URI ...]

Delete files specified as args.

If trash is enabled, file system instead moves the deleted file to a trash directory (given by FileSystem#getTrashRoot).

Currently, the trash feature is disabled by default. User can enable trash by setting a value greater than zero for parameter fs.trash.interval (in core-site.xml).

See expunge about deletion of files in trash.

Options:

  • The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist.(選項(xiàng)將不會(huì)顯示診斷消息,也不會(huì)修改退出狀態(tài)以反映文件不存在時(shí)的錯(cuò)誤。)
  • The -R option deletes the directory and any content under it recursively.(選項(xiàng)以遞歸方式刪除目錄及其下的任何內(nèi)容。)
  • The -r option is equivalent to -R.
  • The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.(-選項(xiàng)將繞過垃圾桶(如果啟用),并立即刪除指定的文件。當(dāng)需要從超過配額的目錄中刪除文件時(shí),這可能很有用。)
  • The -safely option will require safety confirmation before deleting directory with total number of files greater than hadoop.shell.delete.limit.num.files (in core-site.xml, default: 100). It can be used with -skipTrash to prevent accidental deletion of large directories. Delay is expected when walking over large directory recursively to count the number of files to be deleted before the confirmation.(選項(xiàng)在刪除文件總數(shù)大于`hadoop.shell.delete.limit.num.files'(在core-site.xml中,默認(rèn)值:100)的目錄之前需要進(jìn)行安全確認(rèn)。它可以與-skipTrash一起使用,以防止意外刪除大目錄。當(dāng)遞歸遍歷大目錄以計(jì)算確認(rèn)前要?jiǎng)h除的文件數(shù)時(shí),預(yù)計(jì)會(huì)出現(xiàn)延遲。)

Example:

hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

22.rmdir

Usage: hadoop fs -rmdir [--ignore-fail-on-non-empty] URI [URI ...]

Delete a directory.

Options:

  • --ignore-fail-on-non-empty: When using wildcards, do not fail if a directory still contains files.(使用通配符時(shí),如果目錄仍包含文件,請(qǐng)不要失敗。)

Example:

hadoop fs -rmdir /user/hadoop/emptydir

23.tail

Usage: hadoop fs -tail [-f] URI

Displays last kilobyte of the file to stdout.

Options:

  • The -f option will output appended data as the file grows, as in Unix.

Example:

hadoop fs -tail pathname

24.touch

Usage: hadoop fs -touch [-a] [-m] [-t TIMESTAMP] [-c] URI [URI ...]

Updates the access and modification times of the file specified by the URI to the current time. If the file does not exist, then a zero length file is created at URI with current time as the timestamp of that URI.

  • Use -a option to change only the access time(僅更改訪問時(shí)間的選項(xiàng))
  • Use -m option to change only the modification time(僅更改修改時(shí)間的選項(xiàng))
  • Use -t option to specify timestamp (in format yyyyMMdd:HHmmss) instead of current time(用于指定時(shí)間戳(格式為yyyyMMdd:HHmmss)而不是當(dāng)前時(shí)間的選項(xiàng))
  • Use -c option to not create file if it does not exist(如果文件不存在,則不創(chuàng)建該文件的選項(xiàng))

The timestamp format is as follows * yyyy Four digit year (e.g. 2018) * MM Two digit month of the year (e.g. 08 for month of August) * dd Two digit day of the month (e.g. 01 for first day of the month) * HH Two digit hour of the day using 24 hour notation (e.g. 23 stands for 11 pm, 11 stands for 11 am) * mm Two digit minutes of the hour * ss Two digit seconds of the minute e.g. 20180809:230000 represents August 9th 2018, 11pm

Example:

hadoop fs -touch pathname
hadoop fs -touch -m -t 20180809:230000 pathname
hadoop fs -touch -t 20180809:230000 pathname
hadoop fs -touch -a pathname

25.touchz

Usage: hadoop fs -touchz URI [URI ...]

Create a file of zero length. An error is returned if the file exists with non-zero length.(創(chuàng)建一個(gè)長度為零的文件。如果存在長度為非零的文件,則返回錯(cuò)誤。)

Example:

hadoop fs -touchz pathname

26.help

# 查看ls的命令幫助文檔
hadoop fs -help ls

27.將fsimage文件轉(zhuǎn)換為xml文件

hdfs oiv -p 文件類型(xml) -i 鏡像文件 -o 轉(zhuǎn)換后文件輸出路徑

28.將edits文件轉(zhuǎn)換為xml文件

hdfs oev -p 文件類型(xml) -i 鏡像文件 -o 轉(zhuǎn)換后文件輸出路徑

29.查看支持的壓縮算法

hadoop checknative

命令集合

The Hadoop FileSystem shell works with Object Stores such as Amazon S3, Azure ABFS and Google GCS.

# Create a directory
hadoop fs -mkdir s3a://bucket/datasets/

# Upload a file from the cluster filesystem
hadoop fs -put /datasets/example.orc s3a://bucket/datasets/

# touch a file
hadoop fs -touchz wasb://yourcontainer@youraccount.blob.core.windows.net/touched

Unlike a normal filesystem, renaming files and directories in an object store usually takes time proportional to the size of the objects being manipulated. As many of the filesystem shell operations use renaming as the final stage in operations, skipping that stage can avoid long delays.

In particular, the put and copyFromLocal commands should both have the -d options set for a direct upload.

# Upload a file from the cluster filesystem
hadoop fs -put -d /datasets/example.orc s3a://bucket/datasets/

# Upload a file from under the user's home directory in the local filesystem.
# Note it is the shell expanding the "~", not the hadoop fs command
hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket/datasets/

# create a file from stdin
# the special "-" source means "use stdin"
echo "hello" | hadoop fs -put -d -f - wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt

Objects can be downloaded and viewed:

# copy a directory to the local filesystem
hadoop fs -copyToLocal s3a://bucket/datasets/

# copy a file from the object store to the cluster filesystem.
hadoop fs -get wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt /examples

# print the object
hadoop fs -cat wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt

# print the object, unzipping it if necessary
hadoop fs -text wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt

## download log files into a local file
hadoop fs -getmerge wasb://yourcontainer@youraccount.blob.core.windows.net/logs\* log.txt

Commands which list many files tend to be significantly slower than when working with HDFS or other filesystems

hadoop fs -count s3a://bucket/
hadoop fs -du s3a://bucket/

Other slow commands include find, mv, cp and rm.

Find

This can be very slow on a large store with many directories under the path supplied.

# enumerate all files in the object store's container.
hadoop fs -find s3a://bucket/ -print

# remember to escape the wildcards to stop the shell trying to expand them first
hadoop fs -find s3a://bucket/datasets/ -name \*.txt -print

Rename

The time to rename a file depends on its size.

The time to rename a directory depends on the number and size of all files beneath that directory.

hadoop fs -mv s3a://bucket/datasets s3a://bucket/historical

If the operation is interrupted, the object store will be in an undefined state.

Copy

hadoop fs -cp s3a://bucket/datasets s3a://bucket/historical

The copy operation reads each file and then writes it back to the object store; the time to complete depends on the amount of data to copy, and the bandwidth in both directions between the local computer and the object store.

The further the computer is from the object store, the longer the copy takes

Deleting objects

The rm command will delete objects and directories full of objects. If the object store is eventually consistent, fs ls commands and other accessors may briefly return the details of the now-deleted objects; this is an artifact of object stores which cannot be avoided.

If the filesystem client is configured to copy files to a trash directory, this will be in the bucket; the rm operation will then take time proportional to the size of the data. Furthermore, the deleted files will continue to incur storage costs.

To avoid this, use the the -skipTrash option.

hadoop fs -rm -skipTrash s3a://bucket/dataset

Data moved to the .Trash directory can be purged using the expunge command. As this command only works with the default filesystem, it must be configured to make the default filesystem the target object store.

hadoop fs -expunge -D fs.defaultFS=s3a://bucket/
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容