큰 텍스트 파일을 같은 줄의 작은 파일로 분할하려면 어떻게 해야 합니까?

programing

큰 텍스트 파일을 같은 줄의 작은 파일로 분할하려면 어떻게 해야 합니까?

powerit 2023. 4. 13. 21:13

큰 텍스트 파일을 같은 줄의 작은 파일로 분할하려면 어떻게 해야 합니까?

큰(줄 수 기준) 일반 텍스트 파일이 있는데, 작은 파일로 분할할 수도 있고 줄 수 기준도 있습니다.따라서 파일에 약 200만 행이 포함되어 있는 경우, 20,000 행이 포함된 10개의 파일 또는 20,000 행이 포함된 100개의 파일로 분할하고 싶습니다(나머지 1개의 파일과 균등하게 분할하는 것은 문제가 되지 않습니다).

Python에서는 쉽게 할 수 있지만, Bash나 Unix 유틸리티를 사용하여 (수동으로 회선을 세거나 분할하는 것이 아니라) 닌자 방식으로 할 수 있는 방법이 있는지 궁금합니다.

split 명령어를 확인합니다.

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

다음과 같은 작업을 수행할 수 있습니다.

split -l 200000 filename

각각 200000 행의 파일이 생성됩니다.xaa xab xac...

출력 파일의 크기로 분할하는 다른 옵션(줄 바꿈 시 분할):

 split -C 20m --numeric-suffixes input_filename output_prefix

다음과 같은 파일을 만듭니다.output_prefix01 output_prefix02 output_prefix03 ...각각 최대 20메가바이트 크기입니다.

split 명령어를 사용합니다.

split -l 200000 mybigfile.txt

네, 있습니다.split명령어를 입력합니다.파일을 행 또는 바이트 단위로 분할합니다.

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

파일 "file.txt" 를 10,000 행의 파일로 분할합니다.

split -l 10000 file.txt

큰 텍스트 파일을 각각 1000 행의 작은 파일로 분할하려면 다음 절차를 수행합니다.

split <file> -l 1000

큰 바이너리 파일을 각각 10M의 작은 파일로 분할하려면:

split <file> -b 10M

분할 파일을 단일 파일로 통합하려면:

cat x* > <file>

파일을 분할합니다.각 분할에는 10행(마지막 분할 제외):

split -l 10 filename

파일을 5개의 파일로 분할합니다.파일은 각 분할의 크기가 같도록 분할됩니다(마지막 분할 제외).

split -n 5 filename

각 분할에 512바이트의 파일을 분할합니다(마지막 분할 제외.KB는 512k, 메가바이트는 512m 사용).

split -b 512 filename

각 분할에서 최대 512바이트의 파일을 구분 없이 분할합니다.

split -C 512 filename

사용방법:

파일을 고정 크기로 분할하여 INPUT의 연속 섹션을 포함하는 출력 파일을 만듭니다(아무것도 제공되지 않거나 INPUT이 '-'인 경우 표준 입력).

Syntax split [options] [INPUT [PREFIX]]

용도:

sed -n '1,100p' filename > output.txt

여기서 1과 100은 에서 캡처하는 회선번호입니다.output.txt.

split(GNU coreutils에서 2010-12-22 버전8.8 이후)에는 다음 파라미터가 포함되어 있습니다.

-n, --number=CHUNKS     generate CHUNKS output files; see explanation below

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

따라서,split -n 4 input output.4개의 파일이 생성됩니다(output.a{a,b,c,d})는 같은 바이트의 양이지만 중간에 줄이 끊어질 수 있습니다.

풀 라인(즉, 선으로 분할)을 유지하려면 다음과 같이 하십시오.

split -n l/4 input output.

관련 답변: https://stackoverflow.com/a/19031247

AWK 를 사용할 수도 있습니다.

awk -vc=1 'NR%200000==0{++c}{print $0 > c".txt"}' largefile

되는 , 라고 하면 x줄씩 나누면 .split왜 도 않았는지 .그런데 왜 아무도 요건에 주의를 기울이지 않았는지 궁금합니다.

"계수할 필요 없이" -> wc + cut 사용
디폴트로는 "나머지는 추가 파일에 저장" -> split이 사용됩니다.

"wc + cut"이 없으면 할 수 없지만, 저는 그것을 사용하고 있습니다.

split -l  $(expr `wc $filename | cut -d ' ' -f3` / $chunks) $filename

이것은 .bashrc 파일 함수에 쉽게 추가할 수 있기 때문에 호출하기만 하면 파일 이름과 청크를 전달할 수 있습니다.

 split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2) $1

추가 파일에 나머지 없이 x개의 청크만 원하는 경우, 각 파일의 합계(chunks - 1)를 계산하기 위해 수식을 수정하십시오.이 접근방식을 사용하는 이유는 보통 파일당 x줄 대신 x줄의 파일만 원하기 때문입니다.

split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2 + `expr $2 - 1`) $1

이것을 스크립트에 추가하고, 그것을 「ninja 방식」이라고 부를 수 있습니다.필요에 맞는 것이 없는 경우는, 그것을 구축할 수 있기 때문입니다.

HDFS는 작은 파일을 가져와 적절한 크기로 분할합니다.

이 방법을 사용하면 회선이 끊어집니다.

split -b 125m compact.file -d -a 3 compact_prefix

모든 파일에 대해 128MB로 분할하려고 합니다.

# Split into 128 MB, and judge sizeunit is M or G. Please test before use.

begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $1}' `
sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $2}' `
if [ $sizeunit = "G" ];then
    res=$(printf "%.f" `echo "scale=5;$begainsize*8 "|bc`)
else
    res=$(printf "%.f" `echo "scale=5;$begainsize/128 "|bc`)  # Celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
fi
echo $res
# Split into $res files with a number suffix. Ref:  http://blog.csdn.net/microzone/article/details/52839598
compact_file_name=$compact_file"_"
echo "compact_file_name: "$compact_file_name
split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}

다음 예제에서는 "toSplit.txt" 파일을 "splited00.txt", "splited01.txt", "splited25.txt"라는 이름의 200줄의 작은 파일로 분할합니다.

split -l 200 --numeric-suffixes --additional-suffix=".txt" toSplit.txt splited

언급URL : https://stackoverflow.com/questions/2016894/how-can-i-split-a-large-text-file-into-smaller-files-with-an-equal-number-of-lin

'programing' 카테고리의 다른 글

groovy array/hash/collection/list의 요소 확인 방법 (0)	2023.04.13
WPF XAML : DataGrid에서 다중 선택을 해제하려면 어떻게 해야 합니까? (0)	2023.04.13
인수 목록을 가져오려면 어떻게 해야 합니까? (0)	2023.04.13
WPF 텍스트 상자 바인딩 업데이트 (0)	2023.04.13
IENumerable에 ForEth 확장 방식이 없는 이유는 무엇입니까? (0)	2023.04.13

현재글큰 텍스트 파일을 같은 줄의 작은 파일로 분할하려면 어떻게 해야 합니까?

각종 프로그래밍 정보를 다루는 블로그입니다.

mariadb, c, bash, reactjs, mongodb, Excel, MYSQL, sql-server, Oracle, PowerShell, python, jquery, ajax, json, Android, spring-boot, ASP.NET, git, Wordpress, angularJS,

Today :
Yesterday :

powerit

큰 텍스트 파일을 같은 줄의 작은 파일로 분할하려면 어떻게 해야 합니까?

큰 텍스트 파일을 같은 줄의 작은 파일로 분할하려면 어떻게 해야 합니까?

큰 텍스트 파일을 각각 1000 행의 작은 파일로 분할하려면 다음 절차를 수행합니다.

큰 바이너리 파일을 각각 10M의 작은 파일로 분할하려면:

분할 파일을 단일 파일로 통합하려면:

파일을 분할합니다.각 분할에는 10행(마지막 분할 제외):

파일을 5개의 파일로 분할합니다.파일은 각 분할의 크기가 같도록 분할됩니다(마지막 분할 제외).

각 분할에 512바이트의 파일을 분할합니다(마지막 분할 제외.KB는 512k, 메가바이트는 512m 사용).

각 분할에서 최대 512바이트의 파일을 구분 없이 분할합니다.

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

큰 텍스트 파일을 같은 줄의 작은 파일로 분할하려면 어떻게 해야 합니까?

큰 텍스트 파일을 같은 줄의 작은 파일로 분할하려면 어떻게 해야 합니까?

큰 텍스트 파일을 각각 1000 행의 작은 파일로 분할하려면 다음 절차를 수행합니다.

큰 바이너리 파일을 각각 10M의 작은 파일로 분할하려면:

분할 파일을 단일 파일로 통합하려면:

파일을 분할합니다.각 분할에는 10행(마지막 분할 제외):

파일을 5개의 파일로 분할합니다.파일은 각 분할의 크기가 같도록 분할됩니다(마지막 분할 제외).

각 분할에 512바이트의 파일을 분할합니다(마지막 분할 제외.KB는 512k, 메가바이트는 512m 사용).

각 분할에서 최대 512바이트의 파일을 구분 없이 분할합니다.

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바