My main aim is to split out records into files according to the ids of each record, and there are over 15 billion records right now which can certainly increase. I need a scalable solution using Amazon EMR. I have already got this done for a smaller dataset having around 900 million records.
Input files are in csv format, with one of the field which is need to be the file name in the output. So say that there are following input records:
awesomeId1, somedetail1, somedetail2 awesomeID1, somedetail3, somedetail4 awesomeID2, somedetail5, somedetail6
So now 2 files should be as output, one named
awesomeID1.dat and other as
awesomeID2.dat, each having records pertaining to respective IDs.
Size of the input: Total 600 GB (size of gzippef files) per month, each files is around 2 3 GB. And I need to process it for around 6 months or more at a time. so Total data size would be 6*600 GB (compressed).
Previously I was getting
Too many open files error when I was using
FileByKeyTextOutputFormat extends MultipleTextOutputFormat<Text, Text> to write to s3 according to the id value. Then as I have explained here, instead of writing every file directly to s3, I wrote them locally and moved to s3 in batches of 1024 files.
But now with increased amount of data, I am getting following message from s3 and then it skips writing the file in question :
"Please reduce your request rate." Also I am having to run on a cluster with 200 m1.xlarge machines which then take around 2 hours, and hence it is very costly too!
I would like to have a scalable solution which shall not fail if amount of data increases again in future.