calling spark-ec2 from within an EC2 instance: ssh

2020-04-06 01:12发布

In order to run Amplab's training exercises, I've create a keypair on us-east-1 , have installed the training scripts (git clone git://github.com/amplab/training-scripts.git -b ampcamp4) and created the env. variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY following the instructions in http://ampcamp.berkeley.edu/big-data-mini-course/launching-a-bdas-cluster-on-ec2.html

Now running

 ./spark-ec2 -i ~/.ssh/myspark.pem -r us-east-1  -k myspark --copy launch try1

generates the following messages:

 johndoe@ip-some-instance:~/projects/spark/training-scripts$ ./spark-ec2 -i ~/.ssh/myspark.pem -r us-east-1  -k myspark --copy launch try1
 Setting up security groups...
 Searching for existing cluster try1...
 Latest Spark AMI: ami-19474270
 Launching instances...
 Launched 5 slaves in us-east-1b, regid = r-0c5e5ee3
 Launched master in us-east-1b, regid = r-316060de
 Waiting for instances to start up...
 Waiting 120 more seconds...
 Copying SSH key /home/johndoe/.ssh/myspark.pem to master...
 ssh: connect to host ec2-54-90-57-174.compute-1.amazonaws.com port 22: Connection refused
 Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/johndoe/.ssh/myspark.pem root@ec2-54-90-57-174.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned  non-zero exit status 255, sleeping 30
 ssh: connect to host ec2-54-90-57-174.compute-1.amazonaws.com port 22: Connection refused
 Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/johndoe/.ssh/myspark.pem root@ec2-54-90-57-174.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30
 ...
 ...
 subprocess.CalledProcessError: Command 'ssh -t -o StrictHostKeyChecking=no -i /home/johndoe/.ssh/myspark.pem root@ec2-54-90-57-174.compute-1.amazonaws.com '/root/spark/bin/stop-all.sh'' returned non-zero exit status 127

where root@ec2-54-90-57-174.compute-1.amazonaws.com is the user & master instance. I've tried -u ec2-user and increasing -w all the way up to 600, but get the same error.

I can see the master and slave instances in us-east-1 when I log into the AWS console, and I can actually ssh into the Master instance from the 'local' ip-some-instance shell.

My understanding is that the spark-ec2 script takes care of defining the Master/Slave security groups (which ports are listened to and so on), and I shouldn't have to tweak these settings. This said, master and slaves all listen to post 22 (Port:22, Protocol:tcp, Source:0.0.0.0/0 in the ampcamp3-slaves/masters sec. groups).

I'm at a loss here, and would appreciate any pointers before I spend all my R&D funds on EC2 instances.... Thanks.

1条回答
Animai°情兽
2楼-- · 2020-04-06 01:32

This is most likely caused by SSH taking a long time to start up on the instances, causing the 120 second timeout to expire before the machines could be logged into. You should be able to run

./spark-ec2 -i ~/.ssh/myspark.pem -r us-east-1  -k myspark --copy launch --resume try1

(with the --resume flag) to continue from where things left off without re-launching new instances. This issue will be fixed in Spark 1.2.0, where we have a new mechanism that intelligently checks the SSH status rather than relying on a fixed timeout. We're also addressing the root causes behind the long SSH startup delay by building new AMIs.

查看更多
登录 后发表回答