Windows下搭建Jetbrains IDEA + Spark的本地开发环境

标签:#spark##大数据##编程# 时间:2018/07/26 09:46:27 作者:小木

安装JDK、Scala、Python环境,在命令行中测试环境是否可用,注意,本机装的是基于Hadoop2.7版本的Spark2.3.1版本,需要的JDK是1.8,Scala是2.11版本,Python用的是3.1,小版本号无所谓,大版本号要对。

1、JDK安装测试
2、Scala安装测试
3、Python安装测试
4、下载Spark,http://spark.apache.org/downloads.html ,我下载的是spark-2.3.1-bin-hadoop2.7.tgz版本,这里可以去华为云提供的镜像下载:https://mirrors.huaweicloud.com/

5、下载完成后解压两次,能得到spark-2.3.1-bin-hadoop2.7文件夹,将该文件夹拷贝到某个目录下,(注意,路径中不能包含空格,否则会出错
6、配置全局环境变量,即SPARK_HOME,并将SPARK_HOME设置为D:\ProgramFiles\spark-2.3.1-bin-hadoop2.7,这是我的路径,然后将D:\ProgramFiles\spark-2.3.1-bin-hadoop2.7\bin加到PATH中即可。然后打开cmd,运行spark-shell即可进入spark的交互界面,可以编程了。注意,如果遇到 Failed to locate the winutils binary in the hadoop binary path 之类的错误,那是由于缺少winutils的原因。解决方案是下载对应的hadoop,然后配置好HADOOP_HOME,并https://github.com/steveloughran/winutils/ 下载对应版本的winutils.exe拷贝到hadoop下面的bin下,然后即可。

7、如果使用IDEA编写Spark程序也很简单,在IDEA中新建scala工程,然后将该工程添加maven支持。
在maven中的pom.xml添加如下内容即可。注意scala.version的版本,要和安装的对应,也要和spark要求的对应。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>dufei</groupId>
  <artifactId>spark_ml</artifactId>
  <version>1.0-SNAPSHOT</version>

  <properties>
    <scala.version>2.11.8</scala.version>
  </properties>

  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </pluginRepository>
  </pluginRepositories>

  <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.3.1</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-mllib_2.11</artifactId>
      <version>2.3.1</version>
      <scope>compile</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.3.1</version>
    </dependency>



  </dependencies>



</project>

8、编写spark程序,例如使用mllib下的kmeans方法(RDD的方式),代码如下:

def main(args: Array[String]): Unit = {

    val masterURL = "local[*]"

    val conf = new SparkConf().setAppName("KMeans Test").setMaster(masterURL)
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")

    // Load and parse the data
    val data = sc.textFile("file:/d:/data/kmeans_data.txt")
    val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()

    // Cluster the data into two classes using KMeans
    val numClusters = 2
    val numIterations = 20
    val clusters = KMeans.train(parsedData, numClusters, numIterations)

    // Evaluate clustering by computing Within Set Sum of Squared Errors
    val WSSSE = clusters.computeCost(parsedData)
    println(s"Within Set Sum of Squared Errors = $WSSSE")

    // Save and load model
    clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
    val sameModel = KMeansModel.load(sc, "target/org/apache/spark/KMeansExample/KMeansModel")


  }

点击运行即可。

欢迎大家关注DataLearner官方微信,接受最新的AI技术推送