Windows下搭建Jetbrains IDEA + Spark的本地开发环境
安装JDK、Scala、Python环境,在命令行中测试环境是否可用,注意,本机装的是基于Hadoop2.7版本的Spark2.3.1版本,需要的JDK是1.8,Scala是2.11版本,Python用的是3.1,小版本号无所谓,大版本号要对。
1、JDK安装测试
2、Scala安装测试
3、Python安装测试
4、下载Spark,http://spark.apache.org/downloads.html ,我下载的是spark-2.3.1-bin-hadoop2.7.tgz版本,这里可以去华为云提供的镜像下载:https://mirrors.huaweicloud.com/
5、下载完成后解压两次,能得到spark-2.3.1-bin-hadoop2.7
文件夹,将该文件夹拷贝到某个目录下,(注意,路径中不能包含空格,否则会出错)
6、配置全局环境变量,即SPARK_HOME,并将SPARK_HOME设置为D:\ProgramFiles\spark-2.3.1-bin-hadoop2.7,这是我的路径,然后将D:\ProgramFiles\spark-2.3.1-bin-hadoop2.7\bin加到PATH中即可。然后打开cmd,运行spark-shell即可进入spark的交互界面,可以编程了。注意,如果遇到 Failed to locate the winutils binary in the hadoop binary path 之类的错误,那是由于缺少winutils的原因。解决方案是下载对应的hadoop,然后配置好HADOOP_HOME,并https://github.com/steveloughran/winutils/ 下载对应版本的winutils.exe拷贝到hadoop下面的bin下,然后即可。
7、如果使用IDEA编写Spark程序也很简单,在IDEA中新建scala工程,然后将该工程添加maven支持。
在maven中的pom.xml添加如下内容即可。注意scala.version的版本,要和安装的对应,也要和spark要求的对应。
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>dufei</groupId>
<artifactId>spark_ml</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<scala.version>2.11.8</scala.version>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.3.1</version>
<scope>compile</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.1</version>
</dependency>
</dependencies>
</project>
8、编写spark程序,例如使用mllib下的kmeans方法(RDD的方式),代码如下:
def main(args: Array[String]): Unit = {
val masterURL = "local[*]"
val conf = new SparkConf().setAppName("KMeans Test").setMaster(masterURL)
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
// Load and parse the data
val data = sc.textFile("file:/d:/data/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Save and load model
clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
val sameModel = KMeansModel.load(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
}
点击运行即可。
欢迎大家关注DataLearner官方微信,接受最新的AI技术推送
