Project Setup

Configure an Existing Project

To add Emma to an existing project, add the emma-language dependency

<!-- Core Emma API and compiler infrastructure -->
<dependency>
    <groupId>org.emmalanguage</groupId>
    <artifactId>emma-language</artifactId>
    <version>0.2.3</version>
</dependency>

and either emma-flink or emma-spark depending on the desired execution backend.

<!-- Emma backend for Flink -->
<dependency>
    <groupId>org.emmalanguage</groupId>
    <artifactId>emma-flink</artifactId>
    <version>0.2.3</version>
</dependency>

<!-- Emma backend for Spark -->
<dependency>
    <groupId>org.emmalanguage</groupId>
    <artifactId>emma-spark</artifactId>
    <version>0.2.3</version>
</dependency>

Setup a New Project

To bootstrap a new project org.acme:emma-quickstart from a Maven archetype, use the following command.

mvn archetype:generate -B                  \
    -DartifactId=emma-quickstart           \
    -DgroupId=org.acme                     \
    -Dversion=0.1-SNAPSHOT                 \
    -Dpackage=org.acme.emma                \
    -DarchetypeArtifactId=emma-quickstart  \
    -DarchetypeGroupId=org.emmalanguage    \
    -DarchetypeVersion=0.2.3

Build the project with one of the following commands.

mvn package # without tests
mvn verify  # with tests

HDFS Setup

If you are not familiar with Hadoop, check the “Getting started with Hadoop” guide.

To run the algorithms on a Flink or Spark cluster, copy the input files to HDFS.

Assuming a variable to bin/hdfs

export HDFS=/path/to/hadoop-2.x/bin/hdfs
export HDFS_ADDR="$HOSTNAME:9000"

you can run the following commands.

$HDFS dfs -mkdir -p /tmp/output
$HDFS dfs -mkdir -p /tmp/input
$HDFS dfs -copyFromLocal emma-quickstart-library/src/test/resources/* /tmp/input/.

Running the Examples on Flink

If you are not familiar with Flink, check the “Getting started with Flink” guide.

Assuming a variable to bin/flink

export FLINK=/path/to/flink-1.2.x/bin/flink

and a local filesystem path shared between all nodes in your Flink cluster

export CODEGEN=/tmp/emma/codegen

you can run the algorithms in your quickstart project with one of the following commands.

WordCount
Transitive Closure
K-Means

$FLINK run -C "file://$CODEGEN/" \
  emma-quickstart-flink/target/emma-quickstart-flink-0.1-SNAPSHOT.jar \
  word-count \
  hdfs://$HDFS_ADDR/tmp/input/text/jabberwocky.txt \
  hdfs://$HDFS_ADDR/tmp/output/wordcount-output.tsv \
  --codegen "$CODEGEN"

$FLINK run -C "file://$CODEGEN/" \
  emma-quickstart-flink/target/emma-quickstart-flink-0.1-SNAPSHOT.jar \
  transitive-closure \
  hdfs://$HDFS_ADDR/tmp/input/graphs/trans-closure/edges.tsv \
  hdfs://$HDFS_ADDR/tmp/output/trans-closure-output.tsv \
  --codegen "$CODEGEN"

$FLINK run -C "file://$CODEGEN/" \
  emma-quickstart-flink/target/emma-quickstart-flink-0.1-SNAPSHOT.jar \
  k-means 2 4 0.001 10 \
  hdfs://$HDFS_ADDR/tmp/input/ml/clustering/kmeans/points.tsv \
  hdfs://$HDFS_ADDR/tmp/output/kmeans-output.tsv \
  --codegen "$CODEGEN"

Running the Examples on Spark

If you are not familiar with Spark, check the “Getting started with Spark” guide.

Assuming a variable to bin/spark-submit

export SPARK=/path/to/spark-2.1.x/bin/spark-submit

and a Spark master URL

export SPARK_ADDR="$HOSTNAME:7077"

you can run the algorithms in your quickstart project with one of the following commands.

WordCount
Transitive Closure
K-Means

$SPARK --master "spark://$SPARK_ADDR" \
  emma-quickstart-spark/target/emma-quickstart-spark-0.1-SNAPSHOT.jar \
  word-count \
  hdfs://$HDFS_ADDR/tmp/input/text/jabberwocky.txt \
  hdfs://$HDFS_ADDR/tmp/output/wordcount-output.tsv \
  --master "spark://$SPARK_ADDR"

$SPARK --master "spark://$SPARK_ADDR" \
  emma-quickstart-spark/target/emma-quickstart-spark-0.1-SNAPSHOT.jar \
  transitive-closure \
  hdfs://$HDFS_ADDR/tmp/input/graphs/trans-closure/edges.tsv \
  hdfs://$HDFS_ADDR/tmp/output/trans-closure-output.tsv \
  --master "spark://$SPARK_ADDR"

$SPARK --master "spark://$SPARK_ADDR" \
  emma-quickstart-spark/target/emma-quickstart-spark-0.1-SNAPSHOT.jar \
  k-means 2 4 0.001 10 \
  hdfs://$HDFS_ADDR/tmp/input/ml/clustering/kmeans/points.tsv \
  hdfs://$HDFS_ADDR/tmp/output/kmeans-output.tsv \
  --master "spark://$SPARK_ADDR"