Skip to content

Commit 30e6dd6

Browse files
committed
Minor edit
Changed one of the titles to "Guidelines while working with HDInsight workloads"
1 parent 226ce4f commit 30e6dd6

File tree

1 file changed

+6
-7
lines changed

1 file changed

+6
-7
lines changed

articles/data-lake-store/data-lake-store-performance-tuning-guidance.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ This section provides general guidance to improve performance when data is copie
3434

3535
* **Source Data** - There are many constraints that can arise from where the source data is coming from. Throughput can be a bottleneck if the source data is on slow spindles or remote storage with low throughput capabilities. SSDs, preferably on local disk, will provide the best performance due to higher disk throughput.
3636

37-
* **Network** - If you have your source data on VMs, the network connection between the VM and Data Lake Store is important. Use VMs with the largest available NIC to get more network bandwidth.
37+
* **Network** - If you have your source data on VMs, the network connection between the VM and Data Lake Store is important. Use VMs with the largest available NIC to get more network bandwidth.
3838

3939
* **Cross-region copy** - There is a large network cost inherent to cross-region data I/O, for example running a data ingestion tool on a VM in US East 2 to write data to a Data Lake Store account in US Central. If you’re copying data across regions, you may see reduced performance. We recommend using data ingestion jobs on VMs in the same region as the destination Data Lake Store account to maximize network throughput.
4040

@@ -48,10 +48,10 @@ This section provides general guidance to improve performance when data is copie
4848
| [AdlCopy](data-lake-store-copy-data-azure-storage-blob.md) | Azure Data Lake Analytics units |
4949
| [DistCp](data-lake-store-copy-data-wasb-distcp.md) | -m (mapper) |
5050
| [Azure Data Factory](../data-factory/data-factory-azure-datalake-connector.md)| parallelCopies |
51-
| [Sqoop](data-lake-store-data-transfer-sql-sqoop.md) | fs.azure.block.size, -m (mapper) |
51+
| [Sqoop](data-lake-store-data-transfer-sql-sqoop.md) | fs.azure.block.size, -m (mapper) |
5252

5353

54-
## Guidelines while working with data analysis workloads
54+
## Guidelines while working with HDInsight workloads
5555

5656
While running analytic workloads to work with data in Data Lake Store, we recommend that you use HDInsight 3.5 cluster versions to get the best performance with Data Lake Store. When your job is more I/O intensive, then certain parameters can be configured for performance reasons. For example, if the job mainly consists of read or writes, then increasing concurrency for I/O to and from Azure Data Lake Store could increase performance.
5757

@@ -71,9 +71,9 @@ Azure Data Lake Store is best optimized for performance when there is more concu
7171

7272
For example, let’s assume you have a single D3v2 node in your HDInsight cluster that has 12GB of YARN memory and 3GB containers. You scale your cluster to 2 D3v2, which increases your YARN memory to 24GB. This increases concurrency from 4 to 8.
7373

74-
![Data Lake Store performance](./media/data-lake-store-performance-tuning-guidance/image-3.png)
75-
76-
3. **Start by setting the number of tasks to the number of concurrency you have** – By now, you have already set the container size appropriately to get the maximum amount of concurrency. You should now set the number of tasks to use all those containers. There are different names for tasks in each workload.
74+
![Data Lake Store performance](./media/data-lake-store-performance-tuning-guidance/image-3.png)
75+
76+
3. **Start by setting the number of tasks to the number of concurrency you have** – By now, you have already set the container size appropriately to get the maximum amount of concurrency. You should now set the number of tasks to use all those containers. There are different names for tasks in each workload.
7777

7878
You may also want to consider the size of your job. If the size of your job is large, then each task may have a large amount of data to process. You may want to use more tasks so that each task will not be processing too much data.
7979

@@ -91,4 +91,3 @@ Azure Data Lake Store is best optimized for performance when there is more concu
9191
## See also
9292
* [Overview of Azure Data Lake Store](data-lake-store-overview.md)
9393
* [Get Started with Azure Data Lake Analytics](../data-lake-analytics/data-lake-analytics-get-started-portal.md)
94-

0 commit comments

Comments
 (0)