You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/data-lake-store/data-lake-store-performance-tuning-guidance.md
+6-7Lines changed: 6 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,7 +34,7 @@ This section provides general guidance to improve performance when data is copie
34
34
35
35
***Source Data** - There are many constraints that can arise from where the source data is coming from. Throughput can be a bottleneck if the source data is on slow spindles or remote storage with low throughput capabilities. SSDs, preferably on local disk, will provide the best performance due to higher disk throughput.
36
36
37
-
***Network** - If you have your source data on VMs, the network connection between the VM and Data Lake Store is important. Use VMs with the largest available NIC to get more network bandwidth.
37
+
***Network** - If you have your source data on VMs, the network connection between the VM and Data Lake Store is important. Use VMs with the largest available NIC to get more network bandwidth.
38
38
39
39
***Cross-region copy** - There is a large network cost inherent to cross-region data I/O, for example running a data ingestion tool on a VM in US East 2 to write data to a Data Lake Store account in US Central. If you’re copying data across regions, you may see reduced performance. We recommend using data ingestion jobs on VMs in the same region as the destination Data Lake Store account to maximize network throughput.
40
40
@@ -48,10 +48,10 @@ This section provides general guidance to improve performance when data is copie
48
48
| [AdlCopy](data-lake-store-copy-data-azure-storage-blob.md) | Azure Data Lake Analytics units |
## Guidelines while working with data analysis workloads
54
+
## Guidelines while working with HDInsight workloads
55
55
56
56
While running analytic workloads to work with data in Data Lake Store, we recommend that you use HDInsight 3.5 cluster versions to get the best performance with Data Lake Store. When your job is more I/O intensive, then certain parameters can be configured for performance reasons. For example, if the job mainly consists of read or writes, then increasing concurrency for I/O to and from Azure Data Lake Store could increase performance.
57
57
@@ -71,9 +71,9 @@ Azure Data Lake Store is best optimized for performance when there is more concu
71
71
72
72
For example, let’s assume you have a single D3v2 node in your HDInsight cluster that has 12GB of YARN memory and 3GB containers. You scale your cluster to 2 D3v2, which increases your YARN memory to 24GB. This increases concurrency from 4 to 8.
73
73
74
-

75
-
76
-
3.**Start by setting the number of tasks to the number of concurrency you have** – By now, you have already set the container size appropriately to get the maximum amount of concurrency. You should now set the number of tasks to use all those containers. There are different names for tasks in each workload.
74
+

75
+
76
+
3.**Start by setting the number of tasks to the number of concurrency you have** – By now, you have already set the container size appropriately to get the maximum amount of concurrency. You should now set the number of tasks to use all those containers. There are different names for tasks in each workload.
77
77
78
78
You may also want to consider the size of your job. If the size of your job is large, then each task may have a large amount of data to process. You may want to use more tasks so that each task will not be processing too much data.
79
79
@@ -91,4 +91,3 @@ Azure Data Lake Store is best optimized for performance when there is more concu
91
91
## See also
92
92
*[Overview of Azure Data Lake Store](data-lake-store-overview.md)
93
93
*[Get Started with Azure Data Lake Analytics](../data-lake-analytics/data-lake-analytics-get-started-portal.md)
0 commit comments