Skip to content

Commit 891983b

Browse files
authored
Merge pull request cityofcapetown#1 from cityofcapetown/wip/unified-assessment
Agreed version of assessment README
2 parents 5922d6b + 5ce8de1 commit 891983b

File tree

2 files changed

+55
-35
lines changed

2 files changed

+55
-35
lines changed

README.md

Lines changed: 55 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
2+
<img src="img/city_emblem.png" alt="City Logo"/>
3+
14
# City of Cape Town - Data Science Unit Code Challenge
25

36
## Purpose
@@ -16,17 +19,16 @@ So, follow common conventions with respect to directory structure and names to m
1619
### Candidates where programming is required (Data Scientist and Engineers)
1720
Requirements and notes:
1821
* Our primary programming languages are `python` and `R`. We will accept code that is packaged in `py`, `.ipynb`, `.R` and `.Rmd` files.
19-
* Bash scripts are fine for glue. You may develop in any development environment you choose.
22+
* Bash or similar scripting language files are fine for glue. You may develop in any development environment you choose.
2023
* We expect to be able to clone your repo, immediately identify what script to execute from your README file, and execute it to completion with no human interaction.
2124
In order to ensure that our environment has the right libraries or packages, please follow standard python (PEP8) or R guidelines for structure in your code, i.e place `import` and `library()` commands at the top of your scripts.
2225
* If your repo does not clone and run, we will not attempt to fix it.
2326
* If your analysis makes use of any external data, the data must either be included in the repo, or be downloaded automatically during script execution.
2427

2528
### Candidates where programming is not required (Data Analysts)
26-
*NB* If you prefer, you may submit using the requirements described above.
29+
*Note* If you prefer, you may submit using the requirements described above.
2730

28-
The final output of your analysis should be either a self-contained `html` file, executed `ipynb` file,
29-
Excel document with appropriate formatting and pivot tables, or a PowerBI file.
31+
You can use any tool to produce the output, e.g. Python, R, Excel, Power BI, Tableau, etc. The final deliverable needs to be a pdf report with your analysis.
3032

3133
## How to submit
3234
### Candidates where programming is required (Data Scientist and Engineers)
@@ -45,55 +47,73 @@ NOTE: If you would like to _improve_ the content of this repository, by fixing t
4547
4. Send us an email, with your archived project attached. If it is larger than 10 MB, then share it via a cloud storage service such as DropBox, and include the link in your email.
4648

4749
## Challenge
50+
Follow the below steps, completing those indicated as relevant to the positions for which you are interviewing. If there are any steps that you can not complete after a reasonable amount of effort, rather move on to later steps, attempting everything relevant at least once.
4851

49-
For all roles, we expect the challenge response to include what you consider to be role-appropriate testing and validation. For instance, a Data Scientist might want to include MAPE scores or confusion matrices. A Data Engineer may want to include logging and data quality validation tests. A Data Analyst might want to plot histograms of request counts to ensure that outliers aren't overwhelming your analysis.
52+
For all roles, we expect the challenge response to include what you consider to be role-appropriate testing and validation. For example, a Data Scientist might want to include MAPE scores or confusion matrices. A Data Engineer may want to include logging and data quality validation tests. A Data Analyst might want to plot histograms of request counts to ensure that outliers aren't overwhelming your analysis.
5053

51-
### If you are interviewing for a Data Science role
54+
Your code should be well formatted according to generally accepted style guides and include whatever is necessary for a team-mate unfamiliar with it to maintain it.
5255

53-
In the `data` directory you will find two datasets:
54-
* `data/sr.csv` contains 36 months of service request data, where each row is a service request. A service request is a request from one of our residents to undertake significant work.
55-
* `data/sal.zip` contains Shapefiles of City of Cape Town small area layers.
56+
### 0. Setup
57+
We have made the following datasets available (each filename is a link). These are all available in an AWS bucket `cct-ds-code-challenge-input-data`, in the `af-south-1` region, with the object name being the filenames below):
58+
* [`sr.csv.gz`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr.csv.gz) contains 36 months of service request data, where each row is a service request. A service request is a request from one of the residents of the City of Cape Town to undertake significant work. This is an important source of information on service delivery, and our performance thereof. *Note* as indicated by the extension, this file is compressed.
59+
* [`sr_hex.csv.gz`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr_hex.csv.gz) contains the same data as `sr.csv` as well as a column `h3_level8_index`, which contains the appropriate resolution level 8 H3 index for that request. If the request doesn't have a valid geolocation, the index value will be `0`. *Note* as indicated by the extension, this file is compressed.
60+
* [`sr_hex_truncated.csv`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/sr_hex_truncated.csv) is a truncated version of `sr_hex.csv`, containing only 3 months of data.
61+
* [`city-hex-polygons-8.geojson`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/city-hex-polygons-8.geojson) contains the [H3 spatial indexing system](https://h3geo.org/) polygons and index values for the bounds of the City of Cape Town, at resolution level 8.
62+
* [`city-hex-polygons-8-10.geojson`](https://cct-ds-code-challenge-input-data.s3.af-south-1.amazonaws.com/city-hex-polygons-8-10.geojson) contains the [H3 spatial indexing system](https://h3geo.org/) polygons and index values for resolution levels 8, 9 and 10, for the City of Cape Town.
5663

57-
1. *Time series challenge*: Predict the weekly number of expected service requests per small area for the next 4 weeks.
58-
2. *Introspection challenge*: Reshape the data into number of requests per small area in the last 12 months. Augment this data with any data you may find that is relevant. Predict the number of requests from this augmented data. Determine the drivers of requests of that particular type.
59-
3. *Classification challenge*: Classify a small area layer as formal, informal or rural based on the data derived from the service request data. NOTE TO CCT: We'll need to pre-classify the SALs.
64+
*Note* Some of these files are large, so start downloading as soon as possible.
6065

61-
Feel free to use any other data you can find in the public domain.
66+
In some of the tasks below you will be creating datasets that are similar to these, feel free to use them to validate your work.
6267

63-
**The final output of the execution of your code should be a self contained `html` file or executed `ipynb` file that is your report.**
64-
65-
A statistically minded layperson should be able to read this report and follow your analysis without guidance.
68+
### 1. Data Extraction (if applying for a Data Engineering Position)
69+
Use the [AWS S3 SELECT](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html) command to read in the H3 resolution 8 data from `city-hex-polygons-8-10.geojson`. Use the `city-hex-polygons-8.geojson` file to validate your work.
6670

67-
### If you are interviewing for a Data Analyst role
71+
Please log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used.
6872

69-
In the `data` directory you will find a file called `data/sr_sal.csv`. It contains 36 months of service request data, where each row is a service request. A service request is a request from one of our residents to undertake significant work. The column titled 'SAL' contains the name of the Census 2011 Small Area that the particular request falls under.
73+
### 2. Initial Data Transformation (if applying for a Data Engineering and/or Science Position)
74+
Join the file `city-hex-polygons-8.geojson` to the service request dataset, such that each service request is assigned to a single H3 hexagon. Use the `sr_hex.csv` file to validate your work.
7075

76+
For any requests where the `Latitude` and `Longitude` fields are empty, set the index value to `0`.
77+
78+
Include logging that lets the executor know how many of the records failed to join, and include a join error threshold above which the script will error out. Please also log the time taken to perform the operations described, and within reason, try to optimise latency and computational resources used.
79+
80+
### 3. Descriptive Analytic Tasks (if applying for a Data Analyst Position)
81+
Please use the `sr_hex_truncated.csv` dataset to address the following.
82+
83+
Please provide the following:
7184
1. Provide a visual answer to the question "which areas and request types should Electricity concentrate on to reduce the overall volume of their requests".
72-
2. Provide a working prototype dashboard for monitoring progress in reducing Electricity service request volume per area and per type.
85+
2. Provide a working prototype dashboard for monitoring progress in reducing Electricity service request volume per area, and per type.
7386

74-
An Executive-level person should be able to read this report and follow your analysis without guidance.
87+
An Executive-level person should be able to read this report and follow your analysis without guidance.
7588

76-
### If you are interviewing for a Data Engineer role
89+
### 4. Predictive Analytic Tasks (if applying for a Data Science Position)
90+
Please use `sr_hex.csv` dataset, only looking at requests from the `Water and Sanitation Services` department.
7791

78-
In the `data` directory you will find a dataset file, `data/sr.csv`, which contains 36 months of service request data, where each row is a service request. A service request is a request from one of our residents to undertake significant work.
92+
Please chose two of the following:
93+
1. *Time series challenge*: Predict the weekly number of expected service requests per hex for the next 4 weeks.
94+
2. *Introspection challenge*: Reshape the data into number of requests, per type, per hex in the last 12 months. Chose a particular request type, or group of requests. Develop a model that predicts the number of requests of your selected type, using the rest of your data. Based upon the model, and any other analysis, determine the drivers of requests of that particular type(s).
95+
3. *Classification challenge*: Classify a hex as formal, informal or rural based on the data derived from the service request data.
7996

80-
We have made two resources available remotely:
81-
* A GeoJSON file that contains the level 8, 9 and 10 resolution hexagons for the City of Cape Town at [this location](insert readonly object url).
82-
* An AWS S3 writeonly bucket at [this location](Insert here).
97+
Feel free to use any other data you can find in the public domain, except for task (3).
8398

84-
1. *Extraction and Load Challenge* - create a script which joins the SR data to the H3 hex data, meeting the following requirements:
85-
1. Uses AWS S3 SELECT syntax to read in H3 resolution 8 data.
86-
2. Joins it to the service request dataset
87-
3. Select a subset of columns (including the H3 index column), and write it to the writeonly bucket. Be sure to name your output file something that is recognizable as your work, and unlikely to collide with the names of others.
88-
* Include logging that lets the executor know how many of the records failed to join, and include a join error threshold above which the script will error out.
99+
**The final output of the execution of your code should be a self-contained `html` file or executed `ipynb` file that is your report.**
100+
101+
A statistically minded layperson should be able to read this report and follow your analysis without guidance.
89102

90-
2. *Transformation challenge* - write a script which anonymises the SR file, but preserves the following resolutions (You may use H3 indexes or lat/lon coordinates for your spatial data):
103+
Please log the time taken to perform the operations described, and within reason, try to optimise latency and computation resources used.
104+
105+
### 5. Further Data Transformations (if applying for a Data Engineering Position)
106+
Write a script which anonymises the `sr_hex.csv` file, but preserves the following resolutions (You may use H3 indexes or lat/lon coordinates for your spatial data):
91107
* location accuracy to within approximately 500m
92108
* temporal accuracy to within 6 hours
93109
* scrubs any columns which may contain personally identifiable information.
94-
95-
Your code should be well formatted according to generally accepted style guides and include whatever is necessary for a team-mate unfamiliar with it to maintain it.
96110

97-
## Contact
111+
We expect in the accompanying report that follows you will justify as to why this data is now anonymised. Please limit this commentary to less than 500 words. If your code is written in a code notebook such as Jupyter notebook or Rmarkdown, you can include this commentary in your notebook.
98112

99-
You can contact riaz.arbi, gordon.inggs and/or colinscott.anthony @ capetown.gov.za
113+
### 6. Data Loading Tasks (if applying for a Data Engineering Position)
114+
Select a subset of columns (including the H3 index column) from the `sr_hex.csv` or the anonymised file created in the task above, and write it to the write-only S3 bucket.
115+
116+
Be sure to name your output file something that is recognisable as your work, and unlikely to collide with the names of others.
117+
118+
## Contact
119+
You can contact riaz.arbi, gordon.inggs and/or colinscott.anthony @ capetown.gov.za for any questions on the above.

img/city_emblem.png

26.2 KB
Loading

0 commit comments

Comments
 (0)