Starting a new Data Projects, Don’t start coding before you read this.

Gaurav Kumar
5 min readMar 1, 2025

--

We’ve all been there, staring at the notebook. What Now? The hypothesis that we started with is going nowhere. How did we even land here. 3 weeks of work down the drain.

Here are a few major steps that you need to take care before starting a notebook.

1. Open Communication with Stakeholders

Who are the beneficiaries of the outcome? Who are you doing all the exercise for? Is it some external vendor, or your internal team, or some team from other department. Start communicating from day 0 itself. Your goals and vision must be aligned. Leaving apart the technical details, what is the expected overall outcome from the entire exercise.

It might not be just a single point pass or fail type of statement, it can be vague and end-goal may refine itself over the course of project. But a proper communication will ensure that you are not wasting time on things that don’t matter and you and stakeholders are on the same page.

a. Set up a weekly or bi-weekly meeting

b. Loop in the not just for updates, but for the ideas that might arise during conversation

If in case you are the stakeholder, setup a time to check the direction of the project.

2. Define the Problem Statement

The next step is to precisely define the problem or the task that you are going to embark on. If the project already has some progress, understanding the problem statement becomes even more crucial. It didn’t complete in the first place for some reason.

Try answering the following questions :-

a. What is the impact of the problem?

b. What is the severity of the Issue, how messed up is the ask from stakeholders? (it can be pretty messed up)

c. What is the value add post solving the problem, say you manage to solve; will your stakeholders be jumping or just “meh”? (It will matter how prompt they’ll be in discussing the roadblocks)

You may need to break the problem statement into multiple actionable steps going furher, but a precise statement helps in guiding direction.

3. Access the data

There is no good answer here, this might take up to 80% of your time. Be ready for that, but try making it an iterative process.

Analyze the Internal data avalable to you, take help from stakeholders in case you need to understand overview of the data. Start exploring the data aligning with the problem statement. Get in and out of the entire dataset. Here are some points that you can keep in mind :-

a. What is the coverage of data

b. What are the columns and intuitive meaning of the columns, by an overview, which of them are primary

c. What about null values, outliers (can you remove or should you correct)

d. Which of them can be features for your problem statement

e. What is the nature of the columns, categorical / continuous

f. Does the data contain, what stakeholders find important/deemed to be important

g. Get into basic statistics of the data (mean / median) , cross-sectional stats

Data Plotting can come handy here

It is quite visually effective to plot Histograms for each features to look at the scale and the distribution of the data, You can also go for scatter plot, in case you find it useful.

I suggest not to go for fancy plots at this time, the simpler plots convey as much information as you need at preliminary data check.

Note :- In case you feel the data insufficient, you should start looking for external data at this point rather than waiting for results from internal data.

The external data procurement (in case essential, will take a good amount of time and may extend the deadline of the project, so its a wise choice to look for alternate datasets along with)

4. Plan the Project Pipeline

The approach becomes crucial to the project implementation as well as the delivery. The direction of the approach is mostly guided by the end outcome you would like to generate.

Start from the basic questions ,

  1. What output do you want ? Classification ? Regression ?
  2. How will the output processed in the live setting?
  3. What is the model complexity that you can afford ? (can model take more than few milliseconds to generate results?)

Based on the above points, you can decide how the problems needs to be broken down into smaller steps, you can also decide on the tools and techniques and the algorithms that you can use for the starting phase. At this point, you can start with the Exploratory Data Analysis.

5. Define the Metric

How will you know if your project is moving in the right direction. For this you need to set up quantitative goals. Be it accuracy metric or Precision, or RMS score.

You may also need to define the base score. May be this is the benchmark below which anything becomes unacceptable. This can be a single parameter or a group of parameter based on the constraints we discussed in the previous topic.

The metric will also help you to communicate to the stakeholders with the progress of the project at the later stage. Over the timeline of the project, the metric can also change and if it feels vague at start, start with couple of possible metrics.

6. Keep Documenting

Good documentation is often overlooked but is vital for collaboration and future reference.

Document:

Your problem statement and objectives.

Data sources and preprocessing steps.

Decisions made during the project and their rationale.

Documentation will help in reverting back to a possible checkpoint in case you need to abandon a path and look into another possible direction. Documentation will also help you generate ideas at any later stage once you ponder about the mess that you’ll create.

7. Iterate

There is a very slim chance that any Data project worth doing will be aced in the first attempt. It may take couple of iterative steps at any of the above steps. So dont shy if you cant get results in the first attempt. The trick here is to make the process smooth and direction clear so that iteration feels like progress.

Final Thoughts

In the world where most of the tasks are getting automated which may greatly enhance your speed of execution, your job mostly relies on the discovery of direction and proper communication and navigation of challenges. The heavy lifting can be done by machines but proper delivery and responsibility of the delivery still rests with you fellow analysts.

--

--

Gaurav Kumar
Gaurav Kumar

No responses yet