Information of Team and Sponsor

Team Member

Tyler J Matteo: Responsible for the cleaning and connection of multiple data sets. And then perform feature analysis and modeling of multidimensional data to predict the main factors affecting lead paint violations. Also responsible for applying data to to generate interactive visualization tools.

Zhiwei Fan: Responsible for the code part of the landlords’ violation list, and the time series analysis of the work efficiency of the HPD inspectors.

Report writing and slide production are all done by two of us in cooperation.

Research and Analytics Department in Office of the NY Attorney General:

The Research and Analytics Department is part of the Executive Division and works closely with senior staff and divisions throughout the office to support the OAG’s major initiatives, investigations, and policy development.

Visualization of Results

Feature engineering

By looking at violations and complaints per unit at the census tract level, we were able to identify several trends along demographic and economic lines. For this analysis we focus on violations and complaints per unit built before 1960. The median violation and complaint rates among census tracts in the upper quintiles for percent Latino or African American population were significantly higher than the city overall while those higher in white population ranked below the city-wide median. Likewise for economic factors such as median income or the percent of the population who have at most a high school education, census tracts with disadvantaged populations have higher rates of violations and complaints, whereas those with highly educated and high income residents experience fewer violations and complaints than the city as a whole.

Median violations per unit for census tracts separated into quintiles for several key socioeconomic features

However, these trends were generally not as visible when looking at the age of housing stock along the same lines. In fact, the average age of buildings in the upper quartiles for white population and median household income was actually older than that of the upper quartiles for African American population or percent with at most a high school diploma. Looking at the total share of housing in each tract built before 1960 shows a similar trend. While the percentage is generally higher in tracts with a higher population of disadvantaged groups, the share differs only a few percent in most cases. This suggests that disadvantaged groups may not be more likely to live in housing built before lead paint was made illegal, rather they’re simply more likely to live in similar housing that is more poorly cared for or renovated.

Median percent of housing units built before 1960 for census tracts separated into quintiles for several key socioeconomic features

Time series analysis

By observing the monthly response time and the total number of inspections, we can see that the workload of the inspectors from March to July in 2020 is significantly affected. The response time has been extended by an average of 20 days, and the number of inspections per month is only about 50% of the previous.

2015-2021 Monthly Average Response Time

2015-2021 Monthly Total Inspections

In the next two pictures, we can see that inspectors are more likely to be denied entry during the epidemic than usual. However, even if we only calculate the hit rate when entry is allowed, the hit rate is still lower than in previous years. The reason may be that there are more violations in the apartments that deny access to the inspectors.

2015-2021 Monthly Inspector No-entry Rate

2015-2021 Monthly Inspector Hit Rate When Allowing Entry


Our linear regression model of tract-level violation rates shows results that appear to echo our exploratory analysis.

Feature Coefficients For Linear Regression Model of Tract Violations per Unit

There is a strong inverse correlation between the amount of white population in a tract and its rate of violations. The same is true for median income. Features associated with the overall age of the housing stock are, however, less important to the model, again echoing our earlier analysis.
For our classification models we gathered various performance metrics and documented feature importance for our Random Forest Classifier.

Accuracy Precision Recall F1 AUC ROC Specificity
Random Forest 0.52 0.52 0.97 0.68 0.51 0.52
Bagging 0.52 0.52 1.0 0.68 0.51 0.52

Random Forest Classification Model Performance

Feature Importance

Both models produced a high number of false positives. This is likely due to the distribution of the features in the dataset overall. Because certain characteristics are overly represented in the complaint dataset, there is little variation in these features for the model to learn from.

Interactive Visualization Tool

Based on the open source tool, we completed a visualization tool. Sponsors can use the Filter function in the tool to display the data they are interested in on the map and observe the correlation between data in different dimensions. The link is as follows:

Geographical Visualization Tool

Links to Resources

Final Technical Report:

Predicting and Preventing Lead Paint Poisoning

Literature Reference:

Robust and efficient fuzzy match for online data cleaning

Validation of a Machine Learning Model to Predict Childhood Lead Poisoning


American Community Survey (ACS)

Complaint Problems | NYC Open Data

Housing Maintenance Code Complaints | NYC Open Data

Housing Maintenance Code Violations | NYC Open Data

Multiple Dwelling Registrations | NYC Open Data

Primary Land Use Tax Lot Output (PLUTO) | NYC Open Data

Project Introduction


Peeling lead paint is the most common cause of lead poisoning in young children. Lead dust from peeling paint can land on windowsills, floors, and toys. When children play on the floor and put their hands and toys in their mouths, they can swallow lead dust. Exposure to lead is known to have several harmful effects on children including developmental impairment, learning disorders, and problems with hearing and speech. To protect children from lead paint poisoning, NYC Local Law 1 of 2004 requires landlords to use firms certified by the U.S. Environmental Protection Agency when disturbing more than 100 square feet of lead paint, replacing windows, or fixing violations issued by the HPD. It is presumed that these hazards exist in tenant-occupied buildings of three or more units built before January 1st, 1960 where a child under the age of six resides. In February of 2021, this law was expanded to also apply to one and two unit buildings. In another push to address this issue, Local Law 31 of 2020 requires all buildings built prior to 1960 to undergo an inspection by an EPA certified inspector no later than August 9th, 2025. While the city has made progress in addressing lead paint hazards, there is concern that the COVID-19 pandemic has inhibited inspectors’ abilities to quickly identify violations.

Problem Statement

Through our discussions with our project sponsors, we have identified these key problems that we will be aiming to solve through a combination of data analysis and predictive modeling:

  • Where are children being exposed to hazardous levels of lead paint;
  • Which individual landlords are responsible for the highest rate of lead based paint violations;
  • What effect, if any, has COVID-19 had on violation and complaint response and remediation times.