JoulesEye

Abstract

Smartphones and smartwatches have contributed significantly to fitness monitoring by providing real-time statistics, thanks to accurate tracking of physiological indices such as heart rate. However, the estimation of calories burned during exercise is inaccurate and cannot be used for medical diagnosis. In this work, we present JoulesEye, a smartphone thermal camera-based system that can accurately estimate calorie burn by monitoring respiration rate. We evaluated JoulesEye on 54 participants who performed high intensity cycling and running. The mean absolute percentage error (MAPE) of JoulesEye was 5.8%, which is significantly better than the MAPE of 37.6% observed with commercial smartwatch-based methods that only use heart rate. Finally, we show that an ultra-low-resolution thermal camera that is small enough to fit inside a watch or other wearables is sufficient for accurate calorie burn estimation. These results suggest that JoulesEye is a promising new method for accurate and reliable calorie burn estimation.

Main Idea

We used a thermal camera attachment for phones to accurately estimate EE (Figure 1). Breathing causes change in temperature in the nostrils which results in variations in pixel intensity. We used classical region tracking approach like Channel and Spatial Reliability Filter [14] to retrieve the pixel intensity in the nostrils. The intensity information over time gives us the proxy of the breathing signal. We also used temperature and heart rate data to improve the results. To extract temperature, we monitored multiple points on the forehead. The forehead is an area of bony prominence where the probability of observing the change in temperature due to workout is high [11]. While prior work has estimated respiration rate using thermal images [2, 6], experiments have not been performed when the participants move exaggeratedly while exercising. Motion is also a challenge for wireless signals-based respiration monitoring as removing motion artifacts has been a longstanding challenge. Our algorithms work even when the user is cycling or running vigorously. The sensed respiration rate, temperature and heart rate information from thermal data are fed into a deep learning model to estimate energy expenditure.

JoulesEye Overview — Fig. 1. JoulesEye estimates Energy Expenditure (EE) from respiration rate. In a) the participant is riding a cycle with thermal camera and phone fixed on the handrail. b) Shows a frame of the thermal video. c) Shows the respiration rate detection pipeline during motion combined with deep learning architecture to predict energy expenditure.

JoulesEye Setup

Our goal is to determine how many calories a person has burned while exercising by measuring the respiration rate. The breathing or respiratory rate is detected as a result of the temperature fluctuations due to airflow in the nasal. The physical phenomenon is based on the radiative and convective heat transfer component during the breathing cycle, which results in a periodic increase and decrease of the temperature at the tissues around the nasal cavity. These observable temperature fluctuations are quantifiable in a thermal video as pixel intensity variations of the nostrils ROI [16]. In this section, we first describe the setup of our system and then the algorithm for obtaining the temperature and respiration rate. The setup is divided into two major category of systems: Ground Truth Devices and JoulesEye System.

Ground Truth Devices

First, we discuss the components used for collected ground truth data followed by other components used in the design of JoulesEye.

Indirect Calorimeter

We have used the Fitmate Pro [18] which is an indirect calorimeter used to collect VO₂ (volume of oxygen) data during sub-maximal as well as maximal exercise by measuring the volume of oxygen consumed and the volume of carbon dioxide produced. Submaximal exercise is performed at a level below the maximum capacity of an individual. During physical assessment, only sub-maximal exercises should be performed by participants in the absence of a clinical physician [8] The main components of the Fitmate Pro are:

Oxygen Sensor: Measures the oxygen consumption and the carbon dioxide expulsion from the body. The concentration of oxygen and carbon dioxide is directly proportional to the energy expenditure.
Flow Sensor: Measures the volume of air breathed in and out by the user.
Microprocessor Unit: Analyzes the data from the sensors and calculates the energy expenditure based on proprietary algorithms.
Display Screen: It is shown in Figure 2(d). It displays the results of the energy expenditure calculation, including the number of calories burned, in real-time.
Mouthpiece or Mask: Attaches to the face and connects to the calorimeter, allowing the measurement of inhaled and exhaled air.

Respiration Belt

The ground truth of respiration rate is available from the indirect calorimeter. We also collect the respiration rate, using the Vernier GoDirect [7] respiration belt (Figure 2(e)). We collected res piration rate from two sources because it is impossible to use the calorimeter and JoulesEye simultaneously. Thus, instead of comparing JoulesEye with the gold standard output of the calorimeter, we use the chest belt as the reference measurement of respiration. The belt consists of a flexible, stretchable material that is worn around the chest, and it contains a sensor that detects the pressure changes caused by breathing. It has a mea surement range of 0-100 breaths per minute with an error of ± 1 breath per minute. It has a sampling rate of 0.1 Hz.

JoulesEye Setup — Fig. 2. JoulesEye’s is composed of a thermal camera retrofitted in an iPhone as shown in a). JoulesEye can be used in a smartwatch as shown in b). The camera in b) is a low resolution (32x24) thermal camera. c) and e) show the ground truth data collection procedure with indirect calorimeter while running and biking. d shows a screen grab from the indirect calorimeter recording the energy expenditure during an exercise session.

JoulesEye System

JoulesEye consists of a thermal camera to record the respiration rate of a person. The thermal camera (Figure 2(a, b)) is used to retrieve the estimated value of respiration rate, temperature and used to estimate the EE. We used a FLIR One Pro [9] smartphone attachment thermal imaging camera. To take thermal videos, the device needs to be attached to an iPhone and connected to the FLIR ONE mobile application. A user can select the video mode and start recording. The thermal video will be recorded in real-time, showing temperature differences and heat patterns in the scene. The camera has a sampling rate of 8.6 frames per second with a temperature range of -20°C to 120°C. The combined unit of the smartphone and the thermal camera was securely mounted on the handgrip of an ergometer or affixed near the display screen of a treadmill in order to capture thermal video data of the face. We also developed a wristband prototype JoulesEye as shown in Figure 2(b).

Data Collection

All participants between 18 to 70 years of age without any prior heart ailment could become a participant in the study. In total, 54 volunteers participated in an approximately 45 minutes study session (Table 1). The entry survey consisted of a questionnaire where the participants self-declared their age, weight, sex, time of the last meal and recent illnesses. Our data collection method followed the best practice validation protocol mandated by the Network of Physical Activity Assessment (INTERLIVE) [3]. The participants were shown how to wear the indirect calorimeter mask. All participants participated in two back-to-back data collection sessions.

Table 1. Demographic information for the participants
Total participants (n)	54
Participants who performed cycling on ergometer	41
Participants who performed running on treadmill	13
Female (n, %)	24, (44.4%)
Age (in years) (mean, range)	28.4 (25–54)

Session 1: Data Collection Using JoulesEye

Participants cycled on a stationary bike or ran on a treadmill for both the sessions. In the first session, the participant ran for three minutes at a high intensity (4-5 miles/h running and 2.5-3 miles/h cycling). We limited the high intensity session to three minutes keeping the participant’s comfort in mind. Figure 3(a) shows a frame of the face during this session. The following data are collected during this session:

Thermal video data of the upper body with the frame covering the face. This data is later processed to extract respiration rate.
Respiration rate from the chest belt.

Thermal Images during Data Collection — Fig. 3. In a) the participant has not donned the indirect calorimeter mask and hence the region tracking algorithm is able to keep track of the nostrils (nostril also shown in inset image). In b) the nostrils are covered by the mask making respiration detection impossible. We call a) as the JoulesEye data collection. During a) we could not collect the indirect calorimeter data in parallel as otherwise the nostrils would be occluded. Here, we use the respiration data from chest belt as the reference values. Thus, we could quantitatively evaluate the performance of JoulesEye’s respiration rate pipeline with ground truth respiration data from the belt. We later used the respiration rate from JoulesEye data to estimate energy expenditure

Session 2: Data Collection Using Indirect Calorimeter

In this session, the participant donned the indirect calorimeter mask along with chest belt and performed cycling or running for 15 minutes comprising of High Intensity Interval Training (HIIT). The thermal camera recorded the face of the person during this session as well. Figure 3(b) shows a thermal frame of this session where the participant has donned the mask. Note that the nostrils are not visible and this thermal data cannot be used to extract respiration rate. The following data are collected during this session:

Thermal data with frame covering the upper body including the face. The nostrils are now occluded by the mask.
Respiration rate from the chest belt.
Energy Expenditure, volume of exhaled air and respiration rate from the indirect calorimeter.

Additional Data: Heart Rate and Temperature

We evaluated how temperature data and heart rate data can affect the energy expenditure estimation. We are interested in heart rate because, heart rate is one of the most common proxies for energy expenditure and together with respiration rate, the estimations can improve.

Extracting Temperature

Our pilot experiments showed that temperature change occurs in the region of face with bony prominence like forehead, jawline and nose tip when a person is cycling an ergometer or running on the treadmill. It is known that physical activity increases the metabolic rate and generates heat in the body. This increased heat is transmitted through the blood vessels and nerves in the bony regions, leading to an increase in skin temperature in these regions [5]. We extracted temperature information from the forehead.

Heart Rate

To make a fair comparison of the energy expenditure (EE) estimates produced by our approach, it was important to compare it with the currently accepted EE estimates produced by smart watches. The heart rate data from a Apple Watch was collected continuously during cycling and running, providing a continuous measurement of the individual’s heart rate. This heart rate data was used as an additional optional input to our approach of estimate energy expenditure in combination with other physiological signals such as respiration rate and temperature. We also used the energy expenditure data from the Apple Watch as the reference for comparison with the energy expenditure estimates produced by our approach. By using the energy expenditure data from the Apple Watch as well as from our own model, it was possible to make a fair comparison of the accuracy of the energy expenditure estimates with respect to the ground truth.

Modeling

Energy Expenditure (EE) is represented in cal/min, deduced from VO₂. We aim to estimate Energy Expenditure (EE) from Respiration Rate (RR). We do this in two phases:

We will first estimate the volume of exhaled air (\(v\)) from RR.
Next, we will use the estimated volume information (\(v\)) to estimate the measures the oxygen concentration in a breath or VO₂.

The inspiration of using this two-phased approach comes from the indirect calorimeter, which measures the oxygen concentration (O₂) in a breath. O₂ concentration in the inhaled air vary depending on factors like gas exchange efficiency and body composition. By trying to model the relationship between the amount of O₂ consumed and the volume of exhaled air (\(v\)), we can account for efficiency and body composition. But, \(v\) is not readily available without the indirect calorimeter. Therefore, our first objective is to estimate \(v\) from RR data. Using RR alone to estimate VO₂ can lead to inaccuracies because it does not take into account individual differences in lung capacities and breathing patterns. We expect our model to learn these factors to estimate \(v\) from RR alone. Our second model would then learn the transfer function and estimate unmeasured factors that would determine VO₂ from \(v\).

Predicting Volume from RR

Both our models are an adaptation of the Temporal Convolution Network with residuals (TCN) [4]. TCN leverages causal convolutions and dilation. Causal convolution enforces a unidirectional information flow, while dilation allows to capture long-range dependencies of the input. In our work, the model tries to learn a function \(f_1\) that best predicts the volume \(v_t\) at time stamp \(t\) such that

\[ v_t = f_1(v_{t-k:t-1}, RR_{t-k:t}) \]

The model iterates over multiple samples of input and output to learn the function \(f_1\). During prediction, subsequent samples (\(v_{t+1}, v_{t+2}...\)), are predicted autoregressively i.e.

\[ v_{t+1} = f_1(v_{t-k+1:t}, RR_{t-k+1:t+1}) \]

where the predicted volume is used as an input to the next model which predics VO₂.

Predicting VO₂ from Volume

The approach to modeling VO₂ from volume is similar to the previous modeling approach where we use the TCN network, but this time we only use the volume information to predict VO₂, i.e.

\[ vo_{t+p} = f_2(v_t : v_{t+p-1}) \]

where \(p\) is the number of samples of volume. Therefore, to predict the first sample of VO₂, we need in total \(k + p\) samples of respiration rate.

Deep Learning Model Architecture — Fig. 4. We build a deep learning network similar to the Temporal Convolution Network (TCN) with residuals to estimate volume as a function of respiration rate and volume i.e. v_t = f₁(v_(t-k:t-1),RR_(t-k:t)). Additionally, we also evaluated the performance of the model with additional covariates, namely heart rate (HR) and temperature (T) collected from the forehead. On using HR and T, the equation becomes, v_t = f₁(v_(t-k:t-1),HR_(t-k:t),T_(t-k:t),RR_(t-k:t)). The residual blocks are composed of 1D dilated causal convolution (the first layer has no dilation), a ReLU activation and dropout [10]. A similar convolution is used to later predict VO2 (calorie or energy expenditure) from Volume.

Using Heart Rate (HR) and Temperature Data (T)

During data collection, we retrieved heart rate data from both a chest belt and a smartwatch. Additionally, we obtained approximate temperature data from thermal readings. To enhance our analysis, we incorporated Heart Rate (HR) data from the smartwatch and forehead temperature (T) data as additional covariates. These supplementary variables enabled us to evaluate the performance of estimating \(v\) using different combinations of covariates, including HR alone, RR alone, a combination of RR and HR, and a combination of RR, HR, and T. For example, with an input of RR , HR and T, the equation to estimate \(v\) becomes

\[ v_t = f_1(v_{t-k:t-1}, HR_{t-k:t}, T_{t-k:t}, RR_{t-k:t}) \]

Figure 4 illustrates the corresponding TCN model for this combination of inputs. By modifying one (e.g., using only T and RR or T and HR) or two covariates (using only RR), we adjusted the input dimension of the model, necessitating corresponding adaptations to the kernel dimension while maintaining the dimensions of the tensors within the residual network unchanged.

Results and Discussion

In this section, we first discuss the performance of respiration detection from thermal video when compared to ground truth from a respiration belt. Next, we discuss the performance of energy expenditure estimation from respiration, heart rate, and temperature data.

Result on Estimating Respiration Rate

With the data from the first session we observed that the error between respiration rate detection from thermal data when compared to respiration belt is 2.1% (Figure 7 (A)). Furthermore, from the data collected from the second session we quantified that the error between respiration rate detection from indirect calorimeter and respiration belt is 1.68%.

Both these numbers (2.1% and 1.68%) are better compared to previous work [1] which uses Electrocardiogram and Photoplethysmogram to calculate respiration rate. Since both, respiration rate and energy expenditure are on different scales, using MAPE gives us a good idea of how changing one modality impacts the other.

Fig. 6. JoulesEye EE estimation pipeline: The calorimeter’s mask obstructs direct thermal-based respiration retrieval by blocking the camera’s view of the nostrils. To replicate thermal-based respiration, we added noise to the belt-derived respiration signal, introducing a 2.1% error to simulate the difference between the reference respiration from the belt and the thermal video. The resulting noisy respiration rate (RR) signal was then input into the first TCN model for volume estimation. The estimated volume was subsequently passed to the second TCN model to predict VO₂ or energy expenditure.

Result on Estimating Energy Expenditure

Using True and Reference Respiration Rate

Figure 5(B) shows the pipeline of estimating energy expenditure or VO₂ from ground truth respiration rate from the calorimeter and the reference respiration rate from the chest belt. The first TCN model is used to estimate volume of exhaled air from respiration rate data. The estimated volume of exhaled air is then used as an input to the second TCN model which estimates energy expenditure or VO₂. Figure 5 shows that the best result of 5% Mean Absolute Percentage Error (MAPE) was obtained when ground truth respiration rate was input into the model. Using the belt’s respiration rate as an input gives us an MAPE of 5.2%. To put these numbers into context, we compared the performance of using respiration rate as a predictor versus heart rate and temperature. We also compared the result obtained from Apple Smart Watch. In Figure 7, the following inputs are shown:

True HR: This is the heart rate obtained from the indirect calorimeter chest band for heart rate.
Estimated HR: This is the heart rate obtained using Apple Smartwatch.
True RR: This is the respiration rate obtained from indirect calorimeter.
Estimated RR: This is the RR generated by adding noise to the RR from the respiration belt.
Estimated RR and HR: This means the estimated RR data and Apple Watch HR data.
Estimated RR, HR and T: This means the estimated RR data, Apple Watch HR data and the temperature data collected during session 2.

Percentage Error in Estimating Energy Expenditure — Fig. 7. Comparision of True and Estimated HR/RR with MAPE analysis.

Using Proxy Respiration Rate

We described earlier, why respiration rate obtained from thermal data is not available with ground truth of energy expenditure. But, from comparison of Session 1 data (Figure 5), we know that the error between respiration rate obtained from thermal data and belt is 2.1%. Thus, we can use the respiration data of belt from Session 2 (Figure 5) and add noise to it so that we generate new respiration data which has an error of 2.1%. We use this respiration data to predict energy expenditure as shown in Figure 6. Mathematically,

\[NoisyRR_i = BeltRR_i + \epsilon_i\]

\[\epsilon_i \sim N(\mu = 0.44, \sigma = 0.35)\]

where NoisyRR is the noisy respiration data and BeltRR is the respiration rate from the belt. We refer to this noisy respiration data as a proxy to the estimated respiration data. The choice of mean and standard deviation was such that so that the error between the noisy respiration rate and the ground truth respiration rate is 2.1%. A quantile-quantile probability plot confirmed that the respiration data from the belt follows a normal distribution and hence we choose the generate the noise from a normal distribution.

Energy Estimation Heart Rate vs Respiration Rate — Fig. 8. Error Trends in estimation of EE using HR and RR while Cycling and Running

We compare the estimates from Apple Watch heart rate data and the estimates from estimated respiration rate in Figure 8. For cycling activity, the energy expenditure estimated from the heart rate data are relatively inaccurate as apparent from the noisy data in the lower portion of Figure 8(a). The same trend is not observed in Figure 8(b) when respiration rate is used as a predictor. It is important to note that, for each participant, we changed the demographic information before data collection in the Apple Health app.

Effect of Occlusion

Figure 9 (a) demonstrates that when three or more frames are consecutively occluded, the respiration rate estimation exhibits a high Mean Absolute Error of 20.1, while one or two frame occlusions result in a significantly lower error rate. The occurrence of prolonged occlusion, lasting three or more frames, leads to the loss of nostril tracking by the Region of Interest (ROI) tracker, causing alterations in the mean intensity signal, as illustrated in Figure 9 (b). As a consequence, this deviation in the intensity signal adversely affects the accuracy of respiration estimation. However, in such instances, we activate the RGB camera to re-establish the tracking of nostrils through landmark detection. By continuously tracking the nostril with the RGB camera, we can successfully retrieve the correct mean intensity signal, as shown in Figure 9 (c).

Discussion

According to literature [12, 13, 17], respiration rate information helps in explaining body composition or adiposity which is an important determiner for EE. Body composition plays a significant role in determining energy expenditure because each type of tissue in the body requires a different amount of energy to maintain. Muscle tissue is more metabolically active than fat tissue, meaning it requires more energy to sustain. To analyse if body composition affects the energy expenditure estimates, we split our data into people with normal and overweight Body Mass Index (BMI). Table 2 shows the MAPE of energy expenditure from Apple Watch and the MAPE of energy expenditure estimates from respiration rate.

Table 2: EE estimates by Apple Watch is higher for people with high Body Mass Index (BMI) and relatively better for people with normal BMI.
	All Participants	Participants with Normal BMI	Participants with Overweight BMI
Error (Apple Watch)	37.6%	29.7%	51.8%
Error (JoulesEye) with RR	5.8%	5.2%	6.9%

Another reason why respiration rate explains the change in EE can be deduced from Figure 10 which shows that the heart rate, respiration rate and EE are well correlated, however the correlation between heart rate and energy expenditure is lower (Pearson Correlation = 0.78) compared to respiration rate and EE (0.93). Figure 10 suggests that high frequency information in the EE signal are captured by the respiration rate and not the heart rate. Heart rate signal is smoother resulting where no frequent changes are observed unlike in respiration rate and EE signal.

Fig. 10. Correlation between Calorie, Respiration Rate and Heart Rate.

Result with Reduced Video Resolution

The FLIR Thermal camera needs to be retrofitted with an iPhone and its video recordings are saved in 1440x1080 pixel resolution without access to any raw data. But, for JoulesEye to be practical, we envision the smartwatch might come with a low resolution thermal camera. The primary advantage of using a low resolution thermal camera is reduced power and privacy concerns. As shown in Figure 11-b), we designed a 32x24 pixel resolution MLX90640 based thermal imaging system. It also has a RGB camera beside it. The RGB camera helps initially locate the nostrils and thereafter, the CSRT algorithm keeps track of the nostril. We evaluated our low-resolution thermal system for respiration rate detection on 5 participants. These participants were asked to run on a treadmill at 4 miles per hour for a minute with the constraint that they look into the JoulesEye smartwatch thermal camera by extending their hand, akin to looking into a smartwatch.

JouleEye Smartwatch — Fig. 11. In a) we show what our future smartwatches can look like. In (b) we show our first prototype wristband thermal camera which is composed of a low resolution thermal camera and a RGB camera.

When compared to ground truth respiration rate data collected via the belt, we observed that the MAPE of estimating respiration rate is 8.1%. This high error arose because we were not able to achieve a high frame rate for the thermal camera. The current frame rate is 3 frames per second which is fine for slow or no movements but it causes a dithering effect when there is too much movement from the participant. We repeated the procedure (discussed earlier) of adding noise to respiration belt data so that the new data has an error of 8.1%. Using this data we got an energy expenditure estimate of 15.4%. While 15.4% is higher compared to the estimates from the watch’s heart rate data alone (using our algorithms and not Apple Watch) which is 12% (Figure 7), combining this respiration rate data with heart rate data reduces the error to 10.1%. This shows that even though the frame rate of the wristband prototype is low, leveraging thermal data and heart rate data from smartwatch can estimate energy expenditure accurately when compared to heart rate data alone. The results are summarised in Table 3.

Table 3. Estimation of EE using a low-resolution thermal camera in combination with heart rate data yields an error of 10.1%, showcasing its superiority over using heart rate data alone. These results demonstrate that even with a very low-resolution thermal camera, EE estimation can be enhanced.

Table (a): The error (MAPE) in RR estimation varies with changes in thermal video resolution.
Resolution	Error on estimated RR
1080p thermal camera	2.1%
24p thermal camera	8.1%

Table (b): Reduced thermal video resolution leads to increased error (MAPE) in EE estimation.
Input Data	Error on estimated EE
RR from 1080p thermal	5.4%
RR from 24p thermal	15.4%
RR from 1080p thermal and HR	5.3%
RR from 24p thermal and HR	10.1%

Impact of Changing Time Resolution

Our result of JoulesEye shown in Figure 7 is based on input data sampling interval of 90𝑠, where 60𝑠 is required to estimate the first sample of volume and further 30𝑠 more data is required to estimate the first sample of VO₂. Figure 14 shows how the percentage error changes when we gradually decrease the input chunk length of respiration rate estimation. We observe that using 15𝑠 of respiration data is enough to predict energy expenditure with a better performance as compared to heart rate alone. This implies that after exercising a user will have to look into the watch for 15𝑠 + 30𝑠 for her energy expenditure to be predicted by the model. We believe that work needs to be done in order to reduce this interval further towards making the system even more practical.

Limitations and Future Work

We now discuss the limitations of our present work and plans for addressing them in the future.

Smartphone/Smartwatch Integration: Our objective is to retrofit a smartphone/smartwatch with a low resolution thermal camera [15]. Although we prototyped JoulesEye, engineering challenges to obtain higher frame rate remains an unsolved problem. Our initial result are promising, but our system is not yet real time, meaning the video processing and deep learning pipeline needs to be run after data recording.
Usability of Smartphone Prototype: Although we developed a prototype smartwatch for JoulesEye, we did not conduct any usability study with it. Currently, performing a usability study would not yield desired results, as each participant would need to continuously look into the watch for at least 45 seconds to obtain any energy expenditure estimate. Such extended duration for glancing at the watch is impractical. Further research is required to significantly reduce this time interval, allowing a quick glance at the watch to provide accurate energy expenditure values.
Uncertainty in Estimation: Smartphone/Smartwatch Integration:Smartphone/Smartwatch Integration: Our current methods for estimating energy expenditure give a point estimate. In the future, we plan to incorporate uncertainty in our estimation. Incorporating such uncertainty will be particularly important as various sensing modalities will be affected differently owing to differences in external conditions. As an example, the algorithms for heart rate estimation will likely not suffer even when the surroundings are dark, but the algorithms to estimate nostril position from RGB will suffer. Thus, in the future, we plan to implement a principled uncertainty based approach, where uncertainties in the different parts of the pipeline (estimating respiration rate, temperature; estimating energy expenditure using machine learning model) are considered while estimating energy expenditure.

JoulesEye: Energy Expenditure Estimation and Respiration Sensing from Thermal Imagery While Exercising