Research on the Selection of the most Popular Product Categories in TikTok based on Linear Regression

. In the e-commerce industry, live streaming has become an increasingly popular way to promote and sell products. With the rise of social media platforms like Facebook, Instagram, and TikTok, more and more businesses are using live streaming to engage with their customers and boost sales. This study investigates the relationship between sales and the number of live viewers to find out the most popular partition for live sales of TikTok’s e-commerce by employing simple linear regression, deseasonalization, and variable transformation techniques to optimize the models. This study improves the accuracy and interpretability of regression models in the e-commerce industry, specifically focusing on the relationship between real-time viewership and Gross Merchandise Volume (GMV). The findings indicate that removing seasonality and applying log-to-log transformation provide more reliable and accurate models, ultimately helping businesses better understand and optimize their sales strategies.


Introduction
TikTok e-commerce refers to e-commerce trading activities conducted on the TikTok platform.This trading form combines the characteristics of social media and e-commerce and provides users with commodity display and purchase channels through short videos, live broadcast and other forms.TikTok ecommerce has formed a unique business model in China and is an emerging force in the Chinese e-commerce market.
The development of TikTok e-commerce benefits from the widespread popularity of TikTok platform and the high participation of users.TikTok platform has a huge user base, including many young users, who have strong pursuit and consumption desire for new things and fashion trends.At the same time, the TikTok platform also provides an easy and interesting shopping experience, allowing users to enjoy shopping while browsing short videos and live broadcasts [1].
The business model of TikTok e-commerce is mainly divided into two types, namely, interest ecommerce and shelf e-commerce.Interest e-commerce is divided into two main parts: one is to display goods and promote goods through live broadcast, and the other is to display goods and promote goods through short videos.In the form of live streaming, sellers can display the characteristics and usage methods of products to users through live streaming and interact with users to guide them to place orders and make purchases.In the form of short videos, sellers can show users the appearance and characteristics of the product by shooting short videos and add purchase links in the videos to guide users to place orders and make purchases.
The rise of TikTok e-commerce has changed the business model of traditional e-commerce to a certain extent, making the display and sales of goods more personalized and emotional, and providing users with a more attractive and interactive shopping experience.At the same time, TikTok E-commerce also provides sellers with a more open and flexible sales platform, allowing small and micro enterprises and individual sellers to have opportunities to trade and promote on this platform.
In the e-commerce industry, understanding the relationship between real-time viewership and GMV is essential for businesses to optimize their sales strategies.This study aims to examine the use of simple linear regression, deseasonalization, and variable transformation techniques to model this relationship more accurately and interpretably.Through this investigation, it provides businesses with more reliable models to base their sales strategies on [2].
The existing literature on TikTok is relatively rich, including the rise of the live broadcast model [3], research and analysis on the use of TikTok in the UK, followed by the study of TikTok in the field of digital marketing [4][5][6], research on its operation process, the impact of data flow and the rise of interested ecommerce models.As well as the research on TikTok users, they analyzed the impact of TikTok's Media on people's consumption intention and behavior for different age groups [7,8].In addition, TikTok has an impact on the humanities and society [9].TikTok also plays a certain role in social education, social science, and international public relations.The above literature reflects to some extent the popularity of TikTok software in recent years and its promotion of online shopping.Based on the above literature, this article conducts research on the issue of live streaming sales, hoping to find popular categories suitable for live streaming sales.This not only supplements and expands the literature, but also helps to provide some directions for the development of live streaming sales.

Data
The data used in this study comes from TikTok's data website, not limited to the grey dolphin, flying melon, etc.The data indicators include the total live sales, the number of live sales, the number of viewers in the live broadcast room, the host information of the live broadcast, the traffic data, the live broadcast sales category, etc. Between some subjective factors of consumers, such as viewing type preferences, viewing duration, purchase intention and other information, A form that is difficult to quantify will be considered as some uncontrollable external factors in this report.

Simple Linear Regression
Simple linear regression is a basic statistical analysis method used to study the linear relationship between an independent variable and a dependent variable.In this method, firstly it assumes a linear relationship between the independent variable and the dependent variable, that is, the value of the dependent variable can be linearly predicted by the value of the independent variable [10,11].
The model of univariate linear regression can be expressed as:  =  0 +  1  +  , where Y is the dependent variable and X is the independent variable,  0 and  1 is the regression coefficient,  is a random error term. 0 represents the value of Y at X = 0.  1 represents the speed at which Y changes with X.  is a random Error term, representing the part that cannot be explained by the model.The goal is to estimate  0 and 1 and improve the fitting degree of the evaluation model [12].
The analysis process of univariate linear regression mainly includes the following steps: 1. Data collection: Collect data on independent and dependent variables, and check whether the data meets the assumptions of the linear regression model.
2. Model hypothesis test: test the hypothesis of the model, including linear relationship, normal distribution, covariance and independence.
3.Model fitting: use the least square method to estimate the parameters of the model and calculate statistical indicators such as Goodness of fit and residual error to evaluate the degree of fitting of the model.
4.Parameter estimation: Use the estimated parameter values to predict the dependent variable values corresponding to unknown independent variables and calculate indicators such as prediction error and confidence interval to evaluate the accuracy of.

Model diagnosis: test the residual of the model to identify the deviation and outlier of the model, and check whether the model meets the further assumptions.
The simple linear regression method is widely used in fields such as social sciences, natural sciences, and business to study the linear relationships between various dependent and independent variables.At the same time, this method also has its limitations, for example, it cannot handle nonlinear relationships and complex multivariate relationships.Therefore, some other methods have been adopted in this report to optimize the fitting effect of the model.

Deseasonalization
Removing seasonality is a very important step in time series analysis, which can help us better understand and predict time series data.Seasonality refers to periodic fluctuations in time series data, usually caused by certain fixed seasonal factors, such as holidays, weather, business cycle, etc.The method of removing seasonality can help us separate seasonal and non-seasonal factors, in order to better understand the trend and volatility of data.In this study, live streaming sales are clearly a time series, so in order to optimize this model, it is necessary to consider the influence of seasonal factors.Here, a combination of de seasonality and univariate linear regression can be used for analysis.This can more accurately evaluate the linear relationship between the independent and dependent variables, while eliminating the influence of seasonality.
The following are some common methods that combine de seasonality and univariate linear regression: 1. Seasonal difference regression method: This method first uses seasonal difference method to remove seasonal effects, and then uses univariate linear regression to analyze the linear relationship between the independent and dependent variables.This method helps to more accurately evaluate the linear relationship between independent and dependent variables without being disturbed by seasonal influences.
2. Seasonal moving average regression method: This method first uses the moving average method to remove seasonal effects, and then uses univariate linear regression to analyze the linear relationship between the independent and dependent variables.This method can help to better understand the relationship between independent and dependent variables and eliminate the influence of seasonality.
3. Seasonal trend model method: This method uses a seasonal trend model to analyze the relationship between independent and dependent variables and considers the impact of seasonality.This model can decompose the time series into three parts: seasonality, trendiness, and randomness, and then use univariate linear regression to analyze the relationship between trendiness and non-seasonality parts.
However, the limitation of one-dimensional linear regression itself remains unresolved, and further exploration is needed to accurately fit complex data trends.Regression based only on the independent variable is insufficient to perfectly fit the complex data trends.

Simple Linear Regression Deformation
Simple linear regression is a commonly used statistical method for studying the linear relationship between an independent variable and a dependent variable.However, in real life, the relationship between the independent and dependent variables may not only be linear, but may exist in the form of curves, exponents, logarithms, etc.The use of polynomial regression, logarithmic regression, power function regression, inverse function regression and other deformation methods can better conform to the actual situation, so as to more accurately describe the relationship between independent variables and dependent variables.And in a univariate linear regression model, the regression coefficient represents the degree of influence of the independent variable on the dependent variable.Using these deformation methods can better explain the influence of the independent variable on the dependent variable, and thus more accurately predict the value of the dependent variable.And these deformation methods can also help us better predict the values of dependent variables.For example, in the logarithmic regression, taking the logarithm of the independent variable and the dependent variable to make the data more consistent with the normal distribution, so as to more accurately predict the value of the dependent variable [13,14].
Deformation methods.1. Polynomial regression: Polynomial regression is a method to extend the unitary linear regression model to higher order terms.In polynomial regression, this paper uses the polynomial function of the independent variable to fit the dependent variable to better describe the relationship between the independent variable and the dependent variable.For example, the quadratic polynomial regression model can be expressed as  =  0 +  1  +  2  2 .
2. Logarithmic regression: Logarithmic regression is a method to expand the unitary linear regression model to the logarithmic scale.In logarithmic regression, taking the logarithm of the independent and/or dependent variables to better describe the relationship between them.For example, a logarithmic regression model can be represented as ln  =  0 +  1 ln .
3. Power function regression: Power function regression is a method to extend the unitary linear regression model to power function.In power function regression, using the power function of the independent variable to fit the dependent variable to better describe the relationship between them.For example, the power function regression model can be expressed as  =  0   1 .
4. Inverse function regression: Inverse function regression is a method of extending a univariate linear regression model to an inverse function.In inverse function regression, using the inverse function of the independent variable to fit the dependent variable, in order to better describe the relationship between them.For example, an inverse function regression model can be represented as  =  0 +  1 1  .

Simple linear Regression Analysis
In this study, the dependent variable is sales, and the independent variable is the number of live viewers.To conduct the analysis, this project collected data on the number of live viewers and sales for a sample of live streams from some e-commerce data platforms.And then used simple linear regression to model the relationship between the two variables.In the fitting model process, this study selected 3 Live partitions that performed well in terms of slope from the 11 Live partitions.Then used the formula for simple linear regression, in which y represents the dependent variable (sales),  0 represents the intercept (which is meaningless in our case),  1 represents the slope (the velocity of change in sales with changes in the number of live viewers), and the epsilon represents the random error term with a mean of zero.
After analyzing the data, it can be found that the apparel partition had a strong positive relationship between sales and the number of live viewers.The model for the apparel partition had a high R-squared value of 0.942 as shown in Table 1, indicating that 94.2% of the variation in sales was explained by the number of live viewers.While the models for the jewelry and beauty partitions performed well in terms of slope, their R-squared values were relatively low, indicating that only a few parts of the model could be interpreted.Specifically, the jewelry partition had an R-squared value of 44.94%, while the beauty partition had an Rsquared value of 47.69%(Figure 1-3).To handle outliers, this paper performed a residual analysis and identified several high leverage points and outliers.Then removed any values that did not meet the criteria for standard residual values, which improved the R-squared values for the simple linear regression model in the three live partitions.However, it can still be noticed a nonlinear trend in the versus fits plot for the apparel partition after the outliers were processed, which indicated a limitation of the simple linear regression model as shown in Figure 4-6.The results suggests that there is a strong positive relationship between sales and the number of live viewers in the e-commerce industry.However, the limitations of the simple linear regression model should be considered.While a high R-squared value indicates a strong relationship between the dependent and independent variables, it does not necessarily indicate a strong generalization ability of the model.The presence of outliers and nonlinear trends can also affect the accuracy of the model.

Seasonal Decomposition Analysis
To remove seasonality from the data, this paper used two methods for seasonal decomposition: the additive model (T+S+C+I) and the multiplicative model (T*S*C*I), with a seasonal length of 7 days due to the short time span of the data.Then compared the performance of the two models in terms of accuracy and interpretability.
By selecting data with significant changes after removing seasonality, it can be found a significant decrease in volatility between February 6th and March 2nd, which eliminated some residual regression that could not be processed by one-dimensional linear regression but could be quantified through time-series analysis.
Secondly, removing seasonality increased the comparability of data between different time periods within the same region.This was especially important when comparing real-time viewership and GMV data across different time periods.
Lastly, removing seasonality simplified the structure of models and reduced the risk of overfitting, thus improving the generalization ability of models on unknown data.Although a linear regression with a sinusoidal function could also process the fluctuations, it was still inferior to the model with seasonality removed, which was simpler and more accurate.
Furthermore, it can also be found that the multiplicative model generally performed better than the additive model in the selected dataset.This was because the multiplicative model assumes a relationship of multiplication between trend and seasonality, meaning that the influence of seasonality changes with the trend over time.In contrast, the additive model assumes a relationship of addition between trend and seasonality, meaning that the influence of seasonality remains constant, and trend changes only lead to an overall upward or downward movement.Therefore, the multiplicative model was more suitable for processing data spanning two consecutive months with a magnitude of tens of thousands and can provide more accurate trend predictions in one-dimensional linear regression, thereby improving the fitting effect of regression.These results suggest that removing seasonality from data is both feasible and reliable.Processing seasonal data can improve the accuracy and interpretability of models, providing a more reliable basis for subsequent analysis, and simplify the structure of models while reducing the risk of overfitting.And multiplicative model generally performed better than the additive model in the selected dataset as shown in Figure 7 and Figure 8.

Variable Transformation Analysis
Thus, to improve the accuracy and interpretability of regression models, employing variable transformation using deseasonized data.After deseasonizing the data, the result can be explored different variable transformations.And after exploring different variable transformations, it can be found that log x (number of people) and log y (GMV in different departments) provided the best goodness of fit on average among all the transformations, while also being good for explanation with an intuitive coefficient to compare.The output results were stable and had a better performance compared to models using raw data, as shown by the comparison.
In the fitting process, this paper evaluated the models using R square and P value and conducted residual analysis to identify any abnormal patterns in the data.The models showed good performance in terms of accuracy and interpretability.However, it is observed an obvious residual pattern that occurred multiple times.After consideration, the study decided to keep the log-to-log transformation and provided several possible realistic interpretations for the pattern of residuals, including changes in the needs of the targeted audience, significant variance in the popularity among hosts, and fluctuations in the unit price of goods caused by changing marketing methods.It can be recognized that there may be some non-linear patterns in the data that could be solved by changing the bases of logarithms, but this would incur trouble to the interpretability of our later analyzing process.The selection of log-and-log transformation was preferred because it was suitable for operating the data set, which had large orders of predictor and response variables (10^8) and a relatively large distance between them (10^2-10^3/10^8), with some non-linear patterns inside.This transformation provided a more interpretive model with higher accuracy and precision as shown in Figure 9 and Figure 10.

Discussion
Initially, this paper used a simple linear regression model based on the data, which provided a good Rsquare value.However, it is found non-linear trends in the residuals, indicating the presence of a stable fluctuation component.To eliminate this component's influence, by removing the seasonality and selecting a multiplicative model with a 7-day seasonality length, which significantly improved the R-square value and explained the underlying trend of the data.Then this paper transformed the variables again by using log on x and y, which further improved the R-square value and the model's explanatory power.
Based on the final model, it is found that Baby care and pet, Fresh, Local Service, Clothes and Underwear, and Food and Drinks partitions have the highest slopes and are the most suitable for livestreaming sales as shown in Figure 11.This paper analyzed the reasons behind this and found that emotional connection, trust, selling non-standard products, and official support are important factors in livestreaming.For example, emotional connections are important for Baby care and pet, Fresh, and Food and Drinks partitions, while the environment of livestreaming is conducive to displaying goods for Clothes and Underwear.However, the final model did have two problems, as shown by the example of the makeup partition.Firstly, it is found a weak correlation between the sales amount and the number of current viewers, but a strong correlation with the number of units sold, which attributing to the non-linear relationship between the sales amount and the price of makeup products during the shopping festival.Secondly, it is found that an influential point in the data, which occurred on March 8th during the shopping festival, resulting in a different relationship with the other data.
Based on the modeling analysis, it is recommend that Baby care and pet, Fresh, Local Service, Clothes and Underwear, and Food and Drinks partitions are the most suitable for livestreaming sales.

Conclusion
This paper determined the most suitable partition for livestreaming sales in the e-commerce market, which is developing rapidly.In this paper, the simple linear regression method is used to study the classification problem that is most suitable for TikTok live broadcast sales.The conclusions obtained include: Firstly, the deformation of simple linear regression and the method of deseasonality in this report's research do optimize the fitting effect of the model.Secondly, through the effect of model fitting, it can be found that maternal and infant products and fresh products have significant advantages in live sales, and the future development is very promising.
This study provides valuable insights into improving regression modeling in the e-commerce industry, specifically focusing on the relationship between realtime viewership and GMV.By employing deseasonalization and variable transformation techniques, businesses can develop more accurate and interpretable models to optimize their sales strategies, ultimately contributing to increased revenue and success in the competitive e-commerce market.
However, the model in this study still has shortcomings.Due to the inability to obtain quantifiable indicator data, there is no doubt that it also has a certain degree of impact on live sales.Therefore, relevant suggestions are for reference only.

Table 1 .
Some Typical Partitions