INTRODUCTION Our team''''s project name is "Predicting Crop Yields in Selected Asian Countries Using Machine Learning" Agriculture plays a crucial role in sustaining economies and ensuring
Trang 1Vietnam National University, Hanoi
Lecturer’s name: Do Trung Tuan Group’s name: Group 6
May, 2023
Trang 2Contents
Figure 4
1 PROJECT PROPOSAL 5
1.1 Team member list 5
1.2 Team name: Group 6 5
1.3 Work division – Contribution 5
2 INTRODUCTION 6
3 PROBLEM STATEMENT 7
4 GETTING THE DATA 8
5 EXPLORATORY DATA ANALYSIS 8
5.1 Preprocess the datasets 9
5.2 Understanding Dataset Features 12
6 DESCRIPTIVE STATISTICS 13
6.1 Statistical numbers 13
6.2 Lets separate Numerical and categorical variables for easy analysis 17
7 REGRESSION ANALYSIS 18
7.1 Bangladesh’s crop production (BLD) 18
7.2 China’s crop production (CHN) 21
7.3 Japan’s crop production (JPN) 22
7.4 Korea’s crop production (KOR) 23
7.5.Thailand’s crop production (THA) 25
7.6 India’s crop production (IND) 26
7.7 Iran’s crop production (IRN) 27
8 DECISION TREE 28
8.1 Decision tree in text form 28
8.2 Decision tree using the scikit-learn library in Python 29
Trang 39 CONCLUSION REMARK 32 10 REFERENCE 33
Trang 4Figure
Figure 1 Overview of the raw data 9
Figure 2 Remove 10
Figure 3 Drop the columns contain relevant info and all the possible feature 10
Figure 4 Identify unique value for each feature 11
Figure 5 Define filtered data frame 11
Figure 6 Code snippet 13
Figure 7 Statistical numbers 14
Figure 8 Production plot by subject 15
Figure 9 Total production for each kind, grouped by country 16
Figure 10 Code snippet 17
Figure 11 Code snippet 17
Figure 12 Boxplot for numerical columns 18
Figure 13 Linear Regression Models for Each Type of Production 19
Figure 14 Production and prediction crop production of BGD 20
Figure 15 Production and prediction crop production of CHN 21
Figure 16 Production and prediction crop production of JPN 22
Figure 17 Production and prediction crop production of KOR 24
Figure 18 Production and prediction crop production of THA 25
Figure 19 Production and prediction crop production of IND 26
Figure 20 Production and prediction crop production of IRN 27
Figure 21 Below are the steps involved in creating a decision tree 28
Figure 22 A decision tree in text form 28
Figure 23 Code snippet 30
Figure 24 A decision tree using the scikit learn library in Python.- 31
Trang 51 PROJECT PROPOSAL 1.1 Team member list
1.2 Team name: Group 6
1.3 Work division – Contribution
2 Reading and analyzing results then Writing report
Trang 62 INTRODUCTION
Our team's project name is "Predicting Crop Yields in Selected Asian Countries Using Machine Learning"
Agriculture plays a crucial role in sustaining economies and ensuring food security for nations worldwide In the context of Asia, where agriculture is a significant sector, accurately predicting crop yields becomes imperative By employing machine learning techniques such as Exploratory Data Analysis (EDA), regression analysis, and decision trees, it becomes possible to harness the power of data to forecast crop production for the years 2026 2028 This essay aims to explore -the potential of -these machine learning methods in predicting crop yields in selected Asian countries, thereby enabling policymakers and stakeholders to make informed decisions and implement effective strategies to address potential food shortages or surpluses
Machine learning techniques have gained considerable attention and recognition due to their ability to analyze vast amounts of data and identify
meaningful patterns and relationships EDA, as an initial step, allows us to understand the data's structure, identify missing values, outliers, and relationships between variables By conducting a comprehensive EDA on historical agricultural datasets, we can gain valuable insights into the factors that influence crop yields, such as temperature, precipitation, soil composition, and cultivation practices
Regression analysis offers a statistical approach to modeling the relationship between these influential factors and crop yields By fitting regression models to historical data, we can estimate the relationship and quantify the impact of each variable on crop production This knowledge can then be utilized to predict future yields based on projected values of the input variables
Furthermore, decision trees provide a powerful framework for predicting crop yields by constructing a tree-like model of decisions and their potential consequences Decision tree algorithms can consider multiple variables simultaneously and create a tree structure that maps out different scenarios, leading to different yield outcomes By training decision tree models on historical data, we can create predictive models capable of estimating crop yields for future years based on specified input conditions
In conclusion, the utilization of machine learning techniques such as EDA,
Trang 7regression analysis, and decision trees offers a promising approach to predict crop yields in selected Asian countries for the period of 2026-2028 These methods can provide valuable insights into how policymakers can allocate resources effectively, implement suitable policies, and support farmers in making informed decisions By leveraging the power of data and machine learning, we can strive for a more sustainable and resilient agricultural future in Asia
3 PROBLEM STATEMENT
This research explores agricultural data and employs data mining techniques and machine learning algorithms to ascertain optimal crop yields, offering valuable insights into crop production
Furthermore, leveraging food data spanning the past 35 years, this study enables the prediction of food production for the upcoming three-year period
The dataset consists of over 1000 data points collected from seven randomly selected countries in Asia It encompasses four major agricultural crops, namely rice, wheat, soybean, and maize, over a period spanning from 1990 to 2025 This comprehensive dataset allows for a detailed analysis of the trends and patterns in crop production across these countries over a significant time frame By exploring this extensive data, we can gain valuable insights into the agricultural productivity in Asia and make informed predictions about future crop yields using advanced machine learning techniques
Trang 84 GETTING THE DATA
Yield data for two crops: rice, wheat, soybean and maize for 7 randomly Asia countries below At the national level, forecasts are made throughout the year
5 EXPLORATORY DATA ANALYSIS
In this step, we leverage standard machine learning and analytics techniques to process, clean, analyze, visualize, and model our data We perform these tasks using Python, utilizing Jupyter Notebook as our development environment The analysis is facilitated by various statistical libraries, which are detailed in the "Preprocess dataset" section The code for this step can be found in the Python file named "exploratory_data_analysis.py" Additionally, the raw data is stored in “crop_production.csv”
Trang 95.1 Preprocess the datasets
To begin our analysis, we start by loading the necessary dependencies and configuring the settings for our analysis We import the following libraries:
Pandas: Used for data manipulation and analysis Seaborn: Used for data visualization
Numpy: Used for numerical computations Sklearn: Used for machine learning tasks
After loading the dependencies, we load our data into a DataFrame and examine its structure by printing the first 5 rows and the last 5 rows This allows us to get a quick overview of the data Here is the code snippet for loading the dependencies and printing the data:
Figure 1 Overview of the raw data
Having reviewed the raw data, we proceed to dive deeper into the analysis Our targeted data is the "Value" column in the DataFrame Therefore, we identify a list of
Trang 10possible features to consider As a first step, we drop the 'Index', 'Indicator', 'Frequency', 'Flag Codes' column as it duplicates the Pandas' index
Figure 2 Remove
We have observed that the data features "LOCATION", "SUBJECT", and "TIME" are suitable and of sufficient quality for further statistical analysis
# Therefore, we will filter and focus solely on these features
We use code: df.head(5) #display number of data lines as required
Figure 3 Drop the columns contain relevant info and all the possible feature
Next, we examine each feature and list all the unique values it contains This helps us understand the distinct categories present in each feature
During this analysis, we identify columns that contain only empty or one unique value These columns do not provide meaningful information for our analysis, so we decide to remove them from the DataFrame
Trang 11Figure 4 Identify unique value for each feature
Figure 5 Define filtered data frame
Now, with the selected features including "LOCATION," "SUBJECT,"
Trang 12"MEASURE," and "TIME," along with the "Value" column, we can form a filtered DataFrame to proceed to the next steps of our analysis
By following these steps, we ensure that we have a clean and focused dataset, ready for further analysis and modeling
5.2 Understanding Dataset Features
Upon inspecting the raw dataset and examining several data rows, we can gain valuable insights into the different columns and their corresponding features:
LOCATION: This column represents the geographic location and is classified by country code In the given dataset, we have data from seven distinct countries: Bangladesh (BGD), China (CHN), Japan (JPN), South Korea (KOR), Thailand (THA), Indonesia (IDN), and Iran (IRN) Each country code corresponds to a specific location where agricultural production data was recorded
SUBJECT: This column indicates the type of agricultural production The dataset includes four main categories: "RICE", "WHEAT", "SOYBEAN", and "MAIZE" These categories represent different crops or agricultural products
TIME: This column records the time period for the data In the dataset, the TIME feature is represented in the form of years Each entry in the TIME column corresponds to a specific year during which the agricultural production data was collected
Value: This column represents the actual value of agricultural production It contains numeric values that quantify the production quantity or other relevant metrics associated with the specific agricultural subject and location
By examining the unique values in each column, we gain a better
understanding of the distinct locations, subject categories, and time periods covered by the dataset This information helps us identify the key components and characteristics of the data, enabling us to perform more targeted analysis and draw meaningful conclusions about agricultural production trends across different countries and crops
Trang 136 DESCRIPTIVE STATISTICS 6.1 Statistical numbers
Since our data is primarily clustered around the "SUBJECT" feature with unique values of 'RICE,' 'WHEAT,' 'SOYBEAN,' and 'MAIZE,' we proceed to calculate various statistical measures for these categories Specifically, we calculate the mean, median, correlation, maximum, and minimum values for each category This analysis allows us to gain insights into the characteristics and variations within each subject's data
The following code snippet demonstrates how we perform these calculations and presents the overall results:
Figure 6 Code snippet
Trang 14Figure 7 Statistical numbers
Next, we plot data with the main focus feature Subject Overall, this code generates a line plot that visualizes the data of two subjects over time The x-axis represents the time values, the y-axis represents the corresponding values of the subjects, and each subject is differentiated by a different colored line
plt.figure(figsize=(12,6)): Sets the size of the figure to 12 inches in width and 6 inches in height, ensuring a proper aspect ratio for the plot
sns.lineplot(data=df_filtered, x='TIME', y='Value', hue='SUBJECT'): Creates a line plot using the lineplot function from Seaborn The data parameter specifies the DataFrame df_filtered containing the data to be plotted The x parameter specifies the column to be plotted on the x-axis, which is 'TIME' The y parameter specifies the column to be plotted on the y-axis, which is 'Value' The hue parameter specifies the column that represents the different subjects, which is 'SUBJECT' This results in multiple lines on the plot, each representing a different subject
plt.title("Line Plot by Subject"): Sets the title of the plot to "Line Plot by Subject"
Trang 20Figure Production and prediction crop production of BGD14
The purpose of this code snippet is to predict crop yields for the upcoming years using a linear regression model
From the graph above, we can see that the food production of 4 crops of Bangladesh in the period 2026 and 2028 will all have positive growth Bangladesh's domestic agricultural output is not enough to meet domestic consumption demand Therefore, they choose to import food from abroad and Vietnam is one of the countries Bangladesh chooses to cooperate with The Minister of Food of Bangladesh said that the country's rice production is insufficient to supply 170 million people, so Bangladesh still needs to import rice from the main suppliers including Vietnam For the Bangladesh market, VINAFOOD II has been the main supplier of rice under the MOU for many years now Of which, 2011 provided 450,000 tons; in 2017 supply 250,000 tons; in 2021 supply 52,500 tons of white rice; and in 2022 supply 230,000 tons of rice Also according to Bangladesh's Food Minister, the country's rice production is not enough to supply 170 million people, so Bangladesh still needs to import rice, with the main suppliers being India, Vietnam and Myanmar In that spirit, Bangladesh has agreed to extend the MOU on rice trade with Vietnam for another five
Trang 21years
Utilizing similar code lines in this section and extracting information from the data file, we can predict the agricultural yields of the next six countries
7.2 China’s crop production (CHN)
We will forecast the CHN's crop production including 4 crops and find the CHN's production forecast for the period between 2026 and 2028
Figure 15 Production and prediction crop production of CHN
From the forecast chart, it can be seen that China's 4 crop food production -forecast for the period between 2026 and 2028, both recorded an increase In 2022, China's total food production reached 686.55 million tons, up 3.7 million tons, equivalent to 0.5% compared to 2021, continuing to record a new record, maintaining production of more than 650 million tons, stable for 8 consecutive years According to data released by the State Bureau of Statistics of China on December 12, the country's food production increased in all three harvests of the year In terms of main foods, production of wheat and maize both increased slightly, rice decreased by 2%, and