Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 32

 Title: Car Price Prediction

 Abstract

 This project serves as a practical experience in working with big databases using
Python and understanding the challenges involved in data retrieval and analysis. This
is on purpose so that we learn categorizing cars based on their specifications, such as
price and other relevant elements, will enhance our understanding of how data can be
effectively organized and utilized for decision-making in the automotive domain or AI
algorithms.The project utilizes Python code provided by the professor
(“https://thecleverprogrammer.com/2021/08/04/car-price-prediction-with-machine-
learning/”), which is designed to search and retrieve data from a database. The
database link provided in the code is replaced with a new database to fetch updated
results. We run this code on the Colab website, and we input different databases,
compare the results, and analyze the process. We begin by understanding the structure
of the car database and identifying the elements of the relevant specification for
categorization, such as price, make, model, year, mileage, horsepower, and so on. The
Python code provided by the professor is reviewed and modified to include the new
database link, ensuring that the code is configured to fetch data from the updated
database. Once the code is configured, it is executed on different databases containing
car data. The retrieved data is then processed and analyzed to generate results. Then
we compare the results obtained from different databases, assessing the accuracy and
efficiency of the code, and identifying any discrepancies or patterns in the retrieved
data.
The project is based on this link:
 “https://thecleverprogrammer.com/2021/08/04/car-price-prediction-with-machine-
learning/”
The project was made on:
 https://colab.research.google.com

 Introduction (Why you choose this project?)


- We chose this topic for car price prediction for our project due to its relevance and
significance in the car industry and market. Understanding the factors that influence car prices
and being able to accurately predict them has practical applications in various domains, such as
car sales, insurance, and finance. Through this project, we aimed to gain valuable insights into
the data writing and updating techniques and methodologies used for car price prediction, and
further enhance our skills in utilizing Python for data analysis and modeling in real-life
problems. Also, this project can help us learn how to work with big databases and it will
contribute to our professional development career. In the future, companies could use this code
and algorithm to implement in different areas and industries.
 Summary of any one research paper

 The article I chose is titled “A Comparative Study of Machine Learning Techniques


for Predicting Used Car Prices”. This research paper compares different machine-
learning techniques used to predict the prices of used cars. This study analyzes
different variables such as car model, age, and mileage, and uses various algorithms
for example: linear regression, decision tree, and random forest to predict the car
prices. The study found that random forest and decision tree algorithms performed way
better than the linear regression in predicting the vehicles prices or advantages.
Overall, the study concludes that machine learning techniques can be effectively used
in big databases of which cars are categorized, with random forest being the best-
performing algorithm. This research paper showed that the application of machine
learning algorithms can provide accurate predictions for used car prices, which can
benefit the retail business industry. The research also highlights the importance of
considering multiple variables and feature selection in creating an effective prediction
model. Also, the flowcharts, tables, and other forms of content described the algorithm
logic in a visual way.
 Data Set Description
https://www.kaggle.com/datasets/rohitagrawal362/audi-car-price-prediction
model year pricetransmissi mileage fueltype tax Highway enginesize
on mpg
A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4
A6 2016 16500 Automatic 36203 Diesel 20 64.2 2
A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4
A4 2017 16800 Automatic 25952 Diesel 145 67.3 2
A3 2019 17300 Manual 1998 Petrol 145 49.6 1
Data Base 1: /content/DatabaseAudiMarketplace.csv

 Table 1. The primary database of Audi is the Excel spreadsheet containing the initial five rows
of data.

Rows number: 10669


Columns number: 10
Attributes: “Model, Year, Price, Transmission, Mileage, Fuel Type, Tax, Highway Mpg,
Engine Size”
Attributes Explained:
Model: The vehicle’s naming.
Year: Year made of the automobile.
Price: The price of the car.
Transmission: The vehicle's gearbox type.
Mileage: The mileage the automobile has covered overall.
Fuel Type: The kind of gasoline the automobile runs on.
Tax: The cost of the car's related taxes.
Highway Mpg: Estimated driving miles per liter for the vehicle.
Engine Size: The number of liters in the combustion chamber of the vehicle.

Data Base 2: https://raw.githubusercontent.com/amankharwal/Website-data/master/


CarPriceDatabase.csv

Car_ID Symbolling Car Name Fuel Type Aspiration Doors Number CarBody
1 3 alfa-romero gas std two convertible
giulia
2 3 alfa-romero gas std two convertible
stelvio
3 1 alfa-romero gas std two hatchback
Quadrifoglio
4 2 audi 100 ls gas std four sedan
5 2 audi 100ls gas std four sedan
continuing… ↓

Drive Engine Wheelbase Engine Fuel Bore Stroke Compression Horsepower


wheel location size System ratio ratio

RWD front 88.6 130 mpfi 3.47 2.68 9 111


RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 94.5 152 mpfi 2.68 3.47 9 154
FWD front 99.8 109 mpfi 3.19 3.4 10 102
4WD front 99.4 136 mpfi 3.19 3.4 8 115
continuing… ↓
Peak rpm City mpg Highway mpg Price
5000 21 27 13495
5000 21 27 16500
5000 19 26 16500
5500 24 30 13950
5500 18 22 17450

 Table 2. The primary six rows of database two.

Rows number: 206 & Columns Number: 22


Attributes: car_ID: An index or identity for every table row.
Symbolling: The car's corresponding coverage risk assessment symbol.
CarName: The brand or make of the vehicle.
Fuel type: The kind of gasoline the automobile runs on.
Aspiration: The car's extraction or turbocharging system.
Door number: How many doorways the automobile has.
Carbody: The kind or design of a car's exterior.
Drivewheel: The kind of wheel operate or drivewheel arrangement.
Engine location: The position of the vehicle's engine, such as the front or back.
Wheelbase: The measurement in inches between the front and rear wheel centers.
Engine size: The car's engine's cubic centimeters (CC) displacement or size.
Fuel system: The vehicle's fuel distribution system kind.
Bore ratio: The car's motor's cylinder ratio.
Stroke: The car's motor's cycle ratio.
Compression ratio: The car's motor's compression amount.
Horsepower: The amount of force that the car's motor can produce.
Peak rpm: The maximum number of rotations per minute (rpm) that the car's engine can
produce.
City mpg: The estimated miles per gallon (mpg) of the vehicle while commuting.
Highway mpg: The estimated miles per gallon (mpg) of the vehicle when traveling on the
highway.
Price: The cost of the vehicle.

 Algorithm
1. First we find a database of any market with enough types of data like year, km driven,
model, engine size etc..
2. Then we edit the database to make it work with the program
3. Then we edit the python file a bit because of the difference of databases. Meaning the
python program will expect always a specific type of database. We have to change the
program every time the database we input is different.
4. And then after the database and the python program is linked together, we run the
program and then we see a few details and information.
5. We can see it made some calculation like the average price of the car, it made a graph,
it made a colored matrix and etc…
6. After the calculations the data learning machine will be able to make a prediction.
7. This prediction will be put in websites like second hand car websites to buy.
8. When the user finds their ideal car to buy. They will click on the car a person is
selling, and then our program will take input from the database of that website.
Meaning the program takes input every single car of that specific car the user chose in
order for the data learning machine to predict the price. It is recommended the to input
atleast 10,000 entries to make a more accurate prediciton.
9. After the user chooses the car they should be able to see the price after the learning
data machine took the database from the website and made the proper calculations
needed to make a prediction.
 Flowchart

Figure 1 – Flowchart : When a program starts, first we collect the data, then we import
important libraries provided by the code runner, then we load the datasets, after that
comes the esploratory data analysis of which makes the data preprocessing start. The
models are built and trained that leads into the algorithm selecting the best models and
making predictions on the particular chosen dataset and this cycle ends.
 Experiment results (Entire code "change all the variable names", All outputs, All figures
outputs with explanations)
 The first database's whole code, with the variable names modified:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
DatabaseAudiMarketplace = pd.read_csv("/content/DatabaseAudiMarketplace.csv")
DatabaseAudiMarketplace.head()
DatabaseAudiMarketplace.isnull().sum()
DatabaseAudiMarketplace.info()
print(DatabaseAudiMarketplace.describe())
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(DatabaseAudiMarketplace.price)
plt.show()
print(DatabaseAudiMarketplace.corr())
plt.figure(figsize=(20, 15))
correlations = DatabaseAudiMarketplace.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()
predict = "price"
DatabaseAudiMarketplace = DatabaseAudiMarketplace[["enginesize",
"highwaympg","price"]]
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error
model.score(xtest, predictions)
print(DatabaseAudiMarketplace)
The entire code of database 1 explained in parts (First a part of the code is showed and
then the output of the taken code is shown and then explained):
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
DatabaseAudiMarketplace = pd.read_csv("/content/DatabaseAudiMarketplace.csv")
DatabaseAudiMarketplace.head()

 - As can be seen from these lines of code, in order for the project to function, we
import some modules like dictionaries. Following the import, the Excel database sheet
for Audi's first six lines are output.
id model year price transmission mileage fueltype tax highwaympg enginesize
0 A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4
1 A6 2016 16500 Automatic 36203 Diesel 20 64.2 2.0
2 A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4
3 A4 2017 16800 Automatic 25952 Diesel 145 67.3 2.0
4 A3 2019 17300 Manual 1998 Petrol 145 49.6 1.0

Table 3. From what we observe in the table below, the python execution printed the
first six lines found in the DatabaseAudiMarketplace.csv database that we entered into
Excel.
DatabaseAudiMarketplace .isnull().sum()
model 0
year 0
price 0
transmission 0
mileage 0
fueltype 0
tax 0
highwaympg 0
enginesize 0
dtype: int64

Table 4. This table displays the command isnull, a feature of the Panda function that
checks to see whether or not there is an unfilled or null cell in an Excel sheet. If so, the
phrase will be true rather than false. Consequently, the result will be 1 rather than 0.
This indicates that the database we entered is operating properly. 1 for True and 0 for
False.
DatabaseAudiMarketplace .info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10668 entries, 0 to 10667
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 10668 non-null object
1 year 10668 non-null int64
2 price 10668 non-null int64
3 transmission 10668 non-null object
4 mileage 10668 non-null int64
5 fueltype 10668 non-null object
6 tax 10668 non-null int64
7 highwaympg 10668 non-null float64
8 enginesize 10668 non-null float64
dtypes: float64(2), int64(4), object(3)
memory usage: 750.2+ KB

 Table 5. The following piece of code displays all the technical data that the database
can provide, such as the file size, memory use, the number of entries, etc. As the table
shows, the size of the file is around 750.2 KB and it contains three different types of
data which is float, integer and object.
print(DatabaseAudiMarketplace.describe())

type year price mileage tax highwaympg \


count 10668.000000 10668.000000 10668.000000 10668.000000 10668.000000
mean 2017.100675 22896.685039 24827.244001 126.011436 50.770022
std 2.167494 11714.841888 23505.257205 67.170294 12.949782
min 1997.000000 1490.000000 1.000000 0.000000 18.900000
25% 2016.000000 15130.750000 5968.750000 125.000000 40.900000
50% 2017.000000 20200.000000 19000.000000 145.000000 49.600000
75% 2019.000000 27990.000000 36464.500000 145.000000 58.900000
max 2020.000000 145000.000000 323000.000000 580.000000 188.300000

continuing below >>

enginesize
count 10668.000000
mean 1.930709
std 0.602957
min 0.000000
25% 1.500000
50% 2.000000
75% 2.000000
max 6.300000

 Table 6. The following statement gives an in-depth technical explanation of the


database in the following instance using a dataframe with numerical data. It displays
the mean value, sometimes called the standard deviation. The.description command
will explain it so that machine learning may use the results to calculate the cost of the
automobile. It has a few rows, as we can see. The count indicates the number of times
that particular column's data input has been entered or written in an Excel spreadsheet;
the mean represents the outcome of the mean deviation; and the standard deviation
represents the outcome. 25% is the data that is the closest to the minimum value, or
value. Consequently, about 25% of the vehicles had 1.5 liter engines. the same holds
true for both 50% and 75%. It displays the data that most closely approximates the
proportion I just said. The biggest value is Max.
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(DatabaseAudiMarketplace .price)
plt.show()

<ipython-input-8-b24cc0cfc4f5>:3: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

Figure.2. As you observe, this code will generate a graph showing the average automobile
price. Due to pricing differences in the excel sheet, the graph will alter depending on the
database that we import. Specifically, the market if one takes the database as the market's
input. From what we notice, the majority of the automobiles in this photo cost approximately
$20,000.
sns.distplot(DatabaseAudiMarketplace .price)
print(DatabaseAudiMarketplace.corr())
plt.figure(figsize=(20, 15))
correlations = DatabaseAudiMarketplace.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()

<ipython-input-9-0bb2be9b5c0c>:1: FutureWarning: The default value of numeric_only in


DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
print(DatabaseAudiMarketplace.corr())
<ipython-input-9-0bb2be9b5c0c>:3: FutureWarning: The default value of numeric_only in
DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
correlations = DatabaseAudiMarketplace.corr()

year price mileage tax highwaympg enginesize


year 1.000000 0.592581 -0.789667 0.093066 -0.351281 -0.031582
price 0.592581 1.000000 -0.535357 0.356157 -0.600334 0.591262
mileage -0.789667 -0.535357 1.000000 -0.166547 0.395103 0.070710
tax 0.093066 0.356157 -0.166547 1.000000 -0.635909 0.393075
highwaympg -0.351281 -0.600334 0.395103 -0.635909 1.000000 -0.365621
enginesize -0.031582 0.591262 0.070710 0.393075 -0.365621 1.000000

 Table 7.

- The result, as can be observed, contained a warning. It is a panda alert that could interfere
later. The table containing dataframe content is located below the warning. The graphic that
follows the table displays a chart with various colors.

(Due to the dimensions of the image, it is put on the following page...)


Figure 3. This graph illustrates how the various database classifications that we inserted by
input. Since we have 1 in a diagonal position, as was the intended result, everything is working
as it should. It varies between the cells, and the data learning machine will use this to calculate
the cost of the car. This graph is a dataframe diagram that demonstrates how the various
database subcategories that were entered work. Since we have 1 in a diagonal position, as was
the intended result, everything functions as it should. It varies between the cells, and the data
learning machine will use this to determine the cost of the car. Every every row and every
column will eventually have the number 1 because, at some time, the row and the column will
contain the same data combined, which will result in a value of 1 if they are both the same data.
The diagonal red line, as was already indicated, demonstrates this.
predict = "price"
DatabaseAudiMarketplace = DatabaseAudiMarketplace [["enginesize", "highwaympg","price"]]
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error
model.score(xtest, predictions)

<ipython-input-19-a1c592d6430c>:3: FutureWarning: In a future version of pandas all


arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
x = np.array(data.drop([predict], 1))
1.0

- This output includes a panda warning and the single digit 1 (i). This demonstrates that the
program that uses the pandas functions as expected. indicating that the code we were
developing performed as planned.
print(DatabaseAudiMarketplace)

- enginesize highwaympg price


0 1.4 55.4 12500
1 2.0 64.2 16500
2 1.4 55.4 11000
3 2.0 67.3 16800
4 1.0 49.6 17300
... ... ... ...
10663 1.0 49.6 16999
10664 1.0 49.6 16999
10665 1.0 49.6 17199
10666 1.4 47.9 19499
10667 1.4 47.9 15999

[10668 rows x 3 columns]

 Table 8. The changed version of the database that we entered will be displayed in this
output. Since this was the only technique available to get the database functioning with
the program, there are now only 3 rows. It must be altered for it to function. Meaning
the changed version of the database that we entered will be displayed in this output.
Since this was the only way to get the database to work with the program, there are
now only 3 rows. It must be altered for it to function. We only utilized three sorts of
data for our application since other statistics, such mpg, would not affect the car's
worth. Due to the fact that inputting data types like a car's ID won't in any way alter
the car's worth, sometimes doing so might cause problems for the data learning
machine. It could make a mistake in some circumstances.
 The second database's whole source code, with modified variable names:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
CarPriceDatabase =
pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-
CarPriceDatabase/master/CarPrice.csv")
CarPriceDatabase.head()
CarPriceDatabase.isnull().sum()
CarPriceDatabase.info()
print(CarPriceDatabase.describe())
CarPriceDatabase.CarName.unique()
sns.set_style("whitegrid")
plt.figure(figsize=(15, 10))
sns.distplot(CarPriceDatabase.price)
plt.show()
print(CarPriceDatabase.corr())
plt.figure(figsize=(20, 15))
correlations = CarPriceDatabase.corr()
sns.heatmap(correlations, cmap="coolwarm", annot=True)
plt.show()
predict = "price"
CarPriceDatabase = CarPriceDatabase[["symboling", "wheelbase", "carlength",
"carwidth", "carheight", "curbweight",
"enginesize", "boreratio", "stroke",
"compressionratio", "horsepower", "peakrpm",
"citympg", "highwaympg", "price"]]
x = np.array(CarPriceDatabase.drop([predict], 1))
y = np.array(CarPriceDatabase[predict])
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(xtrain, ytrain)
predictions = model.predict(xtest)
from sklearn.metrics import mean_absolute_error
model.score(xtest, predictions)
print(CarPriceDatabase)
The whole second database's output with interpretations:
Car_ID Symbolling Car Name Fuel Type Aspiration Doors Number CarBody
1 3 alfa-romero gas std two convertible
giulia
2 3 alfa-romero gas std two convertible
stelvio
3 1 alfa-romero gas std two hatchback
Quadrifoglio
4 2 audi 100 ls gas std four sedan
5 2 audi 100ls gas std four sedan
continuing… ↓
Drive Engine Wheelbase Engine Fuel Bore Stroke Compression Horsepower
wheel location size System ratio ratio

RWD front 88.6 130 mpfi 3.47 2.68 9 111


RWD front 88.6 130 mpfi 3.47 2.68 9 111
RWD front 94.5 152 mpfi 2.68 3.47 9 154
FWD front 99.8 109 mpfi 3.19 3.4 10 102
4WD front 99.4 136 mpfi 3.19 3.4 8 115
continuing… ↓
Peak rpm City mpg Highway mpg Price
5000 21 27 13495
5000 21 27 16500
5000 19 26 16500
5500 24 30 13950
5500 18 22 17450

 Table 9. Essentially, this table will display the input, corresponding to the database that
we entered. From what we notice and can observe, it just displays the excel upload's rows
and columns. Simply put, this table will display the input, which is the database that we
entered. As we can see, it just displays the rows and columns of the excel file that we
entered. This repository Only the first 6 rows of the Excel sheet we provided to the
program have been printed.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 car_ID 205 non-null int64
1 symboling 205 non-null int64
2 CarName 205 non-null object
3 fueltype 205 non-null object
4 aspiration 205 non-null object
5 doornumber 205 non-null object
6 carbody 205 non-null object
7 drivewheel 205 non-null object
8 enginelocation 205 non-null object
9 wheelbase 205 non-null float64
10 carlength 205 non-null float64
11 carwidth 205 non-null float64
12 carheight 205 non-null float64
13 curbweight 205 non-null int64
14 enginetype 205 non-null object
15 cylindernumber 205 non-null object
16 enginesize 205 non-null int64
17 fuelsystem 205 non-null object
18 boreratio 205 non-null float64
19 stroke 205 non-null float64
20 compressionratio 205 non-null float64
21 horsepower 205 non-null int64
22 peakrpm 205 non-null int64
23 citympg 205 non-null int64
24 highwaympg 205 non-null int64
25 price 205 non-null float64
dtypes: float64(8), int64(8), object(10)
memory usage: 41.8+ KB

 Here we can understand that this output just displays the number of non-null values
and the type of input that we sent to the database. It also displays the number of
distinct data kinds and the amount of RAM used. We can also see that every type of
column, also known as data type, in our dataset is represented in this table. This also
indicated the type of data—which may have been an integer, a float, or an object—that
we had entered in that column or data type.
car_ID symboling wheelbase carlength carwidth carheight \
count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000
mean 103.000000 0.834146 98.756585 174.049268 65.907805 53.724878
std 59.322565 1.245307 6.021776 12.337289 2.145204 2.443522
min 1.000000 -2.000000 86.600000 141.100000 60.300000 47.800000
25% 52.000000 0.000000 94.500000 166.300000 64.100000 52.000000
50% 103.000000 1.000000 97.000000 173.200000 65.500000 54.100000
75% 154.000000 2.000000 102.400000 183.100000 66.900000 55.500000
max 205.000000 3.000000 120.900000 208.100000 72.300000 59.800000

curbweight enginesize boreratio stroke compressionratio \


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 2555.565854 126.907317 3.329756 3.255415 10.142537
std 520.680204 41.642693 0.270844 0.313597 3.972040
min 1488.000000 61.000000 2.540000 2.070000 7.000000
25% 2145.000000 97.000000 3.150000 3.110000 8.600000
50% 2414.000000 120.000000 3.310000 3.290000 9.000000
75% 2935.000000 141.000000 3.580000 3.410000 9.400000
max 4066.000000 326.000000 3.940000 4.170000 23.000000

horsepower peakrpm citympg highwaympg price


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 104.117073 5125.121951 25.219512 30.751220 13276.710571
std 39.544167 476.985643 6.542142 6.886443 7988.852332
min 48.000000 4150.000000 13.000000 16.000000 5118.000000
25% 70.000000 4800.000000 19.000000 25.000000 7788.000000
50% 95.000000 5200.000000 24.000000 30.000000 10295.000000
75% 116.000000 5500.000000 30.000000 34.000000 16503.000000
max 288.000000 6600.000000 49.000000 54.000000 45400.000000

 In this particular instance, this operation uses a dataframe with numerical data to
technically describe the database. It displays the mean value, sometimes called the
standard deviation. The dataframe containing numerical data is used to technically
describe the database using this command. The average number, sometimes called the
standard deviation, is displayed. As is evident, the numbers vary according on the type
of row. The count indicates the number of times the particular data was entered, the
mean displays the mean deviation, and the standard deviation result is displayed. Min
displays the lowest value. 25% displays the mean value of 25% of the lowest values.
Likewise for 50% and 75%. Max displays the highest value that the data may have
been.
<ipython-input-2-3b6c97159ec3>:7: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

<ipython-input-2-3b6c97159ec3>:9: FutureWarning: The default value of numeric_only in


DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
print(data.corr())
<ipython-input-2-3b6c97159ec3>:11: FutureWarning: The default value of numeric_only in
DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid
columns or specify the value of numeric_only to silence this warning.
correlations = data.corr()

 We can observe a few warnings alerts in this report. The cautions advise modifying the
software code (program) afterwards since it might not be functional. The cautions
advise modifying the software afterwards since it might not be reliable. Due to the
importance of the seaborn's purpose, this may alter in a later release. Right now, it
functions properly . If necessary, it will be changed to its operational condition going
forward.
Figure 5. In this figure are shown the scales in numerical forms that describe the output data for car price predictions
generated from the code execution. This describes the price ratings with visual clustered columns in chart combined
with the line containing the maximum and minimum values. Meaning this graph displays the typical cost of a car in a
market. As we can see, the typical cost of a car in this location ranges from $5,000 to $50,000. with an 8–10k dollar
used automobile average. Every distinct kind of database we offer the software will have a different effect on this.
car_ID symboling wheelbase carlength carwidth \
car_ID 1.000000 -0.151621 0.129729 0.170636 0.052387
symboling -0.151621 1.000000 -0.531954 -0.357612 -0.232919
wheelbase 0.129729 -0.531954 1.000000 0.874587 0.795144
carlength 0.170636 -0.357612 0.874587 1.000000 0.841118
carwidth 0.052387 -0.232919 0.795144 0.841118 1.000000
carheight 0.255960 -0.541038 0.589435 0.491029 0.279210
curbweight 0.071962 -0.227691 0.776386 0.877728 0.867032
enginesize -0.033930 -0.105790 0.569329 0.683360 0.735433
boreratio 0.260064 -0.130051 0.488750 0.606454 0.559150
stroke -0.160824 -0.008735 0.160959 0.129533 0.182942
compressionratio 0.150276 -0.178515 0.249786 0.158414 0.181129
horsepower -0.015006 0.070873 0.353294 0.552623 0.640732
peakrpm -0.203789 0.273606 -0.360469 -0.287242 -0.220012
citympg 0.015940 -0.035823 -0.470414 -0.670909 -0.642704
highwaympg 0.011255 0.034606 -0.544082 -0.704662 -0.677218
price -0.109093 -0.079978 0.577816 0.682920 0.759325

carheight curbweight enginesize boreratio stroke \


car_ID 0.255960 0.071962 -0.033930 0.260064 -0.160824
symboling -0.541038 -0.227691 -0.105790 -0.130051 -0.008735
wheelbase 0.589435 0.776386 0.569329 0.488750 0.160959
carlength 0.491029 0.877728 0.683360 0.606454 0.129533
carwidth 0.279210 0.867032 0.735433 0.559150 0.182942
carheight 1.000000 0.295572 0.067149 0.171071 -0.055307
curbweight 0.295572 1.000000 0.850594 0.648480 0.168790
enginesize 0.067149 0.850594 1.000000 0.583774 0.203129
boreratio 0.171071 0.648480 0.583774 1.000000 -0.055909
stroke -0.055307 0.168790 0.203129 -0.055909 1.000000
compressionratio 0.261214 0.151362 0.028971 0.005197 0.186110
horsepower -0.108802 0.750739 0.809769 0.573677 0.080940
peakrpm -0.320411 -0.266243 -0.244660 -0.254976 -0.067964
citympg -0.048640 -0.757414 -0.653658 -0.584532 -0.042145
highwaympg -0.107358 -0.797465 -0.677470 -0.587012 -0.043931
price 0.119336 0.835305 0.874145 0.553173 0.079443

compressionratio horsepower peakrpm citympg \


car_ID 0.150276 -0.015006 -0.203789 0.015940
symboling -0.178515 0.070873 0.273606 -0.035823
wheelbase 0.249786 0.353294 -0.360469 -0.470414
carlength 0.158414 0.552623 -0.287242 -0.670909
carwidth 0.181129 0.640732 -0.220012 -0.642704
carheight 0.261214 -0.108802 -0.320411 -0.048640
curbweight 0.151362 0.750739 -0.266243 -0.757414
enginesize 0.028971 0.809769 -0.244660 -0.653658
boreratio 0.005197 0.573677 -0.254976 -0.584532
stroke 0.186110 0.080940 -0.067964 -0.042145
compressionratio 1.000000 -0.204326 -0.435741 0.324701
horsepower -0.204326 1.000000 0.131073 -0.801456
peakrpm -0.435741 0.131073 1.000000 -0.113544
citympg 0.324701 -0.801456 -0.113544 1.000000
highwaympg 0.265201 -0.770544 -0.054275 0.971337
price 0.067984 0.808139 -0.085267 -0.685751

highwaympg price
car_ID 0.011255 -0.109093
symboling 0.034606 -0.079978
wheelbase -0.544082 0.577816
carlength -0.704662 0.682920
carwidth -0.677218 0.759325
carheight -0.107358 0.119336
curbweight -0.797465 0.835305
enginesize -0.677470 0.874145
boreratio -0.587012 0.553173
stroke -0.043931 0.079443
compressionratio 0.265201 0.067984
horsepower -0.770544 0.808139
peakrpm -0.054275 -0.085267
citympg 0.971337 -0.685751
highwaympg 1.000000 -0.697599
price -0.697599 1.000000

 In this output we can see that the information of the table is linked with the colored
matrix . Meaning this graph displays the data that the data learning system will use to
forecast the price of the automobile. We can see the information on the colored chart
in the next page.
Figure 6. This graph shows how the several database categories that we entered work and function. Everything is
going according to plan and expectation because 1 is positioned diagonally. The data learning system will utilize this
variation across cells to calculate the cost of the automobile.
<ipython-input-3-09e4e61e658b>:7: FutureWarning: In a future version of pandas all arguments
of DataFrame.drop except for the argument 'labels' will be keyword-only.
x = np.array(data.drop([predict], 1))
1.0
 I- As we can see, just the value 1.0 is displayed in this output. This is how the software
was meant to operate since if 1 does not appear, then 0 will, which would indicate that
the program has made a catastrophic error. Meaning that the application is running
smoothly based on the entire output. If it is, 1.0 will appear; otherwise, 0.0. This test
checks to see if the dataframe is functioning as planned and is one of the pandas
functions.

symboling wheelbase carlength carwidth carheight curbweight \


0 3 88.6 168.8 64.1 48.8 2548
1 3 88.6 168.8 64.1 48.8 2548
2 1 94.5 171.2 65.5 52.4 2823
3 2 99.8 176.6 66.2 54.3 2337
4 2 99.4 176.6 66.4 54.3 2824
.. ... ... ... ... ... ...
200 -1 109.1 188.8 68.9 55.5 2952
201 -1 109.1 188.8 68.8 55.5 3049
202 -1 109.1 188.8 68.9 55.5 3012
203 -1 109.1 188.8 68.9 55.5 3217
204 -1 109.1 188.8 68.9 55.5 3062

enginesize boreratio stroke compressionratio horsepower peakrpm \


0 130 3.47 2.68 9.0 111 5000
1 130 3.47 2.68 9.0 111 5000
2 152 2.68 3.47 9.0 154 5000
3 109 3.19 3.40 10.0 102 5500
4 136 3.19 3.40 8.0 115 5500
.. ... ... ... ... ... ...
200 141 3.78 3.15 9.5 114 5400
201 141 3.78 3.15 8.7 160 5300
202 173 3.58 2.87 8.8 134 5500
203 145 3.01 3.40 23.0 106 4800
204 141 3.78 3.15 9.5 114 5400
citympg highwaympg price
0 21 27 13495.0
1 21 27 16500.0
2 19 26 16500.0
3 24 30 13950.0
4 18 22 17450.0
.. ... ... ...
200 23 28 16845.0
201 19 25 19045.0
202 18 23 21485.0
203 26 27 22470.0
204 19 25 22625.0

[205 rows x 15 columns]



- The altered version of the input we made is displayed in this output. In this instance, a
few columns are absent since they were not necessary or significant in figuring out the car's
pricing, such as the car's identification number, fuel system, engine position, etc. indicating
that this is the database's revised version. There are no missing rows, but there are 11 missing
columns since, as we already explained, they were not necessary to calculate the car's price.
Adding information like a car ID, which has no value or significance to the price of the car,
would also interfere with learning data machines; as a result, they must be removed.
 Compare a minimum 2 datasets with all outputs
Dataset 1 Dataset 2

- As the pictures show above, there is a significant disparity between both datasets.
They do share a component in common, which is the diagonally arranged number 1. This is
because each column and row contains a different type of data. The number 1 is created when
the same data kinds are combined. As we can see, the major variation is the amount of
columns and rows. Due to the fact that there are fewer data types in the first dataset, audi.csv,
there are much less data types inputted. Consequently, the second data base will have more
squares , because it contains a greater variety of data types. The color types appeared to be
the same, with the exception that the second dataset's size gave rise to a greater range of hues.
Dataset 1
Dataset 2

- The density , quantity and the pricing amount are where we can observe the most
difference in this case. The density ranges from 0 to 5 in the first dataset, but ranges from 0 to
0.00010 in the second. This is because there are more automobiles (or rows) entered into the
first database. Additionally, it appears that the average price in the first database is higher
than in the other one. The price disparity is significant. In the first dataset, the price range is 0
to 150,000 dollars, whereas in the second, it is 0 to 55,000 dollars. In the first dataset, the
typical automobile looks to cost roughly 20,000 dollars, but in the other database the prices
seem to average at the price around 9,000 dollars.
 Conclusion
In conclusion, by reading, studying, writing and doing analyses effort in this project,
our research skills developed a lot. We gained big knowledge from this comparative study that
expresses the way program generates different types of data, for example graphs and tables
and outputs in other visual forms, to illustrate how various characteristics impact and change
the prediction of car prices scales based on the inputted data. The biggest t difference that we
noticed was in the tables, where the two databases differed in terms of the number of rows and
columns, with the second database containing more diverse inputs, including car
dimensions.The codes we took and modified, displayed the average density of prices on the
graph, and told insights into the distribution of prices in each database. Specifically, the first
database most likely tended to have generally higher prices for cars. Further research and
analysis can be conducted so that we explore more the implications of this findings on pricing
strategies and decision impacting in the vehicles industry and also many companies can use
this kind of algorithms or codes to implement their data in projects like this.
 References

Chen, J., Han, Q., Li, F., Wang, Q., Xu, J., & Yan, M. (2022). Comparisons of different methods
used for second-hand car price prediction. Paper presented at the, 12259 122594N-122594N-11.
https://doi.org/10.1117/12.2638739

Demiriz, A. (2018). Used car pricing and beyond: A survival analysis framework. In 2018 First
International Conference on Artificial Intelligence for Industries (AI4I) (pp. 65-68). IEEE.
https://doi.org/10.1109/AI4I.2018.8665680

Gegic, E., Isakovic, B., Keco, D., Kevric, J., & Masetic, Z. (2019). Car price prediction using
machine learning techniques. TEM Journal, 8(1), 113-118. https://doi.org/10.18421/TEM81-16

Google. (n.d.). Car price using machine learning flowchart [Image]. Retrieved from
https://www.google.com/search?
q=car+price+using+machine+learning+flowchart&tbm=isch&ved=2ahUKEwiOx57Nlcz-
AhVsiP0HHfSnAz0Q2-
cCegQIABAA&oq=car+price+using+machine+learning+flowchart&gs_lcp=CgNpbWcQA1DjA
Vj7DWD1DmgAcAB4AIABhgGIAdAIkgEDNi41mAEAoAEBqgELZ3dzLXdpei1pbWfAAQE
&sclient=img&ei=OoVLZI7CHeyQ9u8P9M-
O6AM&bih=577&biw=1280&rlz=1C1GCEU_enXK1022XK1022#imgrc=bbw8OUONW79ItM

Hankar, M., Beni-Hssane, A., & Birjali, M. (2022). Used car price prediction using machine
learning: A case study. In 2022 11th International Symposium on Signal, Image, Video and
Communications (ISIVC) (pp. 1-4). IEEE. https://doi.org/10.1109/ISIVC54825.2022.9800719

Hassanien, B. E., Azim, M. A., & Elgohary, M. A. (2020). Used cars price prediction based on
deep learning techniques. In 2020 IEEE 2nd International Conference on Advances in
Computational Intelligence (ICACI) (pp. 332-336). Brno, Czech Republic. doi:
10.1109/ICACI49156.2020.9120244.

Jin, C. (2021). Price prediction of used cars using machine learning. In 2021 IEEE International
Conference on Emergency Science and Information Technology (ICESIT) (pp. 223-230). IEEE.
https://doi.org/10.1109/ICESIT53460.2021.9696839

Jindal, M., & Jain, N. (2020). Machine learning based car price prediction. In 2020 International
Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE) (pp. 1-
6). Jaipur, India. doi: 10.1109/ic-ETITE50223
Afroz, S. A., Masum, M. M. H., Islam, M. M., & Hossain, M. S. (2022). Prediction of car prices
using machine learning: A comprehensive review. In S. S. Das, S. Misra, S. Mukhopadhyay, &
V. Patidar (Eds.), Artificial Intelligence and Machine Learning for Future Networks and Systems
(pp. 73-89). Springer. https://doi.org/10.1007/978-981-18-9228-6_5

Chen, S., & Liu, Z. (2022). Application of data mining technology in second-hand car price
forecasting. In 2022 3rd International Conference on Electronic Communication and Artificial
Intelligence (IWECAI) (pp. 260-273). Zhuhai, China. doi: 10.1109/IWECAI55315.2022.00058

Dawood, H. A., Ibrahim, F. N., & Ali, O. M. G. (2020). Used car price prediction model: A
machine learning approach. In 2020 4th International Conference on Intelligent Computing in
Data Sciences (ICDS) (pp. 1-6). Cairo, Egypt. doi: 10.1109/ICDS48824.2020.9280089

Hossain, M. I., Uddin, M. S., & Islam, M. R. (2021). Car price prediction model development
and analysis for Bangladesh market. In 2021 IEEE 5th International Conference on Computing
Communication and Automation (ICCCA) (pp. 441-446). Noida, India. doi:
10.1109/CCAA51626.2021.9374211

Huang, Y., Zhang, X., & Cheng, H. (2021). Prediction of car price based on ensemble learning
and data cleaning. In 2021 8th International Conference on Information Science and Technology
(ICIST) (pp. 21-26). Nanjing, China. doi: 10.1109/ICIST51598.2021.00009

Li, H. (2021). Research on big data analysis data acquisition and data analysis. In 2021
International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA) (pp.
162-165). Xi'an, China. doi: 10.1109/CAIBDA53561.2021.00041

Lukić, T., & Vulić, T. (2019). Predicting the price of used cars using multiple linear regression
and artificial neural network. TEM Journal, 8(1), 113-118. doi: 10.18421/TEM81-15

Narayana, C. V., Likhitha, C. L., Bademiya, S., & Kusumanjali, K. (2021). Machine learning
techniques to predict the price of used cars: Predictive analytics in retail business. In 2021
Second International Conference on Electronics and Sustainable Communication Systems
(ICESC) (pp. 1680-1687). Coimbatore, India. doi: 10.1109/ICESC51422.2021.9532845

Shaikh, M. K., Zaki, H., Tahir, M., Khan, M. A., Siddiqui, O. A., & Rahim, I. U. (2022). The
framework of car price prediction and damage detection technique. Pakistan Journal of
Engineering & Technology, 5(4). https://doi.org/10.51846/vol5iss4pp52-59

Kharwal, A. (2021). Car Price Prediction with Machine Learning. Retrieved from
https://thecleverprogrammer.com/2021/08/04/car-price-prediction-with-machine-learning/
Li, Z., Li, Q., Liu, Y., & Li, X. (2020). Research on used car price prediction based on SVM
optimized by genetic algorithm. In 2020 3rd International Conference on Robotics, Control and
Automation (ICRCA) (pp. 75-80). Wuhan, China. doi: 10.1109/ICRCA49248.2020.9289705.

Shao, W., & Liu, J. (2021). Car price prediction with deep learning. In 2021 International
Conference on Big Data and Blockchain (ICBDB) (pp. 172-178). Chengdu, China. doi:
10.1109/ICBDB52020.2021.00036.

Sowmya, P. S., Anu, G. K., & Joy, P. M. (2021). Analysis and prediction of used car prices using
machine learning techniques. In 2021 3rd International Conference on Inventive Computation
Technologies (ICICT) (pp. 1-6). Coimbatore, India. doi: 10.1109/ICICT51501.2021.9441436.

Thai, D. V., Son, L. N., Tien, P. V., Anh, N. N., & Anh, N. T. N. (2019). Prediction car prices
using quantify qualitative data and knowledge-based system. In 2019 11th International
Conference on Knowledge and Systems Engineering (KSE) (pp. 1-5). Da Nang, Vietnam. doi:
10.1109/KSE.2019.8919408.

Huang, Y., Zhang, X., & Cheng, H. (2021). Prediction of car price based on ensemble learning
and data cleaning. In 2021 8th International Conference on Information Science and Technology
(ICIST) (pp. 21-26). Nanjing, China. doi: 10.1109/ICIST51598.2021.00009.

You might also like