Yelp is a web application created to connect people with great local businesses. It is useful for finding restaurants, bars, and various types of services. The site features reviews of an active and well-informed local community.

In this post, I'll describe a scraping script that extracts information about the restaurants in San Francisco from Yelp.

Also, I present a data cleaning process and data analysis of restaurants by location.

Scraping data from yelp.com using requests and BeautifulSoup

In this scraping script, I will take the following information for each San Francisco's restaurant:

name
neighborhood
address
phone
reviews

Below you see the script used to scrape the data from Yelp. Check the comments in each line to get an understanding of how it works.

# Import packages
import requests
from bs4 import BeautifulSoup

# Request header
headers = {"User-Agent": "Chrome/68.0.3440.106"}

# Request parameters
parameters = {
    'find_desc': 'Restaurants',
    'find_loc':'San Francisco CA',
    'start':'0'}

# Yelp search URL
url='https://www.yelp.com/search'

# Get the number of pages return by request (Ex: 'Page 1 of 34') 
# Each page shows 30 items
response = requests.get(url, params=parameters)
# Create BeautifulSoup Object
soup = BeautifulSoup(response.text, 'html.parser')
# Extract the number of pages
string_pages = soup.findAll('div', attrs={'class':'page-of-pages arrange_unit arrange_unit--fill'})[0].text
number_of_pages = int(string_pages.split('of ')[1])

# Variable that stores the data
data = []

# Start page
pagination = 0

# For each pagination extract the items
for i in range(number_of_pages):

    # Produces the request and saves the response
    response=requests.get(url, params=parameters)
    # Create a Soup object
    soup = BeautifulSoup(response.text, 'html.parser')

    # count 30 pages
    #count = 1

    print('Extract data ... page {} from {}'.format(int(pagination/30)+1, number_of_pages))

    # Extract the info
    for a in soup.findAll('div', attrs={'class':'media-story'}):

        # Validate if the info exists
        try:
            address = a.find('address').text.strip()
        except:
            address = ''

        try:
            name = a.find('a', attrs={'class':'biz-name js-analytics-click'}).text
        except:
            name = ''

        try:
            reviews = a.find('span', attrs={'class':'review-count rating-qualifier'}).text.strip().split()[0]
        except:
            reviews = ''

        try:
            phone = a.find('span', attrs={'class':'biz-phone'}).text.strip()
        except:
            phone = ''

        try: 
            neighborhood = a.find('span', attrs={'class':'neighborhood-str-list'}).text.strip()
        except:
            neighborhood = ''

        # Saves info in a dictionary
        info = {
        'name' : name,
        'reviews' : reviews,
        'phone' : phone,
        'address' : address,
        'neighborhood' : neighborhood
        }
        # saves dictionary in the data variable
        data.append(info)

        # Each page shows 30 items, but the Soup Object find more (garbage)
        # To avoid some errors in processing, This code guarantees that only 30 items will be scraped
        #count += 1
        #if count == 30: break

    # Increment the start item ID (pagination)
    pagination+=30
    parameters['start'] = pagination

Extract data ... page 1 from 34
Extract data ... page 2 from 34
Extract data ... page 3 from 34
Extract data ... page 4 from 34
Extract data ... page 5 from 34
Extract data ... page 6 from 34
Extract data ... page 7 from 34
Extract data ... page 8 from 34
Extract data ... page 9 from 34
Extract data ... page 10 from 34
Extract data ... page 11 from 34
Extract data ... page 12 from 34
Extract data ... page 13 from 34
Extract data ... page 14 from 34
Extract data ... page 15 from 34
Extract data ... page 16 from 34
Extract data ... page 17 from 34
Extract data ... page 18 from 34
Extract data ... page 19 from 34
Extract data ... page 20 from 34
Extract data ... page 21 from 34
Extract data ... page 22 from 34
Extract data ... page 23 from 34
Extract data ... page 24 from 34
Extract data ... page 25 from 34
Extract data ... page 26 from 34
Extract data ... page 27 from 34
Extract data ... page 28 from 34
Extract data ... page 29 from 34
Extract data ... page 30 from 34
Extract data ... page 31 from 34
Extract data ... page 32 from 34
Extract data ... page 33 from 34
Extract data ... page 34 from 34

Here is some part of the data extract:

data[:5]

[{'name': 'Little Baobab',
  'reviews': '544',
  'phone': '(415) 643-3558',
  'address': '3380 19th St',
  'neighborhood': 'Mission'},
 {'name': 'The House',
  'reviews': '4354',
  'phone': '(415) 986-8612',
  'address': '1230 Grant Ave',
  'neighborhood': 'North Beach/Telegraph Hill'},
 {'name': 'Fog Harbor Fish House',
  'reviews': '4517',
  'phone': '(415) 421-2442',
  'address': 'Pier 39',
  'neighborhood': "Fisherman's Wharf"},
 {'name': 'Gary Danko',
  'reviews': '4874',
  'phone': '(415) 749-2060',
  'address': '800 N Point St',
  'neighborhood': 'Russian Hill'},
 {'name': 'Marlowe',
  'reviews': '2485',
  'phone': '(415) 777-1413',
  'address': '500 Brannan St',
  'neighborhood': 'Mission Bay'}]

Now let's convert the data to a DataFrame object:

from pandas import DataFrame

restaurants = DataFrame(data)

The results are:

# Show the ten firsts items
restaurants.head(10)

	address	name	neighborhood	phone	reviews
0	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
1	1230 Grant Ave	The House	North Beach/Telegraph Hill	(415) 986-8612	4354
2	Pier 39	Fog Harbor Fish House	Fisherman's Wharf	(415) 421-2442	4517
3	800 N Point St	Gary Danko	Russian Hill	(415) 749-2060	4874
4	500 Brannan St	Marlowe	Mission Bay	(415) 777-1413	2485
5	3870 17th St	Frances	Castro	(415) 621-3870	1416
6	974 Valencia St	Loló	Mission	(415) 643-5656	1868
7	1382 9th Ave	Ushi Taro	Inner Sunset	(415) 702-6014	39
8	652 Polk St	Brenda’s French Soul Food	Tenderloin	(415) 345-8100	9545
9	1782 Haight St	What The Cluck SF	The Haight	(415) 702-6777	53

Data Cleaning

Before saves the data into a file, let's check if there is any null, empty or duplicate value:

# Check if there is any null value in DataFrame
restaurants.isnull().any().any()

False

# Check if there is any empty value in DataFrame
no_address = len(restaurants[restaurants['address'] == ''])
no_name = len(restaurants[restaurants['name'] == ''])
no_neighborhood = len(restaurants[restaurants['neighborhood'] == ''])
no_phone = len(restaurants[restaurants['phone'] == ''])
no_reviews = len(restaurants[restaurants['reviews'] == ''])

print('Column \t\t empty values')
print('address \t ', no_address)
print('name \t\t ', no_name)
print('neighborhood \t ', no_neighborhood)
print('phone \t\t ', no_phone)
print('reviews \t ', no_reviews)

Column          empty values
address       122
name           99
neighborhood       106
phone           119
reviews       99

As presented above, there is a lot of empty values in the data. This happens because the Yelp site does not have all the information for some restaurants.

Let's remove these items:

restaurants = restaurants[restaurants['address'] != '']
restaurants = restaurants[restaurants['name'] != '']
restaurants = restaurants[restaurants['neighborhood'] != '']
restaurants = restaurants[restaurants['phone'] != '']
restaurants = restaurants[restaurants['reviews'] != '']

# Check if there is any duplicate value in DataFrame
restaurants[restaurants.duplicated()]

	address	name	neighborhood	phone	reviews
34	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
68	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
102	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
136	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
170	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
204	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
238	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
272	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
273	600 Guerrero St	Tartine Bakery & Cafe	Mission	(415) 487-2600	7295
306	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
340	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
374	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
408	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
409	903 Cortland Ave	Nute’s	Bernal Heights	(415) 260-0192	127
442	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
476	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
510	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
544	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
578	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
579	3673 Sacramento St	Magic Flute Ristorante	Presidio Heights	(415) 922-1225	510
612	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
646	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
680	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
681	3435 Mission St	El Buen Comer	Bernal Heights	(415) 817-1542	232
714	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
748	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
782	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
816	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
850	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
884	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
918	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
952	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
986	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
1020	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
1054	3380 19th St	Little Baobab	Mission	(415) 643-3558	544
1088	3380 19th St	Little Baobab	Mission	(415) 643-3558	544

So there are various duplicated items. Most of them are about the Little Baobab restaurant, this happens because Yelp can promote some restaurants (Advertisement), so these promoted restaurants are shown on all pages, and the scraping script captures them several times.

Let's remove duplicate items:

restaurants.drop_duplicates(inplace=True)

For a better presentation, let's change the order of the columns in our dataset.

restaurants = restaurants[['name', 'neighborhood', 'address', 'phone', 'reviews']]
restaurants.head(5)

	name	neighborhood	address	phone	reviews
0	Little Baobab	Mission	3380 19th St	(415) 643-3558	544
1	The House	North Beach/Telegraph Hill	1230 Grant Ave	(415) 986-8612	4354
2	Fog Harbor Fish House	Fisherman's Wharf	Pier 39	(415) 421-2442	4517
3	Gary Danko	Russian Hill	800 N Point St	(415) 749-2060	4874
4	Marlowe	Mission Bay	500 Brannan St	(415) 777-1413	2485

In the end, we have 949 restaurants in our dataset.

The last step is to convert the columns to the correct type.

restaurants.dtypes

name            object
neighborhood    object
address         object
phone           object
reviews         object
dtype: object

We can see that the reviews column is treated as a string type (object), let's convert it to a numeric type (int):

restaurants['reviews'] = restaurants['reviews'].astype(int)
restaurants.dtypes

name            object
neighborhood    object
address         object
phone           object
reviews          int32
dtype: object

Finally, let's save the data into a CSV file:

restaurants.to_csv('datasets/yelp_restaurants_sanFranscisco.csv', index=False)

Data Analysis

Now I present a simple analysis of the restaurant's data.

Top 10 reviewed restaurants

Let's see the ten restaurants with more reviews:

import matplotlib.pyplot as plt
import numpy as np
% matplotlib inline

# Get top 10 restaurants
top10 = restaurants[['name', 'reviews']].sort_values('reviews',ascending=False).head(10).sort_values('reviews')

# Plot 
top10.plot.barh(legend=False)
plt.yticks(np.arange(10), top10.name.values)
plt.ylabel('RESTAURANT')
plt.xlabel('REVIEWS')
plt.show()

output_26_0

Brenda's French Soul Food is the restaurant with more reviews.

Neighborhood with more restaurants

Let's take the 10 neighborhoods with more restaurants in San Francisco.

top_neighbor = restaurants['neighborhood'].value_counts().head(10)
top_neighbor.plot.barh(figsize=(7,5))
plt.xlabel('NUMBER OF RESTAURANTS')
plt.ylabel('NEIGHBORHOOD')
plt.show()

output_29_0

The Mission is a neighborhood with more restaurants.

But if we consider all restaurants, the proportion (%) is:

(restaurants['neighborhood'].value_counts(normalize=True).head(10)*100).plot.barh(figsize=(7,5))
plt.xlabel('PROPORTION (%)')
plt.ylabel('NEIGHBORHOOD')
plt.show()

output_31_0