Inflated actual response times from HTML to JSON parsing
in progress
Illia Zub
Today we deployed a speedup for our parsers which decreased parsing time from 1.5 to 1 second for large HTML documents.
I'm glad to have an opportunity to improve the performance of Nokogiri, the widely used open-source library. Nokogiri v1.11.0 includes our patch to libxml2 that speeds up xmlStrlen. More details are in the Nokogiri release description.
Next step is to speed up libxml2 even more or to create Nokogiri-compatible Ruby wrapper around html5ever. Proof of concept shows 60x performance increase.
Illia Zub
Yesterday we speed up JSON parsing by about 600 ms (from 3.9 seconds to 3.3 seconds) on large HTML responses like Google Shopping with
num=500
. I expected a greater speed up.We still have other optimizations to be done.
Joel Hull
Illia Zub thank you for the updated. It will be great to see what you are able to do to make the parsing faster! :)
Illia Zub
Joel Hull: We deployed another performance improvement to our parsers this Monday. It decreased parsing time from 3 - 4 seconds to 1.8 - 2.3 seconds for large response pages like Google Shopping with
num=500
and Google organic results with inline products.In rare cases, parsing time was up to 40 seconds. This is also fixed now.
The current performance bottleneck is in the Nokogiri
at_css
method when HTML size is 1.5 MB or more. We will address this with one of the options:* submit a PR to
libxml2
with performance improvement (Nokogiri uses this library to parse XML and HTML)Illia Zub
in progress
Joel Hull Thanks for reporting this! We're planning to deploy speed up of JSON response this week.
PS. We submitted PR to
nokogiri
(the underlying library we use for HTML parsing): https://github.com/sparklemotion/nokogiri/pull/2100. I assume this change will speed up HTML parsing for thousands of programmers. I'm excited about this. Thank you :-)Joel Hull
You can reproduce the issue using the python package google-search-results version 1.8.3 and the following code snippet
import time
from serpapi import GoogleSearchResults
API_KEY = "add key here"
params = {
"q": "lawn mower",
"api_key": API_KEY,
"location": "Austin, Texas",
"device": "desktop",
"hl": "en",
"gl": "us",
"num": "500", # "Number of Results",
"start": "0", # "Pagination Offset",
"tbm": "shop",
"tbs": "p_ord:rv", # last 24h
"output": "json",
}
start = time.time()
search = GoogleSearchResults(params)
retval = search.get_dict()
etime = time.time() - start
print(f"search execution etime:{etime:0.3f}s")
Elizabeth Oster
under review