Inflated actual response times from HTML to JSON parsing | Voters | SerpApi, LLC - Old roadmap, we've just migrated our public roadmap to our Github `serpapi`

Inflated actual response times from HTML to JSON parsing

in progress

Elizabeth Oster

Response time reporting is inconsistent with actual response time due to parsing taking longer than it should.

October 7, 2020

Illia Zub

Today we deployed a speedup for our parsers which decreased parsing time from 1.5 to 1 second for large HTML documents.
I'm glad to have an opportunity to improve the performance of Nokogiri, the widely used open-source library. Nokogiri v1.11.0 includes our patch to libxml2 that speeds up xmlStrlen. More details are in the Nokogiri release description.
Next step is to speed up libxml2 even more or to create Nokogiri-compatible Ruby wrapper around html5ever. Proof of concept shows 60x performance increase.

Illia Zub

Yesterday we speed up JSON parsing by about 600 ms (from 3.9 seconds to 3.3 seconds) on large HTML responses like Google Shopping with

num=500

. I expected a greater speed up.

We still have other optimizations to be done.

Joel Hull

Illia Zub thank you for the updated. It will be great to see what you are able to do to make the parsing faster! :)

Illia Zub

Joel Hull: We deployed another performance improvement to our parsers this Monday. It decreased parsing time from 3 - 4 seconds to 1.8 - 2.3 seconds for large response pages like Google Shopping with num=500
 and Google organic results with inline products.
In rare cases, parsing time was up to 40 seconds. This is also fixed now.
The current performance bottleneck is in the Nokogiri at_css
 method when HTML size is 1.5 MB or more. We will address this with one of the options:
* submit a PR to libxml2
 with performance improvement (Nokogiri uses this library to parse XML and HTML)
* write our own bindings to lexbor or html5ever to parse HTML and then will create a Nokogiri adapter for that bindings

Illia Zub

marked this post as

in progress

Joel Hull Thanks for reporting this! We're planning to deploy speed up of JSON response this week.

PS. We submitted PR to

nokogiri

(the underlying library we use for HTML parsing): https://github.com/sparklemotion/nokogiri/pull/2100. I assume this change will speed up HTML parsing for thousands of programmers. I'm excited about this. Thank you :-)

Joel Hull

You can reproduce the issue using the python package google-search-results version 1.8.3 and the following code snippet
import time
from serpapi import GoogleSearchResults
API_KEY = "add key here"
params = {
"q": "lawn mower",
"api_key": API_KEY,
"location": "Austin, Texas",
"device": "desktop",
"hl": "en",
"gl": "us",
"num": "500",  # "Number of Results",
"start": "0",  # "Pagination Offset",
"tbm": "shop",
"tbs": "p_ord:rv",  # last 24h
"output": "json",
}
start = time.time()
search = GoogleSearchResults(params)
retval = search.get_dict()
etime = time.time() - start
print(f"search execution etime:{etime:0.3f}s")

Elizabeth Oster

marked this post as

under review